Empirical tests show that the rank order of items derived from the calculated mean of the individual decisionmaker's weights tends to be an adequate predictor of the rank order of items negotiated by the group under the following conditions:
The allocations derived from the quadratic programming method may be checked by similar procedures as used for checking individual and team weights, together with the 'model fit' index described earlier.
Though reliability tests may be sufficient for many decision-making purposes, more stringent validity tests will sometimes be required (Carmines and Zeller, 1980). As noted above, the dynamic simulation model embodied in Budget Monitoring provi.des a means for validation tests of the whole system to be made.
The judgement analysis method has been validated for ratio scaling of sensations like the brightness of light, distance of perceived objects, the heaviness of a lifted weight, the sweetness or sourness of a taste, the acridness of an odour, the loudness of a sound (Saaty, 1980). Similar criterion tests can be used to validate the application of the method to decision issues on more complex phenomena, since the logic, measures and procedures for quantifying physical stimuli on a sensory continuum are virtually identical to those used for judgement analysis applied to decision issues (Lodge, 1981).
Those who have worked hard learning how to derive patterns of judgement and decision from statistically noisy data initially find it hard to accept that judgement analysis can produce relationshi.ps which are as law-like as scales for physical sensations. Human observers are capable of using numbers to make proportional judgements of stimulation levels, and their judgements follow the power law. This is the law-like linear relationship that equal stimulus ratios produce equal subjective ratios, which is one of the most well evidenced laws of human judgement in psychophysics (Stevens, 1957: Cliff, 1973).
However, the potential validity of judgement analysis methods may be further tested by comparison with the results produced when other multi-variate analysis techniques are used. An example is given here which focuses on professional social workers' judgements of the relative seriousness of various kinds of juvenile crimes on which they are required to make recommendations to the Juvenile Court*.
Profiles of nine representative offices selected from the 1977 records of one Inner London Juvenile Court, were presented to 30 social workers randomly selected from staff of a Social Services Organisation.
As Sellins and Wolfgang (1964) detected no significant variation in the perceived seriousness of crimes for different ages of offenders, the judges were asked to assume the offences were committed by 'a 14 year-old boy. In a paired comparison exercise, they judged the perceived relative seriousness of the offences. The mean value for each offence was calculated from the weights assigned by the individual decision-makers.
* This example is based on research undertaken in conjunction with C. Gostick (1983).
An eigenvector scale (AIgie and Foster, 1980), a magnitude scale (Sellins and Wolfgang, 1964), a Thurstone scale (Torgeson, 1958) and a Walker Scale (Walker, 1978) were produced from their judgements, allowing relevant comparisons to be made with the priority (i.e. the geometric mean) scale (AIgie and Poster, 1980).
To obtain the magnitude scale, (Table 4, Column 3), one offence ("theft or shop-lifting of goods, money or property valued at up to £10") was randomly selected as a "standard", and assigned an arbitrary score of 20 units. The judges estimated the seriousness of this standard. In order to directly compare individual weightings, regardless of the original estimations, their individual weightings were transferred to standard weights, to produce universally positive weights with a maximum of two digits per integer. The overall weightings were then produced by calculating the geometric mean value for each of the nine offences. The deviation in the relative variability of the individual offence weightings fluctuated between 15 and 30 percent. (As would be anticipated, the standard of deviation of each item broadly increases in proportion to the mean scale value).
To obtain the Thurstone Comparative Scale (Table 4, Column 4) the judges compared each pair of items in turn in random sequence to determine which they perceived to be the most serious, scoring each item 0 to 1 depending on which of the two was judged the most severe. This procedure continued until each item had been compared against every other item.
The Walker scale (Table 4, Column 5) was derived by a simple summation of each occasion anyone item was judged more severe than any other item. This weight may be used directly or it may be converted into a percentage of the maximum possible score on the basis of:
max score = n(k-l)
where k is the number of scale items, and n is the number
of respondents in the sample. In this case, the maximum score for any item was 240. The relationship between the magnitude scale and both the Thurstone and Walker scales is linear, as confirmed in Figure 1, with an R2 of 0.946 and 0.964 respectively. (The relationship between the magnitude scale and the Thurstone scale is also linear, with R = 0.989).
The scale values derived from this method are likely to be a power function of the underlying scale itself, because of the operation of Steven's "psychophysical law" (Stevens, 1957) on the estimation process (Saaty, 1980). This allows the relationship between the eigenvector and magnitude scales to be predicted as:
y1 = a + xb
where x represents the eigenvector scale, and y the magnitude scale. Converted to linearity, this becomes:
y1 = a + b (log x)
This can be tested by a straightforward linear regression. The relationship between the eigenvector and the magnitude scales was analysed by plotting data on logarithmic/arithmetic coordinates, as shown in Figure 2. This gives an R2 of 0.990. A similar logarithmic relationship also holds, between the eigenvector mean scales and both the Thurstone and Walker scales, with R2 = 0.974 and 0.988 respectively. The strong correlation between the geometric mean and the eigenvector scales where judges have attained Level III consistency has been established elsewhere (W. Foster et aI, 1982).
Despite the small size of the sample, the data fits remarkably well to be the predicted equation. Overall, all five comparative scales produced from the data show a high level of association with the original geometric mean scales. However, since only the geometric mean method provides adequately validated consistency and coherence measures, it would seem to be more appropriate to use this method as the base scale against which other scales are tested for validity.
The tendency for decision-makers to become inconsistent when using many traditional methods of decision-making, especially when dealing with very small or very large relativities, renders the use of consistency measures particularly important as checks (Wallenstein and Budesch, 1983). The fact that once they have made their minds up, so many decision-makers are eventually able to attain acceptable consistency within the tabulated decision standards on decisions in their area of competence and experience, itself provides some evidence of the reliability of judgement analysis methods (Mitra and Phelps, 1984). Thus, while the alternative scaling methods may provide confirmation of a decision-maker's priority weights where they align with those obtained by the geometric mean method, any significant discrepancy in the comparative results does not provide disconfirmation of the weights obtained by the geometric mean method. In this case, the consistency measures are particularly significant. However, it should. be noted that the decision-maker's use of any alternative method to arrive at his/her judgements may itself provide a cue or prompt for mental developments in his/her judgements (as would any systematic re-thinking of an issue), hence for revisions in his/her priority weights. Consequently, the use of alternative methods of decision-making on significant issues is desirable (as well as providing a check for method-independence of results when these are confirmed by two or more decision methods). However, the 'acceptability' or otherwise of the final results is only determinable by means of a consistency measure.
CONSISTENCY CHECKS
The trace index can similarly be further checked against results obtained from other consistency measures such as the eigenvector index, the geometric index, and a comparison algebra.
The eigenvector index is:
n = 0 if and only if the relevant reciprocal matrix A formed from the decision-makers' judgements is consistent, so that
This index is only applicable to consistent reciprocal matrices. While the index may be heuristically acceptable for such matrices, it remains mathematically questionable.
The geometric index is calculable as follows. As described, a skew symmetric matrix was obtained from the decision-maker's judgements. This skew symmetric matrix lies at a standard Eucliedean distance (in the vector space of skew symmetric matrices) from the subspace of matrices which satisfy the condition of consistency. The measure of this distance provides a geometric index of consistency.
However, while the alternative methods of computing consistency may provide confirmation of a decision-maker's consistency where these indices align with the trace index, any significant discrepancy in the comparative results does not provide disconfirmation of consistency as measured by the trace index and checked by the comparison algebra. The 'acceptability' or otherwise of the final consistency is ultimately determinable by the trace index with its tabulated test statistics and confidence levels, as checked by the comparison algebra. This is important in that the degree of any error in measuring consistency exponentially increases as consistency measures are further used as a basis for obtaining more sophisticated measures of 'coherence' and 'comprehensibility'.
The comparison algebra involved is associative, commutative and discriminatory. The items on the judgement scale were tokens which may be converted into fuzzy algebraic 'objects'. From computerized 'fuzzy set' analysis of the judgements made by any decision-maker, the program is able to derive the fuzzy algebra with which the decision-maker is implicitly operating when making his/her comparative decisions. This algebra models the way the decision-maker is combining or concentrating his/her comparisons using the standard scale. The comparison algebra thus obtained is unique to every individual. It provides a check on how far the decisionmaking is operating 'multiplicatively' in making his/her judgements, hence a basis for adjusting the priority weights as necessary to reflect with precise accuracy the decisionmaker's concatenations of the comparisons.