The results produced from RN may be treated in one of three ways by decision-makers:
In the first case, the decision-makers typically use the results to carry through their own customary problemsolving processes on what has been rendered a formally specified problem, and/or to iterate the RN process to arrive at the trial or the finalised solutions. In the second case, they may make further pragmatic adjustments to the results prior to retestinq the effect by iteratively re-running the RN exercise (or parts of it) to arrive at the finalised resolution. However, in the third case, decision-makers may well wish to ensure that the results attain a standard of precision, accuracy and rigour beyond that which is required for a trial solution. In these circumstances, reliability tests may be required.
Various reliability tests may be used at each point of Resources Now. As has been described, several such tests are built in as automatic self-checking procedures with each run of the model - the consistency, coherence and concordance measures with their tabulated test statistics and empirical standards; the 'model fit' measures; the operational priority, operational unit cost and operational demand and supply level measures; and the predictive accuracy measures associated with the simulation modelling elements of the system. However, additional reliability tests may also be used either as a regular feature of applying the approach, or as tests external to the approach. Some of these are now described.
Comparability of items
Any two items are potentially comparable in terms of a common third term (e.g. cost, profit, 'felt' benefit), provided the criteria for the comparison are specified (D. Armstrong, 1978). However, the relative comparability of items in a given list may still be questionable, especially when grouping of budget items is involved.
Heuristic indicators of item non-comparability are usually given by:
- decision-makers' inability or reluctance to rate items in paired comparison exercise,
- consistently high inconsistency scores over several iterations,
- excessively different weighting of one item of the set,
- significantly variant weightings produced from paired comparison exercises on the whole set of items, a relative subset of the items, and the weightings for a subset of the items as calculated from weightings of the whole set.
The relative comparability of items may be more rigorously checked, in terms of relative mutual comprehension of items among a team, relative mutual independence of items, relative mutual exclusiveness of items, relative collective exhaustiveness of items, and relative aggregatability of items. Judgement analysis procedures may be applied to elicit the degree of 'discriminateable difference' between items. This may be further checked if necessary by 'fuzzy set' methods (Wang and Chang, 1980), standard cluster analysis (Everritt, 1974), cladistic analysis, repertory and sociogrid techniques (Kelly, 1955; Thomas, 1970). A further check may be made by cross-matrix correlations of decisionmakers' results on different though inter-translateable classifications of the items (e.g. a market-oriented and a product-oriented classification of budget items in the private sector, or a consumer and a service-oriented classification in the public sector) using Kendall and Spearman concordance and factor analysis tests to measure coherence in respect of the two sets of results. These tests also allow a measure of 'comprehensibility' to be applied to priority decisions. Comprehensibility is a significantly high level of cross-matrix correlation of weights, with consistency and coherence maintained.
Three points should be noted about the priority scale as a prelude to examining further checks and refinements which may be used to ensure reliability of results based on the scale.
First, it is possible to realise a given set of weights by a set of decisions. Suppose it is known that the normalized weights of certain items are
Then a "consistent" set of decisions A = can be found, such that the weights obtained are
and that
and
are "close" within a given tolerance.
Second, a given set of weights can be realised by two qualitatively different sets of decisions. If an agreed standard for a "consistent" set of decisions is stipulated, it is not possible under the judgement analysis method to make two "different" sets of decisions which are both consistent, while obtaining the same weights.
Third, the weights are stable under small perturbations. The individual decision-maker's decisions can be adjusted by only a very small amount without significantly changing the weights obtained or the consistency. The 9-point scale allows for this since it has a certain amount of "fuzziness" built into it. Thus a choice of m, subsequently change to m± 1, may not change the result significantly. However, the number of changes ± 1 which can be made to a particular judgement is limited.
The reliability of the (consistent) decision-makers'priority weights was confirmed by 'two stage sampling', i.e. by choosing a random reciprocal matrix, then comparing the weights obtained from this with the weights obtained by sampling from this matrix. However, in addition to the built-in reliability of the weights, decision-makers may want to check reliability for themselves. The methods for doing this, which are now outlined, provide additional confirmation of the results if this is felt to be required, and/or alternative ways by which decision-makers may come to appreciate the reliability of the results.
The implicit rankings and explicit weightings on the priority scale obtained by judgement analysis may be heuristically tested against a post-hoc intuitive ranking and weighting. An alternative procedure is to test them against the priorities produced by using the Budget Constraints procedure in the manner described earlier. Our experiments confirm the tendency of results derived from ranking and rating scales to progressively approximate results derived from paired comparison scales over successive iterations (Scheibe et aI, 1975). This post-hoc intuitive check confirms results which are intuitively plausible, typically on budgetary issues where the decision-maker's judgement is already coherent. Where the decision-maker is handling an issue about which he is significantly uncertain, this check is not always applicable since 'counter-intuitive' results may quite appropriately emerge. Even in this case, experienced decision-makers are usually able to recognize the soundness of claims of inference implicit in their starting assumptions, and of implicit consequences of their views (White, 1975).
However, decision-makers' priority weights of items may be further refined, checked and tested, by various systematic techniques used alone or in combination. One systematic refinement and check is provided by the Priority-Criteria matrix, which is undertaken once a criterion scale has been elicited. The items are judged on each criterion in turn as if it were the sole criterion. A set of consistent criterion-by-criterion priority scales are thereby produced. The results are combined, using the geometric mean with concordance tests. This provides the basis for modifying item weights in the light of the criteria analysis. Further, by comparing the weights obtained from direct judgment of the items ("all things considered") with those obtained from judgements made on a criterion-by-criterion basis, a measure of coherence is derived by applying the standard Kendall and Spearman concordance and other tests (e.g. of Euclidean distance) to the two sets of results. Coherence involves three-fold consistency in option priorities, criteria weights, and in the application of criteria to options.
This method may be used in judgement analysis as a means of meeting conditions demanded by decision analysts (Schoemaker, 1981) and some Delphi analysts (Jillson, 1975) by treating 'utility' and 'probability' or 'desirability' and 'feasibility', as two key organising criteria, and obtaining item weightings accordingly. However, it should be noted that these two criteria are rarely exhaustive of the decision-makers' realm of values and concerns (Algie, 1975).
Another refinement and check involves eliciting the individual decision-maker's stimulus scale in respect of the items, using discriminant analysis (Klecka, 1980). An adequate indication of the basic dimensions of items judged, for example the relative cutback discontinuities of budget items in the light of a potential overall cut, is obtained in terms of a discrimination scale (e.g. Table 3 in which the dimensions are 'preservability' or 'cuttability').
TABLE 3: A SAMPLE DISCRIMINATION SCALE
(For preservability-cuttability of budget items)
The discrimination scale is obtained by the same procedure used to check relative comparability of items. It is further checked by a'fuzzy set test' by which membership functions of scalar descriptions are obtained (Wang and Chang, 1980, Hersch and Carramazza, 1976). The basic dimensions for each item are judged in respect of each point of the discrimination scale. If necessary, the weights and allocations obtained by the geometric mean and quadratic programming techniques are then adjusted to ensure that the basic dimensions of items as measured on the discrimination scale are preserved. However, it should be noted that the discrimination scale is only a delineation of one particular criterion in the criteria analysis described above (e.g. 'preservability' where that is the basic reference point of the stimulus scale).
A further refinement is to elicit and use the individual decision-maker's response scale. This is obtained by means of a magnitude estimation exercise (Stevens, 1966; Lodge, 1980). The results are used to calculate the individual decision-maker's implicit (self-anchoring) response ratio scale. This is then checked against the response ratio scale implicit in his use of the standard judgement scale in the judgement analysis procedure. Where variations occur, the judgement analysis weightings are corrected so as to preserve the magnitudes implicit in the decisionmaker's self-anchoring response scale as derived from magnitude estimation, the relativities implicit in the same decision-maker's weightings as derived from judgement analysis, and the level of consistency attained as measured by the trace index. Again, it should be noted that the response scale derived by magnitude estimation may sometimes merely articulate one criterion already dealt with on the criterion analysis.
Although magnitude estimation provides an adequate means of obtaining decision-makers' self-anchoring scale, it cannot reliably be used alone. Unless this self-anchoring scale is allied with some validated measure of the decisionmaker's consistency there is no guarantee that the decisionmaker is behaving logically and rationally. There is no comparison of decision-makers' consistency by judgements on the same issue, and no common scale onto which the self-anchoring judgements of different decision-makers can be mapped. Therefore magnitude estimation needs to be combined with judgement analysis to obtain results which are justifiably usable.
Since on all tests to date the results from the magnitude estimation exercises invariably align significantly with the results from judgement analysis exercises undertaken by the same decision-Maker on the same issues when the decision-makers are consistent at Level 3 (though they frequently do not when decision-makers are inconsistent), it is clear that the magnitude estimation method can be eliminated in favour of judgement analysis if decision-makers only have time and energy to do one or the other, especially since judgement analysis takes shorter time than ma9nitude estimation (Lodge, 1981). The reverse cannot hold, given that no measures of consistency are available for Magnitude estimation.
The stimulus ratio scale from the fuzzy set technique may be used to calculate the judge's implicit response ratio scale as a variational problem in fuzzy set theory. This in turn is checked against the judge's response ratio scale derived from magnitude estimation. If necessary, further adjustment may then be made to the item weightings to ensure the 'hest fit' representations of the decisionmaker's stimulus scale and his self-anchoring response ratio scale, as constrained by the measured consistency and coherence tests associated with the geometric weights. (On the relationship between discrimination and magnitude scale see Eisler, 1962).
The stimulus and response scales thus derived provide a context within which a decision-maker's priority scale may be located, hence a basis for refining and checkin9 the geometric weights on his/her priority scale. They also provide a means of checking how far:
- the individual decision-Maker's self-anchoring scales align with his/her use of the standard judgement scale,
- the geometric weights produced from his/her judgement analysis are scale independent,
- the geometric weights reflect the ideas of relative magnitude generated in the decision-maker's mind by the comparisons he/she is asked to make.
- the decision-maker behaved multiplicatively when undertaking judgement analysis.
Any modifications necessary may be calculated in the light of these checks. Some experiments suggest that there tends to be little significant variance in results whether selfanchoring or prescribed standard scales are used (Lefcowitz and Walton, 1973). Note that the scale values derived from use of the standard "judgement scale are a power function of the underlying scale itself because of the operation of Stevens' "pschophysical law" (Stevens, 1957) on the estimation process (Saaty, 1980).
Tests for method independence of the results are conducted by checking the judgement analysis weights against those obtained from any two other methods taken jointly, e.g. decision analysis (Schoemaker, 1980), outranking (Ray, 1980), social judgement theory (Hammond, 1976), eigenvalue analysis (Saatv, 1980), Walker scaling (Walker, 1978), Thurstone scaling (Torneson, 1Q58), etc. Since the measured consistency, coherence and scale independence checks available in judgement analysis are not available with these other methods, the check requires that correlative results are attained jointly from two other methods (after allowance is made for variance produced from different methodological assumptions).
In general, the 'best fit' model may be calculated for any two sets of priority weights produced, as constrained by measured consistency, coherence and concordance. Decisionmakers may iterate the procedures. Variations in weightings over successive iterations provide a measure of the decisionmaker's uncertainty about the issue (as compared with his ratings on a 9-point confidence scale of the same basic form as the standard judgements scale and amenable to the same reliability tests).
Variations in measured consistency, coherence and comprehensibility over successive iterations provide a measure of the decision-maker's judoement capability in respect of the issue on hand, provided his assessments of items on the relevant discrimination scale remain constant. Experiments to date have indicated significant correlation with measures of capability obtained from other comparative tests (Isaac and O'Connor, 1969; Stamp, 1983). It is relevant to note that mentally handicapped people invariably fail to achieve consistencv when using judgement analysis, and indeed, tend to be consistently inconsistent in a quite random way.
The calculated team ranking and weightings may be tested against post-hoc intuitively negotiated rankings and weightings. The team weights may be further tested by using factor analysis to cluster the individual judgement matrices (Kim and Mueller, 1979), and against results obtained using game theoretic methods (Luce and Raiffa, 1957; Alqie and Hall, 1973).