The Standards for Educational and Psychological Testing (1999) (hereafter called the Standards) established by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, are intended to provide a comprehensive basis for evaluating tests. These new standards reflect several important changes from earlier versions, especially in the area of validity. This paper summarizes several key standards applicable to most test evaluation situations.
There must be a clear statement of recommended uses, the theoretical model or rationale for the content, and a description of the population for which the test is intended.
The principal questions to ask when evaluating a test is whether it is appropriate for the intended purposes. The use intended by the test developer must be justified by the publisher on technical or theoretical grounds. Questions to ask:
The sample used to collect evidence of the validity of score interpretation and norming must be of adequate size and must be sufficiently representative to substantiate validity claims, to establish appropriate norms, and to support conclusions regarding the intended use of the scores.
The individuals in the samples used for collecting validity evidence and for norming should represent the population of potential examinees in terms of age, gender, ethnicity, psychopathology, or other dimensions relevant to score interpretation. Questions to ask:
Test scores should yield valid and reliable interpretations. All sources of validity evidence should support the intended interpretations and uses of the test scores.
The current Standards suggest there is only one dimension of validity and that is construct validity. A variety of methods may be used to support validity arguments related to the intended use and interpretation of test scores. Such evidence may be obtained by examining systematically the content of the test, by considering how the test scores relate to other measures, or by considering the relationship between the scores on the test and other variables in terms of the consistency of these relationships with the theoretical predictions on which test development was based. One element of invalidity is the extent that scores reflect systematic, rather than random, error. That is, systematic error contributes to invalid score interpretations, whereas random errors in scores are due to issues associated with unreliability, which also renders scores invalid. Thus, tests may produce scores that are reliable (little or no random error) but are invalid due to some systematic error. Scores that are unreliable may not be validly interpreted.
Evidence that the content of the test is consistent with the intended interpretation of test scores (e.g., a test to be used to assist in job selection should have content that is clearly job-related) may be collected in a variety of ways. One method often used to examine the evidence of score validity for achievement tests is having content experts compare the test content to the test specifications. Content consistency may also be demonstrated by such analyses as a factor analysis (either exploratory or confirmatory, or both). Questions to ask include:
Some tests are designed to assist in making decisions about examinees regarding the examineeÕs future performance. Validity evidence for such tests might include the above content related information and in addition statistical information regarding the correlation (or other similar statistical relationship) between the test to be used as a predictor and some relevant criterion variable. Questions to ask include:
There are many other forms of evidence that may be provided to assist the test user in making a decision about the appropriateness of the test for the userÕs purposes. A few such forms of evidence are noted and additional questions proposed. A multi-trait, multi-method matrix that examines the relationship between the scores from the test being reviewed and scores from other tests that are both similar in intent and dissimilar in the measurement strategy may provide evidence of the nature of score interpretations. Similarly, a test that is designed to assess a particular construct (e.g., creativity) may be shown to produce scores that correlate with test scores from measures of other related constructs (e.g., artistic ability). Other strategies may include experimental studies that could be used to aid in theoretical confirmation of the use of the scores. For example, a test designed to aid in diagnosing a psychological problem may provide evidence of validity by showing that individuals who are known to have the targeted pathology obtain predictably higher (or lower) scores than individuals who are known to not have the targeted pathology. There are many other methods that may be used to provide validity evidence that are found in the Standards. Questions to ask include:
The test scores are sufficiently reliable to permit stable estimates of the construct being measured.
Fundamental to the evaluation of any instrument is the degree to which test scores are free from (random) measurement error and are consistent from one occasion to another. Sources of measurement error may derive from three broad categories: error related to factors that are intrinsic to the test; error related to factors that are intrinsic to the examinee, and errors that are extrinsic to both the test and the examinee. Test factors include such things as unclear instructions, ambiguous questions, and insufficient questions to cover the domain of the construct of interest. Factors intrinsic to the individual may include the examineeÕs personal health at the time of testing, fatigue, nervousness, and willingness to be a risk taker. Extrinsic factors may include such things as extraneous noise or other distractions and misentering a response choice onto an answer sheet. These illustrations are not intended to represent an exhaustive list of factors that influence the reliability of examinee scores on a test.
Different types of reliability estimates should be used to estimate the contribution of different sources of measurement error. Tests that produce scores that are obtained by using raters must provide evidence that raters interpret the scoring guide in essentially the same way. Tests that may be administered on multiple occasions to examine change must provide evidence of the stability of scores over time. Tests that have more than one form that can be used to make the same decision, must demonstrate the comparability (content and statistical) across the multiple forms. Almost all measures, except those that are speeded, should provide an estimate of the internal consistency reliability as an estimate of the error associated with content sampling. Questions to ask:
Detailed and clear instructions outline appropriate test administration procedures.
Statements concerning test validity and the accuracy of the norms can only generalize to testing situations that replicate the conditions used to establish validity and obtain normative data. Test administrators need detailed and clear instructions to replicate these conditions.
All test administration specifications, including instructions to test takers, time limits, use of reference materials and calculators, lighting, equipment, seating, monitoring, room requirements, testing sequence, and time of day, should be fully described. Questions to ask:
The methods used to report test results, including scaled scores, subtest results, and combined test results, are described fully along with the rationale for each method.
Test results should be presented in a manner what will help users (e.g., teachers, clinicians, and employers) make decisions that are consistent with appropriate uses of test results. Questions to ask:
The test is not biased or offensive with regard to race, sex, native language, ethnic origin, geographic region, or other factors.
Test developers are expected to exhibit sensitivity to the demographic characteristics of test takers. Steps can be taken during test development, validation, standardization, and documentation to minimize the influence of cultural dependency, using statistics to identify differential item difficulty, and examining the comparative accuracy of predictions for different groups. Some traits are manifest differently by different cultural groups. Explicit statements of when it is inappropriate to use a test for particular groups must be provided. Questions to ask: