What Makes an Assessment Valid and Reliable?

A valid assessment measures what it claims to measure. That sounds simple, but validity isn’t a single checkbox. It’s a body of evidence built from multiple angles, each one addressing a different question: Does the content match the skill being tested? Do the scores predict what they should? Are the results fair across different groups of people? The more evidence you gather across these dimensions, the stronger your case that the assessment is actually doing its job.

The Five Sources of Validity Evidence

The most widely accepted framework for validity comes from the joint standards published by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education. These standards identify five sources of validity evidence, each addressing a different aspect of whether an assessment works as intended.

Content evidence asks whether the test items adequately represent the subject or skill being measured. A final exam in biology that only covers two out of twelve chapters has weak content evidence. This is typically evaluated by having subject-matter experts review each item and judge whether it’s relevant and representative of the full domain.

Response process evidence looks at what’s actually happening when someone takes the test. Are test-takers using the reasoning skills the assessment is designed to measure, or are they relying on shortcuts like eliminating obviously wrong answers? This often involves think-aloud studies where people describe their thought process while completing items.

Internal structure evidence examines whether the parts of the test relate to each other in ways that match the intended design. If an assessment claims to measure three distinct skills, statistical analysis should confirm that the items cluster into those three groups rather than blending together into one undifferentiated score.

Evidence based on relationships to other variables checks whether scores correlate with outside measures the way you’d expect. A new anxiety questionnaire should correlate strongly with established anxiety measures (convergent evidence) and weakly with unrelated traits like mathematical ability (discriminant evidence). For correlations with other measures, values of 0.5 or above are generally considered strong, 0.3 to 0.5 moderate, and below 0.3 weak.

Consequential evidence considers the real-world impact of using test scores. Are decisions based on the assessment producing the intended outcomes? Are there unintended negative effects on certain groups? This is the most debated source of evidence, but it reflects the idea that validity isn’t just about measurement precision. It’s also about whether the assessment serves its stated purpose without causing harm.

How Content Validity Gets Measured

Content validity often relies on structured expert review rather than purely statistical methods. One common approach uses the Content Validity Ratio, developed by C.H. Lawshe, where a panel of experts independently rates each item as essential, useful but not essential, or not necessary. The ratio is calculated from the proportion of experts who agree an item is essential compared to those who disagree.

The threshold for an acceptable ratio depends on the size of your expert panel. With a small panel of 7 or 8 experts, at least 75% need to agree that an item is essential for it to pass. As the panel grows larger, the threshold drops because statistical confidence increases with more raters. A panel of 20 experts needs roughly 50% agreement, and a panel of 40 needs about 30%. Items that fall below these thresholds get revised or removed.

Construct Validity Ties Everything Together

Psychologist Samuel Messick argued that validity is fundamentally a unified concept, not a collection of separate types. In his framework, all validity evidence is really evidence about construct validity: whether the assessment captures the underlying trait or ability it’s designed to measure. He broke this into six facets that overlap with the five sources above but add useful nuance.

The substantive facet asks whether there’s a solid theoretical reason to expect that certain tasks would reveal the trait being measured. The structural facet checks whether the scoring system matches the complexity of what’s being assessed. If creativity is multidimensional, a single score won’t capture it well. The generalizability facet examines whether scores hold up across different populations, settings, and versions of the test. An assessment validated on college students in one country may not work the same way for working professionals in another.

The external facet is where convergent and discriminant evidence live. If your test measures what it claims, scores should line up with other good measures of the same thing and diverge from measures of different things. The consequential facet brings in questions of bias, fairness, and whether the assessment’s use in decision-making produces equitable outcomes.

Why Reliability Is Necessary but Not Enough

Reliability and validity are deeply connected but not interchangeable. Reliability means the assessment produces consistent results. Validity means those results are accurate. You can have one without the other, but only in one direction: a test can be perfectly reliable while measuring the wrong thing. A bathroom scale that always reads 10 pounds too heavy is reliable (consistent) but not valid (not accurate).

The reverse, however, doesn’t work. If results bounce around randomly each time someone takes the test, you can’t trust those results to reflect anything meaningful. There’s even a mathematical relationship: the maximum possible validity of a test is approximately the square root of its reliability coefficient. So if a test’s reliability is 0.79, its validity can’t exceed about 0.89 no matter how well designed the content is. This means improving reliability directly raises the ceiling on how valid an assessment can be.

Validity in Clinical and Diagnostic Testing

In medical and clinical settings, validity takes on a more concrete form through sensitivity and specificity. These measure how well a diagnostic test identifies people who have a condition and correctly rules out people who don’t.

Sensitivity is the percentage of people with the condition who test positive. A highly sensitive test catches nearly everyone who’s affected, meaning very few cases slip through. Specificity is the percentage of healthy people who correctly test negative. A highly specific test rarely flags someone who doesn’t have the condition.

In practice, there’s usually a tradeoff. Making a test more sensitive (catching more true cases) tends to increase false positives, lowering specificity. The right balance depends on what the test is being used for. Screening tests for serious diseases prioritize sensitivity because missing a case is more dangerous than a false alarm. Confirmatory tests prioritize specificity because you want to be sure before starting treatment.

Positive predictive value adds another layer. It tells you what percentage of people who test positive actually have the condition. This depends not just on the test’s accuracy but on how common the condition is in the population being tested. A test with excellent sensitivity and specificity can still have a low positive predictive value if the condition is rare, because even a small false-positive rate generates many false alarms when most people being tested are healthy.

Fairness as a Validity Concern

An assessment that works differently for different demographic groups has a validity problem. Differential Item Functioning analysis identifies test items where people of equal ability but different backgrounds (gender, ethnicity, language) have unequal chances of answering correctly. When this happens, the item may be measuring something other than the intended skill, like cultural familiarity or language fluency.

Modern approaches to detecting bias go beyond simple statistical significance. Researchers now focus on the magnitude of the difference using effect sizes, because with large enough samples, even trivially small differences become statistically significant. Items flagged for meaningful bias get reviewed by content experts to determine whether the difference reflects a genuine flaw in the item or a legitimate difference in the skill being measured.

Face Validity and Its Limits

Face validity is whether an assessment looks like it measures what it claims, based on a surface-level judgment. It’s the simplest form of validity and the weakest. A math test full of word problems about baseball might appear irrelevant to someone who doesn’t follow sports, even if it measures math skills perfectly well.

Despite its limitations, face validity matters for practical reasons. If the people taking an assessment don’t see it as relevant or sensible, they’re less likely to engage with it seriously. This is especially important in healthcare, where patient-reported outcome measures need to feel meaningful to the people filling them out. Many widely used instruments were developed primarily by researchers and clinicians, with limited input from the people who actually use them. Including end users in the development process improves both face validity and content validity, catching blind spots that experts alone might miss.

Real-World Generalizability

Ecological validity addresses whether assessment results translate to real-life performance. A memory test administered in a quiet lab might not predict how well someone remembers things in a noisy, distracting home environment. This is particularly relevant in neuropsychological testing, where clinicians want to know how a patient will function in daily life, not just in controlled conditions.

Unlike other forms of validity evidence, ecological validity is largely a judgment call rather than a computed statistic. You can calculate a correlation between test scores and real-world performance measures, but the broader question of whether results generalize to naturalistic settings depends on how closely the assessment conditions mirror the situations where the skills actually matter. The closer the match, the more confident you can be that scores mean something outside the testing room.