What Is Validity in Psychology? Types & Examples

Validity in psychology is the extent to which a measurement actually captures what it claims to measure. A depression questionnaire, for example, is valid only if its scores genuinely reflect how depressed someone is, not just how tired or stressed they feel. This sounds straightforward, but establishing validity is one of the trickiest challenges in psychological research and testing, because the things psychologists measure (intelligence, personality, anxiety) can’t be observed directly the way height or weight can.

Validity is often discussed alongside reliability, and the two are easy to confuse. Reliability means a measure gives consistent results. Validity means those results are meaningful. A bathroom scale that always reads ten pounds too heavy is perfectly reliable but not valid. In psychology, a test can be extremely reliable yet have no validity whatsoever. Reliability is necessary for validity, but it isn’t sufficient on its own.

Construct Validity

Most things psychologists measure are constructs: abstract ideas like self-esteem, anxiety, or working memory that can’t be directly seen or touched. Construct validity asks whether a test truly reflects the construct it’s designed to measure. It’s the broadest and most fundamental type of validity, and it’s evaluated through two main forms of evidence.

Convergent validity checks whether a new measure lines up with existing measures of the same or similar constructs. If you develop a new self-esteem questionnaire, its scores should correlate strongly with scores on established self-esteem scales like the Rosenberg Self-Esteem Scale. If they don’t, something is off.

Discriminant validity works in the opposite direction. It checks that the test does not correlate with measures of unrelated constructs. A self-esteem scale might reasonably have some relationship to body image, but the correlation should be smaller than with other self-esteem measures, because the two are related but distinct concepts. If your self-esteem test correlates just as strongly with an anxiety scale as it does with other self-esteem scales, it may be measuring something broader or different than intended.

Content and Face Validity

Content validity asks whether a test’s items adequately and appropriately cover the full scope of the construct. A math test meant to assess fourth-grade math skills needs to include questions on all the topics fourth graders are expected to know, not just multiplication. Establishing content validity typically involves expert evaluation: academic or industry experts review the test items and judge whether they represent the construct thoroughly. One common method is a card-sorting exercise, where experts categorize test items without being told which construct each one belongs to. If more than 90% of items get sorted into the correct category, that’s strong evidence of content validity.

Face validity is simpler and weaker. It’s just whether a test looks like it measures what it’s supposed to measure, based on a surface-level impression. A questionnaire about anxiety that asks about worry, restlessness, and sleep trouble has obvious face validity. Face validity matters mostly for practical reasons: if a test doesn’t look relevant to the people taking it, they may not engage with it seriously. But it’s not considered strong evidence on its own, because appearances can be misleading.

Criterion-Related Validity

Criterion-related validity evaluates a test by comparing its scores to some real-world outcome or established measure. It comes in two forms, distinguished by timing.

Concurrent validity compares test scores to a criterion measured at roughly the same time. A new depression questionnaire has concurrent validity if its scores match up with an already-validated depression inventory when both are given to the same group of people. Similarly, a reading assessment has concurrent validity if its scores correspond to demonstrated reading abilities observed in the classroom.

Predictive validity checks whether test scores predict a future outcome. The classic example is SAT scores predicting first-year college GPA. If a test measuring aggressive tendencies in school children has predictive validity, children who score higher should display more observed aggressive behavior in the weeks and months that follow. The gap between the test and the outcome it predicts is what separates predictive validity from concurrent validity and makes it especially useful for screening and selection decisions.

Internal Validity in Research

Internal validity applies to research studies rather than individual tests. It refers to how confidently a study can establish a cause-and-effect relationship. When a study has strong internal validity, you can trust that changes in the outcome were actually caused by the variable being studied, not by something else.

Several design choices strengthen internal validity. Random assignment puts participants into groups by chance, eliminating systematic differences between groups. Blinding keeps participants (and sometimes researchers) unaware of which treatment they’re receiving, preventing expectations from influencing results. Strict study protocols ensure every participant is treated the same way, so differences in outcomes reflect the treatment rather than inconsistencies in procedure.

Threats to internal validity are the alternative explanations that creep in when these safeguards are missing. Some of the most common include:

History: Events outside the study happen between measurements and influence the results. A stress-reduction study that runs during final exams might show worse outcomes that have nothing to do with the intervention.
Maturation: Participants naturally change over time. In a study lasting months or years, people may improve on their own regardless of treatment, simply through development or practice.
Testing effects: Taking a pretest can itself change performance on later tests. IQ scores, for instance, tend to increase 3 to 5 points on a second administration just from familiarity with the test format.
Attrition: When participants drop out, the remaining group may no longer represent the original sample. Those who stay might be more motivated or healthier, skewing results.
Selection bias: If comparison groups differ in important ways from the start, any differences in outcomes could reflect those pre-existing differences rather than the treatment.
Regression to the mean: When participants are selected because of extreme scores, their scores tend to move toward the average on retesting, which can look like improvement even without any real change.

External and Ecological Validity

External validity asks whether a study’s results apply beyond the specific people, setting, and time period in which it was conducted. A finding from a lab study on American college students may not hold for older adults in a different culture. Researchers strengthen external validity by conducting studies in natural settings rather than laboratories, clearly defining who is included in the study, and using cover stories or other techniques to make participants experience the study as realistic so they behave naturally.

Ecological validity is a closely related concept that focuses specifically on whether a research setting resembles real-world environments. A researcher studying childhood aggression has stronger ecological validity by observing actual conflicts during free play on a schoolyard than by asking children to press buttons in a lab. The cognitive psychologist Ulric Neisser argued in the late 1970s that traditional laboratory memory research, with its artificial word lists and recognition tasks, had neglected important questions and used materials so far removed from people’s natural environments that the findings had limited practical value. His critique helped push the field toward more ecologically valid study designs.

Why Validity Matters in Practice

Validity isn’t just an academic concern. Psychological tests are used to diagnose mental health conditions, place students in educational programs, screen job applicants, and make custody recommendations in court. When a test lacks validity, decisions based on its scores can be wrong in ways that affect people’s lives. A hiring assessment that doesn’t actually predict job performance introduces bias. A clinical screening tool that doesn’t truly capture depression severity can lead to missed diagnoses or unnecessary treatment.

The joint standards published by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education represent the gold standard for test development and use. These standards treat validity not as a single property a test either has or doesn’t have, but as a body of evidence that accumulates over time. No single study proves a test is valid. Instead, researchers build a case through multiple forms of evidence: content coverage, relationships with other measures, predictive accuracy, and the consequences of using the test in practice. The stronger and more diverse that evidence, the more confidence you can place in what the scores mean.