Yes, hair color is categorical data. It is one of the most common textbook examples of a categorical variable, appearing in introductory statistics courses as a go-to illustration of data that falls into distinct groups rather than existing on a numerical scale. When you record someone’s hair color as black, brown, blonde, red, or gray, you’re assigning them to a category, not measuring a quantity.
What Makes Hair Color Categorical
Data generally falls into two broad types: categorical (also called qualitative) and quantitative (also called numerical). Categorical data describes qualities or characteristics using words or labels. Quantitative data involves numbers that result from counting or measuring something. Hair color fits squarely into the categorical camp because the values are descriptive labels like “dark brown” or “blonde,” not numbers you can add, subtract, or average.
A quick test: can you meaningfully calculate the average of your data? You can average heights, ages, or test scores. You cannot average “brown” and “red.” That’s a reliable signal you’re working with categorical data.
Hair Color Is Nominal, Not Ordinal
Categorical data has its own subtypes. The two main ones are nominal and ordinal. Ordinal data has categories with a logical ranking, like education level (high school, bachelor’s, master’s, doctorate) or pain severity (mild, moderate, severe). Nominal data has categories with no inherent order.
Hair color is nominal. There is no agreed-upon way to rank blonde as “higher” or “lower” than brown. You could alphabetize them, but that’s an arbitrary choice, not a property of the data itself. Other classic nominal examples include blood type, eye color, and zip code. Compare that to something like clothing size (S, M, L, XL), where the order is built into the meaning.
How to Work With Hair Color in Analysis
Because hair color is categorical, the statistical tools you can use with it are different from those for numerical data. You can count how many people fall into each category and calculate percentages, but you can’t compute a mean or standard deviation. The most common way to summarize hair color data is with a frequency table, a bar chart, or a pie chart. A bar chart with one bar per color and height representing the count or percentage is the standard visualization.
For statistical testing, you’d typically use methods designed for categorical variables. A chi-square test, for example, can tell you whether hair color is distributed differently between two groups (say, comparing hair color frequencies across regions).
Using Hair Color in Machine Learning
Machine learning models require numerical input, so categorical variables like hair color need to be converted into numbers before a model can use them. This is where people sometimes get tripped up. You might think assigning blonde = 1, brown = 2, red = 3, and black = 4 solves the problem, but it doesn’t. The model would interpret those numbers as having a meaningful order and scale, treating black as “four times” blonde, which is nonsense.
The standard solution is one-hot encoding. Each possible hair color gets its own column that contains a 1 or a 0. A person with brown hair would have a 1 in the “brown” column and 0s everywhere else. This way, the model learns a separate weight for each color without imposing a fake ranking. Google’s machine learning documentation uses color variables as a primary example of when one-hot encoding is necessary.
When the number of possible categories gets very large, one-hot encoding becomes unwieldy because it creates a huge number of columns, most of which are zeros. In those cases, techniques like embeddings or hashing can reduce the number of dimensions while still preserving the categorical nature of the data. Hair color with its handful of common categories doesn’t usually run into this problem.
When Hair Color Can Be Treated as Quantitative
There is one exception worth knowing about. In certain research contexts, hair color is converted into a numerical measurement rather than kept as a label. For example, researchers studying how hair affects medical sensor readings have used a 0-to-4 scale for hair color based on darkness, effectively turning it into an ordinal or even continuous variable. This works because the underlying property being measured is really melanin concentration, which does exist on a spectrum. Darker hair absorbs more near-infrared light than lighter hair, and that absorption can be quantified.
Similarly, if you measured hair color using the wavelength of reflected light or the concentration of specific pigments, you’d have quantitative data. But in everyday statistics, surveys, and data science projects, hair color is recorded as a label and treated as categorical. The context determines the data type: if you wrote down “auburn,” it’s categorical. If you measured light reflectance at 650 nanometers, it’s quantitative. The variable itself isn’t inherently one type. It depends on how you collected the data.
Quick Reference for Common Examples
- Nominal categorical: hair color, eye color, blood type, zip code, country of birth
- Ordinal categorical: education level, satisfaction rating, pain severity, income bracket
- Quantitative (continuous): height, weight, temperature, blood pressure
- Quantitative (discrete): number of siblings, number of pets, shoe size
Hair color sits firmly in the nominal categorical group. Unless you’re measuring pigment concentrations in a lab, treat it as a set of unordered labels, visualize it with bar charts or pie charts, and encode it properly before feeding it into any model.