A synthetic population is a computer-generated dataset designed to mirror the real population of a specific geographic area. It is a full representation of every individual and household, complete with detailed demographic and socioeconomic attributes. Think of it as a digital twin of a city or region, populated by artificial individuals whose collective characteristics—age, income, household size, and more—statistically match the actual community. These datasets are not simple averages; they capture the granular details of a population. For example, a synthetic population would contain distinct, fictional households, each with its own specific income level and number of residents, rather than just the average income of a neighborhood.
Purpose of Creating Synthetic Populations
The primary driver for creating synthetic populations is the need to protect individual privacy and ensure data confidentiality. Many rich datasets, such as national censuses or detailed health surveys, contain sensitive personal information that cannot be publicly released. Sharing such data, even after removing obvious identifiers like names and addresses, carries a disclosure risk that individuals could be re-identified.
Synthetic populations solve this problem by providing a statistically identical but completely anonymous alternative. Because the individuals and households in the dataset are artificially generated, there is no real person to identify, which eliminates the disclosure risk. This allows researchers and planners to access the detailed, individual-level data necessary for their work without compromising the privacy of actual citizens. The synthetic dataset acts as a privacy-preserving proxy, enabling complex analysis and evidence-based decision-making.
The Generation Process
The creation of a synthetic population begins with two types of real-world data. The first is aggregate data, typically from a census, which provides summary tables for specific geographic areas. These tables, known as marginals, show details like the total number of people in different age groups or income brackets and provide the high-level constraints the synthetic population must match.
The second ingredient is a disaggregated sample of the population, such as the Public Use Microdata Sample (PUMS). This sample contains anonymous, individual-level records showing how different characteristics are combined within actual households. While only covering a small fraction of the total population, this microdata reveals the internal relationships between variables, like how age, income, and education level are correlated.
The core of the generation process is an algorithm that uses the microdata sample as a template to build a full population that aligns with the aggregate totals. Methods like Iterative Proportional Fitting (IPF) or combinatorial optimization are used to solve this puzzle. The algorithm iteratively adjusts and clones households from the sample data, placing them into geographical zones until the synthetic population’s statistics match the known census totals.
Applications in Research and Planning
The applications of synthetic populations span numerous fields, enabling detailed simulations that would be unfeasible with real populations. In urban planning and transportation, these datasets are used to model the daily movements of a city’s residents. Planners can simulate traffic flow, estimate demand for new public transit lines, or develop emergency evacuation strategies.
In public health, epidemiologists use synthetic populations to model the spread of infectious diseases. By simulating interactions between synthetic individuals in households, schools, and workplaces, researchers can test the effectiveness of different interventions. They can evaluate the impact of vaccination campaigns or social distancing measures before implementing these policies.
Governments and economists also use this tool to forecast the effects of social and economic policies. Before introducing a new tax credit, for example, analysts can apply it to a synthetic population to see how it might affect household incomes across different demographics. This allows for pre-implementation analysis to identify a policy’s potential benefits or unintended consequences.
Assessing Validity and Limitations
To ensure a synthetic population is a reliable proxy, its creators perform validation tests. They statistically compare the synthetic dataset against the original, confidential data at multiple levels of detail. This involves checking if the distributions of attributes like age and income, and the relationships between them, are accurately reproduced in the synthetic version to ensure it will produce valid results.
A part of this process is ensuring joint distributions are correct—for example, that the number of high-income, single-person households in a specific area matches known patterns. This validation gives researchers confidence that the model’s structure is statistically congruent with the real population.
Despite their utility, synthetic populations have limitations tied to their input data. The model is only as good as the information used to build it and may struggle to accurately represent very small or rare subgroups if they were not well-captured in the initial sample data. Complex social networks or behaviors not present in the source datasets will also be absent from the synthetic version. While not a perfect replica of reality, these datasets are a valuable tool when direct access to sensitive data is not an option.