Claims data is the digital record generated every time a healthcare provider bills an insurance company for a service. Each claim captures who received care, what was done, who provided it, and how much it cost. Collectively, these billions of individual billing records form one of the largest and most widely used sources of healthcare information in the United States, powering everything from insurance reimbursement to population-level research.
What a Single Claim Contains
A healthcare claim is more than just a bill. It’s a structured digital document with dozens of standardized fields that together paint a detailed picture of a healthcare encounter. The core data elements fall into four categories: patient information, provider information, clinical codes, and cost details.
Patient information includes basics like name, date of birth, and a unique identification number assigned by the insurer. Provider information is more layered than you might expect. A single claim can identify the billing provider (the entity sending the bill), the servicing provider (the clinician who actually delivered the care), the admitting provider (the doctor who checked you into the hospital), and the referring provider (the one who sent you there). Each of these is tracked by a National Provider ID, a unique number assigned to every healthcare professional and organization in the country.
Clinical codes tell the story of what happened medically. Diagnosis codes describe why you needed care. Procedure codes describe what was done. Cost fields capture the amount the provider billed, the maximum amount the insurer considers “allowable,” and the amount actually paid. For patients with both Medicare and Medicaid, the claim also tracks how much went toward deductibles and coinsurance.
The Coding Systems Behind Every Claim
Claims data relies on standardized code sets so that a knee replacement billed in Oregon means exactly the same thing as one billed in Florida. Four coding systems do most of the heavy lifting.
- ICD-10 codes identify diagnoses and inpatient hospital procedures. Updated from ICD-9 in October 2015, the current system offers far greater specificity, with terminology that reflects modern clinical practice. ICD-10-CM codes (the “clinical modification” version) are used by all providers in every setting to record diagnoses. ICD-10-PCS codes cover inpatient hospital procedures specifically.
- CPT codes (technically HCPCS Level I) identify services and procedures across six categories: evaluation and management visits, anesthesiology, surgery, radiology, pathology, and laboratory medicine. These are maintained by the American Medical Association.
- HCPCS Level II codes fill the gaps that CPT codes don’t cover, including durable medical equipment, prosthetics, orthotics, ambulance services, and certain drugs. When a new product hits the market before a specific code exists, suppliers can use miscellaneous codes to begin billing immediately.
- NDC codes (National Drug Codes) identify specific prescription drugs dispensed through pharmacies.
Professional vs. Institutional Claims
Not all claims look the same. Professional claims (known as 837P in electronic format, or CMS-1500 on paper) are submitted by individual healthcare professionals and suppliers: your doctor’s office visit, a lab draw, an outpatient specialist appointment. Institutional claims (837I electronically, UB-04 on paper) come from facilities like hospitals, skilled nursing centers, and home health agencies. Institutional claims tend to include additional details like admission and discharge dates, room charges, and facility-specific cost breakdowns. The distinction matters because the two formats capture different slices of a patient’s care, and researchers or analysts working with claims data often need both to get a complete picture.
Pharmacy Claims: A Separate Stream
Pharmacy claims follow their own standard, maintained by the National Council for Prescription Drug Programs (NCPDP). When a pharmacist fills your prescription, the resulting claim captures the National Drug Code identifying the exact medication, the quantity dispensed, the estimated days supply, the fill number (whether it’s a new prescription or a refill), the date the prescription was written, the dispensing fee, and the gross amount due. For compound medications, additional fields record each ingredient, its cost, and the route of administration. Pharmacy claims are transmitted in real time at the point of sale, making them one of the most timely sources of medication utilization data available.
How a Claim Gets Processed
Once a provider submits a claim, it moves through a four-stage adjudication process before anyone gets paid. The initial review checks basic details: Is the patient’s name correct? Are the diagnosis and service codes present? Is the treatment location documented? Claims that pass this screening move to an automatic review, where software checks the claim against the patient’s coverage rules, benefit limits, and medical policies.
Claims that the automated system can’t resolve cleanly get flagged for manual review. A human examiner goes through the claim in detail and may request additional documentation, like medical records, to verify that the services were appropriate for the patient’s situation. Once all reviews are complete, the insurer makes a final payment decision: the claim is paid in full, paid partially, or denied.
Claims Data vs. Clinical Records
Claims data and electronic health records (EHRs) both describe healthcare encounters, but they capture fundamentally different things. Claims data reflects an insurance plan’s coverage decisions and utilization management. EHR data reflects clinicians’ decisions and practice patterns. The practical differences are significant.
Claims data captures nearly all billable encounters across every provider a patient sees, regardless of health system. EHR data, by contrast, is typically limited to care delivered within a single health system. One study comparing the two for rheumatoid arthritis patients found that EHR records substantially undercounted emergency department visits (4% of patients in EHR vs. 11.2% in claims), X-rays (4% vs. 22%), and CT scans (5.1% vs. 7.3%). The EHR simply didn’t have records from care delivered elsewhere.
Where claims data falls short is clinical depth. It contains no lab results, no blood pressure readings, no imaging findings, no pathology reports. If a piece of clinical information isn’t relevant to getting the bill paid, it generally doesn’t appear in the claim. Diagnosis codes match the actual medical record diagnosis roughly 70% of the time, with accuracy dropping for outpatient visits, milder conditions, and primary care settings compared to hospitals and serious diseases.
How Claims Data Is Used in Research
Despite its limitations, claims data has become one of the most valuable tools in health services research. Because it’s population-based rather than drawn from a single hospital or study cohort, it dramatically reduces selection bias. The sample sizes are enormous, often covering millions of people over many years, enabling studies that would be impossible with traditional clinical trials. Researchers can track medication use in fine detail, calculate real-world treatment costs, and study rare conditions that might produce only a handful of cases in a conventional study.
The tradeoffs are real, though. Because claims data was built for billing rather than research, it contains interpretation errors that researchers must navigate carefully. Providers sometimes add diagnosis codes that aren’t directly related to the patient’s actual condition, because certain codes are required to justify coverage. Researchers can’t apply standard diagnostic criteria the way they would in a clinical study, so they have to create “operational definitions” using combinations of codes, visit patterns, and medication fills as proxies for a true diagnosis. And when patients receive reduced copayments through assistance programs, their increased use of services can skew prevalence and incidence rates compared to other groups.
Data Lag and Completeness
Claims data is never truly “real-time.” After a service is delivered, the provider has to submit the claim, the insurer has to process it, and any resubmissions or adjustments need time to work through the system. This gap is known as the run-out period, and it directly affects how reliable a dataset is for analysis.
Enrollment records stabilize quickly, with very little run-out needed. Service utilization and cost data take much longer. Preliminary research files from Medicaid typically include at least six months of run-out for every month of data. Final research files wait for at least 12 months of run-out to capture late-arriving claims and adjustments. For anyone analyzing claims data, using a dataset that hasn’t fully “matured” risks undercounting utilization and underestimating costs.
Sharing Claims Data Through Modern Standards
Historically, claims data lived in siloed insurer databases with little standardization for sharing. That’s changing. CMS now requires the use of HL7 FHIR (Fast Healthcare Interoperability Resources) Release 4.0.1 as the technical backbone for exchanging health data electronically. Under the CARIN Consumer Directed Payer Data Exchange standard, insurers must make explanation of benefits data available to patients through APIs, allowing people to access their own claims and encounter data and share it with applications of their choosing. Medicare’s Blue Button program, which lets beneficiaries download their Parts A, B, and D claims data, is built on this framework. These requirements are part of broader CMS interoperability rules designed to give patients and their care teams easier access to the billing data that, for decades, only insurers could see.