Failure rate is calculated by dividing the number of failures by the total operating time. If 5 components fail over a combined 10,000 hours of operation, the failure rate is 0.0005 failures per hour. That basic ratio is the foundation, but the math gets more nuanced depending on whether your failure rate changes over time, what distribution fits your data, and how much confidence you need in the result.
The Basic Formula
At its simplest, failure rate (often represented by the Greek letter lambda, λ) is:
λ = Number of failures ÷ Total operating time
Total operating time means the cumulative hours (or cycles, miles, or any unit of use) across all items being tracked. If you’re testing 100 light bulbs and they collectively run for 50,000 hours before you stop counting, that’s your denominator. The numerator is however many burned out during that window.
This gives you a rate with units like “failures per hour” or “failures per cycle.” A failure rate of 0.001 per hour means you’d expect, on average, one failure for every 1,000 hours of operation.
Converting Between Failure Rate and MTBF
Mean Time Between Failures (MTBF) is the flip side of failure rate. They use identical data, just presented differently. The conversion is straightforward:
- MTBF = 1 ÷ λ
- λ = 1 ÷ MTBF
So a failure rate of 0.0005 per hour translates to an MTBF of 2,000 hours. This inverse relationship only holds cleanly when the failure rate is constant, which is an important assumption we’ll come back to. In practice, many reliability engineers prefer MTBF because it’s more intuitive: “this component lasts about 2,000 hours on average” is easier to communicate than a small decimal rate.
When Failure Rate Isn’t Constant
The simple formula assumes failures happen at a steady, predictable pace. In reality, most products follow what reliability engineers call the bathtub curve, a pattern with three distinct phases.
During early life (sometimes called infant mortality), the failure rate starts high and decreases. Weak units with manufacturing defects fail quickly, and once those are weeded out, the remaining population is more robust. This phase is why manufacturers often “burn in” components before shipping them.
The middle stretch is the useful life period. Here the failure rate is low and roughly constant. Failures happen randomly rather than from any systematic cause. This is the phase where the simple λ = failures ÷ time formula works best, and it’s the period that most reliability testing targets.
Eventually, components enter the wear-out phase, where the failure rate climbs as materials degrade, parts fatigue, and age catches up. Calculating failure rate during this phase with a simple average will underestimate the risk for older components and overestimate it for newer ones.
Using the Weibull Distribution
When your failure rate isn’t constant, the Weibull distribution is the standard tool. It uses a shape parameter (commonly called beta, β) that controls whether the failure rate is decreasing, constant, or increasing over time.
- β < 1: Failure rate decreases over time (infant mortality phase)
- β = 1: Failure rate is constant (the Weibull simplifies to an exponential distribution, and the basic formula applies)
- β > 1: Failure rate increases over time (wear-out phase)
The Weibull hazard function calculates the instantaneous failure rate at any given point in time, not just a flat average. You need failure time data from your components to estimate the shape and scale parameters, typically using statistical software or specialized reliability tools. Once those parameters are set, you can predict the failure rate at any age, which is far more useful than a single average when planning maintenance schedules or warranty periods.
Handling Items That Haven’t Failed Yet
A common real-world complication: not everything in your dataset has failed by the time you need an answer. Some components are still running, some were removed from service for unrelated reasons, and some tests end before all units fail. These are called censored observations, and ignoring them will skew your calculation.
In a classic example from integrated circuit testing, researchers ran 4,154 units under stress conditions and stopped the test at 1,370 hours. Only 26 had failed. The other 4,128 units were still working. Simply dividing 26 by the test duration would throw away the information those surviving units provide. Instead, each unit contributes its individual operating time to the denominator, whether it failed or not. A unit that ran the full 1,370 hours without failing still adds 1,370 hours to the total time at risk.
For more complex situations where the failure rate changes over time, statistical methods like maximum likelihood estimation handle censored data more rigorously. Most reliability software packages do this automatically.
Adding Confidence Intervals
A single failure rate number can be misleading, especially with small sample sizes. Five failures in 10,000 hours gives you a point estimate of 0.0005 per hour, but the true rate could reasonably be higher or lower. Confidence intervals quantify that uncertainty.
The standard approach uses the chi-square distribution. You need two inputs: the total operating time (T) and the number of failures (r). For a 90% confidence interval, you calculate upper and lower bounds on MTBF using chi-square values, then invert those bounds to get the failure rate interval. The formula for the MTBF bounds is:
Lower MTBF bound = 2T ÷ (chi-square value at upper tail, with 2(r+1) degrees of freedom)
Upper MTBF bound = 2T ÷ (chi-square value at lower tail, with 2r degrees of freedom)
Flipping those MTBF bounds gives you the confidence interval for λ. Chi-square tables are available in any statistics reference, and most spreadsheet software has a built-in chi-square inverse function. The fewer failures you’ve observed, the wider your confidence interval will be, which is a useful reality check on how much you should trust a failure rate calculated from limited data.
The FIT Unit for Very Reliable Components
When components are extremely reliable, failure rates become inconveniently small numbers. The electronics industry uses a unit called FIT (Failures in Time), defined as one failure per one billion device-hours. So 10 FIT means 10 failures are expected for every 10⁹ hours of cumulative operation. To convert a per-hour failure rate to FIT, multiply by 10⁹. A failure rate of 0.00000001 per hour equals 10 FIT, which is much easier to compare across components on a spec sheet.
Change Failure Rate in Software
If you landed here from a software engineering context, “failure rate” often refers to something different: the Change Failure Rate, one of the four DORA metrics used to measure software delivery performance. This is calculated as:
Change Failure Rate = Deployments requiring immediate intervention ÷ Total deployments
A “failure” here means a deployment that caused an incident, needed a rollback, or required an urgent hotfix. If your team deployed 50 times last month and 3 of those deployments needed immediate remediation, your change failure rate is 6%. Elite-performing teams typically keep this below 5%. Unlike hardware failure rates, this metric is a simple ratio with no time component, making it straightforward to track in your CI/CD pipeline.