Elements of a Reliability Program, Part One
This is a two part series where I outline the basic elements of creating and supporting a reliability program.
Gather Requirements and Set Reliability Goals
Reliability Goals start with understanding customer needs and the competitive situation surrounding those needs. However, this information must also be viewed through the lens of marketing strategy. Setting a new standard in reliability is a proven way to gain market share. Low warranty costs may also allow a lower price than the competition.
Reliability Goals for complex products have at least two dimensions: how frequently will random-in-time failures be tolerated, and how long the device must last.
The first of these is usually expressed as a failure rate (e.g. Annual Failure Rate, or AFR, in % per year) when speaking to customers, or as a Mean Time Between Failures (MTBF) when speaking to engineers.
The second of these is often expressed as the L1, L10, or L50 life, when 1%, 10%, or 50% of the devices, respectively, have failed in an end-of-life (wear-out) failure mode. The latter can be expressed in time (hours or months), number of cycles, takeoffs plus landings, sorties, and so on (e.g., L10 = 1000 cycles).
These two dimensions of reliability are independent. It is possible to have an MTBF of millions of hours, a situation where random-in-time failures are rare, yet have most units fail before five years because certain components inevitably wear out.
Subassemblies and components that have wear-out failure modes which cannot economically be pushed to last long enough are often called “service items.” Examples are rechargeable batteries and automobile tires.
Sometimes these service items can be incorporated with a consumable item or grouped with other service items in logical periodic replacement cartridges. Examples are putting the drum of a laser printer in the cartridge with the powdered ink, or selling defibrillator pads with battery packs for periodic replacement.
Indeed, service strategy is one of the major inputs to the product design. Sometimes sterility considerations or shelf life make it logical to break the product into a long-lived portion and a “disposable” portion. An example of this is an electronic thermometer with a disposable sheath for each use. Other considerations from the service plan may be to design a low cost, unserviceable, sealed unit, or to perform on-site repairs due to size. If the intention is to use loaners units and bring all units to one centralized location for repair, the device must be rugged enough for multiple shipments.
Risk Control Measures (mitigations) from Risk Management (safety) are another rich source of Reliability Goals. Safety may dictate an architectural change to the product to achieve desired reliability. An example of this is the dual-diagonal braking system on automobiles, which is now standard.
Other sources of Reliability Goals are external standards (e.g. ISO, IEC) and the manufacturing plan. For example, if manufacturing screening is to be utilized, good design margin is needed to ensure the product’s fatigue life will be only slightly consumed during manufacturing.
Allocate Reliability Goals to Subassemblies and Key Components
If the product contains redundancy or the ability to partially function in the presence of failures, a Reliability Block Diagram should be constructed showing how the reliability of each individual piece combine to produce the top-level reliability. Based on field experience with previous models, competitive information, and engineering judgment, the top-level goal for random-in-time failures should be allocated to subassemblies and key components. This will typically result in a Pareto of (unequal) individual failure rates.
Service strategy (e.g. annual preventive maintenance, service based on mileage) will combine with field experience, competitive data, and engineering judgment to set the end-of-life goals for the “service items.” For example, it may make no sense to increase product cost by pushing out the end-of-life for one service item, if another service item requires annual maintenance and both can be replaced simultaneously.
Develop Analytical Models and Reliability Predictions
Depending on the nature of each key component or subassembly, develop an estimate of reliability based on: Finite Element Analysis (FEA), comparison to similar designs, Physics of Failure (PoF), parts count prediction techniques, and supplier test data. Be cautious, however, when using supplier data. If your application is more stressful than the conditions of the supplier test, additional testing will be necessary.
By comparing the allocated Reliability Goal with the various analytical models or reliability predictions for each subassembly or key component, one can see where to focus the reliability engineering effort. It may be possible to re-allocate the individual goals in light of the analytical results. If the gap is large, an architectural change in the product, such as specifically targeted redundancy, should be considered.
Begin Long Term Life Tests
Often called “cycle testing,” this is subjecting components which rotate, flex, or receive repetitive electrical or mechanical stresses to several worst case lifetimes of wear. Depending on the product’s intended usage pattern, it may be possible to accomplish this quickly by increasing the frequency of cycling. For example, switches, cables, and connectors can usually be thoroughly tested over a weekend. But if the component is intended to run 24/7, such as a disk drive’s spindle, it may require an Accelerated Life Test (ALT) where the stress is increased, to see the relevant failure modes.
Whether the frequency or the stress or both are increased, the goal is to discover relevant failures. It is possible to miss failure modes because some things only happen over time and don’t speed up with cycles. An example of this would be copper migration and subsequent oxidation in a connector. It is also possible to produce “foolish failures” which will not occur in the customer environment. These are test artifacts, which result from the accelerating stress. Reliability engineering requires experience and judgment.
Wear-out failure modes are improved by identifying the “reservoir” of material that is being consumed (or transformed) by a “process.” One then increases the reservoir and / or slows down the process to push out the occurrence of failure until satisfactory life is achieved. If this is not possible, preventive maintenance is used to restore the reservoir of material by replacing a service item.
In part two the discussion will continue on the elements of a basic reliability program