转自:七月 可靠性工程师SRE 2018-10-20
理论上而言,这个也许是对的。但实际上,这个所谓实际现场数据收集有些偏颇。MIL-HDBK-217 很明确它并不是根据现场数据收集而制定的,因为再过去的二十多年它就再也没有更新过。同样IEC62380自从2004年颁布之后就再也没有更新,而且这个标准也是基于老的现场数据。那么其他诸如SR-332, FIDES或者SN29500呢?的确它们都还是进行定期的跟新,但是它们更新的数据来源非常的有限,这是一个致命的问题。
错误观念 #3:浴盆曲线
The traditional “bathtub curve”—in which reliability is described by a declining failure rate (quality), a steady-state failure rate (operational life), and an increasing failure rate (robustness)—is based on a misconception. (Image source: DfR Solutions)
The actual failure rate of products in the field is based on a combination of decreasing failure rates due to quality, increasing failure rates due to wear-out, and a very small number of truly random occurrences. (Image source: DfR Solutions)
错误概念 #4:可靠性失效物理不能够用于预计产品的工作寿命
那么,让我们回到为什么这些标准还在使用的一个根本原因:没有可以替代他们的方法。至少目前这个还有这么一个方法。但是如果扒开表面,我们看到还有一些力量在影响。首先就是人的本性是不希望被要求干更多的,从传统的经验寿命预计到失效物理研究需要大量的工作,这个需要从简单方法到了解每一个故障模式,建立算法模型包括复杂的双曲切线(hyperbolic tangents 不知道是什么鬼),可能还需要对电路模拟建模知识,有限元分析以及其他一些奇奇怪怪的技术。对于受到经典可靠性培训的可靠性工程师,只是培训了传统的方法,对这些新技术还是有点发怵的。
For decades, the electronics industry has been stuck using theobsolete and inaccurate MIL-HDBK-217 to make reliability predictions thatare required by the top of the supply chain (Department of Defense, FAA,Verizon, etc.). Each of these handbooks (some of which have not been updated inmore than two decades) assigns a constant failure rate to every component.It then arbitrarily applies modifiers (“lambdas”) based ontemperature, humidity, quality, electrical stresses, etc. This simplisticapproach was appropriate back in the '50s and '60s, when the method was firstdeveloped. It can no longer be justified, however, given the rapid improvementin simulation tools and the extensive access to component data.
So, why do some people in the electronicsindustry keep using these approaches? Four key misconceptions seem to breathelife into these archaic documents even after they have been proven wrong overand over and over again.
#1: Empirical handbooks arebased on actual field failures.
Theoretically, this could be true. In practice,the process is a little more muddled. MIL-HDBK-217 is clearly not based onactual field failures because it has not been updated in over 20 years. Samewith IEC 62380, which was published in 2004 and is based on even older fielddata. What about the rest, like SR-332 or FIDES or SN29500? Yes, they areupdated on a more regular basis, but their fatal flaw is their very limitedsource of information. There are indications that the number of companiessubmitting field failure information into these documents is less than 10 andsometimes less than
5. How relevant is failure data from 5 companies for theother 120,000 electronic OEMs in the world? Not very.
And it gets even worse the deeper you go. Mostof these companies do not identify the specific failurelocation on all of their field failures. Failure analysis when there is a highnumber of failures? Yes. Failure analysis on high value products? Yes. The restof the stuff? Repair and replace or just throw it away. This results in avery teeny, tiny number of samples being the basis for these “etchedin stone” failure rates. And what if this arbitrary self-selectioncauses some components to not have any field failure information? Only twooptions: Keep the old failure rate number or make up a new one.
It should give all of us some pause. Thereliability of airplanes, satellites, and telephone networks could be, in somevery loose way, based on an arbitrary set of filtered data from a self-selectedgroup of three companies.
A standard piece of electronics will haveapproximately 200 unique part numbers and 1,000 components. About 20 of these200 unique parts will be integrated circuits. Off the top of my head, eachintegrated circuit will have up to 12 possible ways to fail in the field (ignoringdefects). These include dielectric breakdown over time, electromigration, hotcarrier injection, bias temperature instability, EOS/ESD, EMI, wire bondcorrosion, wire bond intermetallic formation, solder fatigue (thermal cycling),solder fatigue (vibration), solder failure (shock), and metal migration (on thePCB). This means you would have to calculate 240 combinations of part andfailure mechanisms. Each one requires geometry information, materialinformation, environmental information, etc. And that’s just for integratedcircuits!
But these are the true reliabilityfundamentals of electronics. And, just like stock pickers, if you capturethe true fundamentals, you will get it right every single time.
#2: Past performance is anindication of future results.
That disclaimer on mutual funds is there for areason. Less than 0.3% of mutual funds deliver top 25% returns four years in arow[之1].Have you ever thought about why mutual funds are unable to consistentlydeliver? It’s the same reason why handbooks are unable to consistently deliver.Both are unable to capture the true underlying behavior that drives success andfailure. Critical details, like how companies treat their customers or theirR&D pipeline, are fundamental to the success of companies, but are oftennot accounted for by mutual fund managers because it's “toohard” and “too expensive.”
The same rationale is used by engineers whorely on handbooks. The reason why one product had a mean time between failure(MTBF) of 100 years and another had an MTBF of 10 years may have nothing to dowith temperature or quality factors or number of transistors or electricalderating. If you really want to understand and predict reliability, you have toknow all the ways the product will fail. Yes, this is hard. And yes, thisis really hard with electronics. How hard? Let’s run through ascenario.
#3: The bathtub curve exists.
There is a belief that the reliability of anyproduct can be described by a declining failure rate (quality), a steady statefailure rate (operational life), and an increasing failure rate (robustness).If this is truly the behavior of a fielded product, one can understand themotivation for handbooks that calculate a MTBF. To avoid the portion of life atwhich the failure rate declines, companies will screen their products. To avoidthe portion of life where the failure rate increases, companies will overdesigntheir products. If both activities are done well, the only thing to worry aboutis the middle of the bathtub curve. Right?
Wrong! The first, and biggest problem, is thisconcept of “random” failures that occur during the operationallifetime. If the failures are truly random, the rate at which they occur shouldbe independent of the design of the product. And if they are independentof the design, why would you try to calculate the failure rate based on thedesign? One slightly extreme example would be the failures of utility metersbecause a cat decided to urinate on the box. This failure is truly random and,because it is random, it has nothing to do with the design. (Side note: No oneever got fired because the rate of these truly “random” events wastoo high.) This failure mode may be partially dependent on thehousing/enclosure, but housings and enclosures are not considered in empiricalprediction handbooks.
The reality is that failure rate duringoperational life is a combination of decreasing failure rates due to quality,increasing failure rates due to wear-out, and a very small number of trulyrandom occurrences. The wear-out portion is increasing in frequency andbecoming harder to identify because the shrinking features of the currentgeneration of integrated circuits is causing wear-out behavior earlier thanever before. IC wear-out behavior is different than wear-out seen with movingparts and interconnect fatigue. Most failure mechanisms associated with integratedcircuits have very mild wear-out behavior (Weibull slopes of 1.2 to 1.8). Thismeants that it can be really hard to see these failures in the warrantyreturns, but they're there.
Misconception #4: Reliability Physics cannotbe used to predict operating life performance.
So, now we get to the real reason why thesehandbooks are still around: There is nothing available to replace them. Atleast, that can be the mentality. If you scratch the surface, however, thereare other forces at play. The first is that human nature is to not ask for morework. Switching from empirical prediction to reliability physics will be morework. The activity goes from simple addition (failure rate 1 + failure rate 2 +failure rate 3 + …) to algorithms that can contain hyperbolic tangents (saythat three times fast) and may require knowledge of circuit simulation, finiteelement, and a lot of other crazy stuff. For reliability engineers trainedin classic reliability, which teaches you to use the same five techniques regardlessof product or industry, this can be daunting.
The second is that the motivation to changepractices is not there. In many organizations and industries, traditionalreliability prediction can be a “check the box” activity without realizing thedamaging influence it has on design, time to market, and warranty returns.Companies end up implementing very conservative design practices, such asmilitary grade parts or excessive derating, because these activities arerewarded in the empirical prediction world. Many times, design teams guidedtoward these practices have no idea of the original motivation (i.e., “wehave always done it this way”). If reliability prediction becomes a check thebox activity, design is forced to go through the laborious design-test-fixprocess (also known as reliability growth, though it is more wasting time thangrowing anything). Finally, since handbook reliability prediction is divorcedfrom the real world, the eventual cost of warranty returns can experience wildswings in magnitude for each product. These costs are not expected orpredicted by the product group.
不会吧, 还没学习,就过时了,没有继承就没有发展呀?