转自:七月 可靠性工程师SRE 2018-10-20
译者注:译者很赞同作者的一个观点:时代和技术都在进步,过去无法预知未来。一个标准的诞生是花费了很多的心血,但是如果追赶不上时代的步伐,那也只能被淘汰。可靠性工程也需要与时俱进,研究新的可靠性评估手段,新的失效模式,新的失效机理,融入到日常的可靠性工作当中。或者说尽信标准,不如无标准。
最近几十年,由于甲方(比如美国国防部,FAA,Verizon等等)强制要求,整个电子产品行业不得不继续沿用已经过时而且错误的美军标MIL-HDBK-217进行可靠性预计。而这一类的书(有些已经很多年没有更新)都是基于电子器件失效率恒定为基础的,然后根据温度,湿度,器件质量和应力等级对恒定失效率(λ)进行修正。这个简单的方法放在五六十年代刚开始应用的时候还是不错的。但是随着建模技术快速发展和数据收集难度增加,这个方法就很难适应了。
但是,为什么电子行业我们还是有许多人在坚持的用这个方法呢?即使这个方法一而再再而三被证明错误的。主要四个错误的观念给了这些方法苟延残喘的机会。
错误的观念#1:这个标准是根据实际现场数据制定的
理论上而言,这个也许是对的。但实际上,这个所谓实际现场数据收集有些偏颇。MIL-HDBK-217 很明确它并不是根据现场数据收集而制定的,因为再过去的二十多年它就再也没有更新过。同样IEC62380自从2004年颁布之后就再也没有更新,而且这个标准也是基于老的现场数据。那么其他诸如SR-332, FIDES或者SN29500呢?的确它们都还是进行定期的跟新,但是它们更新的数据来源非常的有限,这是一个致命的问题。
数据的更新主要来源于不到10家甚至不到5家的现场数据,这几家的数据怎么能够代表超过12万电子产品制造商呢?这基本上做不到。
而且如果越深挖问题越多。这些公司很难追踪到所有现场失效数据以及失效具体信息。对于故障率很高的器件是可以做到,对于价格超高的失效分析可以做到,但是对其它的就很难了,因为这些故障器件有可能直接维修,替换甚至丢掉了。这就导致了只有少量的数据真正能够用于失效率评估。更有甚者,如果一些零部件都没有任何的现场失效的数据,那只能参考以往的失效率或者直接拍一个。
这里我们需要思考一下,飞机,航空和通信网络所用的失效率都是通过下面公司有限筛选的数据来整理出来的。
错误观念2#:过去数据可以预知未来。
共同基金所谓的免责声明是有原因的。只有不到0.3%的共同基金能够连续四年达到25%的收益率。为什么共同基金不能够做到稳定的回报率呢?这其实也是预计标准不能够一直准确可靠性预计的原因。它们都很难抓到导致预计准确或者错误的的根本原因。一些主要的因素诸如公司对客户的重视程度,产品研发过程是一个公司能够成功的关键,但是这个共同基金而已这个要么太难要么太费力气所以没有办法计算。
这个也同样适用于那些依赖标准进行预计的工程师。比如一个产品的MTB是100年而另外一个产品是10年可能和温度、质量因子、晶体管数量和施加应力都没那么大关系。如果真的想要了解产品和评估寿命,我们需要知道产品的所有可能的失效模式。诚然,这个非常的难。但是,这个真的很难吗?有多难呢?
如果一个标准的电子产品有大概200个独立的部件,包含1000个零部件。其中200部件里面有20个是集成电路芯片。我们脑海里就闪现出不下12种器件现场可能失效的模式(忽略器件本身缺陷),这些包括随着时间绝缘性下降(TDDB),电子迁移,热电子激发,偏置电压不稳定,EOS/ESD,金线腐蚀,合金层增长,焊锡温度循环疲劳,焊锡振动疲劳,焊锡冲击失效,PCB板表面金属迁移等等。这就意味着240个器件和失效模型组合。而每个就需要电路板结构,材料信息,环境应力信息等等,而且这只是针对集成电路芯片而已,其他180个部件还没有考虑。
而且这些是真正的电子产品的可靠性基础。就像基金经理,如果能够抓到真正的机理,那么没有就都能够抓对。
错误观念 #3:浴盆曲线
大家都相信(包括译者)任何产品都可以描述为早期失效率下降(产品质量导致),中期稳定失效率(工作寿命)和末期失效率上升(产品健壮度)这三个阶段。如果这真的是产品现场应用的失效分布,我们就理解了这些标准计算MTBF的原因。未来防止失效率下降的过程,制造商通常都采用应力筛选。同样未来避免失效率上升,制造商都提高设计余量。如果这两个都做得不错的话,最后我们需要担心的也只是浴盆曲线中间恒定失效率的部分了。
The traditional “bathtub curve”—in which reliability is described by a declining failure rate (quality), a steady-state failure rate (operational life), and an increasing failure rate (robustness)—is based on a misconception. (Image source: DfR Solutions)
错!首先一个大大的问题是在工作寿命期间的“随机”失效的概念。如果失效真的是随机的,那么它就不依赖于产品的设计。如果它是独立于产品设计的,那么为啥还要基于产品的设计进行失效率计算呢?一个极端的例子就是公共电表是否失效取决于一只猫最后在哪个电表上撒尿。这个失效真的是随机的,完全跟设计没有关系。(这个译者持保留态度,比如可以设计猫撒尿也不会失效的电表)。(注:没有人因为由于“随机”失效率过高而被公司干掉)。这个电表是否失效一个主要依赖于电表外壳设计,但是这个防护外壳的失效通常不在寿命预计标准中。
The actual failure rate of products in the field is based on a combination of decreasing failure rates due to quality, increasing failure rates due to wear-out, and a very small number of truly random occurrences. (Image source: DfR Solutions)
实际上在工作寿命期间的失效率是由于产品质量导致的早期下降失效率和由于老化后期导致失效率和少量的随机失效率的综合。老化导致失效随着应用频率增加但是很难定义成为老化失效,这个主要是随着芯片小型化导致老化失效比以前提前很多。IC的老化失效不同于可运动的结构件或者是连接件的疲劳失效模式。许多集成芯片的失效机理在疲劳老化是比较温和的(weibull中β只有1.2到1.8)。这就以为这在保修期中退回的产品中很难看到,但实际上他们是存在的。(译者,这个不大好理解)
错误概念 #4:可靠性失效物理不能够用于预计产品的工作寿命
那么,让我们回到为什么这些标准还在使用的一个根本原因:没有可以替代他们的方法。至少目前这个还有这么一个方法。但是如果扒开表面,我们看到还有一些力量在影响。首先就是人的本性是不希望被要求干更多的,从传统的经验寿命预计到失效物理研究需要大量的工作,这个需要从简单方法到了解每一个故障模式,建立算法模型包括复杂的双曲切线(hyperbolic tangents 不知道是什么鬼),可能还需要对电路模拟建模知识,有限元分析以及其他一些奇奇怪怪的技术。对于受到经典可靠性培训的可靠性工程师,只是培训了传统的方法,对这些新技术还是有点发怵的。
第二个是要改变现有的方法缺乏动力。在许多的企业的行业,传统的可靠性预计就如果对着每项框框打勾,这个不会影响到产品设计,产品开发周期和实际市场返修。而且企业通常都是非常保守的进行产品设计,比如选用军级器件,额外的降额设计,而且这个在传统可靠性预计上得到加分。许多时候设计团队就是应用这样的指导原则不管这个原则是怎么来的。(比如说“我们一直就这么干啊”)。如果可靠性预计最终只是变成了打勾勾,而设计更多的是通过设计-测试-改进这么一个流程(俗称可靠性增长,尽管好多时候是浪费时间啥也没提高)。最终可靠性预计标准和现实背离,而产品的维修成本就会随着产品变化和变化,而这个产品的研发部门就无法预计和评估真正维修成本。
For decades, the electronics industry has been stuck using theobsolete and inaccurate MIL-HDBK-217 to make reliability predictions thatare required by the top of the supply chain (Department of Defense, FAA,Verizon, etc.). Each of these handbooks (some of which have not been updated inmore than two decades) assigns a constant failure rate to every component.It then arbitrarily applies modifiers (“lambdas”) based ontemperature, humidity, quality, electrical stresses, etc. This simplisticapproach was appropriate back in the '50s and '60s, when the method was firstdeveloped. It can no longer be justified, however, given the rapid improvementin simulation tools and the extensive access to component data.
So, why do some people in the electronicsindustry keep using these approaches? Four key misconceptions seem to breathelife into these archaic documents even after they have been proven wrong overand over and over again.
Misconception
#1: Empirical handbooks arebased on actual field failures.
Theoretically, this could be true. In practice,the process is a little more muddled. MIL-HDBK-217 is clearly not based onactual field failures because it has not been updated in over 20 years. Samewith IEC 62380, which was published in 2004 and is based on even older fielddata. What about the rest, like SR-332 or FIDES or SN29500? Yes, they areupdated on a more regular basis, but their fatal flaw is their very limitedsource of information. There are indications that the number of companiessubmitting field failure information into these documents is less than 10 andsometimes less than
5. How relevant is failure data from 5 companies for theother 120,000 electronic OEMs in the world? Not very.
And it gets even worse the deeper you go. Mostof these companies do not identify the specific failurelocation on all of their field failures. Failure analysis when there is a highnumber of failures? Yes. Failure analysis on high value products? Yes. The restof the stuff? Repair and replace or just throw it away. This results in avery teeny, tiny number of samples being the basis for these “etchedin stone” failure rates. And what if this arbitrary self-selectioncauses some components to not have any field failure information? Only twooptions: Keep the old failure rate number or make up a new one.
It should give all of us some pause. Thereliability of airplanes, satellites, and telephone networks could be, in somevery loose way, based on an arbitrary set of filtered data from a self-selectedgroup of three companies.
A standard piece of electronics will haveapproximately 200 unique part numbers and 1,000 components. About 20 of these200 unique parts will be integrated circuits. Off the top of my head, eachintegrated circuit will have up to 12 possible ways to fail in the field (ignoringdefects). These include dielectric breakdown over time, electromigration, hotcarrier injection, bias temperature instability, EOS/ESD, EMI, wire bondcorrosion, wire bond intermetallic formation, solder fatigue (thermal cycling),solder fatigue (vibration), solder failure (shock), and metal migration (on thePCB). This means you would have to calculate 240 combinations of part andfailure mechanisms. Each one requires geometry information, materialinformation, environmental information, etc. And that’s just for integratedcircuits!
But these are the true reliabilityfundamentals of electronics. And, just like stock pickers, if you capturethe true fundamentals, you will get it right every single time.
Misconception
#2: Past performance is anindication of future results.
That disclaimer on mutual funds is there for areason. Less than 0.3% of mutual funds deliver top 25% returns four years in arow[之1].Have you ever thought about why mutual funds are unable to consistentlydeliver? It’s the same reason why handbooks are unable to consistently deliver.Both are unable to capture the true underlying behavior that drives success andfailure. Critical details, like how companies treat their customers or theirR&D pipeline, are fundamental to the success of companies, but are oftennot accounted for by mutual fund managers because it's “toohard” and “too expensive.”
The same rationale is used by engineers whorely on handbooks. The reason why one product had a mean time between failure(MTBF) of 100 years and another had an MTBF of 10 years may have nothing to dowith temperature or quality factors or number of transistors or electricalderating. If you really want to understand and predict reliability, you have toknow all the ways the product will fail. Yes, this is hard. And yes, thisis really hard with electronics. How hard? Let’s run through ascenario.
Misconception
#3: The bathtub curve exists.
There is a belief that the reliability of anyproduct can be described by a declining failure rate (quality), a steady statefailure rate (operational life), and an increasing failure rate (robustness).If this is truly the behavior of a fielded product, one can understand themotivation for handbooks that calculate a MTBF. To avoid the portion of life atwhich the failure rate declines, companies will screen their products. To avoidthe portion of life where the failure rate increases, companies will overdesigntheir products. If both activities are done well, the only thing to worry aboutis the middle of the bathtub curve. Right?
Wrong! The first, and biggest problem, is thisconcept of “random” failures that occur during the operationallifetime. If the failures are truly random, the rate at which they occur shouldbe independent of the design of the product. And if they are independentof the design, why would you try to calculate the failure rate based on thedesign? One slightly extreme example would be the failures of utility metersbecause a cat decided to urinate on the box. This failure is truly random and,because it is random, it has nothing to do with the design. (Side note: No oneever got fired because the rate of these truly “random” events wastoo high.) This failure mode may be partially dependent on thehousing/enclosure, but housings and enclosures are not considered in empiricalprediction handbooks.
The reality is that failure rate duringoperational life is a combination of decreasing failure rates due to quality,increasing failure rates due to wear-out, and a very small number of trulyrandom occurrences. The wear-out portion is increasing in frequency andbecoming harder to identify because the shrinking features of the currentgeneration of integrated circuits is causing wear-out behavior earlier thanever before. IC wear-out behavior is different than wear-out seen with movingparts and interconnect fatigue. Most failure mechanisms associated with integratedcircuits have very mild wear-out behavior (Weibull slopes of 1.2 to 1.8). Thismeants that it can be really hard to see these failures in the warrantyreturns, but they're there.
Misconception #4: Reliability Physics cannotbe used to predict operating life performance.
So, now we get to the real reason why thesehandbooks are still around: There is nothing available to replace them. Atleast, that can be the mentality. If you scratch the surface, however, thereare other forces at play. The first is that human nature is to not ask for morework. Switching from empirical prediction to reliability physics will be morework. The activity goes from simple addition (failure rate 1 + failure rate 2 +failure rate 3 + …) to algorithms that can contain hyperbolic tangents (saythat three times fast) and may require knowledge of circuit simulation, finiteelement, and a lot of other crazy stuff. For reliability engineers trainedin classic reliability, which teaches you to use the same five techniques regardlessof product or industry, this can be daunting.
The second is that the motivation to changepractices is not there. In many organizations and industries, traditionalreliability prediction can be a “check the box” activity without realizing thedamaging influence it has on design, time to market, and warranty returns.Companies end up implementing very conservative design practices, such asmilitary grade parts or excessive derating, because these activities arerewarded in the empirical prediction world. Many times, design teams guidedtoward these practices have no idea of the original motivation (i.e., “wehave always done it this way”). If reliability prediction becomes a check thebox activity, design is forced to go through the laborious design-test-fixprocess (also known as reliability growth, though it is more wasting time thangrowing anything). Finally, since handbook reliability prediction is divorcedfrom the real world, the eventual cost of warranty returns can experience wildswings in magnitude for each product. These costs are not expected orpredicted by the product group.
参考原文:
https://www.designnews.com/electronics-test/end-near-mil-hdbk-217-and-other-outdated-handbooks/138268218059056
[s-68]
非常不错的文章,就像现在可靠性网的资料,很多都是十几年前的,这些标准、理论真的一直都是对的,真的不用更新吗?
[s-68]
原理都是错的,那么基于这个原理衍生出来的方法肯定是错的
在没有更高效的预计手段之前,用现有的手段预计无可厚非吧。现在市面上大多可靠性工具软件都是基于这些标准来做的预计模块,why?大家细扣起来早都知道有问题,为啥还这么做呢,就是没有更可行且高效的办法。
不会吧, 还没学习,就过时了,没有继承就没有发展呀?