首先了解一下故障的定義:
故障定義
硬件故障(Hardware failure)
工業界通常使用“浴盆曲線”來描述硬件故障,具體如下圖所示。具體來說,硬件的生命周期一般被劃分為三個時期:
1) The first part is a decreasing failure rate, known as early failures
2) The second part is a constant failure rate, known as random failures
3) The third part is an increasing failure rate, known as wear-out failures
圖1 浴盆曲線(Bath tubcurve)
軟件故障(Software failure)
軟件故障可以通過每千行代碼的缺陷數(Defects/KLOC)進行測量,稱為缺陷密度(Defect Density):
Defect Density= Number of Defects / KLOC
影響缺陷密度的因素主要有如下幾點:
1)軟件過程(代碼評審、單元測試等)
2)軟件復雜度
3)軟件規模
4)開發團隊經驗
5)可復用代碼比例(久經考驗的代碼)
6)產品交付前的測試
衡量指標
平均故障間隔時間(MTBF)
英文全稱:Mean Time Between Failure,顧名思義,是指相鄰兩次故障之間的平均工作時間,是衡量一個產品的可靠性指標。
故障率(Failure Rate)
以下文字摘自wiki,避免翻譯失真:
Failure rate is the frequency with which an engineered system or component fails,expressed, for example, in failures per hour. It is often denoted by the Greekletter λ (lambda) and is important in reliability engineering.
The failure rate of a system usually depends on time, with the rate varying overthe life cycle of the system. For example, an automobile's failure rate in itsfifth year of service may be many times greater than its failure rate duringits first year of service. One does not expect to replace an exhaust pipe,overhaul the brakes, or have major transmission problems in a new vehicle.
In practice, the mean time between failures (MTBF, 1/λ) is often reported insteadof the failure rate. This is valid and useful if the failure rate may beassumed constant – often used for complex units / systems, electronics – and isa general agreement in some reliability standards (Military and Aerospace). Itdoes in this case only relate to the flat region of the bathtub curve, alsocalled the "useful life period". Because of this, it is incorrect to extrapolateMTBF to give an estimate of the service life time of a component, which willtypically be much less than suggested by the MTBF due to the much higher failurerates in the "end-of-life wear out" part of the" bathtubcurve".
為便于理解,舉個例子:比如正在運行中的100只硬盤,1年之內出了2次故障,則故障率為0.02次/年。
上文提到的關于MTBF和Failure Rate關系值得細細體會,在現實生活中,硬件廠商也的確更熱衷于在產品上標注MTBF(個人猜測是因為MTBF往往高達十萬小時甚至百萬小時,容易吸引眼球)。Failure Rate伴隨著產品生命周期會產生變化,因此,只有在前述“浴盆曲線”的平坦底部(通俗點說就是產品的“青壯年時期”)才存在如下關系:
MTBF= 1/λ
平均修復時間(MTTR)
英文全稱:Mean Time To Repair,顧名思義,是描述產品由故障狀態轉為工作狀態時修理時間的平均值。在工程學,MTTR是衡量產品維修性的值,在維護合約里很常見,并以之作為服務收費的準則。
圖2 硬件MTTR估算
圖3 軟件MTTR估算
可用性(Availability)
GB/T3187-97對可用性的定義:在要求的外部資源得到保證的前提下,產品在規定的條件下和規定的時刻或時間區間內處于可執行規定功能狀態的能力。它是產品可靠性、維修性和維修保障性的綜合反映。
關于Availability這個計算公式,很容易理解,這里不多做解釋。通常大家習慣用N個9來表征系統可用性,比如99.9%(3-ninesavailability),99.999%(5-ninesavailability)。
宕機時間(Downtime)
顧名思義,指機器出現故障的停機時間。這里之所以會提Downtime,是因為使用每年的宕機時間來衡量系統可用性,更符合直覺,更容易理解。
圖4 Availability與Downtime對應關系
延伸思考
MTBF不靠譜?
一般來說,服務器的主要部件MTBF,廠商標稱值都在百萬小時以上。比如:主板、CPU、硬盤為100wh,內存為400wh(4根內存約為100wh),從而可以推算出服務器整體MTBF約25wh(約30年),年故障約3%,也就是說,100臺服務器每年總要壞那么幾臺。
上面的理論計算看著貌似也沒啥問題,感覺還挺靠譜。但如果換個角度想想,總覺得哪里不太對勁:MTBF約30年,難道說可以期望它服役30年?先看看**的工程師如何解釋:
It is common to see MTBF ratings between 300,000 to 1,200,000 hours for hard disk drivemechanisms, which might lead one to conclude that the specification promisesbetween 30 and 120 years of continuous operation. This is not the case! Thespecification is based on a large (statistically significant) number of drivesrunning continuously at a test site, with data extrapolated according tovarious known statistical models to yield the results.
Based on the observed error rate over a few weeks or months, the MTBF is estimatedand not representative of how long your individual drive, or any individualproduct, is likely to last. Nor isthe MTBF a warranty - it is representative ofthe relative reliability of a family of products. A higher MTBF merely suggestsa generally more reliable and robust family of mechanisms (depending upon theconsistency of the statistical models used). Historically, the field MTBF, whichincludes all returns regardless of cause, is typically 50-60% of projected MTBF.
看到這里,再聯系前文對于Failure Rate的闡述,我知道各位讀者有沒有摸清其中的門道。其實說白了很簡單,這些廠商真正測算的是產品在“青壯年”健康時期的Failure Rate,然后基于與MTBF的倒數關系,得出了動輒百萬小時的MTBF。而現實世界中,這些產品的Failure Rate在“中晚年”時期會快速上升,因此,這些MTBF根本無法反映產品的真實壽命。文中也提到,**也意識到MTBF存在弊端,因此改用AFR(AnnualizedFailure Rate),俗稱“年度不良率”。
其實,早在2007年,Google和CMU同時在FAST07發表論文,詳細討論了硬盤故障的問題:
CMU《Diskfailures in the real world: What does an MTTF of 1,000,000 hours mean to you?》
Google《FailureTrends in a Large Disk Drive Population》
Google采集了公司超過10w塊消費級HDD硬盤數據(SATA和PATA,5400轉和7200轉,7家不同廠商,9種不同型號,容量從80G到400G不等),最終得出如下數據:
Google found that disks had an annualized failure rate (AFR) of 3% for the first threemonths, dropping to 2% for the first year. In the second year the AFR climbed to8% and stayed in the 6% to 9% range for years 3-5.