国产SUV精品一区二区_午夜影院啊啊啊_日韩久久精品一区二区三区_一区二区三区日本在线观看,在线播放精品视频,视频一区国产精品,久久久精品456亚洲影院

 
新聞中心
News Center
「可靠性」是「可用性」?
來源:可靠性技術交流 | 作者:cszhouwei博客 | 發布時間: 2021-07-21 | 1751 次瀏覽 | 分享到:
相信點開這篇文章的讀者,一定或多或少接觸過“高可靠”“高可用”這些字眼,但是往往或語焉不詳,或羅列術語(MTBF、MTTR ...),那么我們到底應該如何定量描述系統的可靠性和可用性指標呢,這些看著很上流的術語到底意味著什么呢?



首先了解一下故障的定義:

故障定義

硬件故障(Hardware failure)

工業界通常使用“浴盆曲線”來描述硬件故障,具體如下圖所示。具體來說,硬件的生命周期一般被劃分為三個時期:

1)  The first part is a decreasing failure rate, known as early failures

2)  The second part is a constant failure rate, known as random failures

3)  The third part is an increasing failure rate, known as wear-out failures

 

圖1 浴盆曲線(Bath tubcurve)

 

軟件故障(Software failure)

軟件故障可以通過每千行代碼的缺陷數(Defects/KLOC)進行測量,稱為缺陷密度(Defect Density):

Defect Density= Number of Defects / KLOC

影響缺陷密度的因素主要有如下幾點:

1)軟件過程(代碼評審、單元測試等)

2)軟件復雜度

3)軟件規模

4)開發團隊經驗

5)可復用代碼比例(久經考驗的代碼)

6)產品交付前的測試

 

衡量指標

平均故障間隔時間(MTBF)

英文全稱:Mean Time Between Failure,顧名思義,是指相鄰兩次故障之間的平均工作時間,是衡量一個產品的可靠性指標。

故障率(Failure Rate)

以下文字摘自wiki,避免翻譯失真:

Failure rate is the frequency with which an engineered system or component fails,expressed, for example, in failures per hour. It is often denoted by the Greekletter λ (lambda) and is important in reliability engineering.

The failure rate of a system usually depends on time, with the rate varying overthe life cycle of the system. For example, an automobile's failure rate in itsfifth year of service may be many times greater than its failure rate duringits first year of service. One does not expect to replace an exhaust pipe,overhaul the brakes, or have major transmission problems in a new vehicle.

In practice, the mean time between failures (MTBF, 1/λ) is often reported insteadof the failure rate. This is valid and useful if the failure rate may beassumed constant – often used for complex units / systems, electronics – and isa general agreement in some reliability standards (Military and Aerospace). Itdoes in this case only relate to the flat region of the bathtub curve, alsocalled the "useful life period". Because of this, it is incorrect to extrapolateMTBF to give an estimate of the service life time of a component, which willtypically be much less than suggested by the MTBF due to the much higher failurerates in the "end-of-life wear out" part of the" bathtubcurve".

為便于理解,舉個例子:比如正在運行中的100只硬盤,1年之內出了2次故障,則故障率為0.02次/年。

上文提到的關于MTBF和Failure Rate關系值得細細體會,在現實生活中,硬件廠商也的確更熱衷于在產品上標注MTBF(個人猜測是因為MTBF往往高達十萬小時甚至百萬小時,容易吸引眼球)。Failure Rate伴隨著產品生命周期會產生變化,因此,只有在前述“浴盆曲線”的平坦底部(通俗點說就是產品的“青壯年時期”)才存在如下關系:

MTBF= 1/λ

 

平均修復時間(MTTR)

英文全稱:Mean Time To Repair,顧名思義,是描述產品由故障狀態轉為工作狀態時修理時間的平均值。在工程學,MTTR是衡量產品維修性的值,在維護合約里很常見,并以之作為服務收費的準則。

 

圖2 硬件MTTR估算

 

圖3 軟件MTTR估算

 

可用性(Availability)

GB/T3187-97對可用性的定義:在要求的外部資源得到保證的前提下,產品在規定的條件下和規定的時刻或時間區間內處于可執行規定功能狀態的能力。它是產品可靠性、維修性和維修保障性的綜合反映。

 

關于Availability這個計算公式,很容易理解,這里不多做解釋。通常大家習慣用N個9來表征系統可用性,比如99.9%(3-ninesavailability),99.999%(5-ninesavailability)。

 

宕機時間(Downtime)

顧名思義,指機器出現故障的停機時間。這里之所以會提Downtime,是因為使用每年的宕機時間來衡量系統可用性,更符合直覺,更容易理解。

 

圖4 Availability與Downtime對應關系

 

延伸思考

MTBF不靠譜?

一般來說,服務器的主要部件MTBF,廠商標稱值都在百萬小時以上。比如:主板、CPU、硬盤為100wh,內存為400wh(4根內存約為100wh),從而可以推算出服務器整體MTBF約25wh(約30年),年故障約3%,也就是說,100臺服務器每年總要壞那么幾臺。

上面的理論計算看著貌似也沒啥問題,感覺還挺靠譜。但如果換個角度想想,總覺得哪里不太對勁:MTBF約30年,難道說可以期望它服役30年?先看看**的工程師如何解釋:

It is common to see MTBF ratings between 300,000 to 1,200,000 hours for hard disk drivemechanisms, which might lead one to conclude that the specification promisesbetween 30 and 120 years of continuous operation. This is not the case! Thespecification is based on a large (statistically significant) number of drivesrunning continuously at a test site, with data extrapolated according tovarious known statistical models to yield the results.

Based on the observed error rate over a few weeks or months, the MTBF is estimatedand not representative of how long your individual drive, or any individualproduct, is likely to last. Nor isthe MTBF a warranty - it is representative ofthe relative reliability of a family of products. A higher MTBF merely suggestsa generally more reliable and robust family of mechanisms (depending upon theconsistency of the statistical models used). Historically, the field MTBF, whichincludes all returns regardless of cause, is typically 50-60% of projected MTBF.

看到這里,再聯系前文對于Failure Rate的闡述,我知道各位讀者有沒有摸清其中的門道。其實說白了很簡單,這些廠商真正測算的是產品在“青壯年”健康時期的Failure Rate,然后基于與MTBF的倒數關系,得出了動輒百萬小時的MTBF。而現實世界中,這些產品的Failure Rate在“中晚年”時期會快速上升,因此,這些MTBF根本無法反映產品的真實壽命。文中也提到,**也意識到MTBF存在弊端,因此改用AFR(AnnualizedFailure Rate),俗稱“年度不良率”。

其實,早在2007年,Google和CMU同時在FAST07發表論文,詳細討論了硬盤故障的問題:

CMUDiskfailures in the real world: What does an MTTF of 1,000,000 hours mean to you?

GoogleFailureTrends in a Large Disk Drive Population

Google采集了公司超過10w塊消費級HDD硬盤數據(SATAPATA5400轉和7200轉,7家不同廠商,9種不同型號,容量從80G400G不等),最終得出如下數據:

Google found that disks had an annualized failure rate (AFR) of 3% for the first threemonths, dropping to 2% for the first year. In the second year the AFR climbed to8% and stayed in the 6% to 9% range for years 3-5.


















主站蜘蛛池模板: 梅河口市| 静安区| 上思县| 玉龙| 阿荣旗| 昆明市| 绥棱县| 揭阳市| 屯留县| 内丘县| 贵港市| 江永县| 通榆县| 彭山县| 沭阳县| 右玉县| 玛曲县| 三台县| 昌乐县| 五莲县| 沙河市| 开阳县| 太仓市| 扎囊县| 鄱阳县| 吉木萨尔县| 平和县| 澜沧| 庆元县| 仲巴县| 西畴县| 景洪市| 临桂县| 峨眉山市| 太谷县| 镇原县| 黄浦区| 宁化县| 洱源县| 曲水县| 南部县|