In today’s world, where data and data drives are critical resources, managing HDDs in operations and proactively detecting potential failures is of paramount importance. Any oversight in timely failure detection could lead to severe disruptions across products and services of many businesses.
In earlier times, when a disk exhibited signs of trouble, the primary approach involved on-site software-based repairs. However, this method can prove to be costly and time-intensive, as it involves data transfer from the drive, drive isolation, diagnostic procedures, and subsequent reintegration into the operational workflow.
We at ULINK have therefore created an advanced machine learning (ML) system designed to predict the likelihood of drive failure. For cloud servers, data centers, and enterprise servers, millions of drives are deployed in operation. These in turn generate billions of rows of SMART (Self-Monitoring, Analysis and Reporting Technology) data, repair logs, Online Vendor Diagnostics (OVD) or Field Accessible Reliability Metrics (FARM) logs, and manufacturing data about each disk drive.
That amounts to a lot of parameters and factors being tracked and monitored across every single HDD to make an accurate prediction about the likelihood of drive failure. Given the abundance of raw data, we had to distill the pertinent features to uphold the precision and efficiency of our machine learning models.
Google in collaboration with Seagate, its HDD original equipment manufacturer (OEM) partner for its data centers, has also developed an ML system for predicting frequent HDD problems. Its system was built on top of Google Cloud to forecast the probability of a recurring failing disk—a disk that fails or has experienced three or more problems in 30 days.
What distinguishes DA Drive Analyzer’s ML system from the aforementioned collaboration is that the data we collect is not just from the HDD of one manufacturer but multiple HDD manufacturers including Seagate, WDC, Toshiba, and HGST. In doing so, we created a drive failure predictions system that is broadly applicable to today’s consumer market. Read this post on how we ranked SATA HDDs using long latency read data.
Given the sheer quantity of drives present in a modern enterprise data center, relying solely on human efforts for monitoring these devices is unfeasible. Further with the higher accuracy provided by ML models used in ULINK’s DA Drive Analyzer, engineers get a larger window to identify and manage failing drives. In doing so, they can not only cut down on expenses but also proactively avert drive problems before they impact end users.