Maintaining reliable and healthy compute resources is essential for large scale cloud providers. System failures can lead to reboots and performance degradation, which in turn lead to downtimes and loss in revenue. There is an increasing emphasis on minimal runtime interruptions, quick fault diagnosis without impacting workload performance and service continuity.
Recently, at the OCP Storage Tech Talk 2025, Microsoft presented its Azure Failure Detection and Prediction (AFPD) system that focuses on early isolation of potential node failures, proactive mitigation to prevent customer disruption, and swift, non-disruptive repair actions for workload continuity. We will take a close look at it and see how it compares with the ULINK DA Drive Analyzer technology.
Detection Lifecycle
AFPD’s workflow spans three integrated domains and this layered approach increases reliability while minimizing false positives in detection.
– Online Monitoring: Real-time fault prediction, detection, and preventative actions
– Offline Repair: Stress testing and failure diagnosis post-isolation
– Root Cause Analysis after removal of faulty SSDs: Bug identification, ML labeling, and remediation using historical failure data
AFPD System Architecture
The system architecture follows an end-to-end telemetry-driven loop:
– Telemetry Collection (Online & Offline): Logs and sensor data are ingested daily. The collection frequency depends on the rate of change in telemetry attributes.
– Log Parsing & Detection Rules: Logs are filtered through detection algorithms. Rule-based approaches work best for deterministic signals.
– Machine Learning (ML) Engine: ML approach is used when rule-based approaches are insufficient. The ML engine is trained with labeled data for fault classification.
– Decision Logic: Contextual bandits are used to select optimal mitigation strategies.
– Mitigation & Remediation: The key steps include live migration, customer notification, and smart repair via spare node management
Fault Prevention Strategy
AFPD captures a diverse spectrum of failure signals:
– Online Failures: Azure service, PCIe, and kernel errors
– Offline Failures: HW diagnostics and telemetry from burn-in testing
– Failure Detection Signals: SMART logs, controller logs, vendor-specific metrics, environmental data
Failure data feeds the labeling pipeline to train the prediction model. The approach creates a feedback loop to refine telemetry needs and isolate hardware-level faults proactively.
AFPD automates the fault detection lifecycle with ML and contextual logic, which dramatically improves operational speed and precision. AFPD nearly doubles efficacy and significantly improves precision to about 90%. The recall rate is about 37%.
In contrast, ULINK DA Drive Analyzer’s failure prediction technology, which focuses on prediction of failures in storage devices rather than entire nodes, offers a competitive level of predictive performance for consumer devices. It processes data, including SMART indicators, host notifications, HDD logs, and manufacturing data from multiple HDD manufacturers, such as Seagate, WDC, Toshiba, and HGST. ULINK DA Drive Analyzer has been publicly available since December 2021 and has been at the forefront of drive failure prediction using ML models.
Related Reading
Google’s ML Model for HDD Reliability Management in Data Centers
Meta’s Perspective on HDD Health in Data Centers – Part 2
DA SmartQuest: DA Drive Analyzer’s AI-Powered Drive Failure Prediction for Windows
Recent Comments