Welcome to Scholar Publishing Group

International Journal of Engineering Technology and Construction, 2026, 7(1); doi: 10.38007/IJETC.2026.070101.

Engineering Study of Disaster Recovery and Fault Self-Healing Mechanisms for Distributed Systems under Cross-Regional Deployment Conditions

Author(s)

Xiao Ma

Corresponding Author:
Xiao Ma
Affiliation(s)

Cloud Data Technologies, eBay, San Jose, 95125, California, United States

Abstract

Introduction: Although the deployment of cross-regional services enhances availability, it can also multiply the failure cascades and recovery uncertainties. Methodology: We will introduce CRASH, a self-healing and disaster recovery solution that is a three-part system incorporating (i) multi-signal anomaly detection and impact scoring on dependencies, (ii) policy-guarded remediation as a Markovian decision problem, and (iii) adaptive failover, replication-lag estimation and confidence-sensitive decision thresholds. Findings: In a geo-distributed microservice testbed with injected region, network, and storage failures, CRASH minimises MTTR by 34-60 percent and RPO by 38-75 percent compared to Kubernetes-native and chaos-oriented baselines, whilst enhancing availability and reducing performance overhead. Statistically significant improvements in MTTR (p<0.001) were shown by two-sided t-welch tests, with all 95 percent confidence intervals favouring crash. Final Remarks: It offers an engineering recipe of recoverability in the context of reliable cross-region recovery through the combination of the observability, safe automation and the use of data-driven decision-making.

Keywords

Disaster Recovery; Cross-Region Deployment; Self-Healing; Automated Remediation; Geo-Replication

Cite This Paper

Xiao Ma. Engineering Study of Disaster Recovery and Fault Self-Healing Mechanisms for Distributed Systems under Cross-Regional Deployment Conditions. International Journal of Engineering Technology and Construction (2026), Vol. 7, Issue 1: 1-7. https://doi.org/10.38007/IJETC.2026.070101.

References

[1] R. Jin et al. Baking Disaster-Proof Kubernetes Applications with Resource-Aware Recipes. ACM (2024).

[2] A. Mhatre et al. Chaos Engineering: A Multi-Vocal Literature Review. arXiv (2024); also ACM (2025).

[3] Al-Said Ahmad, A., et al. Examining the effect of chaos engineering on different user load levels in cloud-native applications. Computing (Springer), 2024.

[4] Learning Recovery Strategies for Dynamic Self-healing in Distributed Systems. ACM (2024).

[5] W. Shen et al. Speculative Distributed Transactions with Geo-Replication (Mako). OSDI 2025 / ACM Proceedings, 2025.

[6] J. Geng et al. Tiga: Accelerating Geo-Distributed Transactions with ... (SOSP/2025).

[7] The use of Large Language Models to automate the generation and implementation of remediation playbooks [7]. 2024 ACM.

[8] Z. Yazdanparast et al. A Survey on Self-healing Software System. arXiv (2024).