A major incident in one of the data centers in France – OVH caused massive downtime and potential data loss for many customers as per the reports in Data Center World.
“On Wednesday, 10 March 2021 a fire broke out in a room at the SBG-2 OVHcloud data center in Strasbourg, France. The fire reportedly destroyed SBG-2 and damaged four of 12 rooms in the adjacent SBG-1 data center. Two adjacent data centers, SBG-3 and SBG-4, were not damaged but were shut down during the event, requiring a massive, time-consuming reboot of all their systems.”
As per the reports even after 14 days of this event, they are still struggling to bring customer services online. This might be due to the volume of the impacted hardware items and the amount of data.
Today together with our CTO Daniel Ananthan who is a specialist in data center design thought of exploring this incident so that all of our readers could learn from this incident.
As more and more applications and databases move to the cloud and organizations adopting the cloud first approach towards their business organizations are becoming more dependent on CSPs. CSPs are becoming a critical element in business continuity. Even the most mature CSPs such as AWS and Azure have undergone large service outages.
What should CIOs do to minimize the impact of such outages?
Reviewing the Service Level agreements and the business impact assessments would be an important future activity for CIOs. If you are dependent on cloud service providers for hosting business critical systems you would have to review the below elements of the Service Levels
- Level of Redundancy offered – Server / Pod Level, Data Center level, Network Redundancy
- Measuring the Offered SLA against the Agreed – CIOs should have a strong focus on monitoring the service availability.
- Having the Backups – While most of the CSPs are offering backups, it’s always safe to keep a copy of your backups including the SaaS services such as Office 365. Many backup platforms now provide native support for cloud service providers. Information such as backup and replication are often made the responsibilities under the shared responsibility model of the Cloud.
- Studying and minimizing application dependencies – Many applications still have a monolithic architecture and are dependent on specific resources such as VMs. If applications could be deployed on a decoupled cloud native approach which is failure tolerant we can minimize the impact of events such as this. Solutions such as Kubernetes and Microservices would help here.
- Evaluating the Cybersecurity Controls – We have observed the growing risk of malware including ransomware which has resulted in many incidents of business losses. Going forward CIOs and CSOs will have to evaluate the information security controls against such attacks by the service providers.
- Conducting tabletop Disaster recovery drills and calculating the RPTOs using the tools and available automation systems
Having said that there are many organizations that are dependent on Data centers for their services. They run business critical data centers and might not be able to move to a cloud native architecture in the near future. How could such customers who have on prem data centers or using Colocation facilities minimize the impact of such data center outages?
Our CTO Daniel Ananthan highlights a few important points.
- Early Detection and Suppression System is a crucial component most people miss during the Design.
- During the Datacenter Physical Infra Designing, we shall look at all Critical and Risk Analysis criteria.
- Electrical Power Distribution and Cooling are also another Aspect and reason for fire.
- Power Capacity Planning and Predictive Alert system via DataCenter Infra Management is other key component for Operation of Datacenter. The Technology is developing for a small footprint but more processing power equipment, demanding Power consumption.
VS ONE is helping a number of clients optimize their data centers with our data center advisory services. This involves multiple assignments including analysis of the existing power, and cooling arrangements, Identification of data center hotspots, and analysis of the Fire controls including fire alarms and fire suppression solutions.