Uptime of EquipmentCloud® in 2020: 100% Availability!
In 2020, Kontron AIS was able to guarantee 100% availability (excl. < 1% planned maintenance) for all EquipmentCloud® users, achieving 0 hours of unplanned downtime. However, with a look behind the scenes of the IIoT service solution EquipmentCloud ® we would like to show why this should be a matter of course and is no coincidence. The interaction of various, coordinated factors plays a vital role here.
1. Second-generation Oracle Cloud Infrastructure (OCI) as a backbone
Oracle has specialized in databases and system stability for over 30 years. Data storage and backup of EquipmentCloud® customer systems based on Oracle Cloud Infrastructure (OCI) take place in three data centers around Frankfurt am Main. The three locations serve for mirroring, redundancy, and hardened security of the data, thus ensuring a high level of reliability and availability even in the event of possible failures. Oracle provides the network connection and the CPUs exclusively, so there can be no overbooking. As an Oracle ISV partner, there is a long-standing contact with Oracle at development level, which is maintained for regular exchange on the latest and security-relevant topics. Kontron AIS always relies on the latest front-end APEX updates or system patches provided by Oracle, for which downtimes can be scheduled in good time.
2. Continuous performance analysis
As a further building block in ensuring high system stability, Kontron AIS conducts continuous performance analysis of the system during operation and rollouts. Fundamentally, the front-end and back-end worlds are separated from each other. In addition, access to the system is strictly limited: only system administrators from the Cloud OPS team have access. Access is currently protected by an SSH key and password and is subject to continuous monitoring. Key system parameters include, for example, main memory utilization, rendering times of web pages or system response times, which are determined with statistical methods and a logging of system actions unrelated to users. In addition, support is automatically informed 24/7 in the event of serious system errors, allowing countermeasures to be taken at an early stage.
3. Preventive training measures
In order to be prepared for emergencies, possible events are simulated with the entire team. On the one hand, this minimizes unplanned downtimes and, on the other, preventively trains measures and maintenance routines for particular scenarios. These include, for example, complete system failure, simultaneous repair of the new system on the backup container, front-end monitoring of the system load, identification of front-end errors, start-up of databases, run-throughs of the procedure in the event of urgently required updates, cloning the production environment, or the fully automatic creation and rollout of the front end.
4. Automatic testing and automated rollout process
Kontron AIS relies on automatic build processes (continuous integration) and automated testing in product development to complete updates within a three-week sprint. The test and rollout process consists of several stages. This enables gradual but automated rollout of new versions on all staging systems from the test server to the mirror system to the production system, provided that the tests have been successful. Developers refer to such multi-stage rollout strategies as a so-called code freeze. This is based on scripted infrastructure and regression tests with risk assessment. It also includes the tracking and registration of updates as well as a complete overview of all installed versions. Kontron AIS also follows the credo that no updates are rolled out before public holidays or weekends to avoid possible disruptions.
Four ways in which we at Kontron AIS ensure high availability of the SAAS IIoT service solution EquipmentCloud ® throughout the year. Would you like to learn more about this topic?
Sources:
https://www.pwc.de/de/prozessoptimierung/assets/cloudcomputing-studie.pdf