Applying AI and Machine Learning to improve system recoveries and replications.
With the constant rise in cybercrime, having the ability to recover systems, or failover to a clean replica, is vital to minimise the impact of downtime and hopefully mitigate the need to pay any ransom demands. Although modern disaster recovery software tools can automate recovery and replication processes there are still many scenarios that will cause a recovery or replication to fail. Recovery failure can cause major financial impact during a cyberattack or another outage situation. Cristie Software has been at the forefront of recovery and replication software development since 2008 and is improving recovery and replication processes by harnessing the power of AI and machine learning to automate recovery failure resolution. These advanced new capabilities are underpinned by the recent UK patent award which describes the progressive methods that the Cristie Software development team have invented to facilitate these new features.
Determine if your Disaster Recovery (DR) process will work when you need it most.
Taking the hard work out of system recovery failure.
With the advent of cloud computing and hybrid cloud/physical models, it is common for recovery or replication targets to be hosted within different operating environments than their source systems. This factor alone can present issues of incompatibility, particularly where platform specific drivers are required during system startup. Cristie Software recovery and replication solutions allow complete flexibility regarding the direction of backup and recovery between physical, virtual and cloud-based systems. In addition, boot critical drivers or software elements required when migrating between operating environments are automatically inserted during the recovery or replication process through Cristie Software’s dissimilar hardware technology. This feature can save systems administrators significant time and effort by removing the need to manually intervene during the recovery process to search, download and install missing components.
In addition to platform specific requirements, a variety of other technical and human inspired failures feature during recovery testing, with the following being some of those most frequently reported by our software.
– Bad backup file(s): Often a corrupted backup file is detected that needs rectification and replacement
– Incomplete backup job: Incomplete file/folder selection during configuration of the backup job is another common reason for failure. This can be just an administrative oversight or in some cases poor internal communication being departments.
– Network configuration: General network configuration errors are common especially when moving between physical and virtual targets.
The Crisite Software suite including the Cristie Virtual Appliance (VA) which controls the Recovery (BMR) solutions and is an optional tool to be used alongside CloneManager replication all provide extensive log files detailing both successful operations and recovery/replication failures. Our most recent UK patent award details our progressive techniques to derive actionable insights from information rich log files through machine learning with the goal of providing automated resolution to many common failure scenarios.
Improving system recovery through advanced log parsing algorithms and machine learning.
Log files containing runtime information are a vital tool to assist root cause analysis following a recovery failure, however they can be extensive files containing unstructured information that requires manual filtering to find specific entries, and of course knowledge to determine which reported events are significant when determining any failure resolution.
Cristie Software are implementing the very latest compute efficient log parsing algorithms to transform raw log messages into structured log messages which can then be compared against a ‘known good’ recovery runtime outputs to train the VA in anomaly detection and categorization.
The recovery process and anomaly detection take place over several phases. The first step is the creation of a virtual environment in which to restore the system. The virtual environment is created according to a stored configuration and the system is restored from a backup or from a replica into the virtual environment. An attempt is then made to boot the restored system in the virtual environment with runtime logs collected during system startup.
The VA then parses the log files from the recovery simulation using machine learning rules based on known-good recoveries to identify any anomalies. The machine learning rules derived from the known-good runtime outputs train the VA on what log file entries to ignore so that only failed operations are collected for analysis. This training data is continually updated and reanalysed to ensure accuracy improves which each new software iteration.
The next phase involves categorization of any runtime anomalies found in order to create structured log groups containing matched log events. Log groups could include errors such as connectivity, network configuration, dependant systems and user permissions. As an example, one system may be a web application server, which will not work correctly without a database server (i.e., the web application server is a dependent system of the database server).
In the final phase the VA will determine any modifications required to the stored configuration of the recovery profile in the virtual test environment based on the outcome of the log file analysis and will attempt to make those changes automatically. The recovery simulation will then be repeated, and the resulting runtime logs will be parsed again to check for anomalies. This cycle will be repeated until the recovery simulation succeeds without errors in line with the domain knowledge provided by the baseline know-good runtime information. If the VA determines that manual intervention is necessary to resolve a recovery failure, then the ML algorithms will attempt to provide as much guidance as possible to the system administrator via the failure analysis panel within the VA user interface.
Fewer recovery failures and faster problem resolution.
Automating system recovery, replication and migration has been the core focus of the Cristie Software suite since inception driven by innovative techniques and the latest advances in computing. All major cloud and virtualization platforms can be supported as replication or recovery targets and specific extensions are available to enhance system recovery from backup solutions such as Dell Technologies Avamar, Dell Technologies Networker, IBM Spectrum Protect, Cohesity DataProtect, and Rubrik Security Cloud. Visit the CloneManager® and System Recovery product pages or contact the Cristie Software team for more information regarding the Cristie Software suite of solutions for system recovery, replication & migration.
In summary.
Automating system recovery, replication and migration has been the core focus of the Cristie Software suite since inception driven by innovative techniques and the latest advances in computing. All major cloud and virtualization platforms can be supported as replication or recovery targets and specific extensions are available to enhance system recovery from backup solutions such as Dell Technologies Avamar, Dell Technologies Networker, IBM Spectrum Protect, Cohesity DataProtect, and Rubrik Security Cloud. Visit the CloneManager® and System Recovery product pages or contact the Cristie Software team for more information regarding the Cristie Software suite of solutions for system recovery, replication & migration.