High Availability and Disaster Recovery

The High Availability (HA) architecture is the right option when the document management becomes business-critical and requires minimal downtime and continuous availability.

The same architecture is also used to implement a Disaster Recovery (DR) where the purpose is to guarantee the survival of the service even in the face of a serious event on the main node.

In both cases, the solution is to use two nodes: the main one which is normally used to serve the traffic and the secondary one to which requests are delegated only if the main node is out of service.

For the HA both nodes are in the same LAN while for the DR they are located in distinct geographic areas.

High Availability

Each node has its own storage and database, but both are continuously replicated from the main node to the secondary one. In this way as soon as the main node becomes unavailable, the system will continue to work thanks to the secondary one that starts with the data from the last replication.

Set up the run level

Because of in this approach there is no communication between both applications, there is no need to activate the clustering feature.

The main node should be configured to work in Default run level as usual, while the secondary should work in the Slave run level. By default, most of the aspects are disabled in the slave, a good solution for a consultation only system. If you want a fully functional system, you may switch to the Default run level as soon as the secondary node takes the control.

Slave run level

Replication policies

The two metrics normally used to evaluate HA (and also Disaster Recovery (DR)) are the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO).

There is a difference between RTOs and RPOs that can be obtained to support high availability versus disaster recovery. With HA, data replication can be synchronous because the redundant components are located in the LAN environment. Active and standby databases can be updated at the same time, enabling full, automatic and real-time recoveries to satisfy the most demanding RTOs and RPOs. As a result, the standby instance is "hot" and in sync with the active instance, so it is ready to take over immediately in the event of a failure.

However, to recover systems, software, and data in the event of a disaster, the redundant components must be on a wide area network (WAN). This is important because you need to keep the redundant components in a geographic location away from the active instance. But with a WAN, data replication is asynchronous to avoid a negative impact on throughput performance. This means that updates to standby instances will delay updates made to the active instance, causing a delay during the restore process. Since disasters are rare, some delay can be tolerable and depends on (a) how critical it is for your business to achieve the lowest possible RTO and RPO and (b) how much budget you can allocate to achieve the best RTO and RPO.

Depending on the RTO and RPO you want to reach, you should regulate how frequent is the replication of the data from the main node to the secondary one.