Standard-Size your SIEM HA and DR Environments

A common decision made when designing a highly available or disaster recovery enabled SIEM solution is to under-size the secondary environment with fewer resources than the main production environment. Many believe that losing a server to a hardware failure or application due to corruption is highly unlikely, and if such a situation were to occur, a system with fewer resources can suffice while the primary system is down. Thus, with a minute probability of having a server or application failure, many believe they can get away with fewer RAM, cores, and disk space on the HA or DR server(s). After all, none of us want to bet on something that isn’t likely to happen.

For many organizations, this may be an acceptable risk for many reasons, including budgetary restrictions, other security compensating controls, risk appetite, and others. For example, a small company simply may not have the budget for a highly available SIEM solution. For others, it may be that their other security applications provide compensating controls, where their analysts can obtain log data from other sources.

For organizations looking to implement high availability in some fashion, it’s important to understand how small differences can have a major impact.

To highlight what can happen, let’s use Company A as an example. Company A is designing a new SIEM estimated to process 10,000 EPS. The company has requested additional budget for HA capabilities but want to save some of the security budget for another investment. They find an unused server in the data centre that has fewer RAM, cores, and disk space, and decide to use it as the DR server. They ask their SIEM vendor for their thoughts, and the vendor replies that the reduced system resources should only result in slower search response times, and the 2 TB hard drive should provide just over four days of storage (given 10,000 EPS X 2,500 bytes per normalized event X 86,400 seconds/day X 80% compression = 432GB/day after compression). Company A accepts the risk, thinking that any system issue should be addressed within their 24-hour hardware SLA, and that they should have the application back up in two days at most.

A few months down the road, the Production SIEM system fails. The operations staff at Company A quickly reconfigure the SIEM Processing Layer (Connectors/Collectors/Forwarders) to point to the DR server. Log data begins ingesting into the DR server and is searchable by the end users. The security team pats themselves on the back for averting disaster.

However, things go sour when the server admins learn that the hardware vendor won’t be able to meet the 24-hour SLA. Three days pass and the main Production server still remains offline. While the DR server is still processing data and searches are slower but completing, the security team becomes anxious as the 2TB hard drive approaches 90% utilization on the fourth day.

When disk capacity is fully utilized early the following morning, the SIEM Analytics Layer (where data is stored) begins purging data greater than four days old, diminishing the security team’s visibility. The purging jobs are also adding stress to the already fully utilized cores, which are also now causing searches to time out. The Analytics Layer begins refusing new data, forcing the SIEM Processing Layer to cache data locally on the server. That is also a concern for the security team since the single Processing Layer server only has 500GB of disk space, and the 400GB of available cache at this rate would be fully utilized in roughly 4 hours (given 10,000 EPS X 2,500 bytes per normalized event X 14,400 seconds = 360 GB uncompressed) as the SIEM’s processing application can’t compress data (this SIEM can only compress at the Analytics Layer).

As you can see, overlooking a somewhat minor design consideration, making assumptions, relying on SLAs, and so on, can have major impacts on your environment and reduce the utility of your SIEM in a disaster. As SIEMs consist of multiple applications (Connectors, Analytics), the required high availability should be considered for the various components. Forgoing high availability may need to be done for budgetary or other reasons, but it’s critical to ensure that the environment is aligned to your requirements and risk appetite.