A SIEM Odyssey: How Albert Einstein Would Have Designed Your SIEM Architecture

Albert Einstein taught us that there are four dimensions: the three physical dimensions plus time. The light being generated by the sun exists, but it will take about eight minutes to reach the earth before it exists in our environment. Many of the lights you see in the sky at night were generated by stars millions of years ago, and may no longer exist today.

The four dimensions of spacetime can teach us a lot about the universe, and now a good lesson on SIEM architecture design. In SIEM environments, log data is sent through various layers, introducing a delay between the data source and destination. In a properly designed SIEM architecture, the delay between the source and destination should be minimal, a few minutes at most. But in an undersized SIEM architecture, delays between the source and destination can be high, and in worst cases, data may not reach the destination at all.

SIEM environments have three main layers. The first is the data sources, the various Windows servers, firewalls, and security tools that will either send data to your SIEM, or your SIEM will pull data from. The second layer is the Processing Layer, which consists of applications (Connectors/Forwarders/Ingestion Nodes) designed to process and structure log data, and forward it. The final layer is the Analytics Layer, which is where log data is stored, security analytics is performed, and end users search for data.

To highlight the risk introduced into your organization by an undersized SIEM architecture, we’ll use a DDoS attack against your organization as an example. The Bad Guyz Group has found a clever way to funnel millions out of your organization. Before they initiate the fraud scheme, they want to distract your organization from what is actually going on in order to buy themselves time, and thus launch a large-scale DDoS against your web servers.

Your DDoS protections begin sending out alerts, notifying your SOC of the attack and that there’s no concern at the moment. The amount of traffic directed at your web servers seems to be increasing, but is not near a level that will void your DDoS protections. Your SOC notifies leadership that even though there’s an active attack in progress, there’s nothing to be worried about.

While your DDoS protection is working as expected, your SIEM Processing Layer is being flooded with a 400% increase in firewall and proxy traffic. Your SIEM Processing Layer was only designed to process 10,000 events per second (EPS), and is now struggling to process a surge of 40,000 EPS. Cache files start to appear within minutes, growing at a rate of 100GB per hour, which will exhaust cache space within eight hours.

Hours later, SOC Analysts notice that the timestamps on most of the log data are several hours behind. They send an email to the SIEM Engineers, asking to see if there’s anything wrong with the SIEM application. When the SIEM Engineers get out of their project meeting a few hours later, they login to the servers and notice the cache files, extremely high EPS rates, and maxed-out RAM/CPU usage. They then notice that the surge in data is being caused by firewall and proxy logs. After a conversation with the SOC, the SIEM Engineers are then informed of the DDoS attack that happened earlier in the day.

Later in the evening, the SIEM Engineers warn leadership that the Processing Layer is dropping cache files of log data and is refusing new connections, resulting in data loss. The average log data delay now stands at eight hours as the DDoS attack continues.

Fortunately, the DDoS attack stopped the following morning, and the SIEM Processing Layer began reducing the amount of cache files on the servers. The SIEM Engineers anticipate that cache files should be completely cleared by 5PM.

Later that morning, the SOC Manager gets a call from the Fraud team, asking if they can see traffic to several IP Address. The SOC Analysts begin searching, but the response times are very slow and the latest data available is from last night. Just as the SIEM Engineers expected, by 5PM all cache files were cleared, and analysts were searching data in real-time again. They found only one hit from one of the IPs provided. The Fraud team insists there should be more than that, but the SIEM Engineers note that the other hits may have been dropped when the Processing Layer was refusing new connections during the surge in data. Leadership isn’t happy, and calls for an immediate review of the SIEM environment.

The bad news is that many SIEM environments are not sized appropriately to deal with such scenarios, or with legitimate data surges in general. These situations can leave your organization blind to an attack in progress, as the data required for an investigation exists, but is not yet available to your analysts, or worse is being dropped from existence.

The good news is that you can significantly reduce the probability and severity of this scenario. While SIEM environments can be expensive, the costly part is typically the Analytics Layer, and for many organizations over-sizing this layer isn’t an option. However, the Processing Layer tends to be much less expensive, and in some cases would simply result in the cost of the physical or virtual servers required.

A SIEM Processing Layer should be significantly larger than your sustained average event per second rate. While this number is a requirement to determine SIEM application licensing costs, many organizations make the mistake of sizing their SIEM according to this metric. In addition to spikes, the amount of traffic received by your SIEM during the day is likely to be much higher than at night. If your sustained EPS rate is 20,000 EPS, then it can be possible for your EPS during the day to be 30,000, and EPS at night to be in the 5,000-10,000 EPS range. If you receive a spike in traffic during the day, the 30,000 EPS can turn into 60,000 EPS. In many SIEM environments, this would cause the Processing Layer to quickly exhaust caches and begin dropping data. Supporting a large spike in traffic could simply be done by adding more devices (Connectors/Forwarders/Ingestion Nodes) within the Processing Layer. The increase in processing power and overall cache availability would reduce the risk of log transmission delays and data dropping.

In addition to reducing the probability of data delay and loss, an over-sized Processing Layer brings high availability benefits, and as well can make migrations and upgrades easier. If you have a single point of failure, you can lose your Processing Layer entirely if the device fails. If you have enough Connectors/Forwarders only to process your sustained EPS rate, you risk the above scenario when one of the devices within the Processing Layer fails, as the others have to make up for the extra EPS rates. If you need to upgrade your Processing Layer, the extra devices can make the upgrade smoother and transparent to any operational issues.

While we may have solved the issue with the Processing Layer, the surge in data can also result in transmission delays, data loss, slower end-user search response times, and system stability issues on the Analytics Layer. However, while it may seem logical to build an Analytics Layer to support double the sustained EPS rates, it can be cost prohibitive for many organizations. An adequately-sized Processing Layer can assist during surges by aggregating data (combining similar events into one, for SIEM products that support aggregation), caching it locally, and limiting the EPS-out rates to the Analytics Layer. Your SIEM Engineers can also control what data is sent over others, so if there’s a dire need for a particular data source, your staff can limit other data sources to be sent to the Analytics Layer so that the pertinent sources can be sent in priority.

In summary, there’s a strong return on investment for building an adequate SIEM Processing Layer given the low costs, risk reduction, and invaluable security benefits. Even with a minimal Analytics Layer, a properly-sized Processing Layer will be sufficiently able to process a surge of data, cache data that can’t be forwarded, prevent data loss, and reduce risks caused by large increases in log data. Make the investment and leave spacetime issues for the physicists!

Please like, share, or comment if you enjoyed this article. Thank you!