The Million Dollar SIEM Question: To Parse or Not To Parse
Given that SIEMs process and store data, one of the major requirements of a successful SIEM environment is proper and sufficient storage. Depending on your organization’s SIEM requirements, the cost of storage alone for your SIEM can exceed application licensing costs.
The most common omission in a SIEM product selection analysis is differentiating how the proposed applications process and store data. SIEMs process and store data differently, and thus will all produce different storage requirements. Given that storage is a major cost of your environment, how the application processes and stores data can alter your storage costs significantly.
Traditional, as well as newer SIEM products are designed to parse data as it’s ingested, and thus store data as a parsed event. Additionally, SIEM products will parse the data into different field sets; a Windows event in SIEM Product A can be parsed into 200 fields, while SIEM Product B will parse it into 193 fields. This will result in a different event size for the same data. Some newer SIEM products do not parse data as it’s ingested, stores the data raw, and only parses it when required, e.g. when you run a query, report, etc.
There are advantages and disadvantages to both. Parsing (or normalizing/enhancing/enriching) the data structures it, by adding applicable metadata. For example, the following log entry ‘jsmith, 10.1.1.1, failed login’ would appear parsed as ‘username=jsmith, IPAddress=10.1.1.1,event name=failed login’. While this makes the data more organized and allows for more refined searches, it makes the log entry bigger in size. A 500 byte raw event can turn into a 1000 byte parsed event, doubling the storage requirement for this event.
While parsing increases the size of the event, the SIEM tool does gain the ability to manipulate the data. Many SIEMs that structure data have the ability to aggregate events, which can take multiple, similar events and combine them into one. For example, the event ‘source address=10.1.1.1, source port=9022, event name= deny, destination address =123.45.67.89, destination port= 443’ that occurs 10 times can be combined into one, with an extra field added, e.g. event count =10, to indicate how many times the event occurred. Thus, 10 events at 1000 bytes each use 1000 bytes of storage with the structured SIEM application instead of 10,000 bytes.
To highlight how this can affect your organization, let’s look at Company A as an example. The technical staff at Company A have been reading about the benefits of SIEM from some guy on the Internet, and decide they want to implement one. They want 2 months of online data followed by 10 months offline, and for the data to be highly available. Their storage team can provide high-speed storage for $5,000 per terabyte, and lower-speed storage for $2,000 per terabyte.
Next, they’ve invited a couple of vendors in for a product overview, and they’re making each complete the SIEM Storage Requirements and Costs spreadsheet they created.
First up for Company A is Product A. Product A does not parse event data as it’s ingested and stores logs in raw format. It gets up to 50% compression on live data, and 85% compression on archived data. The product can replicate data and thus easily meet the high availability requirement. The tech staff at Company A also went the extra mile and determined that the average log event size from all their systems is 700 bytes.
A summary of the requirements and weights so far:
Again, since Product A stores data only in raw format, the Average Normalized Event Size is 0, as it does not store a parsed/normalized event. Product A cannot aggregate data, so there is no Aggregation Benefit. The product will create a copy of each event, bringing the replication factor to 2.
Next, we’re going to determine the average sustained events per second rate from the numbers the tech staff provided. Based on the total number of devices, the SIEM will need sufficient storage to store 5,000 Total Average EPS (events per second) at 700 bytes per event. Using the SIEM Storage Requirements and Costs spreadsheet, we get the following table.
The total daily uncompressed storage requirement is 607 GB. At 50% compression, the Total Online Storage Requirement is 18 TB. At 85% compression, the Total Offline Storage Requirement is 28TB. 18 TB at $5,000 per TB brings the cost to $91,000, and 28 TB at $2,000 adds another $55K, bringing the total to approx. $146,000.
Up next is Product B, which is a SIEM tool that’s designed to work with structured data. It will parse log events and create an Average Normalized Event Size of 2,000 bytes, and will be able to reduce events by 40% through aggregation. It can’t replicate data, but the Connector/Forwarder will be configured to send to 2 destinations. And by complete chance, it has the exact same compression ratios.
Product B is going to process the exact same raw EPS, but the Aggregation Benefit drops the sustained Total Average EPS to 3,000.
The total daily uncompressed storage requirement is approx. 1 TB. At 50% compression, the total Online Storage Requirement is 31.2 TB. At 85% compression, the total Offline Storage Requirement is 47.6 TB. 31.2 TB at $5,000 per TB brings the cost to $156,000, and 47.6 TB at $2,000 adds another $95K, bringing the total to approx. $251,000.
As you can see, for the sample insurance company, Product A produces a significantly different storage cost than Product B due to the way the products process and store data. The tech staff at Company A like Product B better, but know that it will be tough to sell the VPs a solution that will cost $500,000 more over the next five years.
However, the result could be completely different at Company B, which has different requirements than Company A. Company B is a telco that wants their IT staff to monitor their network infrastructure with a SIEM. The tech staff at Company A were generous enough to share the SIEM Storage Requirements and Costs spreadsheet with their buddies at Company B.
Company B decides to bring in Product A first and has them fill out the spreadsheet. As the environment mainly consists of network devices, the Average Raw Event Size is going to be 450 bytes. Again, since Product A doesn’t parse log data, there is no Average Normalized Event Size or Aggregation Benefit.
Next, we can calculate a total daily uncompressed storage requirement of 1.2 TB from the 15,780 EPS rate. That will produce an online storage requirement of 37 TB and offline of 56 TB.
That will bring the total yearly storage cost to approx. $300,000.
Next, the tech staff at Company B bring in Product B for an overview.
The mostly network devices will produce an Average Normalized Event Size of 1500 bytes, and the Aggregation Benefit will be very high for the firewall data, reaching an overall total benefit estimated at 80%.
Through the strong aggregation benefit, the total EPS is reduced from nearly 16,000 to just over 3,000. As a result, the storage requirements for Product B are 24.5 TB for online, and 37.4 TB for offline, respectively.
So for Product B, the yearly storage costs sit at $200,000 per year.
The tech staff at Company B like Product A better for some reason, but like at Company A, they know it will be tough to justify the extra $500,000 in storage costs over the next five years.
So as you can see, To Parse Or Not To Parse is not some Shakespearean-sounding cliché by some geek on the Internet. Requirements such as storing two copies of each event, storing both the raw and normalized event (add the storage costs for Product A and B together, roughly!), or retaining two years of log data can create tremendous storage cost differences over the tenure of your SIEM environment.
The best product for your company will be that which meets your requirements best, and storage costs alone should not be the deciding factor for which product is selected for your organization. There are many advantages and disadvantages to leaving data in raw format or parsing it, but as the data shows, you can’t ignore a question that can really be in the millions!
For the record, the lads at Company A have shared the SIEM Storage Requirements and Costs spreadsheet.