The Pros and Cons of Structuring Log Data at Ingestion Time with SIEMs
Another important but often overlooked part of a SIEM architecture and design or product analysis exercise is whether the product structures the data or not as it’s ingested/processed, and how that can affect your organization’s environment. This seemingly miniscule functionality can have a significant effect on your SIEM environment, and can even introduce risk into your organization.
Let’s start with the advantages of structuring (parsing) log events at ingestion time. In general, structuring data as its ingested/processed can give you many opportunities to manipulate the data in a positive way.
Advantages of Structuring Log Data at Ingestion Time
1. Ability to Aggregate
Aggregation gives the SIEM tool the ability to combine multiple, similar events into one single event. The biggest advantage of this is the reduction in EPS rates processed by the SIEM and the reduced storage requirements. Ten firewall deny events from the same source and destination can be combined into one, using 1,000 bytes of storage instead of 10,000 bytes. SIEM tools typically have the ability to aggregate hundreds of events over a period of thirty seconds or more, so it’s common to see aggregation successfully combine hundreds of events into one. Caution must be used with aggregation settings as the higher the aggregation window (the maximum amount of time the Connector/Log Processor will wait for similar events to combine into one), the higher the memory requirement for it.
2. Ability to Standardize Casing
Most SIEM tools can easily standardize casing of all fields parsed by the Connector/Log Processor. This is often an overlooked benefit that comes with structuring data, as the various data sources in your environment will log in various casings, and thus introduce a potential security risk. The security risk that can be introduced is that your security analysts may be getting null search results when the data they need is actually there.
In several investigations, SOC analysts were having issues getting hits for a particular user’s data. Upon closer inspection, we found the desired data, and discovered that the initial searches were coming up null because the casing in the searches did not match that of the log data. The SOC analysts were searching for “frank,” but the SIEM tool was configured to be case-sensitive, and the Windows logs were being logged in uppercase as “FRANK.” Thus, by simply having the Connector standardize casing for particular fields, you can minimize the above scenarios.
Standardizing casing can also increase search performance. When the SIEM tool only has to search in one case, it reduces the amount of characters it needs to search for. A simple search for “FRANK” only requires the tool to match on five characters instead of thirty-two (FRANK, frank, Frank, FRank, FRAnk, etc).
Why not simply disable case-sensitivity for searches? This is seemingly the best option, as the risk of missing data is mitigated. However, the major disadvantage of this option is that it increases the processing power required for the searches. For smaller environments, this is practical and the effects will likely be negligible. However, in larger environments where the SIEM is processing several thousand events per second and there are several end users, the results can be noticeable.
A practical work around can be to disable case-sensitivity for particular searches at the discretion of the analyst. Many SIEM tools offer this option for this very reason. There’s typically an option before the search to disable case-sensitivity.
A best practice to mitigate missing data due to casing issues while maximizing performance is to start with a generic case insensitive search, and once you get hits on the data you’re searching for, switch to the casing you see. For example, if you’re looking for user Frank’s Windows logs, start with a small, e.g. few minute, case-insensitive search “frank”, and once you see that the Windows logs are logging it as “Frank,” switch to the proper casing and then expand the search. This is a practical option that will help analysts avoid missing data, and will not require you to configure your tool to be case insensitive.
Regardless of how you chose to configure case-sensitivity, simply ensure your staff understand how your environment works and best practices for searching your data.
3. Ability to Add and Modify Fields
Many SIEMs can append data to existing fields, override fields with new data, and modify values put into fields. A common nuisance when searching for log data is how some systems have their FQDN logged (e.g. server01.ca.companya.com) while others simply log the server name (server01). This can cause a similar risk as case-sensitivity, where SOC analysts search for Device Host Name =”server01” but get no results as the server appears in the logs as “server01.ca.companya.com”. This forces the SOC analyst to do a wildcard search of Device Host Name =”server01*” etc, and ultimately requires more processing power from the SIEM.
When data is structured/parsed at ingestion time, the SIEM can do a simple lookup of the e.g. first period, strip whatever proceeds the period, and then put that into another field. Using “server01.ca.companya.com” as an example, the parser can leave server01 in the host name field, and then put the stripped ca.companya.com in the e.g. domain field. Analysts then know that they only need to search for the server name in the host name field, and to search the domain field if they want to know the domain of the server.
Now that we’ve fallen in love with the advantages of structuring data at ingestion time, let’s look at the disadvantages before we leave for the honeymoon.
The Disadvantages of Structuring Log Data at Ingestion Time
1. Increased Event Size
The first, and potentially most costly aspect of structuring the data at ingestion time is the increase in event size. When you structure the data, you increase the size of the event, in many cases doubling its size or more. Please see my related article, The Million Dollar SIEM Question: To Parse or Not To Parse, for more on this.
2. Potential Data Loss and Integrity Issues
Given that your parser is instructed to place values in specific fields, for example taking the value after the second comma and putting it into the user name field, there’s potential integrity and data loss issues if your parser is not updated at the same time the logging format for a particular data source changes.
Let’s take a look at a sample log event:
01-11-2018 14:12:22, 10.1.1.1, frank, authentication, interactive login, successful
The parser takes the timestamp from the characters before the first comma, the IP Address after the first comma, the user name after the second comma, the type of event after the third comma, the type of login after the fourth comma, and finally the outcome after the fifth comma. All is well until the vendor decides to add a new field, an event code, and change the order of the events:
01-11-2018 14:12:22, 4390000, 10.1.1.1, authentication, interactive login, successful, frank
This is a simple change, but your parser needs to be updated to ensure that the values are put into the correct fields, and to add in the new field. Should this new log type be implemented without a corresponding parser change, we’re not only going to have data in the wrong fields, we’re not going to know who did the login, as the value “frank” will not be visible to the parser.
3. Increased System Requirements
The more modifications the parser has to make to the event, the more processing power the Connector/Forwarder/Processor will consume. Ensure there are sufficient system resources able to process the required modifications.
A Summary of the Pros and Cons of Structuring data at ingestion time: