What’s the big deal with log data?

So what’s the big deal with all this log data, and why on earth should I spend a large chunk of my budget to collect it? Aren’t the other security tools I have good enough? What exactly is in all this log data, anyway?

Log data is one of today’s most valuable assets: data. Google, Twitter and Facebook collect enough data on people to detect flu outbreaks faster than medical professionals can. Without owning a single taxi, GPS data gave a software company the opportunity to become the world’s largest taxi service. A computer algorithm can recommend a movie you’d like to watch and spare you from having to read reviews of movie critics. Amazon can tell you what book you’d like to read next or what household products you may be running low on.

In the context of cyber security, log data contains records of activity from your various IT systems. These records can help you understand what goes on inside your network. They can show you which user accounts are being used. They can show you which users are consistently visiting blocked websites. They can show you the suspicious files being blocked by your endpoint protection application. They can highlight suspicious processes running on your servers. They can tell you which exploits your web servers are vulnerable to and if anyone is trying to attack them. Ultimately, they can uncover activity in your network that is adding risk to your organization.

Log data is typically output to a file or database, where it was traditionally used for troubleshooting purposes. If someone couldn’t log into a particular application, the system admins would check the log files to see if they could find out why. If a customer application was down, the support team would check the log files to see if they could find out the cause of the crash.

As the amount of log data grew, many saw that the files sitting on their servers contained invaluable data. Many applications were born to manage all of this data, helping organizations search through it and assist in detecting issues before they became outages. In the early 2000’s, some programmers with a security mindset thought of creating an application that would act as a centralized repository of log data for security investigators, and be able to alert in real-time when particular values or suspicious patterns were detected in the log data. The result of this was the birth of SIEM, Security Information and Event Management.

Let’s take a quick peek at some log data. Here’s a small sample of authentication activity, which is a user failing to login, and then successfully logging into their workstation.

-May 1 2018 1:00PM, IP=10.1.1.1, User=Bob, Message=login failure
-May 1 2018 1:01PM, IP=10.1.1.1, User=Bob, Message=login failure
-May 1 2018 1:02PM, IP=10.1.1.1, User=Bob, Message=login success

Most log files will at minimum answer who, what, when, where, why, and how. Given the advent of SIEMs, most vendors now provide detailed logging for their applications, and some even allow you to customize what is output.

Here you can see a couple of punctual users logging into their company network in the morning, generating VPN login data:

-May 1 2018 8:50AM, IP=23.91.128.44, User=John, message=VPN Login Success
-May 1 2018 8:54AM, IP=23.95.148.12, User=Bob, message=VPN Login Success

Log files can also be specific to an application. Here we have some startup activity on the billing server:

-May 1 2018 9:54AM, hostname=billingserver01, message:NOTICE: Application starting
-May 1 2018 9:55AM, hostname=billingserver01, message:NOTICE: Running startup scripts

That’s great, you may think, but why should you devote resources to collect and manage this data? Let’s expand the above entries and see what the big deal is.

Using the authentication activity again:

-May 1 2018 1:00PM, IP=10.1.1.1, User=asmith, Message=login failure
-May 1 2018 1:01PM, IP=10.1.1.1, User=bsmith, Message=login failure
-May 1 2018 1:02PM, IP=10.1.1.1, User=csmith, Message=login failure
-May 1 2018 1:03PM, IP=10.1.1.1, User=dsmith, Message=login failure
-May 1 2018 1:04PM, IP=10.1.1.1, User=esmith, Message=login failure
-May 1 2018 1:05PM, IP=10.1.1.1, User=fsmith, Message=login failure
-May 1 2018 1:06PM, IP=10.1.1.1, User=gsmith, Message=login failure
-May 1 2018 1:07PM, IP=10.1.1.1, User=hsmith, Message=login failure

These log entries become interesting now that someone is trying to log into the billing server using an incremental version of “smith.” This small story could be many things, from a developer testing something, a script running in the background, or it could be indicative of someone trying to guess a username, attempting to gain unauthorized access to the server.

Let’s take a look at the VPN log again:

-May 1 2018 8:50AM, IP=23.91.128.44, User=Bob, message=VPN Login Success
-May 8 2018 8:55AM, IP=23.91.128.44, User=Bob, message=VPN Login Success
-May 15 2018 8:52AM, IP=23.91.128.44, User=Bob, message=VPN Login Success
-May 22 2018 8:59AM, IP=23.91.128.44, User=Bob, message=VPN Login Success
-May 29 2018 8:44AM, IP=23.91.128.44, User=Bob, message=VPN Login Success
-May 29 2018 9:30PM, IP=62.176.64.51, User=Bob, message=VPN Login Success

Nothing unusual about Bob being his punctual self logging into work, except that “he” logged in from Bulgaria at about 9:30PM on May 29. Scenarios like this could be John on a business trip, or not John at all.

Finally, let’s take a look at some file executions in a log file. Here is a sample system updating itself, but for some reason the last file executed doesn’t seem to be a standard update file, which could be indicative of a malicious file being executed.

-May 4 2018 1:10AM, hostname=billingserver01, msg=file “update_01.exe” executed
-May 4 2018 1:13AM, hostname=billingserver01, msg =file “update_02.exe” executed
-May 4 2018 1:15AM, hostname=billingserver01, msg =file “update_03.exe” executed
-May 4 2018 1:50AM, hostname=billingserver01, msg =file “A2.exe” executed

As you can see, log data can contain invaluable data that can help your organization investigate suspicious activity and detect attacks in real time. Log data can indicate issues brewing in your systems that can be caught in advance before an outage or breach occurs. SIEM is a technology that centralizes log data, makes it available for searching, allows staff to alert on suspicious activity, and ultimately enhance the efficiency and effectiveness of your organization’s security operations.

If Milton Friedman Created Your SIEM Team

When you mix an economist with the Godfather, you get an offer you can’t understand. But when you mix the philosophy of a famous economist with your SIEM team, you can create a high-performing team that continuously improves the environment, plans accordingly, creates better use cases, and ultimately reduces the probability of your phone ringing on a Friday afternoon for a SIEM issue.

Milton Friedman was one of the 20th Century’s most influential economists. Without going into detail or starting a debate on economic policy, he argued that a single owner would take better care of something than multiple entities or an unclear entity. The single owner likely has a direct interest in the value of it and will maintain it better than an entity that doesn’t. And thus his famous quote:

“When everybody owns something, nobody owns it, and nobody has a direct interest in maintaining or improving its condition.”
– Milton Friedman

A SIEM is likely one of your more complicated security products to manage, and needs extensive customization over the other black-boxed security applications your vendors manage for you. Not only do you need to manage the content and use cases, you need to manage the data feeds, ensure data is parsing correctly, troubleshoot issues with the application, support SIEM end-users, and plan for growth. All this effort requires input from various teams within your organization. Given the multiple teams involved, it’s critical to establish accountability and know who is responsible for what part of the environment.

SIEM Environment Requirements

The first requirement of any SIEM solution is clear, single ownership; an entity that has a direct interest in improving and maintaining the overall SIEM environment, and is ultimately accountable for its entire operation. Without clear ownership, staff and end users will be discouraged from escalating issues. Teams will not have a dispute mechanism, and instead of resolving issues, they will point the finger at each other. Those issues will then be brushed under the rug, and will result in a major outage or security issue down the road for leadership to deal with. Work will not be distributed accordingly, and highly-skilled staff that are overworked will leave, taking valuable knowledge and training investments with them. Relationships between the teams will be strained, and ultimately entropy will overrun your environment, in which significant investment will need to be made in order to turn it around.

The second requirement of a SIEM solution is a healthy, teamwork-oriented environment. Given that many teams are involved in the implementation and operation of your organization’s SIEM, positive and open communication between the teams is required for issues to be raised, work to be assigned to the appropriate teams, and for knowledge to be shared. Healthy teams will raise pertinent issues to leadership and resolve them quicker than teams that don’t. Healthy teams share valuable knowledge and train each other. All of this contributes to a work environment that retains staff, and attracts new talent into the team.

The third requirement of a SIEM solution is a strong skillset. SIEM environments are complicated, and you’ll need many skills to manage it from architecture and design planning, parser development, rule logic development, to social skills required to obtain and maintain data from other teams. Before making investments in your SIEM skillset, the first two requirements should be met, or else you risk losing highly skilled staff that are hard to find and retain.

The fourth requirement of a SIEM solution is documented roles and responsibilities. Many mistake this as the first requirement, but a RACI, for example, will not be followed or enforced if the first three requirements are not met. If your staff don’t have the proper skillset, one or two employees may end up doing everyone else’s work, and leave when they burn out. If your teams have poor communication with each other, issues may end up going unresolved and unnoticed by leadership, leading to an outage down the road.

Where practical, entire SIEM teams should be under one VP or line of business. Having one VP accountable for the implementation and operation of your SIEM gives the VP incentive to ensure the solution isn’t rushed into production, and that it has adequate resources for operations. The single VP will have more of an incentive to ensure the health of the SIEM environment than another organization that makes one VP accountable for the implementation only, and another VP for the support of it. Such a situation can incentivize the implementer VP to get it in as soon and cheap as possible and leave the support VP with a mess. Given that SIEMs can take years to fully implement, this should be avoided at all costs. The single VP also acts as a single escalation point and can’t deflect the issue to another VP or line of business. When there are 2 VPs and the roles and responsibilities aren’t clear, disputes can arise or the issue can be ignored. Again, it’s ideal to have your entire SIEM environment under a single VP, but in organizations with a good working environment, different parts of it owned by different VPs or lines of business can work out well. There are also some roles and responsibilities, such as server and storage administration, that are common to be outside of your security organization.

RACI Matrix Overview

One of the industry’s most common roles and responsibilities document is a RACI Matrix, which stands for Responsible, Accountable, Consulted, and Informed. The goal of a RACI is to list all stakeholders involved in the solution and the required tasks, and then assign one of the following values to a stakeholder(s) for each task.

While a RACI is designed to document roles and responsibilities, it has another valuable benefit: quantifying work efforts. Once you see all the various tasks involved in your SIEM environment, you can see how much work effort the various stakeholders are assigned. For example, if Engineering is responsible for Parser Management, and they spend 20 hours per week maintaining the 40 custom parsers, they can justify the half of an FTE they’re requesting.

It’s easy for a SIEM RACI to span several hundred lines given the amount of tasks and teams involved, and I’d thus recommend to keep it as high level as possible. The objective should be to assign tasks to the teams, and then leave the teams responsible for figuring out how work is managed. This avoids the SIEM Owner having to resolve disputes within teams. The SIEM owner should have a single point of contact within each of the teams to work with directly.

A SIEM environment should have at minimum an overall RACI that defines the roles and responsibilities of all stakeholders. Additionally, each team may want to create an internal RACI that clarifies who within the team is doing what. This is optional, but highly recommended, as it can help employees understand their tasks, assist management in understanding the required tasks and work efforts, and most important establishes accountability. For example, if you have 100 correlation rules and leave it up to “the team” to manage it, you may find that the task of keeping the rules relevant is being ignored. When you break up the rules, the first 40 to be “owned” by Bill, the next 40 “owned” by Bob, and the final 20 to be “owned” by Joe, who also gets to own reporting, you may find rule updates happening more frequently. There is accountability and you can follow up with Bill, Bob, and Joe to check the status of the tasks. If there isn’t progress, you can further narrow down the issue, whether it’s a skillset gap or work overload, and then coach the employee accordingly.

Many argue that assigning work to an individual rather than a team introduces a skillset gap when that employee leaves. The advantages of assigning it to an individual are a better understanding of the task via specialization and repetition, better documentation of the task as a result of the understanding, and ultimately a position for the individual to improve the condition of the SIEM relating to the task, for example correlation rule updates. Having a group manage something that is not well understood leads to the team ignoring the task, something they can do when no one is accountable for it. A group that doesn’t understand the task cannot document it properly or improve its condition. There’s nothing more frustrating working on something you don’t understand.

An overall RACI is a requirement for any SIEM environment, but as all organizations are different, how a team manages tasks within itself should be at the discretion of leadership.

Sample SIEM RACI

We’ll walk through a sample SIEM RACI to give you an idea on what it may look like in your organization. The RACI will be divided into subsections below by Category and commented on individually. A link to the full RACI Matrix is available at the end of the article.

The Stakeholders in this sample RACI are the SIEM Owner, Engineering, the Content Team, and Incident Response, who all fall under the Security Operations team. The Server Support and Storage Support teams fall under a different line of business, Infrastructure Services.

The first Category is Governance, and you can clearly see how the SIEM Owner is both Accountable and Responsible for the overall SIEM solution, dealing with the vendor, and internal escalations from any stakeholders.

The second Category is Architecture and Design, in which the SIEM Owner is also Accountable and Responsible, but Consults the Engineering, Content, and Incident Response teams. The SIEM Owner needs to work with them to make sure their requirements are met, the search speeds are adequate, the required data sources are available, and that the SIEM solution adequately meets all these requirements, and if not, are built into future versions.

For the Logging Configuration category, the SIEM Owner needs to make sure not only are the required log sources logging to the SIEM, but that they are logging the correct data. Engineering needs to be Consulted to ensure correct parsing, and the Content and Incident Response teams need to make sure the data they need is available within the logs.

The SIEM Owner is also Accountable and Responsible for leading all new projects, and ensuring the SIEM solution is compliant with the organization’s compliance and governance standards. You can also see at this point the SIEM Owner isn’t a mere decision maker; he or she will be active in the management of your company’s SIEM.

The Engineers are Accountable and Responsible for the health and stability of the SIEM solution, and to ensure data feeds are integrated into the SIEM correctly. They do everything from application support to patching. The only two support-related tasks that they are not Accountable and Responsible for are Server and Storage Support, but will be Consulted when necessary.

The Content Team are the SIEM end users, and are strictly Accountable and Responsible for developing and maintaining rules and reports. They are also active in providing input for new use cases, but the Accountability and Responsibility for that task falls on the SIEM owner.

The Incident Response Team is Accountable and Responsible for responding to the alerts generated by the correlation rules, and reviewing reports. They are also Accountable and Responsible to provide tuning recommendations for the rules and reports based on their investigations and observations.

The Engineers tried to get the Content Team to manage user accounts, but they lost the battle and ended up getting the task.

 

As you can see, a RACI is a simple document that can clarify who is responsible for what part of the SIEM environment. It can also be used by leadership to quantify work efforts, assist in understanding the various tasks employees do, and identify areas that require improvement. Issues can be raised and be visible to leadership on Monday morning instead of Friday afternoon, or during a breach.

A RACI is not practical without three other major requirements: clear ownership, a teamwork-oriented environment, and a strong SIEM skillset. Clear ownership gives the owner an incentive to maintain and improve the SIEM, and prevents issues from being ignored or assuming they’re the responsibility of another entity. A high-performing team maintains and improves the environment, retains highly-skilled staff, and attracts new talent into the team. A strong SIEM skillset allows staff to execute the required tasks. All of this contributes to a better return on investment the SIEM will provide your organization, and ultimately a better security posture.

 

Link to a sample SIEM RACI Matrix: Sample_SIEM_RACI

Please like, share or comment if you found this article useful. Thank you!

A SIEM Odyssey: How Albert Einstein Would Have Designed Your SIEM Architecture

Albert Einstein taught us that there are four dimensions: the three physical dimensions plus time. The light being generated by the sun exists, but it will take about eight minutes to reach the earth before it exists in our environment. Many of the lights you see in the sky at night were generated by stars millions of years ago, and may no longer exist today.

The four dimensions of spacetime can teach us a lot about the universe, and now a good lesson on SIEM architecture design. In SIEM environments, log data is sent through various layers, introducing a delay between the data source and destination. In a properly designed SIEM architecture, the delay between the source and destination should be minimal, a few minutes at most. But in an undersized SIEM architecture, delays between the source and destination can be high, and in worst cases, data may not reach the destination at all.

SIEM environments have three main layers. The first is the data sources, the various Windows servers, firewalls, and security tools that will either send data to your SIEM, or your SIEM will pull data from. The second layer is the Processing Layer, which consists of applications (Connectors/Forwarders/Ingestion Nodes) designed to process and structure log data, and forward it. The final layer is the Analytics Layer, which is where log data is stored, security analytics is performed, and end users search for data.

To highlight the risk introduced into your organization by an undersized SIEM architecture, we’ll use a DDoS attack against your organization as an example. The Bad Guyz Group has found a clever way to funnel millions out of your organization. Before they initiate the fraud scheme, they want to distract your organization from what is actually going on in order to buy themselves time, and thus launch a large-scale DDoS against your web servers.

Your DDoS protections begin sending out alerts, notifying your SOC of the attack and that there’s no concern at the moment. The amount of traffic directed at your web servers seems to be increasing, but is not near a level that will void your DDoS protections. Your SOC notifies leadership that even though there’s an active attack in progress, there’s nothing to be worried about.

While your DDoS protection is working as expected, your SIEM Processing Layer is being flooded with a 400% increase in firewall and proxy traffic. Your SIEM Processing Layer was only designed to process 10,000 events per second (EPS), and is now struggling to process a surge of 40,000 EPS. Cache files start to appear within minutes, growing at a rate of 100GB per hour, which will exhaust cache space within eight hours.

Hours later, SOC Analysts notice that the timestamps on most of the log data are several hours behind. They send an email to the SIEM Engineers, asking to see if there’s anything wrong with the SIEM application. When the SIEM Engineers get out of their project meeting a few hours later, they login to the servers and notice the cache files, extremely high EPS rates, and maxed-out RAM/CPU usage. They then notice that the surge in data is being caused by firewall and proxy logs. After a conversation with the SOC, the SIEM Engineers are then informed of the DDoS attack that happened earlier in the day.

Later in the evening, the SIEM Engineers warn leadership that the Processing Layer is dropping cache files of log data and is refusing new connections, resulting in data loss. The average log data delay now stands at eight hours as the DDoS attack continues.

Fortunately, the DDoS attack stopped the following morning, and the SIEM Processing Layer began reducing the amount of cache files on the servers. The SIEM Engineers anticipate that cache files should be completely cleared by 5PM.

Later that morning, the SOC Manager gets a call from the Fraud team, asking if they can see traffic to several IP Address. The SOC Analysts begin searching, but the response times are very slow and the latest data available is from last night. Just as the SIEM Engineers expected, by 5PM all cache files were cleared, and analysts were searching data in real-time again. They found only one hit from one of the IPs provided. The Fraud team insists there should be more than that, but the SIEM Engineers note that the other hits may have been dropped when the Processing Layer was refusing new connections during the surge in data. Leadership isn’t happy, and calls for an immediate review of the SIEM environment.

The bad news is that many SIEM environments are not sized appropriately to deal with such scenarios, or with legitimate data surges in general. These situations can leave your organization blind to an attack in progress, as the data required for an investigation exists, but is not yet available to your analysts, or worse is being dropped from existence.

The good news is that you can significantly reduce the probability and severity of this scenario. While SIEM environments can be expensive, the costly part is typically the Analytics Layer, and for many organizations over-sizing this layer isn’t an option. However, the Processing Layer tends to be much less expensive, and in some cases would simply result in the cost of the physical or virtual servers required.

A SIEM Processing Layer should be significantly larger than your sustained average event per second rate. While this number is a requirement to determine SIEM application licensing costs, many organizations make the mistake of sizing their SIEM according to this metric. In addition to spikes, the amount of traffic received by your SIEM during the day is likely to be much higher than at night. If your sustained EPS rate is 20,000 EPS, then it can be possible for your EPS during the day to be 30,000, and EPS at night to be in the 5,000-10,000 EPS range. If you receive a spike in traffic during the day, the 30,000 EPS can turn into 60,000 EPS. In many SIEM environments, this would cause the Processing Layer to quickly exhaust caches and begin dropping data. Supporting a large spike in traffic could simply be done by adding more devices (Connectors/Forwarders/Ingestion Nodes) within the Processing Layer. The increase in processing power and overall cache availability would reduce the risk of log transmission delays and data dropping.

In addition to reducing the probability of data delay and loss, an over-sized Processing Layer brings high availability benefits, and as well can make migrations and upgrades easier. If you have a single point of failure, you can lose your Processing Layer entirely if the device fails. If you have enough Connectors/Forwarders only to process your sustained EPS rate, you risk the above scenario when one of the devices within the Processing Layer fails, as the others have to make up for the extra EPS rates. If you need to upgrade your Processing Layer, the extra devices can make the upgrade smoother and transparent to any operational issues.

While we may have solved the issue with the Processing Layer, the surge in data can also result in transmission delays, data loss, slower end-user search response times, and system stability issues on the Analytics Layer. However, while it may seem logical to build an Analytics Layer to support double the sustained EPS rates, it can be cost prohibitive for many organizations. An adequately-sized Processing Layer can assist during surges by aggregating data (combining similar events into one, for SIEM products that support aggregation), caching it locally, and limiting the EPS-out rates to the Analytics Layer. Your SIEM Engineers can also control what data is sent over others, so if there’s a dire need for a particular data source, your staff can limit other data sources to be sent to the Analytics Layer so that the pertinent sources can be sent in priority.

In summary, there’s a strong return on investment for building an adequate SIEM Processing Layer given the low costs, risk reduction, and invaluable security benefits. Even with a minimal Analytics Layer, a properly-sized Processing Layer will be sufficiently able to process a surge of data, cache data that can’t be forwarded, prevent data loss, and reduce risks caused by large increases in log data. Make the investment and leave spacetime issues for the physicists!

Please like, share, or comment if you enjoyed this article. Thank you!

The Pros and Cons of Structuring Log Data at Ingestion Time with SIEMs

Another important but often overlooked part of a SIEM architecture and design or product analysis exercise is whether the product structures the data or not as it’s ingested/processed, and how that can affect your organization’s environment. This seemingly miniscule functionality can have a significant effect on your SIEM environment, and can even introduce risk into your organization.

Let’s start with the advantages of structuring (parsing) log events at ingestion time. In general, structuring data as its ingested/processed can give you many opportunities to manipulate the data in a positive way.

Advantages of Structuring Log Data at Ingestion Time

1. Ability to Aggregate

Aggregation gives the SIEM tool the ability to combine multiple, similar events into one single event. The biggest advantage of this is the reduction in EPS rates processed by the SIEM and the reduced storage requirements. Ten firewall deny events from the same source and destination can be combined into one, using 1,000 bytes of storage instead of 10,000 bytes. SIEM tools typically have the ability to aggregate hundreds of events over a period of thirty seconds or more, so it’s common to see aggregation successfully combine hundreds of events into one. Caution must be used with aggregation settings as the higher the aggregation window (the maximum amount of time the Connector/Log Processor will wait for similar events to combine into one), the higher the memory requirement for it.

2. Ability to Standardize Casing

Most SIEM tools can easily standardize casing of all fields parsed by the Connector/Log Processor. This is often an overlooked benefit that comes with structuring data, as the various data sources in your environment will log in various casings, and thus introduce a potential security risk. The security risk that can be introduced is that your security analysts may be getting null search results when the data they need is actually there.

In several investigations, SOC analysts were having issues getting hits for a particular user’s data. Upon closer inspection, we found the desired data, and discovered that the initial searches were coming up null because the casing in the searches did not match that of the log data. The SOC analysts were searching for “frank,” but the SIEM tool was configured to be case-sensitive, and the Windows logs were being logged in uppercase as “FRANK.” Thus, by simply having the Connector standardize casing for particular fields, you can minimize the above scenarios.

Standardizing casing can also increase search performance. When the SIEM tool only has to search in one case, it reduces the amount of characters it needs to search for. A simple search for “FRANK” only requires the tool to match on five characters instead of thirty-two (FRANK, frank, Frank, FRank, FRAnk, etc).

Why not simply disable case-sensitivity for searches? This is seemingly the best option, as the risk of missing data is mitigated. However, the major disadvantage of this option is that it increases the processing power required for the searches. For smaller environments, this is practical and the effects will likely be negligible. However, in larger environments where the SIEM is processing several thousand events per second and there are several end users, the results can be noticeable.

A practical work around can be to disable case-sensitivity for particular searches at the discretion of the analyst. Many SIEM tools offer this option for this very reason. There’s typically an option before the search to disable case-sensitivity.

A best practice to mitigate missing data due to casing issues while maximizing performance is to start with a generic case insensitive search, and once you get hits on the data you’re searching for, switch to the casing you see. For example, if you’re looking for user Frank’s Windows logs, start with a small, e.g. few minute, case-insensitive search “frank”, and once you see that the Windows logs are logging it as “Frank,” switch to the proper casing and then expand the search. This is a practical option that will help analysts avoid missing data, and will not require you to configure your tool to be case insensitive.

Regardless of how you chose to configure case-sensitivity, simply ensure your staff understand how your environment works and best practices for searching your data.

3. Ability to Add and Modify Fields

Many SIEMs can append data to existing fields, override fields with new data, and modify values put into fields. A common nuisance when searching for log data is how some systems have their FQDN logged (e.g. server01.ca.companya.com) while others simply log the server name (server01). This can cause a similar risk as case-sensitivity, where SOC analysts search for Device Host Name =”server01” but get no results as the server appears in the logs as “server01.ca.companya.com”. This forces the SOC analyst to do a wildcard search of Device Host Name =”server01*” etc, and ultimately requires more processing power from the SIEM.

When data is structured/parsed at ingestion time, the SIEM can do a simple lookup of the e.g. first period, strip whatever proceeds the period, and then put that into another field. Using “server01.ca.companya.com” as an example, the parser can leave server01 in the host name field, and then put the stripped ca.companya.com in the e.g. domain field. Analysts then know that they only need to search for the server name in the host name field, and to search the domain field if they want to know the domain of the server.

 

Now that we’ve fallen in love with the advantages of structuring data at ingestion time, let’s look at the disadvantages before we leave for the honeymoon.

The Disadvantages of Structuring Log Data at Ingestion Time

1. Increased Event Size

The first, and potentially most costly aspect of structuring the data at ingestion time is the increase in event size. When you structure the data, you increase the size of the event, in many cases doubling its size or more. Please see my related article, The Million Dollar SIEM Question: To Parse or Not To Parse, for more on this.

2. Potential Data Loss and Integrity Issues

Given that your parser is instructed to place values in specific fields, for example taking the value after the second comma and putting it into the user name field, there’s potential integrity and data loss issues if your parser is not updated at the same time the logging format for a particular data source changes.

Let’s take a look at a sample log event:

01-11-2018 14:12:22, 10.1.1.1, frank, authentication, interactive login, successful
The parser takes the timestamp from the characters before the first comma, the IP Address after the first comma, the user name after the second comma, the type of event after the third comma, the type of login after the fourth comma, and finally the outcome after the fifth comma. All is well until the vendor decides to add a new field, an event code, and change the order of the events:

01-11-2018 14:12:22, 4390000, 10.1.1.1, authentication, interactive login, successful, frank

This is a simple change, but your parser needs to be updated to ensure that the values are put into the correct fields, and to add in the new field. Should this new log type be implemented without a corresponding parser change, we’re not only going to have data in the wrong fields, we’re not going to know who did the login, as the value “frank” will not be visible to the parser.

3. Increased System Requirements

The more modifications the parser has to make to the event, the more processing power the Connector/Forwarder/Processor will consume. Ensure there are sufficient system resources able to process the required modifications.

 

A Summary of the Pros and Cons of Structuring data at ingestion time:

The Million Dollar SIEM Question: To Parse or Not To Parse

Given that SIEMs process and store data, one of the major requirements of a successful SIEM environment is proper and sufficient storage. Depending on your organization’s SIEM requirements, the cost of storage alone for your SIEM can exceed application licensing costs.

The most common omission in a SIEM product selection analysis is differentiating how the proposed applications process and store data. SIEMs process and store data differently, and thus will all produce different storage requirements. Given that storage is a major cost of your environment, how the application processes and stores data can alter your storage costs significantly.

Traditional, as well as newer SIEM products are designed to parse data as it’s ingested, and thus store data as a parsed event. Additionally, SIEM products will parse the data into different field sets; a Windows event in SIEM Product A can be parsed into 200 fields, while SIEM Product B will parse it into 193 fields. This will result in a different event size for the same data. Some newer SIEM products do not parse data as it’s ingested, stores the data raw, and only parses it when required, e.g. when you run a query, report, etc.

There are advantages and disadvantages to both. Parsing (or normalizing/enhancing/enriching) the data structures it, by adding applicable metadata. For example, the following log entry ‘jsmith, 10.1.1.1, failed login’ would appear parsed as ‘username=jsmith, IPAddress=10.1.1.1,event name=failed login’. While this makes the data more organized and allows for more refined searches, it makes the log entry bigger in size. A 500 byte raw event can turn into a 1000 byte parsed event, doubling the storage requirement for this event.

While parsing increases the size of the event, the SIEM tool does gain the ability to manipulate the data. Many SIEMs that structure data have the ability to aggregate events, which can take multiple, similar events and combine them into one. For example, the event ‘source address=10.1.1.1, source port=9022, event name= deny, destination address =123.45.67.89, destination port= 443’ that occurs 10 times can be combined into one, with an extra field added, e.g. event count =10, to indicate how many times the event occurred. Thus, 10 events at 1000 bytes each use 1000 bytes of storage with the structured SIEM application instead of 10,000 bytes.

To highlight how this can affect your organization, let’s look at Company A as an example. The technical staff at Company A have been reading about the benefits of SIEM from some guy on the Internet, and decide they want to implement one. They want 2 months of online data followed by 10 months offline, and for the data to be highly available. Their storage team can provide high-speed storage for $5,000 per terabyte, and lower-speed storage for $2,000 per terabyte.

Next, they’ve invited a couple of vendors in for a product overview, and they’re making each complete the SIEM Storage Requirements and Costs spreadsheet they created.

First up for Company A is Product A. Product A does not parse event data as it’s ingested and stores logs in raw format. It gets up to 50% compression on live data, and 85% compression on archived data. The product can replicate data and thus easily meet the high availability requirement. The tech staff at Company A also went the extra mile and determined that the average log event size from all their systems is 700 bytes.

A summary of the requirements and weights so far:

Again, since Product A stores data only in raw format, the Average Normalized Event Size is 0, as it does not store a parsed/normalized event. Product A cannot aggregate data, so there is no Aggregation Benefit. The product will create a copy of each event, bringing the replication factor to 2.

Next, we’re going to determine the average sustained events per second rate from the numbers the tech staff provided. Based on the total number of devices, the SIEM will need sufficient storage to store 5,000 Total Average EPS (events per second) at 700 bytes per event. Using the SIEM Storage Requirements and Costs spreadsheet, we get the following table.

The total daily uncompressed storage requirement is 607 GB. At 50% compression, the Total Online Storage Requirement is 18 TB. At 85% compression, the Total Offline Storage Requirement is 28TB. 18 TB at $5,000 per TB brings the cost to $91,000, and 28 TB at $2,000 adds another $55K, bringing the total to approx. $146,000.

Up next is Product B, which is a SIEM tool that’s designed to work with structured data. It will parse log events and create an Average Normalized Event Size of 2,000 bytes, and will be able to reduce events by 40% through aggregation. It can’t replicate data, but the Connector/Forwarder will be configured to send to 2 destinations. And by complete chance, it has the exact same compression ratios.

Product B is going to process the exact same raw EPS, but the Aggregation Benefit drops the sustained Total Average EPS to 3,000.

The total daily uncompressed storage requirement is approx. 1 TB. At 50% compression, the total Online Storage Requirement is 31.2 TB. At 85% compression, the total Offline Storage Requirement is 47.6 TB. 31.2 TB at $5,000 per TB brings the cost to $156,000, and 47.6 TB at $2,000 adds another $95K, bringing the total to approx. $251,000.

As you can see, for the sample insurance company, Product A produces a significantly different storage cost than Product B due to the way the products process and store data. The tech staff at Company A like Product B better, but know that it will be tough to sell the VPs a solution that will cost $500,000 more over the next five years.

However, the result could be completely different at Company B, which has different requirements than Company A. Company B is a telco that wants their IT staff to monitor their network infrastructure with a SIEM. The tech staff at Company A were generous enough to share the SIEM Storage Requirements and Costs spreadsheet with their buddies at Company B.

Company B decides to bring in Product A first and has them fill out the spreadsheet. As the environment mainly consists of network devices, the Average Raw Event Size is going to be 450 bytes. Again, since Product A doesn’t parse log data, there is no Average Normalized Event Size or Aggregation Benefit.

Next, we can calculate a total daily uncompressed storage requirement of 1.2 TB from the 15,780 EPS rate. That will produce an online storage requirement of 37 TB and offline of 56 TB.

That will bring the total yearly storage cost to approx. $300,000.

Next, the tech staff at Company B bring in Product B for an overview.

The mostly network devices will produce an Average Normalized Event Size of 1500 bytes, and the Aggregation Benefit will be very high for the firewall data, reaching an overall total benefit estimated at 80%.

Through the strong aggregation benefit, the total EPS is reduced from nearly 16,000 to just over 3,000. As a result, the storage requirements for Product B are 24.5 TB for online, and 37.4 TB for offline, respectively.

So for Product B, the yearly storage costs sit at $200,000 per year.

The tech staff at Company B like Product A better for some reason, but like at Company A, they know it will be tough to justify the extra $500,000 in storage costs over the next five years.

So as you can see, To Parse Or Not To Parse is not some Shakespearean-sounding cliché by some geek on the Internet. Requirements such as storing two copies of each event, storing both the raw and normalized event (add the storage costs for Product A and B together, roughly!), or retaining two years of log data can create tremendous storage cost differences over the tenure of your SIEM environment.

The best product for your company will be that which meets your requirements best, and storage costs alone should not be the deciding factor for which product is selected for your organization. There are many advantages and disadvantages to leaving data in raw format or parsing it, but as the data shows, you can’t ignore a question that can really be in the millions!

 

For the record, the lads at Company A have shared the SIEM Storage Requirements and Costs spreadsheet.