Managing Your SIEM Internally or Outsourcing

Given the advent of cloud computing, many companies are now outsourcing at least some parts of their IT infrastructure and applications to third parties, allowing them to focus on their core companies. Many organizations don’t want to invest in an IT department or data center, and can’t match the speed and efficiency that Amazon AWS or Microsoft Azure can provide.

While your SIEM environment is likely one of your more complicated security applications to manage, you can outsource one or more parts of it to third parties, including content development, application, and infrastructure management. For example, you can focus on developing content internally by your security analysts, have the application setup and maintained by the vendor or other third party, and have the infrastructure hosted in AWS or Azure.

Before embarking on an outsourcing initiative, there are some major considerations that need to be taken before outsourcing one or more parts of your SIEM environment.

Determining which parts to outsource

In order to determine which parts of your SIEM environment to keep or outsource, you’ll need to understand your competencies. Do you have a significant amount of staff with sufficient security competencies, but not SIEM specifically? Do you have a SOC that is proficient in both SIEM content development and investigations? Are you willing to invest in obtaining and retaining scarce SIEM staff? Does your line of business and its regulations force you to keep all data internally? Do you want to keep your SIEM staff internal for a few years and then consider outsourcing it? Can your data centre rack-and-stack a server within a day, or does it take six months?

While these can be difficult questions to answer, the good news is that you’ll be able to outsource any part of your SIEM environment. Many of the major consulting firms can offer you SIEM content developers, engineers, investigators, and application experts to manage everything from your alerts to your SIEM application.

A summary of the SIEM environment functions that can be outsourced:
Content Development
-SIEM correlation rules, reports, metrics, dashboards, and other analytics.
Investigations
-Performing security investigations, working with the outputs of alerts, reports, etc.
Engineering
-Integrating data into the SIEM, parser development.
Application Management & Support
-Installing, updating, and maintaining the SIEM application.
Infrastructure
-Setup and maintenance of SIEM hardware, servers, and storage.

Privacy Standards and Regulations

Depending on what country and jurisdiction your organization falls under, you may be subject to laws that restrict the processing and storing of data in particular geographic regions. Azure and AWS have expanded significantly and have infrastructure available in many countries, allowing your data to be stored “locally.” Additionally, Azure and AWS can provide secure, private networks, segregating your data from other organizations.

While it can seem risky to store your data within another entity, many organizations take advantage of Amazon’s and Microsoft’s network and security team rather than building the required teams internally.

Internet Pipe

You’ll need an adequate, fault-tolerant Internet pipe between your organization and your hosting company. The amount of bandwidth needed depends on the SIEM architecture. For example, you may want to collect log data locally by the Processing Layer, where it will be collected, structured/modified, and then sent to the Analytics Layer. If we’re using Product A, which structures log data into an average event size of 2,000 bytes, and our sustained EPS rate is 5,000, then we’ll need 10MB of available bandwidth in order to forward over the link without any latency.

SIEM application that supports your desired outsourced architecture.

Let’s say for example that you want to collect and consolidate log data locally, and then have that consolidated data sent to a SIEM in the cloud, and another copy to your local, long-term data store. Does the application that will be collecting and consolidating log data locally support this model? Is there a limitation that it can only send to one destination, and thus can’t meet the requirement to send to two destinations?

Other Considerations:
Ownership
-Regardless of how you build and operate your SIEM environment, ensure that there’s an owner with a direct interest in maintaining and improving its condition, especially if working with multiple vendors. Multiple parties can point the finger at each other if there are issues within the environment, so it’s critical to have an entity that can prevent stalemates, resolve issues, improve the SIEM’s condition, and ultimately increase the value it provides to your organization.
Contract Flexibility
-If you’re going to be using the RFP process, understand that the third parties you’re submitting the proposal to are there to win business and compete on price. As a result, some third parties may under-size a solution to create a price attractive to the purchaser. While this can simply be considered business as usual, it’s important to understand that your environment may need to be augmented or adjusted during its tenure, and the service provider may ask for additional funds to ensure the environment’s health. Additionally, requirements can change significantly in a short period of time, which can change the infrastructure required for the environment.

Overall, there are many ways to structure your SIEM environment to take advantage of outsourcing. There are many organizations that can help you manage any part of it. How to do it best will depend on your organization’s requirements, line of business, relationships with third parties, competencies, strategy, and growth.

Selecting a SIEM Storage Medium

Given that SIEMs can process tremendous amounts of log data, a critical foundation of your SIEM environment is storage. In order to provide analytical services and fast search response times, SIEMs need dedicated, high-performing storage. Inadequate storage will result in poor query response times, application instability, frustrated end users, and ultimately make the SIEM a bad investment for your organization.

Before running out and buying the latest and greatest storage medium, understanding your retention requirements should be the first step. Does your organization need the typical three months of online and nine months of offline data? Or do you have larger requirements, such as six months of online followed by two years of offline data? Do you want all of your data “hot”? Answering these questions first is critical to keep costs as low as possible, especially if you have large storage requirements.

Once we understand the storage requirements, we can then better understand which medium(s) to use for the environment. If we have a six-month online retention requirement for a 50,000 event per second processing rate, we’re going to need dedicated, high-speed storage to be able to handle the throughput. While we definitely need to meet the IOPS requirement vendors will request, we need to also ensure the storage medium is dedicated. Even if the storage has the required IOPS, if the application can’t access the storage medium when required, the IOPS will be irrelevant. Thus, if using a technology such as SAN, ensure that the application is dedicated to the required storage and that the SAN is configured accordingly.

Another factor to consider when designing your storage architecture for your SIEM environment is what storage will be used per SIEM layer. The Processing Layer (Connectors/Collectors/Forwarders) typically doesn’t store data locally unless the Analytics Layer (where data is stored) is unavailable. However, when the Analytics Layer is unavailable, the Processing Layer should have the appropriate storage to meet the processing requirements. Dedicated, high-speed storage should be used to process large EPS rates, and should have the required capacity to meet caching requirements.

To save on storage costs, slower, shared storage can be used to meet offline retention requirements. When needing access to historical data, the data can be copied back locally to the Analytics layer for searching.

Ensuring you have the right storage for your SIEM environment is a simple but fundamental task. As SIEMs can take years to fully implement and equally long to change, selecting the correct storage is critical. For medium-to-large enterprises, dedicated, high-speed storage should be used to obtain fast read and write performance. While smaller organizations should also make the same investment, there are cases where slower, more cost effective storage can be used for low processing rates and minimal end user usage of the SIEM.

Understanding Your License Model

SIEM license models can vary significantly. Some are simply based on the average ingested data per day, while others can have multiple factors such as ingested data per day, amount of end users, and the amount of devices it collects data from. Regardless of the license model, it’s critical to understand how it works to ensure you don’t under-allocate sufficient funds for it. A misunderstanding of your license model can unexpectedly consume more security budget than anticipated, and thus increase risk to your organization by limiting resources available for both the SIEM and other security services.

Additionally, as most companies are constantly growing and changing, it’s pivotal to understand how the license model can be augmented, changed, and what the penalties are for any violations.

While the simpler the license model the better, there’s nothing wrong with a license model with various factors as long as it’s well understood and meets your organization’s requirements. After a requirements gathering exercise, you should be able to tell your vendor the expected ingestion rates per day, how many users there will be, and the expected growth rates.

There are other less-obvious factors that can also significantly affect license models. Two often overlooked factors are how the vendor charges for filtering/dropping unneeded data, and if the ingested data rates are based on raw or aggregated/coalesced amounts. For example, if you’re planning on dropping a significant amount of data by the Processing Layer, Product A (which doesn’t charge for dropped data) would have lower license costs than Product B (which can drop data, but includes the dropped amount in license costs), all else equal. Product C, which aggregates/coalesces data and determines license costs based on the aggregated/coalesced EPS, would have lower license costs than Product D, which aggregates/coalesces data but determines license costs based on raw EPS rates, all else equal.

If you’re comparing different SIEMs, you should ensure that you’re performing an accurate comparison, as SIEMs can vary significantly. A license model for a full SIEM solution from Company A is likely to be more expensive than a log management-only solution from Company B.

SIEMs can be expensive and consume a significant portion of your security budget. Misunderstanding your requirements and then signing a contract with a license model that’s unclear or difficult to understand is a major risk. Reduce that risk by spending the resources necessary to understand it and choose one that aligns best with your organization.

Timestamp Management

One of the most critical components of a SIEM environment is accurate timestamps. In order to perform any investigation, analysts will need to know when all the applicable events occurred. Given that SIEM applications can have many timestamps, it’s critical for staff to know which are used for what. Improper use of timestamps can have many repercussions and add risk to your organization.

The two most important timestamps in an event are the time which the event was generated on the data source, and the time the event was received by the SIEM. The time which the event was generated on the data source is commonly known as the Event Time. The time the event was received by the SIEM is commonly known as the Receipt Time.

A common question posed by junior staff is why they can’t find the events they’re looking for. Once I rule out a case-sensitivity searching issue, I then check to see if the correct dates/times are being used in the search. For example, a brute-force alert was generated after a user generated 50 failed logins. While the alert was generated in the past 24 hours, analysts can’t find any failed logins for the user that triggered the alert. Upon closer inspection, we see that the alert was generated off the Receipt Time, but the Event Time timestamps are from one week ago. After a quick investigation, we discovered that the system owner took the system offline last week, and just reconnected it to the network 24 hours ago.

Timestamp discrepancies can be even more severe when using dashboards or monitoring for alerts. For example, a SOC uses a dashboard that monitors for particular IPS alerts that were setup to detect an exploit attempt of a new, high-priority vulnerability. The SIEM has been experiencing caching lately, and events are currently delayed 28 hours. The dashboard the SOC analysts are using is configured to use the Event Time timestamp and shows events from the past 24 hours. An alert that occurred 26 hours ago finally clears the cache queue but fails to show up on the dashboard because it doesn’t match the timestamp criteria. Thus, the critical alert goes unnoticed.

While this can be a severe issue, the fix is simple. The SOC analysts could configure the dashboard to use the Receipt Time timestamp instead, so all alerts received in the past 24 hours would show up and be noticeable regardless of when it was generated. Dashboards in general should have both timestamps shown, and staff in general should be aware of the various timestamps used by the SIEM. Another advantage of using both timestamps is allowing staff to be aware when there are delays in receiving log data. Minor delays can be expected, but significant delays can be caught immediately and actioned before large caches are formed.

Some timestamp discrepancies can be normal, especially in large organizations. You’re bound to have at least some out of thousands of servers with misconfigured dates, development machines logging to the SIEM when they shouldn’t be, or network outages that cause transmission delays. Having staff aware of the various timestamps and potential discrepancies can reduce the risk that they turn into larger issues in your organization.

Monitor for Caching

Caching is a sign that the system is unable to keep up with the volume of data. While some caching can be expected and considered normal, frequent occurrences are an indication of an undersized architecture or application misconfigurations. Excessive caching can result in major delays in receiving log data and ultimately data loss.

Given the risks of data caching, most SIEMs come with monitoring capabilities to alert when caching occurs. These should be implemented to alert when caching is beyond what is considered acceptable. For example, you may expect some minor caching during the day at peak hours, and thus don’t need alerts during this time, but alerts should be generated whenever there is caching outside this period.

Caching can also be detected from the server operating system, where you would see cache files build up in the applicable application directory. Thus, if your SIEM application doesn’t support alerting when caching occurs, you should be able to detect and alert via the OS.

Regardless of how it’s implemented, ensure your environment has appropriate alerting when caching is detected.

Calculate and Configure Caches

Until someone invents a technology that guarantees one hundred percent uptime, we’ll need to accept that at some point in a SIEM environment there will an application or system failure. Additionally, we’ll need to take the application offline at least a few times per year for scheduled maintenance and upgrades. While most SIEM applications have caching capabilities built into them, it’s critical to ensure the environment has appropriate cache sizes configured and sufficient storage. Insufficient storage or misconfigured cache configurations can result in data loss.

Typically in SIEM environments, the Processing Layer (Connectors/Collectors/Parser Layer) is designed to send to the Analytics Layer via TCP, and if it’s unavailable, data will be cached locally on the Processing Layer servers until the Analytics Layer is available again. Thus, the Processing Layer servers will need sufficient local disk space to house the expected caches.

In order to determine what an appropriate cache size is, we need to look at your organization’s requirements, SLAs, and other factors that will help us determine how long an outage can last, and how long it typically takes to resolve issues within your IT department. If you’re certain an outage would last no longer than 3 days, then we need to ensure the Processing Layer servers can support 3 days’ worth of cached log data. Caches can also get large quickly as it’s typically raw, uncompressed data.

To calculate how much storage we’ll need for caching, we can simply take the Average Sustained 24h EPS rate, and then multiply it by the average event size and the amount of seconds per day. For example, if your Average Sustained 24h EPS is 5,000, and your normalized event size is 2,000 bytes, then we’ll need about 864 GB of space per day. So if we have 2 servers in the Processing Layer and we expect an outage to last no longer than 3 days, then we’ll need 1.3 TB of free storage per server to meeting the cache space requirements (864 GB/Day X 3 days = 2.6 TB, or 1.3TB across 2 Servers).

We’ll also need to ensure the application is configured to use the appropriate cache size as well. Many SIEM applications are configured with a default cache size, which may not be sufficient for your environment.

Log Source Verification

A critical function of any SIEM environment is verifying that the intended systems are logging successfully. Systems that are believed to be logging to a SIEM that aren’t pose a significant risk to an organization, creating a false sense of security and limiting the amount of data available for an investigation.

A common mistake that can be made while verifying if a particular system is logging to a SIEM is using the incorrect fields for confirmation. For example, most SIEMs have several IP address and hostname fields, ranging from source IP address, destination IP address, device IP address, and others. Given the multiple fields, it can be confusing to know which is used for what, especially for new staff. This leaves a possibility that staff are pulling incorrect data and providing inaccurate results when performing verification or searching in general.

As an example, Company A is implementing a new Linux server, and the SOC is being asked if they can see logs coming from it, 172.16.2.1. The SIEM application has three IP address fields: Device IP Address, Source IP Address, and Destination IP Address. The Device IP Address field contains the IP address of the server generating the log event. The Source IP Address is the source of the event, and the Destination IP Address field is the target of the event.

One of the SOC analysts performs the verification, searches for “172.16.2.1,” and gets one result:

Event Name: Accept
Source IP Address: 10.1.1.1
Source Port: 22
Device IP Address: 172.16.50.25
Destination IP Address: 172.16.2.1
Destination Port: 22

Without paying attention to the field names, the analyst mistakenly mentions that the new Linux server is logging, when in fact what he’s looking at is an accept traffic event generated from firewall 172.16.50.25, not an event from the new server. The project to implement the new server is now considered completed, and Company A now has a security gap.

While this can be a major issue and add risk to an organization, a simple process can be followed to show staff which fields to use for verification. Your SIEM vendor can also easily tell you which fields to use as well. Learning and education sessions on searching can also be used to address this and ensure staff know how to search effectively.

Alerting On Quiet Log Sources

Data sources that stop logging to your SIEM put your organization at risk. If one of your organization’s firewalls stops logging to the SIEM, your SOC will be blind to malicious traffic traversing it. If your endpoint protection application stops logging, your analysts won’t be able to see if malicious files are being executed on one of your billing servers.

In a perfect world, your SIEM should alert when any data source stops logging to your SIEM. While this is feasible in smaller organizations, it can become daunting in large organizations. It’s easier for your SOC to follow up with one system owner who sits a few cubicles over than with 100 system owners from different lines of businesses. The task of remediating several hundred systems not logging to a SIEM can easily consume an entire resource. In large organizations, network outages, system upgrades and maintenance windows can be a regular occurrence. Should you alert on any data source that stops logging to your SIEM in a predefined period, you could easily end up flooding your SOC, and in a worst case scenario, your analysts will develop a practice of ignoring these alerts.

As a best practice, especially in large organizations, a SIEM should be configured to alert when critical data sources stop logging. The data sources should at minimum include critical servers to the business (e.g. client-facing applications), firewalls, proxies, and security applications. A threshold of less than an hour in your organization may generate excessive alerts, as some sources that are file-based may be delayed by design, for example by copying the file to the SIEM every 30 minutes. However, data sources that haven’t logged in one hour may warrant an alert in your organization.

Another thing to consider when remediating systems not logging to the SIEM is that malware experts and threat intelligence specialists may not be the best resources to chase system owners down. While they may not mind the odd alert for this, they’re not likely going to have time to chase down and get 100 system owners to configure their systems properly, or have the patience to continuously follow up with them. Thus, in larger organizations, project management may be a good fit for this task.

Having all your systems log to your SIEM is a critical part of reducing your organization’s risk. Having a practical, manageable task for remediating systems that stop logging will ensure the process is followed and the risk is reduced.

Disable Unused Content

When building new SIEM environments or working with existing ones, one of the quickest ways you can improve the performance and stability of the environment is to remove unused content. While this may seem obvious to experienced SIEM resources, it’s common to find reports or rules running in the background that don’t serve a purpose. In some environments, unused content can be slowing the system down and contributing to application instability. Unused content is especially common in environments that don’t have enough staff to manage the SIEM.

Default rules provided by the vendor are often enabled but unused. The first indication that a rule is unused is if it doesn’t have an action or it isn’t used for informational purposes. If a rule isn’t alerted to the attention of an investigator or SIEM engineer, it may be that the rule is simply running in the background consuming system sources. A rule to trigger an alert when someone logs into the SIEM may be useful, but an ad-hoc report to obtain the same information may suffice. A significant amount of inefficient rules that match a large percentage of events can adversely affect the performance of the environment.

Reports can be another source of unused content. In many environments, I find reports that were originally setup to be used temporarily, but are no longer being used by the recipient. It’s often for the recipient to forget to follow up with the SIEM staff to note the reports are no longer required. Over a period of several years, this can easily amount to several dozen reports running on a regular basis, putting a significant strain on the system for no benefit.

All SIEM environments are different, and there’s no set of content that must be enabled or disabled. But there’s very likely content in your environment that can be disabled, and the system resources can instead be used to provide security analysts better search response times. So on a regular basis or whenever there’s a complaint about search response times or application instability, determine if there’s any content that can be safely disabled.

Effective Searching

There are two critical reasons end users should learn how to search their SIEM effectively. Ineffective searching is a risk to your organization, where end users can produce inaccurate data, and thus provide inaccurate investigation results. Ineffective searching can also degrade the SIEM’s performance, increasing the amount of time required for analysts to obtain data, while affecting the overall stability of the system.

If a security analyst is asked to perform an investigation and searches incorrectly, the results for a query on malicious traffic may return null when in fact there are matches. A compromised user account may be generating significant log data, but your analysts can’t obtain logs for it because they are searching for “jsmith” instead of a case-sensitive “JSmith.” End users can also match on incorrect fields, believing they are finding the correct data when they are not.

Ineffective queries can lengthen the amount of time required to complete them, and increase the system resources used by the SIEM. Many queries can improved to significantly increase their performance, making the end user happier with a faster response time, and a healthier system that has more CPU and RAM to work on other tasks. A simple rule of thumb is to match as early in the query as possible to limit the amount of data the system searches through. Searching for data in particular fields rather than searching all fields is also a way to reduce the amount of processing the system must do. Additionally, some SIEM tools allow you to easily check for poorly performing queries. For example, Splunk’s Search Job Inspector can not only show you which queries are taking the longest, but even which parts of the query are taking longer than others, allowing you to optimize accordingly.

It’s also common for security analysts to get requests for excessive data. In many cases requestors will ask for more information than is required in order to let them drill-down into the information they need, instead of having to submit multiple requests for data. For example, there may be a request to pull log data on a user for the past two months, when all that is required is some proxy traffic for a few days. These types of requests can be resource-intensive on SIEMs, especially if there are multiple queries running simultaneously. The impact can be more severe when the queries are scheduled reports. Scheduling multiple, large, inefficient queries on a regular basis can consume a significant amount of system resources. With a few inquiries to the requestor, the security analyst may be able to significantly reduce the amount of data searched for.

While ineffective searching is a risk, it’s a simple one to reduce. Training sessions, lunch-and-learns, or workshops can significantly reduce the risks of analysts searching incorrectly and consuming unnecessary system resources. I find a simple three-page deck can provide enough information to assist analysts with searching, highlighting the tool’s case sensitivity, common fields, and sample queries. Nearly all SIEM vendors offer complimentary documentation that will show you how to search best with their product. Thus, a few hours of effort can reduce searching risks while optimizing your SIEM environment.