Category: SIEM

Splunk Risk Analysis Framework

One of the most useful apps you can find in Splunk Enterprise Security is the Risk Analysis Framework. The app allows you to assign a risk score to your use cases that you can use for alerting, threat hunting, and other analytics. Larger organizations will find significant value in the Framework, as it is difficult for most to monitor and triage the alert volume produced by their SIEM.

The Risk Analysis Framework is a simple index that stores each of your alerts (notables) with the entity that triggered the alert (called the “risk object,” an IP or username) and a numeric score. You can set a threshold suitable for your organization, and alert when a particular user or system reaches a particular risk score based on the use cases they’ve triggered. So instead of configuring each of your use cases to trigger an investigation, you can run them in the background and only trigger an investigation when the user or system reaches the threshold.

The Risk Analysis Framework can help with the below common issues organizations have with their SIEM environment:

High alert volume and false positive rate
The top SIEM issue that continues to this day. Organizations spend a significant amount of time and resources monitoring alerts from their SIEM use cases, many of which flag suspicious activity, outliers, and rarities, but end up being false positives. Given that these use cases can be recommended by many SIEM vendors and incident response providers, organizations can be reluctant to turn many of them off.

Compliance or Audit Requirements
Some regulated organizations are mandated to implement and monitor particular use cases, which the security team may consider a low-value activity.

Excessive tuning to make a use case feasible
To make many use cases feasible, organizations often tune out service accounts, certain network segments, and more, which risks tuning out true-positives and reducing the value of the use case.

Narrow, short-term view of the incident and related activity
Many use cases only look back an hour, leaving the analyst more of a short-term view of the user or entity that generated the alert. Given an analyst’s workload, they may put off looking back as necessary as other incidents arise rather than running extended queries that paint a better picture of the threat actor.

Can’t reach a consensus to disable use cases
It can be easier to implement a use case than disable it. Often, many organizations require approvals from multiple teams and lines of business to turn off a use case. As a stakeholder wouldn’t want to leave themselves liable in the event the use case could have detected an attack, they thus defer to leaving the use case enabled.


The Risk Analysis Framework can assist with all of the above.

Have a noisy use case with a high false positive rate? Let it run in the background and add to the user’s or system’s risk score rather than tie up an analyst. If the user or system continues to trigger it consistently or other use cases, then start investigating.

Compliance team giving you a use case that you think isn’t a good use of the security team’s time? Work with them to implement it as a non-alerting rule that adds to the entity’s risk score. I’ve worked with many audit and compliance teams, and they were all open to using the risk scoring method for many of the use cases they proposed.

Unable to implement a use case without filtering service accounts? Create one rule that excludes service accounts that generates an alert, and another that includes service accounts that adds to the user’s risk score.

Want the security team to take a longer-term view of the user or IP that generated the alert? The Risk Analysis Framework is a quick and easy way to provide that visibility. The analyst can quickly see what other use cases the user or IP has triggered in the past 6 months or more.

Can’t reach an agreement to disable a use case? Leave it enabled but convert it to a risk scoring use case.

How the Risk Analysis Framework works

The Risk Analysis Framework is an index (data store) for alerts (notables). To create a risk score for a Correlated Search, add a “Risk Analysis” action.

There’s three options when configuring a Risk Analysis action

1. Risk Object Field: (e.g. Username, source_IP_Address, destination_IP_Address, etc.)

2. Risk Object: User, System (e.g. hostname, IP), Other

3. Risk Score: (e.g. 20, 40, 60 etc.)

Most use cases would use “User” as the Risk Object, but some would be more appropriate for “System” (e.g. a password spraying use case).

Once your use cases are configured with a Risk Analysis action, you can create an alert that sums the risk scores for all risk objects over a particular time period. Using the below table as an example, we could create an alert when a risk object has a score of 250 or greater in the past 6 months. Once the user “Frank” triggered the last alert below, his score would exceed the threshold and an alert would be generated.

DateUse CaseRisk ObjectUse Case ScoreCumulative Score
09-23-2021 13:44:22Rare domain visitedFrank100100
09-25-2021 09:02:45Account Locked OutFrank50150
09-27-2021 21:33:12Rare File ExecutedFrank100250

You can also get creative with the Risk alert, by including not only the total risk score, but the amount of unique use cases triggered by the entity as well.

What’s an appropriate risk score and alert criteria? That depends on many factors, including your organization’s size, line of business, use cases, and others. Generally speaking you should look for high scores that are rare in your organization and as well users that trigger multiple use cases. The average user should not have a risk score in the hundreds or be triggering multiple security use cases.

Does the Risk Analysis Framework need tuning just like your other use cases? The answer is likely yes for your organization. To maximize its value, you should consider omitting particular users (e.g. default account names), use cases that can make it noisy, and other factors that would reduce its value. You should avoid “dumping” use cases into the Risk Framework simply to obtain a check mark, and prevent it from becoming filled with excessively noisy and low-value use cases.

So if you’re running Enterprise Security, the Risk Analysis Framework is an easy way to help with many of the common problems in SIEM environments. It’s easy to use, there’s little to maintain, and it can provide significant security value.

Step into the ring with SIEM heavyweight Sumo Logic

While it has been around for over a decade, Sumo Logic is still unknown to many information security departments. Its absence from the Gartner SIEM Magic Quadrant has likely contributed to its SIEM popularity challenges, but changes in the industry may be in its favour as many organizations migrate to the cloud.

Sumo Logic has mainly been a log management-only solution with limited event management capabilities, but it now offers a full SIEM solution via the JASK acquisition. In addition to common SIEM capabilities, Sumo Logic also provides infrastructure monitoring and business analytics, giving organizations the opportunity to use it for multiple business functions. With a strong client base of over 2,000 clients and many organizations looking to build in the cloud, this “Continuous Intelligence” platform is definitely worth consideration.

I was able to demo Sumo Logic and explore many of the features of its base log management product.

Here are some of the things I liked about it:

Cloud-focused SIEM

One of the things that stood out with Sumo Logic was its direct integrations with many common cloud vendors. Its integrations with AWS, Netskope, and Cloudflare were the simple clicks of a few buttons, and data was ingesting within minutes.

Practical Searching

Sumo Logic has a Lucene-like search language that makes it easy to obtain common security search results. Aggregations and common security searches for IPs, hostnames, and usernames were easy to learn. If you’re familiar with Splunk, picking up Sumo Logic’s search syntax will be easy.

Option to structure data during ingestion or search-time

Structuring data is a critical function of a SIEM. Some SIEMs parse data during ingestion, while others at search time. It’s debatable as to which approach is better, but Sumo Logic has taken an approach that gives you the best of both worlds. Sumo Logic is designed to parse at search time, but you can parse up to fifty fields during ingest. This allows you to structure and quickly search your commonly used fields such as IPs, hostnames, URLs, usernames, and many others, giving you fast search response times, while limiting the amount of parsers you need to create and maintain. Any other fields can be parsed as needed during search-time.

Infrastructure Monitoring and Business Intelligence Capabilities

While providing many SIEM capabilities, Sumo Logic also provides infrastructure monitoring and business analytics. You can monitor and alert on system resource utilization on your servers and applications, and its query language makes it easy to calculate sales, profits and other common business metrics, and turn them into charts and other visualizations.

So if you’re looking for infrastructure monitoring or business analytics in addition to a SIEM, or to simply consolidate applications, Sumo Logic can be used to for all three functions.

Good Documentation

For anything I wanted to know about Sumo Logic regarding data source integrations, search operators, or using lookup tables, I found the documentation helpful, accurate and up to date.

Common SIEM functionality

  • Real-time and scheduled alerts
    You can create your use cases via a search, and then schedule it on a real-time basis or at a regular interval (hourly, daily). The results of a search can be emailed, sent via a Webhook, written to an index, or forwarded to a SOAR.
  • Ability to export a significant amount of data to a file
    During a security incident, you may need to export a significant amount of data to a CSV file for further analysis, or to share with other staff. With Sumo Logic you can export up to 100,000 search results from the web interface, and up to 200,000 results via the API.
  • Supports lookup tables
    Lookup tables are commonly used by security staff to compare large amounts of IPs, usernames, hostnames in firewall, proxy and other data. Security teams often get large lists of suspicious IPs/domains that need correlated against network traffic. You can import lookup tables in Sumo Logic directly via the web interface, and then use it in your searches.
  • Manually importing data from a file (csv, text)
    A common use case is to perform an ad-hoc analysis of a log file from another security application. With Sumo Logic you can import a file directly via the web interface, and then analyze it using common search functions, aggregations, and as well to correlate the data against other data sources.
  • Useful search operators (parse, parse regex, JSON)
    Sumo Logic has practical search operators that allow you to extract data for matching, counting, and sorting. You can search via regex, and two useful operators such as parse (which lets you easily match on any characters), and the JSON operator, which easily allows you to parse JSON values.

Concerns/Considerations:

Doesn’t support many legacy systems

This would only be a “concern” if you have legacy systems and a significant on-prem presence. However, on-prem clients don’t appear to be part of Sumo Logic’s strategy.

Maximum 99.9% Uptime

While nine hours a year may not seem like much, it can be an eternity if these nine hours happen to be during an incident or other critical event.

Base product is more of a log management tool than SIEM

The base product comes close to being a full SIEM, but lacks a basic incident management app, and provides a limited use case library. So if you’re only going with the base product, you’ll have to use another app, excel, or your SOAR for case management.

While there are dashboards for many of the supported data sources, there doesn’t appear to be a significant library of real-time alerts, so much of it would have to be developed internally by your security team or Sumo Logic professional services.

Base product and SIEM product are separate apps

The base log management and SIEM applications are different products, so data has to be forwarded from the Collectors to both.


Summary

Overall, I found the product stable, intuitive, integrations easy to setup, and the query language easy to learn. The product provided fast search response times in general, and even better performance from searches on fields parsed during ingestion. Common security functions such lookup tables, data exports to a file, and manually uploading a log file were all intuitive and can be done directly via the web interface.

So if your next SIEM is going to be in the cloud, be sure to check out Sumo Logic.

Azure Sentinel Lists and Rules

One of the first questions I had about Azure Sentinel was if it supports “Lists.” Lists are available in most (if not all) SIEMs, and how they work in each differs. Lists can help end users create use cases, store selected data outside of retention policies, blacklist/whitelist, and more. You can read more about the utility of SIEM Lists in a previous post here.

Regarding Sentinel, the answer is yes, it supports two main types of lists: temporary lists that are created and used in queries, and external lists (e.g. CSV files hosted in Azure Storage) that can be used for lookups. You can also create a custom log source via the CEF Connector and use that as a pseudo list.

In this post we’ll create a couple of lists and analytics rules that will trigger off values in the lists. We’ll use the data generated from the custom CEF Connector created in a previous post here.

The first use case will detect when the same user account has logged in from two or more different IP addresses within 24 hours, a common use case to flag potential account sharing or compromised accounts. The second use case will trigger when a login is detected from an IP found in a list of malicious IP addresses.

First, let’s create a query to put the users that are logging in from 2 or more different IP addresses into a list called ‘suspiciousUsers’.

Next, let’s take the users from the list and then query the same log source to show the applicable events generated by those users. The results show us all the “Login Success” events generated by the users in the list. We could also use this list to query other data sources in Sentinel.

Query:

let suspiciousUsers =
CommonSecurityLog
| where TimeGenerated >= ago(1d)
| where DeviceProduct == “Streamlined Security Product”
| where Message == “Login_Success”
| summarize dcount(SourceIP) by DestinationUserName
| where dcount_SourceIP > 1
| summarize make_list(DestinationUserName);
CommonSecurityLog
| where TimeGenerated >= ago(1d)
| where DeviceProduct == “Streamlined Security Product”
| where Message == “Login_Success”
| where DestinationUserName in (suspiciousUsers)

So instead of adding the applicable events to the list as they occur and then have a rule query the list, we are simply creating the list in real-time and then using the results in another part of the query. Since the list is temporary, the major thing to consider here is ensuring your retention policies are in line with your use case. This is not an issue with this use case as we are only looking at the past 24 hours, but if you would like to track e.g. RDP authentication events over 6 months, you would need 6 months of online data.

For the next list, we’ll use our CEF Connector to ingest a list of malicious IPs from a threat intelligence feed. We’ll use a simple Python script to write the values in the file to the local Syslog file on a Linux server, which will then be forwarded to Sentinel by the CEF Connector. The IPs in the file were randomly generated by me.

The CSV file has three columns: Vendor, Product, and IP. The values look as follows:

Using an FTP Client (e.g. WinSCP), copy the CSV file to the server.
Next, let’s create a file, give it execute permissions, and the open it.

touch process_ti.py
chmod +x process_ti.py
vi process_ti.py

Paste the script found here into the file, save and close, then run it.

./process_ti.py

Let’s check that there are 300 entries from our CSV file:

CommonSecurityLog
| where DeviceVendor == “Open Threat Intel Feed”
| summarize count()

Now that we can assume the ingestion was successful, let’s make a list named ‘maliciousIPs’. We’ll use this list to match IPs found in the Streamlined Security Product logs.

let maliciousIPs =
CommonSecurityLog
| where TimeGenerated >= ago(1d)
| where DeviceVendor == “Open Threat Intel Feed”
| summarize make_list(SourceIP);
CommonSecurityLog
| where TimeGenerated >= ago(1d)
| where DeviceProduct == “Streamlined Security Product”
| where SourceIP in (maliciousIPs)

Output should look as follows, showing the authentication events from IPs in the ‘maliciousIPs’ list.


Now that we can lookup the data with the queries, let’s create a couple of analytics rules that will detect these use cases in near real-time.

From the Analytics menu, select ‘Create’, then ‘Scheduled query rule’.


Enter a name and description, then select ‘Next: Set rule logic >’.


Enter the query used for the first list (suspiciousUsers), and then we’ll map the DestinationUserName field to the ‘Account’ Entity Type, and SourceIP field to the ‘IP’ Entity Type. You need to click ‘Add’ in order for it to be added to the rule query. Once it’s added the column value will say ‘Defined in query’.


For Query scheduling, run the query every five minutes, and lookup data from the last hour. Set the alert threshold to greater than 0, as the threshold for this use case is already set in the query (2 or more IPs for the same user). We’ll leave suppression off.

One of the nice things about creating rules in Sentinel is that it shows you how many hits your rule will trigger based on your parameters. The graph saves you from doing this yourself, which you would likely do when creating a use case.

We’ll leave the default settings for the ‘Incident settings (Preview)’ and ‘Automated response’ tabs, and then click ‘Create’ on the ‘Review and create’ tab.


Once the rule is created, we can go to the ‘Incidents’ tab to see triggered alerts. We can see that the rule was already triggered by three user accounts.


Next, let’s create a rule that triggers when a user logs in from an IP in the ‘maliciousIPs’ list we created.


We’ll add the query and Map entities as we did in the prior rule.


We’ll schedule the query and set the threshold as follows.


We’ll leave the default settings for the ‘Incident settings (Preview)’ and ‘Automated response’ tabs, and then click ‘Create’ on the ‘Review and create’ tab.


Once the rule is created, we can go back to the Investigations page and see that it has already been triggered by three users.


As you can see, lists are easy to create and can be useful when writing queries and developing use cases. You can also use an external file hosted in Azure Storage and access it directly within a query. For further reading on this topic, there are some helpful posts available on the Microsoft site here, and here.

SIEM Lists and Design Considerations


Those familiar with creating use cases in SIEMs have likely at some point worked with “Lists.” These are commonly known in SIEMs as “Active Lists” (ArcSight), “Reference Sets” (QRadar), “Lookups” (Splunk), “Lookup Tables” (Securonix, Devo), and similar in other tools. Lists are essentially tables of data, and you can think of them as an Excel-like table with multiple rows and columns. Lists are different in each of the SIEMs on the market. Some are simply a single column which you can use for e.g. IP Addresses, and others are up to 20 columns that can support a significant amount of data. Log retention policies typically don’t apply to Lists, so you can keep them for as long as needed.

SIEM Lists have three main drivers: limitations with real-time rule engines, limited retention policies, and external reference data.

Limitations with Real-Time Rule Engines

SIEMs with real-time rule engines have the advantage of triggering your use cases as data is ingested (versus running a query every 5 minutes). But the real-time advantage turns out to be a disadvantage when you have a use case that spans a greater timeframe. The longer the timeframe of the use case, the more system resources used by the real-time engine, thus making some use cases impossible. For example, real-time rule engines can easily detect 10 or more failed logins in 24 hours, but not over three months–that would be far too much data to keep in memory. To compensate for this, Lists can be used to store data required by use cases that can’t be done via the real-time rule engine. The List can store, for example, RDP logins over a much longer period, e.g. for one year, including the login type, username, hostname, IP address, and possibly more depending on your SIEM. You can then trigger an alert when the count for a particular user reaches the desired threshold based on the amount of entries in the List.

Limited Retention Policies

Limited retention policies were also a large driver for Lists. Most SIEM environments only have 3 months of online data. In order to access data older than that, it must be e.g. restored from an archive/backup, which is typically inconvenient enough that an analyst won’t even ask for it. With Lists, you can store selected data outside of your retention policies. If you want to store RDP logins for longer than your retention policy allows, you can simply add the values to a List.

External Reference Data

SIEMs are extremely effective at matching data in log files. The advent of threat intelligence data brought security teams massive lists of malicious IP addresses, hostnames, email addresses, and file hashes that needed to be correlated with firewall, proxy, and endpoint protection logs. These threat intel lists can be easily put into a List and then correlated with all applicable logs. Most (if not all) SIEM products support these types of Lists.

Other List Uses

Lists can often enhance your use case capabilities. If your SIEM product can’t meet all of the conditions of a use case with its real-time engine or query, you can sometimes use Lists to compensate. For example, you can put the result of the first use case into a List, and then use a second use case that uses both the real-time engine and values in the List.

Lists can be useful for whitelisting or suppressing duplicate alerts. For example, you can add the IP, username or hostname of the event that triggered an alert to a List (e.g. users/domains that are already being investigated), and use the List to suppress subsequent alerts from the same IP, username, or hostname.

Lists can also help simplify long and complicated queries. Instead of writing a single query, you can put the results of the first part of a query into a List, and then have the second query run against the values in the List.


As you can see, Lists can be very useful for SIEM end users. Overlooking List functionality during a SIEM design can have profound impacts. While List functionality differs per SIEM, it’s important to understand how your SIEM works and ensure it meets your requirements.

Integrate Custom Data Sources with Azure Sentinel’s CEF Connector

Microsoft Azure Sentinel allows you to ingest custom data sources with its CEF Connector. For those not familiar with CEF (Common Event Format), it was created to standardize logging formats. Different applications can log in wildly different formats, leaving SIEM engineers to spend a large portion of their time writing parsers and mapping them to different schemas. Thus, CEF was introduced to help standardize the format in which applications log, encourage vendors to follow the standard, and ultimately reduce the amount of time SIEM resources spend writing and updating parsers. You can find more information on CEF on the Micro Focus website.

With Azure Sentinel, you can ingest custom logs by simply writing in CEF format to the local Syslog file. Many data sources already support Sentinel’s CEF Connector, and given how simple it is, I’m sure your developers or vendors wouldn’t mind logging in this format if asked. Once the data source is logging in CEF and integrated with Sentinel, you can use the searching, threat hunting, and other functionality provided by Sentinel.

To highlight this, we’re going to write to the Syslog file on a default Azure Ubuntu VM, and then query the data in Sentinel. This activity is simple enough to be done with basic Azure and Linux knowledge, and can be done with Azure’s free subscription, so I would encourage anyone to try it.

Requirements:
-Azure subscription (Free one suffices)
-Basic Azure knowledge
-Basic Linux knowledge
-Azure Sentinel instance (default)
-Azure Ubuntu VM (A0 Basic with 1cpu .75GB RAM suffices)

Once you have an Azure subscription, the first step is to create an Azure Sentinel instance. If you don’t already have one, see the “Enable Azure Sentinel” section on the Microsoft Sentinel website.

Once you have an Azure Sentinel instance running, create an Ubuntu VM. Select ‘Virtual Machines’ from the Azure services menu.

Select ‘Add’.

Add in the required parameters:

At the bottom of the page, select ‘Next: Disks’.

Leave all default values for the Disk, Networking, Management, Advanced, and Tags sections, and then select ‘Create’ on the ‘Review and create tab’.

Add a firewall rule that will allow you to SSH to the server. For example, you can add your IP to access the server on port 22.

Next, select ‘Data connectors’, the ‘Common Event Format (CEF)’ Connector from the list, then ‘Open connector page’.

Copy the command provided. This will be used on the Ubuntu server.

Connect to the Ubuntu server using an SSH client (e.g. Putty).

Once logged in, paste the command, then press Enter.

Wait for the install to complete. As noted on the CEF Connector page, it can take up to 20 minutes for data to be searchable in Sentinel.

You can check if the integration was successful by searching for ‘Heartbeat’ in the query window.

Next, we’re going to use the Linux logger command to generate a test authentication CEF message before using a script to automate the process. We’re going to use the standard CEF fields, and as well add extensions ‘src’ (Source Address), ‘dst’ (destination address), and ‘msg’ (message) fields. You can add additional fields as listed in the CEF guide linked at the beginning of this post. Command:

logger “CEF:0|Seamless Security Solutions|Streamlined Security Product|1.0|1000|Authentication Event|10|src=10.1.2.3 dst=10.4.5.6 duser=Test msg=Test”

where:
CEF:Version|Device Vendor|Device Product|Device Version|Device Event Class ID|Name|Severity|Source Address|Destination Address|Message

The event appears as expected when searching the ‘CommonSecurityLog’, where events ingested from the CEF Connector are stored:

Now we’re going to use the Python script at the end of this post that will generate a sample authentication event every 5 minutes. Simply create a file, give it execute permissions, then open it with vi. Be sure to put the file in a more appropriate location if you plan on using it longer-term.

mkdir /tmp/azure
touch /tmp/azure/CEF_generator.py
chmod +x /tmp/azure/CEF_generator.py
vi /tmp/azure/CEF_generator.py

Paste the script into the file by pressing ‘i’ to insert, and then paste. When finished, exit by pressing Esc, and then save and exit, ‘:wq’.

Run the file in the background by running the following command:

nohup python /path/to/test.py &

As expected, events generated on the Ubuntu server are now searchable in Sentinel:

In less than an hour, you now have searchable data in a standard format from a custom application using Sentinel’s CEF Connector. Additionally, you can setup alerts based on the ingested events, and work with the data with other Sentinel tools such as the incident management and playbook apps.

CEF generator Python script link here.

CEF Event Generator

This is a sample CEF generator Python script that will log sample authentication events to the Syslog file. Created and tested on an Azure Ubuntu 18.04 VM. Please check indentation when copying/pasting.

#!/usr/bin/python
# Simple Python script designed to write to the local Syslog file in CEF format on an Azure Ubuntu 18.04 VM.
# Frank Cardinale, April 2020

# Importing the libraries used in the script
import random
import syslog
import time

# Simple list that contains usernames that will be randomly selected and then output to the "duser" CEF field.
usernames = ['Frank', 'John', 'Joe', 'Tony', 'Mario', 'James', 'Chris', 'Mary', 'Rose', 'Jennifer', 'Amanda', 'Andrea', 'Lina']

# Simple list that contains authentication event outcomes that will be randomly selected and then output to the CEF "msg" field.
message = ['Login_Success', 'Login_Failure']

# Endless loop that will run the below every five minutes.
while True:

    # Assigning a random value from the above lists to the two variables that will be used to write to the Syslog file.
    selected_user = random.choice(usernames)
    selected_message = random.choice(message)

# Assigning a random integer value from 1-255 that will be appended to the IP addresses written to the Syslog file.
    ip = str(random.randint(1,255))
    ip2 = str(random.randint(1,255))

# The full Syslog message that will be written.   
    syslog_message = "CEF:0|Seamless Security Solutions|Streamlined Security Product|1.0|1000|Authentication Event|10|src=167.0.0." + ip + " dst=10.0.0." + ip + " duser=" + selected_user + " msg=" + selected_message

# Writing the event to the Syslog file.
    syslog.openlog(facility=syslog.LOG_LOCAL7)
    syslog.syslog(syslog.LOG_NOTICE, syslog_message)

# Pausing the loop for five minutes.
    time.sleep(300)

# End of script

Standard-Size your SIEM HA and DR Environments

A common decision made when designing a highly available or disaster recovery enabled SIEM solution is to under-size the secondary environment with fewer resources than the main production environment. Many believe that losing a server to a hardware failure or application due to corruption is highly unlikely, and if such a situation were to occur, a system with fewer resources can suffice while the primary system is down. Thus, with a minute probability of having a server or application failure, many believe they can get away with fewer RAM, cores, and disk space on the HA or DR server(s). After all, none of us want to bet on something that isn’t likely to happen.

For many organizations, this may be an acceptable risk for many reasons, including budgetary restrictions, other security compensating controls, risk appetite, and others. For example, a small company simply may not have the budget for a highly available SIEM solution. For others, it may be that their other security applications provide compensating controls, where their analysts can obtain log data from other sources.

For organizations looking to implement high availability in some fashion, it’s important to understand how small differences can have a major impact.

To highlight what can happen, let’s use Company A as an example. Company A is designing a new SIEM estimated to process 10,000 EPS. The company has requested additional budget for HA capabilities but want to save some of the security budget for another investment. They find an unused server in the data centre that has fewer RAM, cores, and disk space, and decide to use it as the DR server. They ask their SIEM vendor for their thoughts, and the vendor replies that the reduced system resources should only result in slower search response times, and the 2 TB hard drive should provide just over four days of storage (given 10,000 EPS X 2,500 bytes per normalized event X 86,400 seconds/day X 80% compression = 432GB/day after compression). Company A accepts the risk, thinking that any system issue should be addressed within their 24-hour hardware SLA, and that they should have the application back up in two days at most.

A few months down the road, the Production SIEM system fails. The operations staff at Company A quickly reconfigure the SIEM Processing Layer (Connectors/Collectors/Forwarders) to point to the DR server. Log data begins ingesting into the DR server and is searchable by the end users. The security team pats themselves on the back for averting disaster.

However, things go sour when the server admins learn that the hardware vendor won’t be able to meet the 24-hour SLA. Three days pass and the main Production server still remains offline. While the DR server is still processing data and searches are slower but completing, the security team becomes anxious as the 2TB hard drive approaches 90% utilization on the fourth day.

When disk capacity is fully utilized early the following morning, the SIEM Analytics Layer (where data is stored) begins purging data greater than four days old, diminishing the security team’s visibility. The purging jobs are also adding stress to the already fully utilized cores, which are also now causing searches to time out. The Analytics Layer begins refusing new data, forcing the SIEM Processing Layer to cache data locally on the server. That is also a concern for the security team since the single Processing Layer server only has 500GB of disk space, and the 400GB of available cache at this rate would be fully utilized in roughly 4 hours (given 10,000 EPS X 2,500 bytes per normalized event X 14,400 seconds = 360 GB uncompressed) as the SIEM’s processing application can’t compress data (this SIEM can only compress at the Analytics Layer).

As you can see, overlooking a somewhat minor design consideration, making assumptions, relying on SLAs, and so on, can have major impacts on your environment and reduce the utility of your SIEM in a disaster. As SIEMs consist of multiple applications (Connectors, Analytics), the required high availability should be considered for the various components. Forgoing high availability may need to be done for budgetary or other reasons, but it’s critical to ensure that the environment is aligned to your requirements and risk appetite.

Managing Your SIEM Internally or Outsourcing

Given the advent of cloud computing, many companies are now outsourcing at least some parts of their IT infrastructure and applications to third parties, allowing them to focus on their core companies. Many organizations don’t want to invest in an IT department or data center, and can’t match the speed and efficiency that Amazon AWS or Microsoft Azure can provide.

While your SIEM environment is likely one of your more complicated security applications to manage, you can outsource one or more parts of it to third parties, including content development, application, and infrastructure management. For example, you can focus on developing content internally by your security analysts, have the application setup and maintained by the vendor or other third party, and have the infrastructure hosted in AWS or Azure.

Before embarking on an outsourcing initiative, there are some major considerations that need to be taken before outsourcing one or more parts of your SIEM environment.

Determining which parts to outsource

In order to determine which parts of your SIEM environment to keep or outsource, you’ll need to understand your competencies. Do you have a significant amount of staff with sufficient security competencies, but not SIEM specifically? Do you have a SOC that is proficient in both SIEM content development and investigations? Are you willing to invest in obtaining and retaining scarce SIEM staff? Does your line of business and its regulations force you to keep all data internally? Do you want to keep your SIEM staff internal for a few years and then consider outsourcing it? Can your data centre rack-and-stack a server within a day, or does it take six months?

While these can be difficult questions to answer, the good news is that you’ll be able to outsource any part of your SIEM environment. Many of the major consulting firms can offer you SIEM content developers, engineers, investigators, and application experts to manage everything from your alerts to your SIEM application.

A summary of the SIEM environment functions that can be outsourced:
Content Development
-SIEM correlation rules, reports, metrics, dashboards, and other analytics.
Investigations
-Performing security investigations, working with the outputs of alerts, reports, etc.
Engineering
-Integrating data into the SIEM, parser development.
Application Management & Support
-Installing, updating, and maintaining the SIEM application.
Infrastructure
-Setup and maintenance of SIEM hardware, servers, and storage.

Privacy Standards and Regulations

Depending on what country and jurisdiction your organization falls under, you may be subject to laws that restrict the processing and storing of data in particular geographic regions. Azure and AWS have expanded significantly and have infrastructure available in many countries, allowing your data to be stored “locally.” Additionally, Azure and AWS can provide secure, private networks, segregating your data from other organizations.

While it can seem risky to store your data within another entity, many organizations take advantage of Amazon’s and Microsoft’s network and security team rather than building the required teams internally.

Internet Pipe

You’ll need an adequate, fault-tolerant Internet pipe between your organization and your hosting company. The amount of bandwidth needed depends on the SIEM architecture. For example, you may want to collect log data locally by the Processing Layer, where it will be collected, structured/modified, and then sent to the Analytics Layer. If we’re using Product A, which structures log data into an average event size of 2,000 bytes, and our sustained EPS rate is 5,000, then we’ll need 10MB of available bandwidth in order to forward over the link without any latency.

SIEM application that supports your desired outsourced architecture.

Let’s say for example that you want to collect and consolidate log data locally, and then have that consolidated data sent to a SIEM in the cloud, and another copy to your local, long-term data store. Does the application that will be collecting and consolidating log data locally support this model? Is there a limitation that it can only send to one destination, and thus can’t meet the requirement to send to two destinations?

Other Considerations:
Ownership
-Regardless of how you build and operate your SIEM environment, ensure that there’s an owner with a direct interest in maintaining and improving its condition, especially if working with multiple vendors. Multiple parties can point the finger at each other if there are issues within the environment, so it’s critical to have an entity that can prevent stalemates, resolve issues, improve the SIEM’s condition, and ultimately increase the value it provides to your organization.
Contract Flexibility
-If you’re going to be using the RFP process, understand that the third parties you’re submitting the proposal to are there to win business and compete on price. As a result, some third parties may under-size a solution to create a price attractive to the purchaser. While this can simply be considered business as usual, it’s important to understand that your environment may need to be augmented or adjusted during its tenure, and the service provider may ask for additional funds to ensure the environment’s health. Additionally, requirements can change significantly in a short period of time, which can change the infrastructure required for the environment.

Overall, there are many ways to structure your SIEM environment to take advantage of outsourcing. There are many organizations that can help you manage any part of it. How to do it best will depend on your organization’s requirements, line of business, relationships with third parties, competencies, strategy, and growth.

Selecting a SIEM Storage Medium

Given that SIEMs can process tremendous amounts of log data, a critical foundation of your SIEM environment is storage. In order to provide analytical services and fast search response times, SIEMs need dedicated, high-performing storage. Inadequate storage will result in poor query response times, application instability, frustrated end users, and ultimately make the SIEM a bad investment for your organization.

Before running out and buying the latest and greatest storage medium, understanding your retention requirements should be the first step. Does your organization need the typical three months of online and nine months of offline data? Or do you have larger requirements, such as six months of online followed by two years of offline data? Do you want all of your data “hot”? Answering these questions first is critical to keep costs as low as possible, especially if you have large storage requirements.

Once we understand the storage requirements, we can then better understand which medium(s) to use for the environment. If we have a six-month online retention requirement for a 50,000 event per second processing rate, we’re going to need dedicated, high-speed storage to be able to handle the throughput. While we definitely need to meet the IOPS requirement vendors will request, we need to also ensure the storage medium is dedicated. Even if the storage has the required IOPS, if the application can’t access the storage medium when required, the IOPS will be irrelevant. Thus, if using a technology such as SAN, ensure that the application is dedicated to the required storage and that the SAN is configured accordingly.

Another factor to consider when designing your storage architecture for your SIEM environment is what storage will be used per SIEM layer. The Processing Layer (Connectors/Collectors/Forwarders) typically doesn’t store data locally unless the Analytics Layer (where data is stored) is unavailable. However, when the Analytics Layer is unavailable, the Processing Layer should have the appropriate storage to meet the processing requirements. Dedicated, high-speed storage should be used to process large EPS rates, and should have the required capacity to meet caching requirements.

To save on storage costs, slower, shared storage can be used to meet offline retention requirements. When needing access to historical data, the data can be copied back locally to the Analytics layer for searching.

Ensuring you have the right storage for your SIEM environment is a simple but fundamental task. As SIEMs can take years to fully implement and equally long to change, selecting the correct storage is critical. For medium-to-large enterprises, dedicated, high-speed storage should be used to obtain fast read and write performance. While smaller organizations should also make the same investment, there are cases where slower, more cost effective storage can be used for low processing rates and minimal end user usage of the SIEM.

Understanding Your License Model

SIEM license models can vary significantly. Some are simply based on the average ingested data per day, while others can have multiple factors such as ingested data per day, amount of end users, and the amount of devices it collects data from. Regardless of the license model, it’s critical to understand how it works to ensure you don’t under-allocate sufficient funds for it. A misunderstanding of your license model can unexpectedly consume more security budget than anticipated, and thus increase risk to your organization by limiting resources available for both the SIEM and other security services.

Additionally, as most companies are constantly growing and changing, it’s pivotal to understand how the license model can be augmented, changed, and what the penalties are for any violations.

While the simpler the license model the better, there’s nothing wrong with a license model with various factors as long as it’s well understood and meets your organization’s requirements. After a requirements gathering exercise, you should be able to tell your vendor the expected ingestion rates per day, how many users there will be, and the expected growth rates.

There are other less-obvious factors that can also significantly affect license models. Two often overlooked factors are how the vendor charges for filtering/dropping unneeded data, and if the ingested data rates are based on raw or aggregated/coalesced amounts. For example, if you’re planning on dropping a significant amount of data by the Processing Layer, Product A (which doesn’t charge for dropped data) would have lower license costs than Product B (which can drop data, but includes the dropped amount in license costs), all else equal. Product C, which aggregates/coalesces data and determines license costs based on the aggregated/coalesced EPS, would have lower license costs than Product D, which aggregates/coalesces data but determines license costs based on raw EPS rates, all else equal.

If you’re comparing different SIEMs, you should ensure that you’re performing an accurate comparison, as SIEMs can vary significantly. A license model for a full SIEM solution from Company A is likely to be more expensive than a log management-only solution from Company B.

SIEMs can be expensive and consume a significant portion of your security budget. Misunderstanding your requirements and then signing a contract with a license model that’s unclear or difficult to understand is a major risk. Reduce that risk by spending the resources necessary to understand it and choose one that aligns best with your organization.