Category: Best Practices

Splunk Risk Analysis Framework

One of the most useful apps you can find in Splunk Enterprise Security is the Risk Analysis Framework. The app allows you to assign a risk score to your use cases that you can use for alerting, threat hunting, and other analytics. Larger organizations will find significant value in the Framework, as it is difficult for most to monitor and triage the alert volume produced by their SIEM.

The Risk Analysis Framework is a simple index that stores each of your alerts (notables) with the entity that triggered the alert (called the “risk object,” an IP or username) and a numeric score. You can set a threshold suitable for your organization, and alert when a particular user or system reaches a particular risk score based on the use cases they’ve triggered. So instead of configuring each of your use cases to trigger an investigation, you can run them in the background and only trigger an investigation when the user or system reaches the threshold.

The Risk Analysis Framework can help with the below common issues organizations have with their SIEM environment:

High alert volume and false positive rate
The top SIEM issue that continues to this day. Organizations spend a significant amount of time and resources monitoring alerts from their SIEM use cases, many of which flag suspicious activity, outliers, and rarities, but end up being false positives. Given that these use cases can be recommended by many SIEM vendors and incident response providers, organizations can be reluctant to turn many of them off.

Compliance or Audit Requirements
Some regulated organizations are mandated to implement and monitor particular use cases, which the security team may consider a low-value activity.

Excessive tuning to make a use case feasible
To make many use cases feasible, organizations often tune out service accounts, certain network segments, and more, which risks tuning out true-positives and reducing the value of the use case.

Narrow, short-term view of the incident and related activity
Many use cases only look back an hour, leaving the analyst more of a short-term view of the user or entity that generated the alert. Given an analyst’s workload, they may put off looking back as necessary as other incidents arise rather than running extended queries that paint a better picture of the threat actor.

Can’t reach a consensus to disable use cases
It can be easier to implement a use case than disable it. Often, many organizations require approvals from multiple teams and lines of business to turn off a use case. As a stakeholder wouldn’t want to leave themselves liable in the event the use case could have detected an attack, they thus defer to leaving the use case enabled.

The Risk Analysis Framework can assist with all of the above.

Have a noisy use case with a high false positive rate? Let it run in the background and add to the user’s or system’s risk score rather than tie up an analyst. If the user or system continues to trigger it consistently or other use cases, then start investigating.

Compliance team giving you a use case that you think isn’t a good use of the security team’s time? Work with them to implement it as a non-alerting rule that adds to the entity’s risk score. I’ve worked with many audit and compliance teams, and they were all open to using the risk scoring method for many of the use cases they proposed.

Unable to implement a use case without filtering service accounts? Create one rule that excludes service accounts that generates an alert, and another that includes service accounts that adds to the user’s risk score.

Want the security team to take a longer-term view of the user or IP that generated the alert? The Risk Analysis Framework is a quick and easy way to provide that visibility. The analyst can quickly see what other use cases the user or IP has triggered in the past 6 months or more.

Can’t reach an agreement to disable a use case? Leave it enabled but convert it to a risk scoring use case.

How the Risk Analysis Framework works

The Risk Analysis Framework is an index (data store) for alerts (notables). To create a risk score for a Correlated Search, add a “Risk Analysis” action.

There’s three options when configuring a Risk Analysis action

1. Risk Object Field: (e.g. Username, source_IP_Address, destination_IP_Address, etc.)

2. Risk Object: User, System (e.g. hostname, IP), Other

3. Risk Score: (e.g. 20, 40, 60 etc.)

Most use cases would use “User” as the Risk Object, but some would be more appropriate for “System” (e.g. a password spraying use case).

Once your use cases are configured with a Risk Analysis action, you can create an alert that sums the risk scores for all risk objects over a particular time period. Using the below table as an example, we could create an alert when a risk object has a score of 250 or greater in the past 6 months. Once the user “Frank” triggered the last alert below, his score would exceed the threshold and an alert would be generated.

DateUse CaseRisk ObjectUse Case ScoreCumulative Score
09-23-2021 13:44:22Rare domain visitedFrank100100
09-25-2021 09:02:45Account Locked OutFrank50150
09-27-2021 21:33:12Rare File ExecutedFrank100250

You can also get creative with the Risk alert, by including not only the total risk score, but the amount of unique use cases triggered by the entity as well.

What’s an appropriate risk score and alert criteria? That depends on many factors, including your organization’s size, line of business, use cases, and others. Generally speaking you should look for high scores that are rare in your organization and as well users that trigger multiple use cases. The average user should not have a risk score in the hundreds or be triggering multiple security use cases.

Does the Risk Analysis Framework need tuning just like your other use cases? The answer is likely yes for your organization. To maximize its value, you should consider omitting particular users (e.g. default account names), use cases that can make it noisy, and other factors that would reduce its value. You should avoid “dumping” use cases into the Risk Framework simply to obtain a check mark, and prevent it from becoming filled with excessively noisy and low-value use cases.

So if you’re running Enterprise Security, the Risk Analysis Framework is an easy way to help with many of the common problems in SIEM environments. It’s easy to use, there’s little to maintain, and it can provide significant security value.

Timestamp Management

One of the most critical components of a SIEM environment is accurate timestamps. In order to perform any investigation, analysts will need to know when all the applicable events occurred. Given that SIEM applications can have many timestamps, it’s critical for staff to know which are used for what. Improper use of timestamps can have many repercussions and add risk to your organization.

The two most important timestamps in an event are the time which the event was generated on the data source, and the time the event was received by the SIEM. The time which the event was generated on the data source is commonly known as the Event Time. The time the event was received by the SIEM is commonly known as the Receipt Time.

A common question posed by junior staff is why they can’t find the events they’re looking for. Once I rule out a case-sensitivity searching issue, I then check to see if the correct dates/times are being used in the search. For example, a brute-force alert was generated after a user generated 50 failed logins. While the alert was generated in the past 24 hours, analysts can’t find any failed logins for the user that triggered the alert. Upon closer inspection, we see that the alert was generated off the Receipt Time, but the Event Time timestamps are from one week ago. After a quick investigation, we discovered that the system owner took the system offline last week, and just reconnected it to the network 24 hours ago.

Timestamp discrepancies can be even more severe when using dashboards or monitoring for alerts. For example, a SOC uses a dashboard that monitors for particular IPS alerts that were setup to detect an exploit attempt of a new, high-priority vulnerability. The SIEM has been experiencing caching lately, and events are currently delayed 28 hours. The dashboard the SOC analysts are using is configured to use the Event Time timestamp and shows events from the past 24 hours. An alert that occurred 26 hours ago finally clears the cache queue but fails to show up on the dashboard because it doesn’t match the timestamp criteria. Thus, the critical alert goes unnoticed.

While this can be a severe issue, the fix is simple. The SOC analysts could configure the dashboard to use the Receipt Time timestamp instead, so all alerts received in the past 24 hours would show up and be noticeable regardless of when it was generated. Dashboards in general should have both timestamps shown, and staff in general should be aware of the various timestamps used by the SIEM. Another advantage of using both timestamps is allowing staff to be aware when there are delays in receiving log data. Minor delays can be expected, but significant delays can be caught immediately and actioned before large caches are formed.

Some timestamp discrepancies can be normal, especially in large organizations. You’re bound to have at least some out of thousands of servers with misconfigured dates, development machines logging to the SIEM when they shouldn’t be, or network outages that cause transmission delays. Having staff aware of the various timestamps and potential discrepancies can reduce the risk that they turn into larger issues in your organization.

Monitor for Caching

Caching is a sign that the system is unable to keep up with the volume of data. While some caching can be expected and considered normal, frequent occurrences are an indication of an undersized architecture or application misconfigurations. Excessive caching can result in major delays in receiving log data and ultimately data loss.

Given the risks of data caching, most SIEMs come with monitoring capabilities to alert when caching occurs. These should be implemented to alert when caching is beyond what is considered acceptable. For example, you may expect some minor caching during the day at peak hours, and thus don’t need alerts during this time, but alerts should be generated whenever there is caching outside this period.

Caching can also be detected from the server operating system, where you would see cache files build up in the applicable application directory. Thus, if your SIEM application doesn’t support alerting when caching occurs, you should be able to detect and alert via the OS.

Regardless of how it’s implemented, ensure your environment has appropriate alerting when caching is detected.

Log Source Verification

A critical function of any SIEM environment is verifying that the intended systems are logging successfully. Systems that are believed to be logging to a SIEM that aren’t pose a significant risk to an organization, creating a false sense of security and limiting the amount of data available for an investigation.

A common mistake that can be made while verifying if a particular system is logging to a SIEM is using the incorrect fields for confirmation. For example, most SIEMs have several IP address and hostname fields, ranging from source IP address, destination IP address, device IP address, and others. Given the multiple fields, it can be confusing to know which is used for what, especially for new staff. This leaves a possibility that staff are pulling incorrect data and providing inaccurate results when performing verification or searching in general.

As an example, Company A is implementing a new Linux server, and the SOC is being asked if they can see logs coming from it, The SIEM application has three IP address fields: Device IP Address, Source IP Address, and Destination IP Address. The Device IP Address field contains the IP address of the server generating the log event. The Source IP Address is the source of the event, and the Destination IP Address field is the target of the event.

One of the SOC analysts performs the verification, searches for “,” and gets one result:

Event Name: Accept
Source IP Address:
Source Port: 22
Device IP Address:
Destination IP Address:
Destination Port: 22

Without paying attention to the field names, the analyst mistakenly mentions that the new Linux server is logging, when in fact what he’s looking at is an accept traffic event generated from firewall, not an event from the new server. The project to implement the new server is now considered completed, and Company A now has a security gap.

While this can be a major issue and add risk to an organization, a simple process can be followed to show staff which fields to use for verification. Your SIEM vendor can also easily tell you which fields to use as well. Learning and education sessions on searching can also be used to address this and ensure staff know how to search effectively.

Alerting On Quiet Log Sources

Data sources that stop logging to your SIEM put your organization at risk. If one of your organization’s firewalls stops logging to the SIEM, your SOC will be blind to malicious traffic traversing it. If your endpoint protection application stops logging, your analysts won’t be able to see if malicious files are being executed on one of your billing servers.

In a perfect world, your SIEM should alert when any data source stops logging to your SIEM. While this is feasible in smaller organizations, it can become daunting in large organizations. It’s easier for your SOC to follow up with one system owner who sits a few cubicles over than with 100 system owners from different lines of businesses. The task of remediating several hundred systems not logging to a SIEM can easily consume an entire resource. In large organizations, network outages, system upgrades and maintenance windows can be a regular occurrence. Should you alert on any data source that stops logging to your SIEM in a predefined period, you could easily end up flooding your SOC, and in a worst case scenario, your analysts will develop a practice of ignoring these alerts.

As a best practice, especially in large organizations, a SIEM should be configured to alert when critical data sources stop logging. The data sources should at minimum include critical servers to the business (e.g. client-facing applications), firewalls, proxies, and security applications. A threshold of less than an hour in your organization may generate excessive alerts, as some sources that are file-based may be delayed by design, for example by copying the file to the SIEM every 30 minutes. However, data sources that haven’t logged in one hour may warrant an alert in your organization.

Another thing to consider when remediating systems not logging to the SIEM is that malware experts and threat intelligence specialists may not be the best resources to chase system owners down. While they may not mind the odd alert for this, they’re not likely going to have time to chase down and get 100 system owners to configure their systems properly, or have the patience to continuously follow up with them. Thus, in larger organizations, project management may be a good fit for this task.

Having all your systems log to your SIEM is a critical part of reducing your organization’s risk. Having a practical, manageable task for remediating systems that stop logging will ensure the process is followed and the risk is reduced.

Disable Unused Content

When building new SIEM environments or working with existing ones, one of the quickest ways you can improve the performance and stability of the environment is to remove unused content. While this may seem obvious to experienced SIEM resources, it’s common to find reports or rules running in the background that don’t serve a purpose. In some environments, unused content can be slowing the system down and contributing to application instability. Unused content is especially common in environments that don’t have enough staff to manage the SIEM.

Default rules provided by the vendor are often enabled but unused. The first indication that a rule is unused is if it doesn’t have an action or it isn’t used for informational purposes. If a rule isn’t alerted to the attention of an investigator or SIEM engineer, it may be that the rule is simply running in the background consuming system sources. A rule to trigger an alert when someone logs into the SIEM may be useful, but an ad-hoc report to obtain the same information may suffice. A significant amount of inefficient rules that match a large percentage of events can adversely affect the performance of the environment.

Reports can be another source of unused content. In many environments, I find reports that were originally setup to be used temporarily, but are no longer being used by the recipient. It’s often for the recipient to forget to follow up with the SIEM staff to note the reports are no longer required. Over a period of several years, this can easily amount to several dozen reports running on a regular basis, putting a significant strain on the system for no benefit.

All SIEM environments are different, and there’s no set of content that must be enabled or disabled. But there’s very likely content in your environment that can be disabled, and the system resources can instead be used to provide security analysts better search response times. So on a regular basis or whenever there’s a complaint about search response times or application instability, determine if there’s any content that can be safely disabled.

Effective Searching

There are two critical reasons end users should learn how to search their SIEM effectively. Ineffective searching is a risk to your organization, where end users can produce inaccurate data, and thus provide inaccurate investigation results. Ineffective searching can also degrade the SIEM’s performance, increasing the amount of time required for analysts to obtain data, while affecting the overall stability of the system.

If a security analyst is asked to perform an investigation and searches incorrectly, the results for a query on malicious traffic may return null when in fact there are matches. A compromised user account may be generating significant log data, but your analysts can’t obtain logs for it because they are searching for “jsmith” instead of a case-sensitive “JSmith.” End users can also match on incorrect fields, believing they are finding the correct data when they are not.

Ineffective queries can lengthen the amount of time required to complete them, and increase the system resources used by the SIEM. Many queries can improved to significantly increase their performance, making the end user happier with a faster response time, and a healthier system that has more CPU and RAM to work on other tasks. A simple rule of thumb is to match as early in the query as possible to limit the amount of data the system searches through. Searching for data in particular fields rather than searching all fields is also a way to reduce the amount of processing the system must do. Additionally, some SIEM tools allow you to easily check for poorly performing queries. For example, Splunk’s Search Job Inspector can not only show you which queries are taking the longest, but even which parts of the query are taking longer than others, allowing you to optimize accordingly.

It’s also common for security analysts to get requests for excessive data. In many cases requestors will ask for more information than is required in order to let them drill-down into the information they need, instead of having to submit multiple requests for data. For example, there may be a request to pull log data on a user for the past two months, when all that is required is some proxy traffic for a few days. These types of requests can be resource-intensive on SIEMs, especially if there are multiple queries running simultaneously. The impact can be more severe when the queries are scheduled reports. Scheduling multiple, large, inefficient queries on a regular basis can consume a significant amount of system resources. With a few inquiries to the requestor, the security analyst may be able to significantly reduce the amount of data searched for.

While ineffective searching is a risk, it’s a simple one to reduce. Training sessions, lunch-and-learns, or workshops can significantly reduce the risks of analysts searching incorrectly and consuming unnecessary system resources. I find a simple three-page deck can provide enough information to assist analysts with searching, highlighting the tool’s case sensitivity, common fields, and sample queries. Nearly all SIEM vendors offer complimentary documentation that will show you how to search best with their product. Thus, a few hours of effort can reduce searching risks while optimizing your SIEM environment.