SECURITY

CI/CD Detection Engineering: Failing, Part 3

It was over a month ago that I promised we would tie together Splunk Security Content and the Splunk Attack Range to automatically test detections. Ultimately, using these projects together in a Continuous Integration / Continuous Delivery (CI/CD) workflow with CircleCI brings the rigors of software development to the SOC and truly treats 🛡detection as code.

Well, I want to share how we have failed at achieving this goal. Not many in our industry talk about failures but in my opinion, if you are not failing then you are not making progress. Let me share what our original plan was and how we realized it was going to fail in the long term and why we decided to scrap it.

In "CI/CD Detection Engineering: Splunk's Security Content, Part 1" we shared how the Splunk Security Content project can be used as a repository for treating Splunk detections as code. In "CI/CD Detection Engineering: Splunk's Attack Range, Part 2" we discussed how the Attack Range allowed us to test these detections in a replicable environment. Our original goal for part 3 of this series was to tie these two projects together using the newly released Attack Range test files and eventually test detections in a CI/CD workflow. Spoiler alert: 🚨 we failed.

Here are the main three reasons why the approach failed:

  1. Testing detections per Pull Request caused CI jobs to the queue and rendered the testing CI pipeline unusable.  
  2. Putting all tests together caused a very long testing time for a nightly job which surpassed the CircleCI job timeout limit.
  3. When multiple test jobs executions fail, Attack Range components were not properly cleaned which caused us to hit AWS resource limits.
     

Let’s dig into how our first approach was architected. First, a new argument was added to the Attack Range that would ingest a test file that has predefined configurations. You see an example of a test file below:

The key arguments specifically are:

  • Target: attack range target to attack
  • Simulation Technique: a technique to launch 
  • An array of detections: to test with a pass/fail condition.
     

We created a few of these test files under their respective MITRE ATT&CK technique in the security content repo as we slowly tested them.

The Attack Range was modified to ingest these test files and run through the following process for testing.  Build an environment, using Atomic Red Team simulate the technique associated with the detection, run the detection, evaluate its results based on the passed condition. Below is a visual representation of this process:

The final piece in our plans was generating CircleCI jobs for each of these test files that executed the above process. For this, we created a simple script called ci-generate.py that would read in every file under the /test folder in Security Content and create a CircleCI task from the file under the CircleCI job test-detections. The task looks like this:

<

The First Failure ⏲❌

We could not run each of these detection tests per PR since it’s execution was over 30 minutes for each test file.

To circumvent this we first started queuing incoming jobs per PR, but it quickly became unusable as we have +10 jobs queued with a test wait time of 16 hours. On our second attempt, we decided to test the detections nightly instead of per PR. To run our detection tests daily we also added a workflow step to our CircleCI configuration file to run the detection. The workflow definition looks like this:

In short, we planned to have our Threat Research team (or anyone in the community) make a PR for new detections with its corresponding test files. After merging the PR we run ci-generate.py the script and update the /.circleci/config file with a new task job under the test-detections job for the corresponding test file to be executed in the nightly workflow. Note that each task is just executing our newly created Attack Range test flags. The overall logical process that we expected was:

The Second Failure ⏰💥

When we started building our library of tested detection it became obvious that our current approach would not scale. After 12 detection files, our nightly testing-detection CI job started failing consistently. This particular one tells the full story of why:

It took 5 full hours to run the job and only 10 detections were tested, and then the job timeouts. We learned that day that CircleCI has a maximum job time limit of 5 hours. After much analysis 🤔 at this point, I was content with calling this approach a failure, but the truth was we were not done with dealing with issues.

The Third Failure 🧟‍♂️

An after-effect of moving to nightly jobs was the fact that we did not catch when things had gone wrong until our next working day ☀️. When nightly jobs failed there were occasions that the test would crash or fail and the next test would begin. Each failed or crash test left behind a tainted Attack Range environment on AWS ⛈. After several job failures, our AWS account started hitting limits ❌ on available resources like VPCs, EIPs, and EC2 instances allowed in the region. These zombie Attack Ranges were extremely labor-intensive to clean up, it entailed an engineer manually removing all the pieces created by Terraform during the build process. To circumvent this we added a reaping job that only executed if a test failed or crashed. This reaper job ran at the end of all the tests using the condition when: on_fail. You can see an example below:

Lessons Learned 📚

Even after addressing the zombie 🧟‍♀️ Attack Ranges and moving to nightly jobs to avoid exploding 💣 our job queue we could still not get around the CircleCI maximum job time limit of 5 hours. At this point, we realized that our attempt at using CircleCI to automate our tests was a failure and started thinking of a better solution. Furthermore, we learned a few lessons on how to better improve the stability of jobs and their execution time.

In part 4 of this series, we will share how we solved the problem above by changing drastically how we approached our testing workflow. For starters, we decided to leave behind the idea of needing a complete Splunk Attack Range for every test and instead broke off attack data generation into its own 🧩 project. During Splunk .conf20 in October, Patrick Bareiss and I announced the Splunk Attack Data Repository 🧱 on our talk "SEC1392C Simulated Adversary Techniques Datasets for Splunk." If you want to get a preview of the next part of this series I highly recommend you to watch it. Splunk Threat Research is now in the process of testing this new service to work out its bugs. Stay tuned for part 4 of this series after testing is completed 😁.

José is a Principal Security Researcher at Splunk. He started his professional career at Prolexic Technologies (now Akamai), fighting DDOS attacks from “anonymous” and “lulzsec” against Fortune 100 companies. As a engineering co-founder of Zenedge Inc. (acquired by Oracle Inc.), José helped build technologies to fight bots and web-application attacks. While working at Splunk as a Security Architect, he built and released an auto-mitigation framework that has been used to automatically fight attacks in large organizations. He has also built security operation centers and run a public threat-intelligence service. Although security information has been the focus of his career, José has found that his true passion is in solving problems and creating solutions. As an example, he built an underwater remote-control vehicle called the SensorSub, which was used to test and measure toxicity in Miami's waterways.

Join the Discussion