IT

Splunking AWS ECS Part 2: Sending ECS Logs To Splunk

Welcome to part 2 of our blog series, where we go through how to forward container logs from Amazon ECS and Fargate to Splunk.

In part 1, "Splunking AWS ECS Part 1: Setting Up AWS And Splunk," we focused on understanding what ECS and Fargate are, along with how to get AWS and Splunk ready for log routing to Splunk’s Data-to-Everything platform. In this segment in the series we will be focusing on building an ECS cluster, defining tasks and deploying a simple container that routes its application logs to Splunk with Firelens.

As a quick recap, in part 1 we configured a CloudWatch log group and two IAM roles that will be required for this walkthrough along with an HTTP Event Collector and index within Splunk. In order to follow the remainder of this post in the series, you will need the following information that we defined in the last part:

  • AWS Region: US-East-1 (the region you’re working in)
  • AWS CloudWatch Log Group: SplunkECS (the name of the log group) 
  • AWS ECS Instance Role: ecsInstanceRole (the name of the role to run container instances)
  • AWS Task Execution Role: ecsTaskExecutionRole (the name of the role to run ECS tasks)
  • Splunk HEC server (Splunk Cloud): https://http-inputs-stackname.splunkcloud.com
  • Splunk HEC server (Splunk Enterprise): https://stackname.example.com (Splunk Enterprise)
  • Splunk HEC Port: 8088 or 443
  • Splunk HEC Index: scratch (the name of the index you configured in your HEC)
  • Splunk HEC Token: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
     

Configuring ECS Prerequisites

ECS is fairly straightforward to configure, but we’ll be relying on a few components to help manage our EC2 Container Instances. Specifically, we’ll need a key-pair for SSH access to our instances, a viable security group and an S3 storage bucket.

Creating A New Key Pair

Creating a key pair will allow us to access the container instance OS should we need to do any manual configuration. Although not required to run an ECS container instance, it’s a good idea to have one on-hand if the Docker (or future ContainerD) runtime is having problems.

  1. First, log in to your EC2 console
  2. From there, under Network and Security > create a new key pair.
  3. In most cases we’ll be connecting via SSH, so choose PEM format if that’s your tool of choice, or choose PPK if you’re using a PuTTY terminal in Windows.
  4. Add a tag if desired (it’s optional, but best practice)
  5. When you create the tag, the .pem or .ppk file should download automatically. Be careful as this is the only time it will download. If you lose this file, you will need to create a new key pair.
     


Now that we have our private key pair, we need to create a new AWS security group for access to and from the container instance

Creating A New Security Group

Next, we will configure a new security group for web and ssh access to and from the container instance vm.

  1. From the EC2 console, select Security Group under the Network and Security Heading.
  2. Create a new security group, providing the following:
  3. Name
  4. Description
  5. VPC - This will have to be the same VPC we’ll be using for our tasks, storage buckets etc. If you don’t know, leave this as your default.
     

Add the following inbound rules:

  1. HTTP: Anywhere
  2. HTTPS: Anywhere
  3. SSH: Your workstation IP, Domain or location. Using ‘Anywhere’ here is risky but if you’re having trouble connecting to your instance
    via SSH this can be changed later.

Save and make note of the security group name as we’ll be needing it later. Security Group: “ECS Container Instance SG


Now, on to storage!

Creating An S3 Storage Bucket

The last thing we’ll need to configure before building our ECS cluster is set up an S3 location and bucket to hold our configurations. S3 buckets are more efficient to run than EFS systems and have tighter integration with CloudWatch. Although an EFS file system can be used with an ECS container instance, opt to use S3 whenever possible. In the case of ECS clusters, because we have a physical dedicated vm hosting our tasks we can access S3 buckets without much issue. In the next part of this blog series we’ll be configuring an EFS file store for Fargate.

  1. Log in to the S3 management console
  2. From here, create a new bucket and give it a name.
  3. Provide the region where you want the bucket to live - It is important that this region is the same location where you’ll be building your ECS cluster and the same region where we’ll be defining our tasks.


For security reasons, it’s important to block all public access. We’ll be using our IAM roles created in part 1 of this series to grant S3 access to our instance and containers.

  1. (Optional) Enable bucket versioning.
  2. Once everything is set, create the new bucket.
  3. After the bucket is created, open the bucket and modify the permissions. The S3 log delivery group needs at minimum read and list permissions. If you intend to export data from your container to your S3 bucket, then write permissions are also required.
  4. Save the changes and review the bucket settings.
     

Interlude: A Little Bit About FluentD And Firelens

Before we get too far down the road, it’s best to talk a little bit about what Firelens and FluentD are, and why they’re relevant. Traditionally, logging application events has been fairly straightforward with monolithic applications deployed to static servers. All we had to do was write STDOUT to a file in a location mounted on a server. With the advent of container technology and code abstraction, a new problem emerged where it became instantly more difficult to understand what applications were doing when there was nowhere permanent to write logs.

A few years ago (2011), a really cool technology stack called FluentD came on the scene to help this out. FluentD is a platform that runs either as a sidecar or a daemonset and unifies the logging layer for all services running in a containered format. All of a sudden, developers no longer needed to worry about how their data exhaust was interacting with other services and instead focused on writing everything that’s meaningful to STDOUT. FluentD taps the STDOUT logging plane and then aggregates and re-routes the data output to somewhere more meaningful — like Splunk!

So that brings us then to Firelens. AWS quickly realized that when they were writing their own orchestration platform in ECS (and more recently Fargate), there was a need to assist in routing logs and data exhaust from applications as well. The added challenge in this case is that the containers are further decoupled as compared to Kubernetes or OpenShift. AWS developed a simple and lightweight platform in Firelens that was highly scalable and could run on a serverless platform — like Fargate.

The combination of Firelens and FluentD (and later FluentBit) is extremely powerful and sets the stage nicely for shipping data off to Splunk in a structured and consistent way.

Building The ECS Cluster

For the sake of simplicity, we’ll be creating a simple ‘cluster’ with only a single node. Adding nodes to the cluster is as straightforward as increasing the availability count. We’ll also be using an autoscaling group to make the management of ECS container instances fast and simple.

  1. From the ECS management console, select Create Cluster.
  2. Use EC2 Linux + Networking
  3. In the Instance Configuration section:
  4. Decide on a Cluster Name to use and use an On-Demand Instance.
  5. Select an EC2 Instance type that’s available in your zone. I’m using t2.medium as it’s widely available. This will be the EC2 class that’s used for all container instances in the autoscaling group.
  6. For now set the number of instances to 1 (this can be scaled later), and choose your desired storage space. For this example, I’m leaving the volume at 30Gb since my container images are quite small. Leave the EC2 AMI Id as the default.
  7. For the Key Pair, select the key pair we created in the first segment of this post.
  8. In the Networking section:
  9. Choose your default VPC. The name of this is unique to your account.  Alternatively, you can create a new VPC but we want to use the VPC that’s standard for EC2 access. This will depend on whether you want your services completely isolated or have access to other resources possibly hosted elsewhere. Since our S3 bucket is configured in our default VPC this makes things easy to route.
  10. Select at least 2 subnets (for availability). You’ll need to make sure that both subnets support the EC2 instance type selected in the last step. I opted for t2.medium as it’s widely supported.
  11. (Optional) Enable Auto assign public IP if you wish to access the container instances elsewhere. This is not necessary however for accessing container endpoints because we’ll be creating external routing later on with a Load Balancer.
  12. Select the security group we created in the earlier segment of this post. This security group grants all the required access to and from the container instances.
  13. Now this is important: select the container instance IAM role (ecsInstanceRole) we created in part 1 of this blog series. This is critical for all access to the containers, buckets and logs to work properly.
  14. Finally, enable CloudWatch Container Insights to log all metric data. This will be handy later in the series when we want to track resource utilization of our cluster.
     

Creating A Load Balancer For Task Target Groups

Now that we have our cluster built, we need a way to access the containers running in the container instances itself. There are a number of ways to manage this, but the easiest by far is to create an application load balancer. This has the latent advantage of also being more secure and provides its own meaningful metrics and logs.

  1. From the EC2 management console, select Load Balancing > Load Balancers.
  2. Create a new Application Load Balancer.
  3. Provide the new load balancer a name, and make the scheme internet facing.
  4. Create a single listener on port 80.
  5. Under availability zones:
  6. Select the same VPC you used in the ECS Cluster definition. To keep consistent, I’ll be using my default aws vpc.
  7. Select the same two availability zones (subnets) we configured for the ECS cluster container instances.
  8. Next, move on to Configure Security Settings > Configure Security Groups.
  9. Select the same security group we used when we created our ECS Cluster.
  10. In Configure Routing
  11. Create a new target group (we’ll call ours nginx).
  12. Select Instance target type
  13. Use the HTTP Protocol on Port 80. This will be the same port that our nginx container will be listening on later.
  14. Select HTTP1, and leave all of the remaining settings as the default.
  15. In Register Targets DO NOT register any targets at this time. We will be handling this later in our task definitions in ECS.
  16. Review and Create
     

Building A Task Definition

Task definitions are the main way to create deployments. The concepts we’ll be covering in this section will also apply when we move to Fargate in the next part of this series. 

Task definitions are important because they essentially define the deployment and act as a proxy for more traditional YAML-style configurations. AWS offers flexibility with task definitions and allows users to either configure them as JSON code or through a web user interface in the ECS management console. Both are viable options and both offer version control which makes changes and updates easier and less risky.

For our first task definition we’ll deploy a simple NGINX server that serves up the very simple and basic boilerplate landing page. Although we’re deploying a simple, single container in our task definition (with one sidecar, but we’ll get to that when we discuss log routing) there are some great tutorials on how to deploy much richer nginx web apps.

  1. From the ECS management console, select Task Definitions > Create new Task Definition > EC2 Definition
  2. Name: Splunk-ECS (name of your choice)
  3. Task Role: ecsTaskExecutionRole (This is the role we configured in part 1 of this series)
  4. Network Mode: <default>
  5. For the task execution IAM Role make sure to use the ecsTaskExecutionRole. This role has S3 storage bucket access that we configured in part 1 of this series.
  6. Under container definitions, add a container:
  7. Container name: nginx-web
  8. Image: nginx:latest
  9. Memory Limit: 256
  10. Port Mappings:
  11. Host: 80
  12. Container: 80
  13. Protocol: tcp
  14. (Optional): Enable CloudWatch Logs
  15. Save the task definition
     

Note: Although version control is enabled by default for task definitions, it is highly recommended to back up the JSON of the task. Copy the JSON output of the task definition and store it in a private repository for recoverability.

The last thing we should do is take stock of our configuration items from earlier, and add our newly configured parameters (Note: Don’t forget to replace these with your own values!)

  • AWS Region: US-East-1 (the region you’re working in)
  • AWS CloudWatch Log Group: SplunkECS (the name of the log group) 
  • AWS ECS Instance Role: ecsInstanceRole (the name of the role to run container instances)
  • AWS Task Execution Role: ecsTaskExecutionRole (the name of the role to run ECS tasks)
  • Splunk HEC server (Splunk Cloud): https://http-inputs-stackname.splunkcloud.com
  • Splunk HEC server (Splunk Enterprise): https://stackname.example.com (Splunk Enterprise)
  • Splunk HEC Port: 8088 or 443
  • Splunk HEC Index: scratch (the name of the index you configured in your HEC)
  • Splunk HEC Token: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
  • AWS Instance Key Pair: ECSDemo
  • AWS S3 bucket: SplunkECS
  • AWS ECS Cluster Name: SplunkECS
  • AWS Load Balancer Name: SplunkECS
  • AWS Load Balancer Target Group: nginx-web
  • AWS Load Balancer DNS Address (Found in the Load Balancer Description): <Your Fully Qualified Address>
  • AWS Task Definition Name: Splunk-ECS
     

Running The Task On Our ECS Cluster

Now that we have all of our AWS components configured, next we need to actually run our first iteration of the task definition. Typically I run the first version of the task definition to make sure my apps and containers are configured correctly. Once I’m happy with how they’re running, I’ll add the log routing which will cover in the last two segments of this article.

  1. From the ECS management console, select your ECS cluster we configured earlier. Under Services > Create a new service. Services will handle the work of creating the requisite networking and manage the deployment for us in one step.
  2. Under configure service:
  3. Launch type: EC2
  4. Task definition: Splunk-ECS
  5. Revision: latest
  6. Cluster: SplunkECS
  7. Service name: SplunkECS
  8. Service type: Replica
  9. Number of tasks: 1
  10. Minimum healthy percent: 100
  11. Maximum percent: 200
  12. Click Next Step
  13. Under load balancing:
  14. Load balancer type: Application
  15. Service IAM role: ecsServiceRole
  16. Load balancer name: SplunkECS
  17. Container to load balance: nginx-web:80:80
  18. Add to load balancer
  19. Target group name: nginx-web (the rest should automatically backfill and populate since we configured the load balancer earlier).
  20. Next step > do not adjust the service’s desired count > next step
  21. Create Service
     

It may take a while for the service and task to start up. Under the task tab in the cluster summary page, you can monitor the state of the task as it starts up. If you find that the task is stuck in PENDING or keeps stopping unexpectedly, you can alway click on the task ARN and view the details as to why the behavior is occurring. If you need to make changes to your task definition you can always create a new task revision. As always, be sure to take a backup of your work.

Once your service is up and running, you can validate the life of the NGINX web server running in our ECS container by visiting the DNS address associated with our load balancer endpoint.


Even though this landing page is nothing interesting it can actually provide a very interesting test bed for us. Since the page is now being hosted live in AWS anyone in the world can access it. With the engine now running, behind the scenes our NGINX server is writing both its access and error logs to a standard out buffer which is currently going to CloudWatch. With everything going at this point, the only thing left to do is have some fun and learn about event routing and getting our data to Splunk!

Creating A FluentD Configuration

Now that we have an ECS cluster built, serving up a container with an empty website, the next task is to configure a sidecar container with FluentD to manage the logs that will be routed with firelens.

We’ll be storing our configuration in the S3 bucket we created earlier in this post.

  1. Navigate to the S3 storage management console
  2. Select the bucket we created earlier (splunkecs)
  3. Create a directory called fluentd
  4. In a text editor, create a new file called fluent.conf
  5. Paste the following code block in the text file and replace the variables with the ones specified in our configuration items.
     
<system>
  log_level info
</system>
<match **>
  @type splunk_hec
  protocol <http or https, depending on your hec global settings>
  hec_host <Splunk HEC server>
  hec_port <Splunk HEC port>
  hec_token <Splunk HEC token>
  index <Splunk HEC index>
  host_key ec2_instance_id
  source_key ecs_cluster
  sourcetype_key ecs_task_definition
  insecure_ssl <true or false, depending on whether you have a valid cert>
  <fields>
    container_id
    container_name
    ecs_task_arn
    source
  </fields>
  <format>
    @type single_value
    message_key log
    add_newline false
  </format>
</match>
  1. In the S3 storage bucket, navigate to the fluentd/ directory
  2. Upload the fluent.conf file
  3. Select the fluent.conf file in the S3 bucket to view its details
  4. Make note of the Amazon Resource Name (ARN) for the fluentd/fluent.conf file in your storage bucket. We’ll need this when we configure our Firelens router:
  5. Fluent.conf ARN: arn:aws:s3:::splunkecs/fluentd/fluent.conf

A quick recap on what we’re doing in the fluent.conf file: The declaration of the main stanza is a blind match on everything we are logging as information. Then we’re defining the fluentd plugin we’re using with the type, and the details about the Splunk HEC. We’re telling FluentD then to use certain metadata for the logs to classify where they’re coming from as the host, source and sourcetype. We’re also colouring in more metadata fields with the container information and the task ARN. Finally we’re telling FluentD how to format the log, and where to break the events.

Putting It All Together: Running Firelens To Route Logs To FluentD And Splunk

At this point we have everything we need to start routing the logs via a simple firelens configuration and FluentD to Splunk.

By this point, we’ve configured our ECS cluster with a container instance, a simple NGINX web server running in a container, a task definition and an S3 bucket location for our configuration files. The last task we have ahead, is of course putting it all together to route the output of the NGINX logs to Splunk. Since NGINX writes two logs to STDOUT, we’ll be trapping those events with the firelens router and using FluentD to aggregate and send them directly to Splunk over an HTTP push.

For this segment, we’ll be making the configuration in the task definition descriptor, which is a JSON object. This is another way of configuring ECS task definitions and offers a bit of flexibility we need for this part of the tutorial.

  1. Open the ECS management console > Task Definitions
  2. Select the task definition we created for the NGINX deployment (Splunk-ECS)
  3. Select the latest revision (e.g., Splunk-ECS:1)
  4. Create new revision
  5. Scroll down to the bottom of the dialog window and select Configure via JSON
  6. The first thing we need to do is define the sidecar container with the FluentD configuration. Add the following block to the containerDefintions{} stanza:
     
{
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": “<AWS CloudWatch Log Group>”,
          "awslogs-region": "<AWS Region>",
          "awslogs-stream-prefix": "<Some prefix of your choice>"
        }
      },
      "image": "splunk/fluentd-hec:1.2.0",
      "firelensConfiguration": {
        "type": "fluentd",
        "options": {
          "config-file-type": "s3",
          "config-file-value": "<fluent.conf ARN>"
        }
      },
      "essential": true,
      "name": "log_router",
      "memory": "256",
      "memoryReservation": "128"
    }

The log driver in the fluentd container is configured as an aws log driver. The reason for this is that we need a fallback location to send FluentD logs to in case the container is having problems sending data to Splunk. Next, we need to tell the NGINX container to route its logs via firelens, rather than through the aws log driver. Locate your NGINX container definition and replace the logConfiguration{} stanza with the following:

"logConfiguration": {
        "logDriver": "awsfirelens"
      },
  1. Save the JSON file and check for any validation errors.
  2. The finalized JSON configuration with all null parameters removed, should look similar to this example:
{
  "containerDefinitions": [
    {
      "logConfiguration": {
        "logDriver": "awsfirelens"
      },
      "entryPoint": null,
      "portMappings": [
        {
          "hostPort": 80,
          "protocol": "tcp",
          "containerPort": 80
        }
      ],
      "memory": 256,
      "image": "nginx:latest",
      "name": "nginx-web"
    },
    {
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "debug-fluentd",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "splunk-ecs"
        }
      },
      "image": "splunk/fluentd-hec:1.2.0",
      "firelensConfiguration": {
        "type": "fluentd",
        "options": {
          "config-file-type": "s3",
          "config-file-value": "arn:aws:s3:::splunkecs/fluentd/fluent.conf"
        }
      },
      "essential": true,
      "name": "log_router",
      "memory": "256",
      "memoryReservation": "128"
    }
  ],
  "family": "Splunk-EC2",
  "requiresCompatibilities": [
    "EC2"
  ]
}
  1. Navigate back to ECS management console > Clusters and select the ECS cluster running the existing Splunk-ECS task
  2. Although it is possible to update and force re-deploy a task definition, we want to make sure we drain all instances of any active connections from the load balancer.
  3. In the Splunk-ECS cluster, select the Splunk-ECS service > Delete
  4. Next, we have to deregister the old container targets for the load balancer. Navigate to the EC2 management console > Target Groups
  5. Select the target group associated with the SplunkECS load balancer (nginx-web) > targets
  6. Select all targets > Deregister
  7. In a web browser, revisit the DNS address associated with our load balancer endpoint. Make sure you have a fresh browser cache. Once you see a 503 message, your targets have successfully been deregistered. The 503 message in this case means that the DNS entry for our load balancer is valid, but the load balancer has nowhere to send the traffic.
     

  1. Navigate back to ECS management console > Clusters and select our ECS cluster
  2. Verify that the previous service and tasks are not running, and Create a new service.
  3. Under configure service:
  4. Launch type: EC2
  5. Task definition: Splunk-ECS
  6. Revision: latest (this should be the newest version with the FluentD container)
  7. Cluster: SplunkECS
  8. Service name: SplunkECS
  9. Service type: Replica
  10. Number of tasks: 1
  11. Minimum healthy percent: 100
  12. Maximum percent: 200
  13. Click Next Step
  14. Under load balancing:
  15. Load balancer type: Application
  16. Service IAM role: ecsServiceRole
  17. Load balancer name: SplunkECS
  18. Container to load balance: nginx-web:80:80
  19. Add to load balancer
  20. Target group name: nginx-web (the rest should automatically backfill and populate since we configured the load balancer earlier).
  21. Next step > do not adjust the service’s desired count > next step
  22. Create Service
  23. Once your service is created, you can review the status of the tasks to make sure everything is running correctly. Since we enabled aws logging for the log router, if any issues arise, they should be available in the CloudWatch log group we created previously.
  24. Once again, in a new browser window navigate to the DNS address of the SplunkECS load balancer. Just like before, you will be greeted with the NGINX welcome message.
  25. Before we head back to Splunk to check out logs, try adding a deliberately invalid uri path e.g /splunk-firelens-test.php. This should give us a 404 page not found error.
     

Now, from our Splunk instance, we should see the bad request in our scratch index!

It all works! At this point you can now start analyzing your application logs, rename sourcetypes and do whatever awesome things you love to do with Splunk and all the data.


In part 3 of this series, we will be performing a similar task, but with containers running in an AWS Fargate profile. Fargate is similar to ECS with regards to task management but since we don’t have access to S3 storage buckets or container instances our approach will be a bit different.

Don’t forget to check out Splunk.com for the latest updates, downloads and events for everything Splunk!

Until then, happy Splunking!

----------------------------------------------------
Thanks!
Andrij Demianczuk

Splunk
Posted by

Splunk