Our next-gen architecture is built to help you make sense of your ever-growing data. Watch a 4-min demo video!

A Practical Guide to Logstash: Parsing Common Log Patterns with Grok

  • Frank Kane
  • January 19, 2021
Share article
logstash parsing with grok

In a previous post, we explored the basic concepts behind using Grok patterns with Logstash to parse files. We saw how versatile this combo is and how it can be adapted to process almost anything we want to throw at it. But the first few times you use something, it can be hard to figure out how to configure for your specific use case. Looking at real-world examples can help here, so let’s learn how to use Grok patterns in Logstash to parse common logs we’d often encounter, such as those generated by Nginx, MySQL, Elasticsearch, and others.

First, Some Preparation

We’ll take a look at a lot of example logs and Logstash config files in this post so, if you want to follow along, instead of downloading each one at each step, let’s just copy all of them at once and place them in the “/etc/logstash/conf.d/logstash” directory.

First, install Git if it’s not already installed:

sudo apt update && sudo apt install git

Now let’s download the files and place them in the directory:

sudo git clone https://github.com/coralogix-resources/logstash.git /etc/logstash/conf.d/logstash

NGINX Access Logs

NGINX and Apache are the most popular web servers in the world. So, chances are, we will often have contact with the logs they generate. These logs reveal information about visits to your site like file access requests, NGINX responses to those requests, and information about the actual visitors, including their IP, browser, operating system, etc. This data is helpful for general business intelligence, but also for monitoring for security threats by malicious actors. 

Let’s see how a typical Nginx log is structured.

We’ll open the following link in a web browser https://raw.githubusercontent.com/coralogix-resources/logstash/master/nginx/access.log and then copy the first line. Depending on your monitor’s resolution, the first line might actually be broken into two lines, to fit on the screen (otherwise called “line wrapping”). To avoid any mistakes, here is the exact content of the line we will copy:

73.44.199.53 - - [01/Jun/2020:15:49:10 +0000] "GET /blog/join-in-mongodb/?relatedposts=1 HTTP/1.1" 200 131 "https://www.techstuds.com/blog/join-in-mongodb/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"

Original image link here

Next, let’s open the Grok Debugger Tool at https://grokdebug.herokuapp.com/ to help us out. In the first field, the input section, we’ll paste the previously copied line.

Original image link here

Now let’s have a look at the Logstash config we’ll use to parse our Nginx log: https://raw.githubusercontent.com/coralogix-resources/logstash/master/nginx/nginx-access-final.conf

From here, we’ll copy the Grok pattern from the “match” section. This is the exact string we should copy:

%{IPORHOST:remote_ip} - %{DATA:user_name} \[%{HTTPDATE:access_time}\] \"%{WORD:http_method} %{DATA:url} HTTP/%{NUMBER:http_version}\" %{NUMBER:response_code} %{NUMBER:body_sent_bytes} \"%{DATA:referrer}\" \"%{DATA:agent}\"

Original image link 

We go back to the https://grokdebug.herokuapp.com/ website and paste the Grok pattern in the second field, the pattern section. We’ll also tick the “Named captures only” checkbox and then click the “Go” button.

Note: For every line you copy and paste, make sure there are no empty lines before (or after) the actual text in the pattern field. Depending on how you copy and paste text, sometimes an empty line might get inserted before or after the copied string, which will make the Grok Debugger fail to parse your text. If this happens, just delete the empty line(s).

Original image link here

This tool is useful to test if our Grok patterns work as intended. It makes it convenient to try out new patterns, or modify existing ones and see in advance if they produce the desired results.

Now that we’ve seen that this correctly separates and extracts the data we need, let’s run Logstash with the configuration created specifically to work with the Nginx log file:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/nginx/nginx-access-final.conf

The job should finish in a few seconds. When we notice no more output is generated, we can close Logstash by pressing CTRL+C.

Now let’s see how the log file has been parsed and indexed:

curl -XGET "http://localhost:9200/nginx-access-logs-02/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1, 
  "track_total_hits": true,
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "tags.keyword": "_grokparsefailure"
          }
        }
      ]
    }
  }
}'

We’ll see a response similar to the following:

      {
        "_index" : "nginx-access-logs-02",
        "_type" : "_doc",
        "_id" : "vvhO2XIBB7MjzkVPHJhV",
        "_score" : 0.0,
        "_source" : {
          "access_time" : "01/Jun/2020:15:49:10 +0000",
          "user_name" : "-",
          "url" : "/blog/join-in-mongodb/?relatedposts=1",
          "path" : "/etc/logstash/conf.d/logstash/nginx/access.log",
          "body_sent_bytes" : "131",
          "response_code" : "200",
          "@version" : "1",
          "referrer" : "https://www.techstuds.com/blog/join-in-mongodb/",
          "http_version" : "1.1",
          "read_timestamp" : "2020-06-21T23:54:33.738Z",
          "@timestamp" : "2020-06-21T23:54:33.738Z",
          "agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36",
          "http_method" : "GET",
          "host" : "coralogix",
          "remote_ip" : "73.44.199.53"
        }

We can see the fields and their associated values neatly extracted by the Grok patterns.

IIS Logs

While we’ll often see Apache and Nginx web servers on the Linux operating system, Microsoft Windows has its own web server included in IIS (Internet Information Services). These generate their own logs that can be helpful to monitor the state and activity of applications. Let’s learn how to parse logs generated by IIS.

Just as before, we will take a look at the sample log file and extract the first useful line: https://raw.githubusercontent.com/coralogix-resources/logstash/master/iis/u_ex171118-sample.log.

We’ll ignore the first few lines starting with “#” as that is a header, and not actual logged data. The line we’ll extract is the following:

2017-11-18 08:48:20 GET /de adpar=12345&gclid=1234567890 443 - 149.172.138.41 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/62.0.3202.89+Safari/537.36+OPR/49.0.2725.39 - https://www.google.de/ www.site-logfile-explorer.com 301 0 0 624 543 46

Original image link here

Once again, to take a closer look at how our specific Grok patterns will work, we’ll paste our log line into the Grok Debugger Tool tool, in the first field, the input section.

Original image link here

The config file we’ll use to parse the log can be found at https://raw.githubusercontent.com/coralogix-resources/logstash/master/iis/iis-final-working.conf.

Original image link here 

Once again, let’s copy the Grok pattern within:

%{TIMESTAMP_ISO8601:time} %{WORD:method} %{URIPATH:uri_requested} %{NOTSPACE:query} %{NUMBER:port} %{NOTSPACE:username} %{IPORHOST:client_ip} %{NOTSPACE:http_version} %{NOTSPACE:user_agent} %{NOTSPACE:cookie} %{URI:referrer_url} %{IPORHOST:host} %{NUMBER:http_status_code} %{NUMBER:protocol_substatus_code} %{NUMBER:win32_status} %{NUMBER:bytes_sent} %{NUMBER:bytes_received} %{NUMBER:time_taken}

…and paste it to the second field in the https://grokdebug.herokuapp.com/ website, the pattern section:

Original Image Link here

Let’s run Logstash and parse this IIS log:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/iis/iis-final-working.conf

As usual, we’ll wait for a few seconds until the job is done and then press CTRL+C to exit the utility.

Let’s look at the parsed data:

curl -XGET "http://localhost:9200/iis-log/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1, 
  "track_total_hits": true,
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "tags.keyword": "_grokparsefailure"
          }
        }
      ]
    }
  }
}'

A response similar to the following shows us that everything is neatly structured in the index.

      {
        "_index" : "iis-log",
        "_type" : "_doc",
        "_id" : "6_i62XIBB7MjzkVPS5xL",
        "_score" : 0.0,
        "_source" : {
          "http_version" : "HTTP/2.0",
          "query" : "adpar=12345&gclid=1234567890",
          "bytes_received" : "543",
          "read_timestamp" : "2020-06-22T01:52:43.628Z",
          "user_agent" : "Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/62.0.3202.89+Safari/537.36+OPR/49.0.2725.39",
          "uri_requested" : "/de",
          "username" : "-",
          "time_taken" : "46",
          "referrer_url" : "https://www.google.de/",
          "client_ip" : "149.172.138.41",
          "http_status_code" : "301",
          "bytes_sent" : "624",
          "time" : "2017-11-18 08:48:20",
          "cookie" : "-",
          "method" : "GET",
          "@timestamp" : "2017-11-18T06:48:20.000Z",
          "protocol_substatus_code" : "0",
          "win32_status" : "0",
          "port" : "443"
        }

MongoDB Logs

While not as popular as MySQL, the MongoDB database engine still has a fairly significant market share and is used by many leading companies. The MongoDB logs can help us track the database performance and resource utilization to help with troubleshooting and performance tuning. 

Let’s see how a MongoDB log looks like: https://raw.githubusercontent.com/coralogix-resources/logstash/master/mongodb/mongodb.log.

Original image link

We can see fields are structured in a less repetitive and predictable way than in a typical Nginx log.

Let’s copy the first line from the log and paste it into the first field of the Grok Debugger Tool website.

2019-06-25T10:08:01.111+0000 I CONTROL  [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'

Original image link

The config file we will use for Logstash, to parse our log, can be found at https://raw.githubusercontent.com/coralogix-resources/logstash/master/mongodb/mongodb-final.conf.

Original image link

And here is the Grok pattern we need to copy:

%{TIMESTAMP_ISO8601:timestamp}\s+%{NOTSPACE:severity}\s+%{NOTSPACE:component}\s+(?:\[%{DATA:context}\])?\s+%{GREEDYDATA:log_message}

As usual, let’s paste it to the second field in the https://grokdebug.herokuapp.com/ website.

Original image link 

Let’s run Logstash:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/mongodb/mongodb-final.conf

When the job is done, we press CTRL+C to exit the program and then we can take a look at how the data was parsed:

curl -XGET "http://localhost:9200/mongo-logs-01/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1, 
  "track_total_hits": true,
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "tags.keyword": "_grokparsefailure"
          }
        }
      ]
    }
  }
}'

The output should be similar to the following:

      {
        "_index" : "mongo-logs-01",
        "_type" : "_doc",
        "_id" : "0vjo2XIBB7MjzkVPS6y9",
        "_score" : 0.0,
        "_source" : {
          "log_message" : "Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'",
          "@timestamp" : "2020-06-22T02:42:58.604Z",
          "timestamp" : "2019-06-25T10:08:01.111+0000",
          "context" : "main",
          "component" : "CONTROL",
          "read_timestamp" : "2020-06-22T02:42:58.604Z",
          "@version" : "1",
          "path" : "/etc/logstash/conf.d/logstash/mongodb/mongodb.log",
          "host" : "coralogix",
          "severity" : "I"
        }

User Agent Mapping and IP to Geo Location Mapping in Logs

Very often, when a web browser requests a web page from a web server, it also sends a so-called “user agent”. This can contain information such as the operating system used by a user, the device, the web browser name and version and so on. Obviously, this can be very useful data in certain scenarios. For example, it can help you find out if users of a particular operating system are experiencing issues.

Web servers also log the IP addresses of the visitors. While that’s useful to have in raw logs, those numbers themselves are not always useful to humans. They might be nice to have when trying to debug connectivity issues, or block a class of IPs, but for statistics and charts, it might be more relevant to have the geographic location of each IP, like country/city and so on.

Logstash can “transform” user agents like

Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/62.0.3202.89+Safari/537.36+OPR/49.0.2725.39

to the actual names of the specific operating system, device and/or browser that was used, and other info which is much more easy to read and understand by humans. Likewise, IP addresses can be transformed to estimated geographical locations. The technical term for these transformations is mapping.

Let’s take a look at an Apache access log: https://raw.githubusercontent.com/coralogix-resources/logstash/master/apache/access_log.

Original image link

We notice IP addresses and user agents all throughout the log. Now let’s see the Logstash config we’ll use to do our mapping magic with this information: https://raw.githubusercontent.com/coralogix-resources/logstash/master/apache/apache-access-enriched.conf.

The interesting entries here can be seen under “useragent” and “geoip“.

Original image link 

In the useragent filter section, we simply instruct Logstash to take the contents of the agent field, process them accordingly, and map them back to the agent field.

In the geoip filter, we instruct Logstash to take the information from the clientip field, process it, and then insert the output in a new field, called geoip.

Let’s run Logstash with this config and see what happens:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/apache/apache-access-enriched.conf

We’ll need to wait for a longer period of time for this to be done as there are many more lines the utility has to process (tens of thousands). As usual, when it’s done, we’ll press CTRL+C to exit.

Now let’s explore how this log was parsed and what was inserted to the index:

curl -XGET "http://localhost:9200/apache-logs/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1,
  "track_total_hits": true,
  "query": {
  "bool": {
    "must_not": [
      {
        "term": {
          "tags.keyword": "_grokparsefailure"
        }
      }
    ]
  }
  }
}'

The output will be similar to the following:

      {
        "_index" : "apache-logs",
        "_type" : "_doc",
        "_id" : "4vgC2nIBB7MjzkVPhtPl",
        "_score" : 0.0,
        "_source" : {
          "verb" : "GET",
          "host" : "coralogix",
          "response" : "200",
          "agent" : {
            "name" : "Firefox",
            "build" : "",
            "device" : "Other",
            "os" : "Windows",
            "major" : "34",
            "minor" : "0",
            "os_name" : "Windows"
          },
          "clientip" : "178.150.5.107",
          "ident" : "-",
          "bytes" : "5226",
          "geoip" : {
            "continent_code" : "EU",
            "timezone" : "Europe/Kiev",
            "country_code3" : "UA",
            "country_name" : "Ukraine",
            "location" : {
              "lat" : 50.4547,
              "lon" : 30.5238
            },
            "region_name" : "Kyiv City",
            "city_name" : "Kyiv",
            "country_code2" : "UA",
            "ip" : "178.150.5.107",
            "postal_code" : "04128",
            "longitude" : 30.5238,
            "region_code" : "30",
            "latitude" : 50.4547
          },
          "referrer" : "\"-\"",
          "auth" : "-",
          "httpversion" : "1.1",
          "read_timestamp" : "2020-06-22T03:11:37.715Z",
          "path" : "/etc/logstash/conf.d/logstash/apache/access_log",
          "@timestamp" : "2017-04-30T19:16:43.000Z",
          "request" : "/wp-login.php",
          "@version" : "1"
        }
      }

Looking good. We can see the newly added geoip and agent fields are very detailed and very easy to read.

Elasticsearch Logs

We explored many log types, but let’s not forget, Elasticsearch generates logs too which helps us troubleshooting issues, like for example, figuring out why a node hasn’t started. Let’s look at a sample: https://raw.githubusercontent.com/coralogix-resources/logstash/master/elasticsearch_logs/elasticsearch.log.

Original link here

Now, this is slightly different from what we’ve worked with up until now. In all the other logs, each line represented one specific log entry (or message). That meant we could process them line by line and reasonably expect that each logged event is contained within a single line, in its entirety. 

Here, however, we sometimes encounter multi-line log entries. This means that a logged event can span across multiple lines, not just one. Fortunately, though, Elasticsearch clearly signals where a logged event begins and where it ends. It does so by using opening [ and closing ] square brackets. If you see that a line opens a square bracket [ but doesn’t close it on the same line, you know that’s a multi-line log entry and it ends on the line that finally uses the closing square bracket ].

Logstash can easily process these logs by using the multiline input codec.

Let’s take a look at the Logstash config we’ll use here: https://raw.githubusercontent.com/coralogix-resources/logstash/master/elasticsearch_logs/es-logs-final.conf.

Original link here

In the codec => multiline section of our config, we define the pattern that instructs Logstash on how to identify multiline log entries. Here, we use a RegEx pattern, but of course, we can also use Grok patterns when we need to.

With negate set to true, a message that matches the pattern is not considered a match for the multiline filter. By default, this is set to false and when it is false, a message that matches the pattern is considered a match for multiline.

what” can be assigned a value of “previous” or “next“. For example, if we have a match, negate is set to false, and what has a value set to previous, this means that the current matched line belongs to the same event as the previous line.

In a nutshell, what we are doing for our scenario here is telling Logstash that if a line does not start with an opening square bracket [ then the line in the log file is a continuation of the previous line, so these will be grouped in a single event. Logstash will apply a “multiline” tag to such entries, which can be useful for debugging, or other similar purposes if we ever need to know which entry was contained in a single line, and which on multiple lines.

In the filter section we use a typical Grok pattern, just like we did many times before, and replace the message field with the parsed content.

Finally, a second Grok pattern will process the content in the message field even further, extracting things like the logged node name, index name, and so on.

Let’s run Logstash and see all of this in action:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/elasticsearch_logs/es-logs-final.conf

After the program does its job, we press CTRL+C to exit.

Logstash has now parsed both single-line events and multiline events. We will now see how useful it can be that multiline events have been tagged appropriately. Because of this tag, we can now search entries that contain only single-line events. We do this by specifying in our cURL request that the matches must_not contain the tags called multiline.

curl -XGET "http://localhost:9200/es-test-logs/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1, 
  "query": {
    "bool": {
      "must_not": [
        {
          "match": {
            "tags": "multiline"
          }
        }
      ]
    }
  }
}'

The output will look something like this:

      {
        "_index" : "es-test-logs",
        "_type" : "_doc",
        "_id" : "9voa2nIBB7MjzkVP7ULy",
        "_score" : 0.0,
        "_source" : {
          "node" : "node-1",
          "source" : "o.e.x.m.MlDailyMaintenanceService",
          "host" : "coralogix",
          "@timestamp" : "2020-06-22T03:38:16.842Z",
          "@version" : "1",
          "message" : "[node-1] triggering scheduled [ML] maintenance tasks",
          "timestamp" : "2020-06-15T01:30:00,000",
          "short_message" : "triggering scheduled [ML] maintenance tasks",
          "type" : "elasticsearch",
          "severity" : "INFO",
          "path" : "/etc/logstash/conf.d/logstash/elasticsearch_logs/elasticsearch.log"
        }

Now let’s filter only the multiline entries:

curl -XGET "http://localhost:9200/es-test-logs/_search?pretty" -H 'Content-Type: application/json' -d'{
  "size": 1, 
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "tags": "multiline"
          }
        }
      ]
    }
  }
}'

Output should look similar to this:

      {
        "_index" : "es-test-logs",
        "_type" : "_doc",
        "_id" : "Kfoa2nIBB7MjzkVP7UPy",
        "_score" : 0.046520013,
        "_source" : {
          "node" : "node-1",
          "source" : "r.suppressed",
          "host" : "coralogix",
          "@timestamp" : "2020-06-22T03:38:16.968Z",
          "@version" : "1",
          "message" : "[node-1] path: /.kibana/_count, params: {index=.kibana}\norg.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:580) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:223) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:288) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.0.jar:7.7.0]\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]\n\tat java.lang.Thread.run(Thread.java:832) [?:?]",
          "timestamp" : "2020-06-15T17:13:35,457",
          "short_message" : "path: /.kibana/_count, params: {index=.kibana}\norg.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:551) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:580) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction.lambda$performPhaseOnShard$0(AbstractSearchAsyncAction.java:223) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.action.search.AbstractSearchAsyncAction$2.doRun(AbstractSearchAsyncAction.java:288) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:692) [elasticsearch-7.7.0.jar:7.7.0]\n\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.7.0.jar:7.7.0]\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]\n\tat java.lang.Thread.run(Thread.java:832) [?:?]",
          "type" : "elasticsearch",
          "severity" : "WARN",
          "tags" : [
            "multiline"
          ],
          "path" : "/etc/logstash/conf.d/logstash/elasticsearch_logs/elasticsearch.log"
        }

Elasticsearch Slow Logs

Elasticsearch can also generate another type of logs, called slow logs and are used to optimize Elasticsearch search and indexing operations. These are easier to process since they don’t contain multiline messages.

Let’s take a look at a slow log: https://raw.githubusercontent.com/coralogix-resources/logstash/master/elasticsearch_slowlogs/es_slowlog.log.

Original image link 

As we did in previous sections, let’s copy the first line and paste it into the first (input) field of the https://grokdebug.herokuapp.com/ website.

[2018-03-13T00:01:09,810][TRACE][index.search.slowlog.query] [node23] [inv_06][1] took[291.9micros], took_millis[0], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[105], source[{"size":1000,"query":{"has_parent":{"query":{"bool":{"must":[{"terms":{"id_receipt":[234707456,234707458],"boost":1.0}},{"term":{"receipt_key":{"value":6799,"boost":1.0}}},{"term":{"code_receipt":{"value":"TKMS","boost":1.0}}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"parent_type":"receipts","score":false,"ignore_unmapped":false,"boost":1.0}},"version":true,"_source":false,"sort":[{"_doc":{"order":"asc"}}]}],

Original image link

Now let’s take a look at the Logstash config we’ll use: https://raw.githubusercontent.com/coralogix-resources/logstash/master/elasticsearch_slowlogs/es-slowlog-final.conf.

Original image link

Let’s copy the Grok pattern within this config and paste it to the second (pattern) field of the https://grokdebug.herokuapp.com/ website.

%{TIMESTAMP_ISO8601:timestamp}\]\[%{LOGLEVEL:level}\]\[%{HOSTNAME:type}\]%{SPACE}\[%{HOSTNAME:[node_name]}\]%{SPACE}\[%{WORD:[index_name]}\]%{NOTSPACE}%{SPACE}took\[%{NUMBER:took_micro}%{NOTSPACE}\]%{NOTSPACE}%{SPACE}%{NOTSPACE}%{NOTSPACE}%{SPACE}%{NOTSPACE}%{NOTSPACE}%{SPACE}%{NOTSPACE}%{NOTSPACE}%{SPACE}search_type\[%{WORD:search_type}\]%{NOTSPACE}%{SPACE}total_shards\[%{NUMBER:total_shards}\]%{NOTSPACE}%{SPACE}source%{GREEDYDATA:query}\Z

Original image link 

Now that we saw how this Grok pattern works, let’s run Logstash with our new config file.

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/elasticsearch_slowlogs/es-slowlog-final.conf

As usual, once the parsing is done, we press CTRL+C to exit the application.

Let’s see how the log file was parsed and added to the index:

curl -XGET "http://localhost:9200/es-slow-logs/_search?pretty" -H 'Content-Type: application/json' -d'{  "size": 1}'

The output will look something like this:

      {
        "_index" : "es-slow-logs",
        "_type" : "_doc",
        "_id" : "e-JzvHIBocjiYgvgqO4l",
        "_score" : 1.0,
        "_source" : {
          "total_shards" : "105",
          "message" : """[2018-03-13T00:01:09,810][TRACE][index.search.slowlog.query] [node23] [inv_06][1] took[291.9micros], took_millis[0], types[], stats[], search_type[QUERY_THEN_FETCH], total_shards[105], source[{"size":1000,"query":{"has_parent":{"query":{"bool":{"must":[{"terms":{"id_receipt":[234707456,234707458],"boost":1.0}},{"term":{"receipt_key":{"value":6799,"boost":1.0}}},{"term":{"code_receipt":{"value":"TKMS","boost":1.0}}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"parent_type":"receipts","score":false,"ignore_unmapped":false,"boost":1.0}},"version":true,"_source":false,"sort":[{"_doc":{"order":"asc"}}]}], """,
          "node_name" : "node23",
          "index_name" : "inv_06",
          "level" : "TRACE",
          "type" : "index.search.slowlog.query",
          "took_micro" : "291.9",
          "timestamp" : "2018-03-13T00:01:09,810",
          "query" : """[{"size":1000,"query":{"has_parent":{"query":{"bool":{"must":[{"terms":{"id_receipt":[234707456,234707458],"boost":1.0}},{"term":{"receipt_key":{"value":6799,"boost":1.0}}},{"term":{"code_receipt":{"value":"TKMS","boost":1.0}}}],"disable_coord":false,"adjust_pure_negative":true,"boost":1.0}},"parent_type":"receipts","score":false,"ignore_unmapped":false,"boost":1.0}},"version":true,"_source":false,"sort":[{"_doc":{"order":"asc"}}]}], """,
          "search_type" : "QUERY_THEN_FETCH"
        }
      }

MySQL Slow Logs

MySQL can also generate slow logs to help with optimization efforts. However, these will log events on multiple lines so we’ll need to use the multiline codec again.

Let’s look at a log file: https://raw.githubusercontent.com/coralogix-resources/logstash/master/mysql_slowlogs/mysql-slow.log.

Original image link here

Now let’s look at the Logstash config file: https://raw.githubusercontent.com/coralogix-resources/logstash/master/mysql_slowlogs/mysql-slowlogs.conf.

Original image link

In the multiline codec configuration, we use a Grok pattern. Simply put, we instruct Logstash that if the line doesn’t begin with the “# Time:” string, followed by a timestamp in the TIMESTAMP_ISO8601 format, then this line should be grouped together with previous lines in this event. This makes sense, since all logged events in this slow log begin with that specific timestamp, and then describe what has happened at that time, in the next few lines. Consequently, whenever a new timestamp appears, it signals the end of the current logged event and the beginning of the next.

Let’s run Logstash with this config:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/mysql_slowlogs/mysql-slowlogs.conf

As always, after the parsing is done, we press CTRL+C to exit the utility.

Let’s look at how the slow log was parsed:

curl -XGET "http://localhost:9200/mysql-slowlogs-01/_search?pretty" -H 'Content-Type: application/json' -d'{

  "size":1,
  "query": {
    "bool": {
    "must_not": [
      {
        "term": {
          "tags.keyword": "_grokparsefailure"
        }
      }
    ]
  }
  }

}'

The output should look like this:

      {
        "_index" : "mysql-slowlogs-01",
        "_type" : "_doc",
        "_id" : "Zfo42nIBB7MjzkVPGUfK",
        "_score" : 0.0,
        "_source" : {
          "tags" : [
            "multiline"
          ],
          "host" : "localhost",
          "user" : "root",
          "lock_time" : "0.000000",
          "timestamp" : "2020-06-03T06:04:09.582225Z",
          "read_timestamp" : "2020-06-22T04:10:08.892Z",
          "message" : " Time: 2020-06-03T06:04:09.582225Z  User@Host: root[root] @ localhost []  Id:     4  Query_time: 3.000192  Lock_time: 0.000000 Rows_sent: 1  Rows_examined: 0 SET timestamp=1591164249; SELECT SLEEP(3);",
          "query_time" : "3.000192",
          "rows_examined" : "0",
          "path" : "/etc/logstash/conf.d/logstash/mysql_slowlogs/mysql-slow.log",
          "sql_id" : "4",
          "@version" : "1",
          "rows_sent" : "1",
          "@timestamp" : "2020-06-22T04:10:08.892Z",
          "command" : "SELECT SLEEP(3)"
        }
      }

AWS ELB

AWS Elastic Load Balancer is a popular service that intelligently distributes traffic across a number of instances. ELB provides access logs that capture detailed information about requests sent to your load balancer. Each ELB log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses.

Let’s look at an example of such a log: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_elb/elb_logs.log

Original image link here

Once again, let’s copy the first line of this log and paste it into the first (input) field of the https://grokdebug.herokuapp.com/ website.

2020-06-14T17:26:04.805368Z my-clb-1 170.01.01.02:39492 172.31.25.183:5000 0.000032 0.001861 0.000017 200 200 0 13 "GET http://my-clb-1-1798137604.us-east-2.elb.amazonaws.com:80/ HTTP/1.1" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36" - -

Original image link here

The Logstash config we’ll use is this one: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_elb/aws-elb.conf.

Original image link here 

From this config, we can copy the Grok pattern and paste it into the second (pattern) field of the https://grokdebug.herokuapp.com/ website.

%{TIMESTAMP_ISO8601:timestamp} %{NOTSPACE:loadbalancer} %{IP:client_ip}:%{NUMBER:client_port} (?:%{IP:backend_ip}:%{NUMBER:backend_port}|-) %{NUMBER:request_processing_time} %{NUMBER:backend_processing_time} %{NUMBER:response_processing_time} (?:%{NUMBER:elb_status_code}|-) (?:%{NUMBER:backend_status_code}|-) %{NUMBER:received_bytes} %{NUMBER:sent_bytes} \"(?:%{WORD:verb}|-) (?:%{GREEDYDATA:request}|-) (?:HTTP/%{NUMBER:httpversion}|-( )?)\" \"%{DATA:userAgent}\"( %{NOTSPACE:ssl_cipher} %{NOTSPACE:ssl_protocol})?

Original image link here

Let’s run Logstash:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/aws_elb/aws-elb.conf

We press CTRL+C once it finishes its job and then take a look at the index to see how the log has been parsed:

curl -XGET "http://localhost:9200/aws-elb-logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 1,
  "query": {
    "bool": {
      "must_not": [
        {
        "term": {
          "tags": {
            "value": "_grokparsefailure"
          }
        }
      }
      ]
    }
  }
}'

The output should look similar to this:

      {
        "_index" : "aws-elb-logs",
        "_type" : "_doc",
        "_id" : "avpQ2nIBB7MjzkVPIEc-",
        "_score" : 0.0,
        "_source" : {
          "request_processing_time" : "0.000032",
          "timestamp" : "2020-06-14T17:26:05.145274Z",
          "sent_bytes" : "232",
          "@version" : "1",
          "userAgent" : "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36",
          "elb_status_code" : "404",
          "ssl_protocol" : "-",
          "path" : "/etc/logstash/conf.d/logstash/aws_elb/elb_logs.log",
          "response_processing_time" : "0.000016",
          "backend_processing_time" : "0.002003",
          "client_port" : "39492",
          "verb" : "GET",
          "received_bytes" : "0",
          "backend_ip" : "172.31.25.183",
          "backend_status_code" : "404",
          "client_ip" : "170.01.01.02",
          "backend_port" : "5000",
          "host" : "coralogix",
          "loadbalancer" : "my-clb-1",
          "request" : "http://my-clb-1-1798137604.us-east-2.elb.amazonaws.com:80/favicon.ico",
          "ssl_cipher" : "-",
          "httpversion" : "1.1",
          "@timestamp" : "2020-06-22T04:36:23.160Z"
        }
      }

AWS ALB

Amazon also offers an Application Load Balancer that generates its own logs. These are very similar to the ELB logs and we can see an example here: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_alb/alb_logs.log.

Original image link here

The config file we will use can be seen here: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_alb/aws-alb.conf.

Original image link here

If you want to test things out in the https://grokdebug.herokuapp.com/ website, the input line you can copy and paste into the first field is the following:

h2 2015-11-07T18:45:33.575333Z elb1 195.142.179.105:55857 10.0.2.143:80 0.000025 0.0003 0.000023 200 200 0 3764 "GET http://example.com:80/favicons/favicon-160x160.png HTTP/1.1" "Mozilla/5.0 (Linux; Android 4.4.2; GT-N7100 Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/30.0.0.0 Mobile Safari/537.36" - - arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067 "Root=1-58337262-36d228ad5d99923122bbe354"

And the Grok pattern is:

%{NOTSPACE:request_type} %{TIMESTAMP_ISO8601:log_timestamp} %{NOTSPACE:alb-name} %{NOTSPACE:client}:%{NUMBER:client_port} (?:%{IP:backend_ip}:%{NUMBER:backend_port}|-) %{NUMBER:request_processing_time} %{NUMBER:backend_processing_time} %{NOTSPACE:response_processing_time:float} %{NOTSPACE:elb_status_code} %{NOTSPACE:target_status_code} %{NOTSPACE:received_bytes:float} %{NOTSPACE:sent_bytes:float} %{QUOTEDSTRING:request} %{QUOTEDSTRING:user_agent} %{NOTSPACE:ssl_cipher} %{NOTSPACE:ssl_protocol} %{NOTSPACE:target_group_arn} %{QUOTEDSTRING:trace_id}

Original image link here

Once again, let’s run Logstash with the new config:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/aws_alb/aws-alb.conf

We’ll press CTRL+C, once it’s done, and then take a look at how the log has been parsed and imported to the index:

curl -XGET "http://localhost:9200/aws-alb-logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "size": 1,
  "query": {
    "bool": {
      "must_not": [
        {"term": {
          "tags": {
            "value": "_grokparsefailure"
          }
        }
      }
      ]
    }
  }
}'

The output should look something like this:

      {
        "_index" : "aws-alb-logs",
        "_type" : "_doc",
        "_id" : "dvpZ2nIBB7MjzkVPF0ex",
        "_score" : 0.0,
        "_source" : {
          "client" : "78.164.152.56",
          "path" : "/etc/logstash/conf.d/logstash/aws_alb/alb_logs.log",
          "client_port" : "60693",
          "ssl_protocol" : "-",
          "target_group_arn" : "arn:aws:elasticloadbalancing:us-west-2:123456789012:targetgroup/my-targets/73e2d6bc24d8a067",
          "backend_port" : "80",
          "trace_id" : "\"Root=1-58337262-36d228ad5d99923122bbe354\"",
          "backend_processing_time" : "0.001005",
          "response_processing_time" : 2.6E-5,
          "@timestamp" : "2020-06-22T04:46:09.813Z",
          "@version" : "1",
          "request_processing_time" : "0.000026",
          "received_bytes" : 0.0,
          "sent_bytes" : 33735.0,
          "alb-name" : "elb1",
          "log_timestamp" : "2015-11-07T18:45:33.578479Z",
          "request_type" : "h2",
          "user_agent" : "\"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36\"",
          "request" : "\"GET http://example.com:80/images/logo/devices.png HTTP/1.1\"",
          "elb_status_code" : "200",
          "ssl_cipher" : "-",
          "host" : "coralogix",
          "backend_ip" : "10.0.0.215",
          "target_status_code" : "200"
        }
      }

AWS CloudFront

Amazon’s CloudFront content delivery network generates useful logs to help ensure availability, performance, and in security audits. 

Here is a sample log: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_cloudfront/cloudfront_logs.log.

Original image link here

The Logstash config file can be viewed here: https://raw.githubusercontent.com/coralogix-resources/logstash/master/aws_cloudfront/aws-cloudfront.conf.

Original image link

Once again, If you want to test how things work, in the https://grokdebug.herokuapp.com/ website, the input line you can copy and paste into the first field is this one:

2020-06-16	11:00:04	MAA50-C2	7486	2409:4073:20a:8398:c85d:cc75:6c7a:be8b	GET	dej1k5scircsp.cloudfront.net	/css/style/style.css	200	http://dej1k5scircsp.cloudfront.net/	Mozilla/5.0%20(X11;%20Linux%20x86_64)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/77.0.3865.75%20Safari/537.36	-	-	Miss	P9QcGJ-je6GoPCt-1KqOIgAHr6j05In8FFJK4E8DbZKHFyjp-dDfKw==	dej1k5scircsp.cloudfront.net	http	376	0.102	-	-	-	Miss	HTTP/1.1	-	-	38404	0.102	Miss	text/css	7057	-	-

And the Grok pattern is:

%{DATE:date}[ \t]%{TIME:time}[ \t]%{DATA:x_edge_location}[ \t](?:%{NUMBER:sc_bytes}|-)[ \t]%{IP:c_ip}[ \t]%{WORD:cs_method}[ \t]%{HOSTNAME:cs_host}[ \t]%{NOTSPACE:cs_uri_stem}[ \t]%{NUMBER:sc_status}[ \t]%{GREEDYDATA:referrer}[ \t]%{NOTSPACE:user_agent}[ \t]%{GREEDYDATA:cs_uri_query}[ \t]%{NOTSPACE:cookie}[ \t]%{WORD:x_edge_result_type}[ \t]%{NOTSPACE:x_edge_request_id}[ \t]%{HOSTNAME:x_host_header}[ \t]%{URIPROTO:cs_protocol}[ \t]%{INT:cs_bytes}[ \t]%{NUMBER:time_taken}[ \t]%{NOTSPACE:x_forwarded_for}[ \t]%{NOTSPACE:ssl_protocol}[ \t]%{NOTSPACE:ssl_cipher}[ \t]%{NOTSPACE:x_edge_response_result_type}[ \t]%{NOTSPACE:cs_protocol_version}[ \t]%{NOTSPACE:fle_status}[ \t]%{NOTSPACE:fle_encrypted_fields}[ \t]%{NOTSPACE:c_port}[ \t]%{NOTSPACE:time_to_first_byte}[ \t]%{NOTSPACE:x_edge_detailed_result_type}[ \t]%{NOTSPACE:sc_content_type}[ \t]%{NOTSPACE:sc_content_len}[ \t]%{NOTSPACE:sc_range_start}[ \t]%{NOTSPACE:sc_range_end}

Original image link here

Now let’s run Logstash:

sudo /usr/share/logstash/bin/logstash -f /etc/logstash/conf.d/logstash/aws_cloudfront/aws-cloudfront.conf

As always, we press CTRL+C once it finishes its job.

Once again, let’s take a look at how the log has been parsed and inserted into the index:

curl -XGET "http://localhost:9200/aws-cloudfront-logs/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "bool": {
      "must_not": [
        {"term": {
          "tags": {
            "value": "_grokparsefailure"
          }
        }
      }
      ]
    }
  }
}'

Part of the output should be similar to the following:

{
        "_index" : "aws-cloudfront-logs",
        "_type" : "_doc",
        "_id" : "Da1s4nIBnKKcJetIb-p9",
        "_score" : 0.0,
        "_source" : {
          "time_to_first_byte" : "0.000",
          "cs_uri_stem" : "/favicon.ico",
          "x_edge_request_id" : "vhpLn3lotn2w4xMOxQg77DfFpeEtvX49mKzz5h7iwNXguHQpxD6QPQ==",
          "sc_bytes" : "910",
          "@version" : "1",
          "cs_host" : "dej1k5scircsp.cloudfront.net",
          "c_ip" : "2409:4073:20a:8398:c85d:cc75:6c7a:be8b",
          "user_agent" : "Mozilla/5.0%20(X11;%20Linux%20x86_64)%20AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Chrome/77.0.3865.75%20Safari/537.36",
          "sc_range_start" : "-",
          "c_port" : "57406",
          "x_edge_result_type" : "Error",
          "referrer" : "http://dej1k5scircsp.cloudfront.net/",
          "x_edge_location" : "MAA50-C2",
          "path" : "/etc/logstash/conf.d/logstash/aws_cloudfront/cloudfront_logs.log",
          "cs_protocol" : "http",
          "time_taken" : "0.001",
          "x_forwarded_for" : "-",
          "time" : "10:58:07",
          "cookie" : "-",
          "sc_status" : "502",
          "date" : "20-06-16",
          "sc_range_end" : "-",
          "x_edge_detailed_result_type" : "Error",
          "ssl_cipher" : "-",
          "cs_method" : "GET",
          "x_host_header" : "dej1k5scircsp.cloudfront.net",
          "sc_content_len" : "507",
          "ssl_protocol" : "-",
          "fle_status" : "-",
          "@timestamp" : "2020-06-23T18:24:15.784Z",
          "fle_encrypted_fields" : "-",
          "cs_bytes" : "389",
          "x_edge_response_result_type" : "Error",
          "host" : "coralogix",
          "cs_uri_query" : "-",
          "sc_content_type" : "text/html",
          "cs_protocol_version" : "HTTP/1.1"
        }
      }

Cleaning Up

Before continuing with the next lesson, let’s clean up the resources we created here.

First, we’ll delete the directory where we stored our sample log files and Logstash configurations:

sudo rm -r /etc/logstash/conf.d/logstash/

Next, let’s delete all the new indices we created:

curl -XDELETE localhost:9200/nginx-access-logs-02

curl -XDELETE localhost:9200/iis-log

curl -XDELETE localhost:9200/mongo-logs-01

curl -XDELETE localhost:9200/apache-logs

curl -XDELETE localhost:9200/es-test-logs

curl -XDELETE localhost:9200/es-slow-logs

curl -XDELETE localhost:9200/mysql-slowlogs-01

curl -XDELETE localhost:9200/aws-elb-logs

curl -XDELETE localhost:9200/aws-alb-logs

curl -XDELETE localhost:9200/aws-cloudfront-logs

Conclusion

I hope this arsenal of Grok patterns for common log types are useful for most of your future Logstash needs. Keep in mind that if the log you encounter is just slightly different, only slight changes need to be made to these patterns, which you can use as your starting templates.

Observability and Security
that Scale with You.