Archive Query with AWS Athena
Coralogix can be configured to automatically and dynamically archive logs to an S3 bucket. This saves Coralogix customers money, but of course, there are times when the data needs to be reindexed. This operation counts the reindexed logs against the daily quota. Many times customers would like to search and focus on the exact logs to be reindexed or even query the logs outside of Coralogix altogether.
Since the logs reside on an S3 bucket owned by the customer, there are many ways to do this with any tool or method that can access S3. This post will show how to use AWS Athena to query these logs.
Note: In order to perform the operations described in this post, you’ll need to have access to an AWS console with the correct permissions.
Finding your S3 Archive folder
First, you’ll need the bucket's name for Coralogix archiving. You can find this by going to Data Flow > Setup Archive.
There are two types of archives:
CSV - which stores the archive logs in a csv format
cx-data - which stores the archive logs in a parquet format
One of them or both can be active.
Access the S3 bucket through the AWS console. It will have o a parent folder called 'cx' → 'csv'/'parquet' → 'v1' (for parquet) or 'v2' (for csv) → team_id → dt → hr. Under each one of these there are log files divided into 10-minute chunks. All hours are UTC. Each log file will be in CSV or Parquet format and include log lines with:
CSV: timestamp, metadata information, and the log itself as text (JSON fields are not parsed into different CSV entries).
Parquet: event_labels__methodname, event_labels__classname, event_labels__category, src_obj__event_labels, event_labels__applicationname, event_metadata__logid, src_obj__user_data, event_labels__subsystemname, event_labels__ipaddress, event_metadata__severity, event_labels__threadid, event_metadata__priorityclass, event_labels__computername, src_obj__event_metadata, team_id, dt, hr. The following columns contain relevant data:
src_obj__event_labels - JSON object with metadata fields. Example entry:
src_obj__user_data - Log text
src_obj__event_metadata - JSON object with metadata fields. Example entry:
team_id - Company ID
dt - date in the format: YYYY-MM-DD
hr - hour
AWS Athena
Athena is an AWS service that makes it easy to query log files using standard SQL syntax. Using Athena commands, you create a DB and a schema and then query the logs.
Open Athena service in your AWS account. You will see the following screen:
As you can see at the top, you need to set up a query result location in Amazon S3 first. Click on the blue link and enter your S3 bucket name.
Once the configuration is completed the following commands need to be run in the Athena query editor:
Database creation
This command creates a database:
Example: CREATE DATABASE reindex_queries;
More information about the CREATE DATABASE command - link.
Table creation
This command creates a table called <table\_name>
within the database, <db\_name>
. This table will have entries that are mapped to the Coralogix logs from the archive, so it is important not to change the order of fields at the first section of the command. ‘Text’ is the field that contains the log as it appears in the logs screen.
CSV:
CREATE EXTERNAL TABLE IF NOT EXISTS <some-db-name>.<some-table-name> (
`timestamp` string,
`severity` string,
`text` string,
`applicationname` string,
`subsystemname` string,
`category` string,
`classname` string,
`computername` string,
`methodname` string,
`threadid` string,
`ipaddress` string)
PARTITIONED BY (
`team_id` string,
`dt` string,
`hr` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://your-bucket-name/cx/csv/v2'
TBLPROPERTIES (
'classification'='csv',
'skip.header.line.count'='1'
)
Parquet:
CREATE EXTERNAL TABLE IF NOT EXISTS <some-db-name>.<some-table-name> (
`event_labels__methodname` string,
`event_labels__classname` string,
`event_labels__category` string,
`src_obj__event_labels` string,
`event_labels__applicationname` string,
`event_metadata__logid` string,
`src_obj__user_data` string,
`event_labels__subsystemname` string,
`event_metadata__timestamp` bigint,
`event_labels__ipaddress` string,
`event_metadata__severity` string,
`event_labels__threadid` string,
`event_metadata__priorityclass` string,
`event_labels__computername` string,
`src_obj__event_metadata` string
)
PARTITIONED BY (
`team_id` string,
`dt` string,
`hr` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://your-bucket-name/cx/parquet/v1'
Following a successful run of the ‘CREATE TABLE’ command, you will see on the left side a new table. If you click on it you will see the entries.
More information about the CREATE TABLE command - link.
Update partitions
When you need to query new data (from a time period you didn't query before), you need to update with the new Date/Hours.
To do this, you will click “Load partitions”.
SELECT query
Using the ‘SELECT’ you can query the logs.
CSV examples:
Example 1:
SELECT
substr(timestamp, 1, 23) as timestamp,
applicationname,
subsystemname,
severity,
threadid,
try(json_extract(text, '$.kubernetes.host')) as kubernetes_host,
text
FROM
"<db_name>"."<table_name>"
WHERE
severity='Info' AND
applicationname='prod' AND
subsystemname!='abcde' AND
team_id='XXX' AND
dt = '2021-01-26' AND
hr >= '03' AND
hr < '04' AND
json_extract(text, '$.kubernetes.host') IS NOT NULL
LIMIT 100;
The above command will bring the first 100 logs that fit the query. It will retrieve the first 23 characters in the timestamp fields, the other indicated fields. In addition, it will extract from the text field (the log as it appears in Coralogix’s log screen) fields: kubernetes.host. It will retrieve their value into result fields called kubernetes.host. It will bring these fields from our DB and table. The query parameters are under the WHERE section. ‘IS NOT NULL’ makes sure that the logs we bring will have the field with a value.
Note that if the database is chosen on the left side then specifying the database is not needed.
Example 2:
In this query, we are searching for the free text "throttler" in the log:
SELECT
substr(timestamp, 1, 23) as timestamp,
applicationname,
subsystemname,
severity,
text
FROM
"coralogix_table"
WHERE
applicationname='ip_log' AND
team_id='12345' AND
dt = '2021-03-17' AND
hr >= '21' AND
hr < '23' AND
text LIKE '%throttler%'
LIMIT 100
The query above returns the first 100 records with timestamp between 2021-03-17 21:00-23:00, application name ip_log, any subsystem name, any severity and text which contains a word "throttler". The team_id 12345 is used only as a filter and will not be displayed in the results.
Example 3:
SELECT
try(json_extract(text, '$.clientIp')) as clientIp,
try(json_extract(text, '$.userId')) as userId,
COUNT(*) as count
FROM
"<db_name>"."<table_name>"
WHERE
team_id='xxxxx' AND
dt = '2021-03-22' AND
hr >= '14' AND
hr < '17' AND
json_extract(text, '$.clientIp') IS NOT NULL AND
json_extract(text, '$.userId') IS NOT NULL AND
text LIKE '%503 service unavailable%' AND
date_parse(substr(timestamp, 1, 19),'%Y-%m-%dT%h:%i:%s') BETWEEN TIMESTAMP '2021-03-22 14:30:00' AND TIMESTAMP '2021-03-22 16:10:00'
GROUP BY text
ORDER BY count DESC;
The query above returns clientIp and userId fields and a count of their combinations:
It is done by COUNT(*) as count in the SELECT session and by GROUP BY text. The results are ordered by count field. ASC and DESC determine whether results are sorted in ascending or descending order. hr field allows only to search in the specific time frame with 1-hour intervals. It was not sufficient, so in the example, the timestamp field was added to the WHERE section. As Athena requires Java timestamp format, the command substr takes the first 19 characters and the date_parse command maps the format %Y-%m-%dT%h:%i:%s to readable by Athena %Y-%m-%d %h:%i:%s.
Example 4:
SELECT
try(json_extract(text,'$.message')) AS message,
COUNT(*) as count
FROM
"<db_name>"."<table_name>"
WHERE
applicationname='prod'
AND team_id='xxxxx'
AND dt >= '2021-04-01'
AND dt <= '2021-05-28'
AND hr >= '00'
AND hr < '24'
AND CAST(json_extract(text,'$.message') AS varchar)='reset'
AND CAST(json_extract(text,'$.message') AS varchar)='machine'
GROUP BY text
ORDER BY count DESC;
This query is similar to the one above but shows new ways of filtering. It shows how to limit results by adding a date range and also by a text ("reset" and "machine") which is a part of the message field. It returns the message field and a count of its combinations.
PARQUET examples:
Example 1:
The above command will bring the first 10 logs. You can use this query to test whether you can reach the archive and to see fields that you can use for more detailed queries.
Example 2:
SELECT
try(json_extract(src_obj__event_labels, '$.applicationname')) as applicationname,
try(json_extract(src_obj__event_labels, '$.subsystemname')) as subsystemname,
src_obj__user_data as log,
try(json_extract(src_obj__event_metadata, '$.timestamp')) as time
FROM "<db_name>"."<table_name>"
WHERE
(JSON_EXTRACT_SCALAR(src_obj__event_labels, '$.applicationname'))='prod'
AND (JSON_EXTRACT_SCALAR(src_obj__event_labels, '$.subsystemname'))='nginx'
AND dt >= '2022-11-01'
AND dt <= '2022-11-05'
AND (JSON_EXTRACT_SCALAR(src_obj__event_metadata, '$.severity'))='Debug'
AND (JSON_EXTRACT_SCALAR(src_obj__user_data, '$.message')) LIKE '%Connected%';
The query above returns three columns: applicationname, subsystemname, and log, with debug logs containing message fields with the word "Connected," where the application name is prod and the subsystem name is nginx, and they were created between 2022-11-01 and 2022-11-05.
Query with Save
This command, also known as CTAS (Create Table AS), is a variation on the ‘CREATE TABLE’ command mentioned above.
Query data and save in a specific location (also creates a table that needs to be deleted):
CREATE TABLE reindex_queries.ctas
WITH (
format='TEXTFILE',
field_delimiter=',',
external_location='s3://coralogix-athena-results-test/reindex-testing/query_1/'
) AS
SELECT
substr(timestamp, 1, 23) as timestamp,
try(json_extract(json_parse(text), '$.log.kind')) AS kind,
applicationname,
subsystemname,
severity,
threadid,
text
FROM
"coralogix_example"."coralogix_reindex_testing_1"
WHERE
applicationname='auditing' AND
team_id='XXX' AND
dt = '2021-01-26' AND
hr >= '03' AND
hr < '04' AND
CAST(json_extract(text, '$.log.kind') AS varchar)='Event' AND
text like '%Started%'
LIMIT 1000;
The first part of the command indicates that a new table called ‘ctas’ will be created for database ‘reindex_queries’. The table will be of type ‘text’, with ‘,’ as delimiter, and will be created in the named bucket.
Download the CSV file with results
Athena will put the result in an S3 bucket location defined in its configuration but also you can download the csv file:
Having the query result in CSV format on an S3 bucket allows you to either send them to Coralogix or access them with another tool (Excel, for example).
Filtering and aggregating on stringified JSON data using AWS Athena
This SQL query applied directly on an S3 bucket using AWS Athena allows you to filter by fields that are currently part of a stringified JSON field in the logs. Use fields within a stringified JSON object to filter and perform aggregations (group-by) on top of it.
SELECT try(json_extract(src_obj__user_data, '$.timestamp')) as timestamp,
try(json_extract(src_obj__user_data, '$.code')) as code,
JSON_EXTRACT_SCALAR(JSON_EXTRACT_SCALAR(src_obj__user_data, '$.request_str'), '$.method') as method,
JSON_EXTRACT_SCALAR(JSON_EXTRACT_SCALAR(src_obj__user_data, '$.request_str'), '$.merchant_id') as merchant_id,
JSON_EXTRACT_SCALAR(JSON_EXTRACT_SCALAR(src_obj__user_data, '$.request_str'), '$.client_ip') as client_ip,
JSON_EXTRACT_SCALAR(JSON_EXTRACT_SCALAR(src_obj__user_data, '$.request_str'), '$.uri') as uri
FROM "<db_name>"."<db_table>"
where dt >= '<start_date>' AND
try(JSON_EXTRACT_SCALAR(src_obj__user_data, '$.code')) ='ADMIN_LOGGED_IN_AS_MERCHANT' AND
JSON_EXTRACT_SCALAR(JSON_EXTRACT_SCALAR(src_obj__user_data, '$.request_str'), '$.method') = 'POST'
In this query:
As an example,
request_str
is a stringified JSON field in the log.method
,merchant_id
,client_ip
,uri
are part of therequest_str
field.The nested use of
JSON_EXTRACT_SCALAR
(i.e., applying it more than once) functions allow access to the stringified JSON fields in the logs.Filters are applied to the extracted JSON fields to narrow down the results.
How to send CSV data to Coralogix
1. Click the following link: https://serverlessrepo.aws.amazon.com/applications/eu-central-1/597078901540/Coralogix-Athena-S3
2. Click the Deploy button.
3. Fill the table:
S3BucketName - The name of the S3 bucket storing the CSV files.
ApplicationName - A mandatory metadata field that is sent with each log and helps to classify it.
CoralogixRegion - Select the region associated with your Coralogix domain. This is a Coralogix parameter and does not relate to your to your AWS region.
PrivateKey - Your Send-Your-Data API key
SubsystemName - A mandatory metadata field that is sent with each log and helps to classify it.
S3KeyPrefix and S3KeySuffix should be adjusted based on your configuration. S3KeyPrefix is a folder at your bucket (for example default Athena directory: Unsaved/) and S3KeySuffix is the extensions of files (by default .csv)
4. Put a thick next to "I acknowledge that this app creates custom IAM roles and resource policies." and click the Deploy button. That's all!
Once new file is uploaded to the bucket, it will be read by the application and data will be send to Coralogix.
Athena pricing
There is a cost associated with Athena usage. It is in the order of $5 per TB of data scanned. You can see more about Athena pricing here.
Support
Need help?
Our world-class customer success team is available 24/7 to walk you through your setup and answer any questions that may come up.
Feel free to reach out to us via our in-app chat or by emailing [email protected].