How to query VPC flow logs¶
The UIS DevOps division enables VPC Flow Logs for all VPCs across all accounts and supported AWS regions. These logs are stored centrally in an S3 bucket within the Log Archive account.
This guide explains how to query the VPC Flow Logs using AWS Athena.
Steps to execute a query¶
- Log into the AWS console for the Log Archive account.
- Usually you would do this by assuming the
OrganizationAccountAccessRole
role from the management account.
- Usually you would do this by assuming the
- Navigate to the Athena service.
-
Run a metadata refresh query to load new data:
MSCK REPAIR TABLE vpc_flow_logs_parquet
This command updates the Athena catalog metadata for the
vpc_flow_logs_parquet
table, registering any new partitions (i.e., new log files). It may take a while to complete depending on how long it's been since it was last run. -
Query the log data.
With the metadata refreshed, you can now run queries on the full set of VPC logs. For example, the following query filters for logs from a specific AWS account, on a specific date, and only includes egress traffic. It limits the output to 100 records:
SELECT * FROM vpc_flow_logs_parquet WHERE "aws-account-id"='111111111111' AND year=2025 AND month=6 AND day=18 AND flow_direction='egress' ORDER BY day, hour ASC LIMIT 100
Recreating the vpc_flow_logs_parquet
table¶
The vpc_flow_logs_parquet
table was originally created using the SQL statement below via the
Athena query console. While it's unlikely you'll ever need to recreate the table, this section
documents the command for reference should the need arise.
CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs_parquet (
version int,
account_id string,
interface_id string,
srcaddr string,
dstaddr string,
srcport int,
dstport int,
protocol bigint,
packets bigint,
bytes bigint,
start bigint,
`end` bigint,
action string,
log_status string,
vpc_id string,
subnet_id string,
instance_id string,
tcp_flags int,
type string,
pkt_srcaddr string,
pkt_dstaddr string,
region string,
az_id string,
sublocation_type string,
sublocation_id string,
pkt_src_aws_service string,
pkt_dst_aws_service string,
flow_direction string,
traffic_path int
)
PARTITIONED BY (
`aws-account-id` string,
`aws-service` string,
`aws-region` string,
`year` int,
`month` int,
`day` int,
`hour` int
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://ucam-devops-log-archive-ed69d78f/OrganizationVPCLogs/AWSLogs/'
TBLPROPERTIES (
'EXTERNAL'='true',
'skip.header.line.count'='1'
)