Skip to content

How to query VPC flow logs

The UIS DevOps division enables VPC Flow Logs for all VPCs across all accounts and supported AWS regions. These logs are stored centrally in an S3 bucket within the Log Archive account.

This guide explains how to query the VPC Flow Logs using AWS Athena.

Steps to execute a query

  1. Log into the AWS console for the Log Archive account.
    • Usually you would do this by assuming the OrganizationAccountAccessRole role from the management account.
  2. Navigate to the Athena service.
  3. Run a metadata refresh query to load new data:

    MSCK REPAIR TABLE vpc_flow_logs_parquet
    

    This command updates the Athena catalog metadata for the vpc_flow_logs_parquet table, registering any new partitions (i.e., new log files). It may take a while to complete depending on how long it's been since it was last run.

  4. Query the log data.

    With the metadata refreshed, you can now run queries on the full set of VPC logs. For example, the following query filters for logs from a specific AWS account, on a specific date, and only includes egress traffic. It limits the output to 100 records:

    SELECT *
    FROM vpc_flow_logs_parquet
    WHERE "aws-account-id"='111111111111'
      AND year=2025
      AND month=6
      AND day=18
      AND flow_direction='egress'
    ORDER BY day, hour ASC
    LIMIT 100
    

Recreating the vpc_flow_logs_parquet table

The vpc_flow_logs_parquet table was originally created using the SQL statement below via the Athena query console. While it's unlikely you'll ever need to recreate the table, this section documents the command for reference should the need arise.

CREATE EXTERNAL TABLE IF NOT EXISTS vpc_flow_logs_parquet (
  version int,
  account_id string,
  interface_id string,
  srcaddr string,
  dstaddr string,
  srcport int,
  dstport int,
  protocol bigint,
  packets bigint,
  bytes bigint,
  start bigint,
  `end` bigint,
  action string,
  log_status string,
  vpc_id string,
  subnet_id string,
  instance_id string,
  tcp_flags int,
  type string,
  pkt_srcaddr string,
  pkt_dstaddr string,
  region string,
  az_id string,
  sublocation_type string,
  sublocation_id string,
  pkt_src_aws_service string,
  pkt_dst_aws_service string,
  flow_direction string,
  traffic_path int
)
PARTITIONED BY (
  `aws-account-id` string,
  `aws-service` string,
  `aws-region` string,
  `year` int,
  `month` int,
  `day` int,
  `hour` int
)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://ucam-devops-log-archive-ed69d78f/OrganizationVPCLogs/AWSLogs/'
TBLPROPERTIES (
  'EXTERNAL'='true',
  'skip.header.line.count'='1'
  )