AWS Athena
Boto3
Dataframe
get_query_results
Python

How to Create Dataframe from AWS Athena using Boto3 get_query_results method

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Athena lets you query files in S3 with SQL, and boto3 gives you the low-level API to run those queries from Python. If you want a pandas DataFrame, the missing step is converting Athena's tabular response format into rows and column names.

The important detail is that get_query_results returns metadata and rows separately, and the first row often contains column headers. You also need to wait for query completion and page through results when the output is larger than a single response.

Run the Athena Query and Wait for Completion

The normal flow is:

  1. call start_query_execution
  2. poll get_query_execution
  3. once the query succeeds, call get_query_results
  4. transform the result into a DataFrame

Here is a working example:

python
1import time
2import boto3
3import pandas as pd
4
5athena = boto3.client("athena", region_name="us-east-1")
6
7query = """
8SELECT order_id, customer_id, total_amount
9FROM analytics.orders
10LIMIT 5
11"""
12
13response = athena.start_query_execution(
14    QueryString=query,
15    QueryExecutionContext={"Database": "analytics"},
16    ResultConfiguration={"OutputLocation": "s3://my-athena-results-bucket/query-results/"}
17)
18
19query_execution_id = response["QueryExecutionId"]
20
21while True:
22    status_response = athena.get_query_execution(QueryExecutionId=query_execution_id)
23    state = status_response["QueryExecution"]["Status"]["State"]
24
25    if state in ("SUCCEEDED", "FAILED", "CANCELLED"):
26        break
27
28    time.sleep(1)
29
30if state != "SUCCEEDED":
31    reason = status_response["QueryExecution"]["Status"].get("StateChangeReason", "Unknown error")
32    raise RuntimeError(f"Athena query failed: {reason}")

You must wait for SUCCEEDED. Calling get_query_results too early will either fail or return incomplete state.

Convert get_query_results Output to Rows

Athena returns rows as a list of Data cells. The simplest approach is to read the column names from ResultSetMetadata, then extract each row into a Python list and build a DataFrame.

python
1def athena_results_to_dataframe(client, query_execution_id):
2    paginator = client.get_paginator("get_query_results")
3    pages = paginator.paginate(QueryExecutionId=query_execution_id)
4
5    column_info = None
6    rows = []
7    first_data_row_seen = False
8
9    for page in pages:
10        result_set = page["ResultSet"]
11
12        if column_info is None:
13            column_info = result_set["ResultSetMetadata"]["ColumnInfo"]
14            columns = [col["Name"] for col in column_info]
15
16        for row in result_set["Rows"]:
17            values = [cell.get("VarCharValue") for cell in row["Data"]]
18
19            # Athena includes a header row as the first row of data.
20            if not first_data_row_seen:
21                first_data_row_seen = True
22                if values == columns:
23                    continue
24
25            rows.append(values)
26
27    return pd.DataFrame(rows, columns=columns)
28
29
30df = athena_results_to_dataframe(athena, query_execution_id)
31print(df)

That produces a normal pandas table with string values. Athena returns text for most fields through this API, so you may want to convert types afterward.

Convert Data Types Explicitly

If a column should be numeric or datetime, cast it in pandas after loading:

python
df["total_amount"] = pd.to_numeric(df["total_amount"], errors="coerce")

You can also inspect ColumnInfo if you want to map Athena data types to pandas types more systematically. That is helpful for production ingestion jobs where type correctness matters.

For example:

python
for column in athena.get_query_results(QueryExecutionId=query_execution_id)["ResultSet"]["ResultSetMetadata"]["ColumnInfo"]:
    print(column["Name"], column["Type"])

This is useful when your query returns decimals, timestamps, or nullable fields that need careful handling.

When to Use This Approach

Using get_query_results is appropriate when you want direct API control, pagination, explicit polling, or custom error handling. If all you need is a DataFrame, higher-level tools such as awswrangler are more convenient, but boto3 keeps dependencies minimal and exposes the raw Athena model.

It is also helpful when you need to integrate query execution into an application that already uses boto3 for IAM, S3, or Glue operations.

Common Pitfalls

  • Forgetting to wait for query completion before calling get_query_results.
  • Not configuring an S3 output location that Athena can write to.
  • Treating the first returned row as data when it is actually the header row.
  • Assuming numeric columns come back as numeric pandas types automatically. Most values arrive as strings.
  • Ignoring pagination and accidentally loading only the first page of a large result set.

Summary

  • Start the Athena query, poll until it reaches SUCCEEDED, then fetch the results.
  • Read column names from ResultSetMetadata and build row lists from Rows.
  • Skip the header row when Athena includes it in the result data.
  • Convert important columns to numeric or datetime types explicitly in pandas.
  • Use get_query_results when you want fine-grained control over the Athena workflow in Python.

Course illustration
Course illustration

All Rights Reserved.