AWS Athena
SQL
Structs
Arrays
Data Querying

AWS Athena Querying by an attributes of a struct with an array

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Athena can query nested data directly, but arrays of structs require a different style than flat SQL tables. If the value you want to filter on lives inside a struct that is itself inside an array, the two main tools are UNNEST and Athena's array functions.

Understanding the Shape of the Data

Suppose each row stores an order with an items array. Each item is a struct, and each item contains nested metadata.

json
1{
2  "order_id": "O-1001",
3  "items": [
4    {"sku": "A1", "qty": 2, "meta": {"color": "red", "size": "M"}},
5    {"sku": "B9", "qty": 1, "meta": {"color": "blue", "size": "L"}}
6  ]
7}

A matching Athena table might look like this:

sql
1CREATE EXTERNAL TABLE orders_raw (
2  order_id string,
3  items array<struct<
4    sku:string,
5    qty:int,
6    meta:struct<color:string,size:string>
7  >>
8)
9ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
10LOCATION 's3://example-bucket/orders/';

Once the schema is correct, querying nested attributes becomes much easier.

Use UNNEST for Element-Level Filtering

If you want to query each item inside the array as if it were a row, flatten the array with UNNEST.

sql
1SELECT o.order_id, i.sku, i.meta.color
2FROM orders_raw o
3CROSS JOIN UNNEST(o.items) AS t(i)
4WHERE i.meta.color = 'red';

This is the most direct approach when you need one result row per matching nested item. It is conceptually similar to exploding the array first and then filtering the exploded rows.

Use EXISTS When You Only Need the Parent Row

Sometimes you do not want every matching item. You only want the parent order if any nested item satisfies a condition. In that case, EXISTS is cleaner than returning exploded rows and deduplicating later.

sql
1SELECT o.order_id
2FROM orders_raw o
3WHERE EXISTS (
4  SELECT 1
5  FROM UNNEST(o.items) AS t(i)
6  WHERE i.meta.size = 'L'
7);

This pattern is useful when the query is logically about orders, not about individual items.

Array Functions for Nested Output

Athena also supports lambda-based array functions. These are helpful when you want to keep the nested shape instead of flattening everything.

For example, keep only the red items:

sql
1SELECT
2  order_id,
3  filter(items, x -> x.meta.color = 'red') AS red_items
4FROM orders_raw;

Or project only the SKUs from the array:

sql
1SELECT
2  order_id,
3  transform(items, x -> x.sku) AS skus
4FROM orders_raw;

These functions are often easier to read when the output should remain nested.

Null Handling Matters

Nested semi-structured data often contains missing fields. If meta or meta.color is null for some elements, a simple filter can silently skip those rows.

sql
1SELECT o.order_id
2FROM orders_raw o
3WHERE EXISTS (
4  SELECT 1
5  FROM UNNEST(o.items) AS t(i)
6  WHERE i.meta IS NOT NULL
7    AND i.meta.color = 'blue'
8);

That extra null check is often worth adding when the raw data is not perfectly clean.

Performance Considerations

Nested queries in Athena can become expensive if the source data is large and still stored as raw JSON. Once the schema is stable, converting the data to Parquet or ORC usually improves both cost and performance.

It also helps to select only the nested fields you actually need. Avoid broad SELECT * queries on wide nested documents if the filter depends on only one attribute.

When to Choose Which Pattern

Use UNNEST when:

  • you need one row per matching array element,
  • you want to aggregate by inner-item attributes,
  • the analysis is item-centric.

Use EXISTS when:

  • you only need to know whether any nested item matches,
  • you want to return one row per parent record,
  • you want to avoid duplicate parent rows.

Use filter or transform when:

  • you want to preserve nested output,
  • you want compact transformations without full row explosion.

Common Pitfalls

A common mistake is trying to reference an array of structs like a scalar value. Arrays need either UNNEST or an array function before you can query inside them naturally.

Another issue is using UNNEST when the real question is about the parent row. That can multiply rows and force unnecessary deduplication.

Teams also forget about null nested fields. In semi-structured datasets, missing attributes are normal, not exceptional.

Summary

  • Arrays of structs in Athena are usually queried with UNNEST or array functions.
  • Use UNNEST when you want one row per nested element.
  • Use EXISTS when you only need the parent row if any nested element matches.
  • Use filter and transform when you want to keep nested output.
  • Stable schemas and columnar storage make nested Athena queries much easier to maintain and cheaper to run.

Course illustration
Course illustration

All Rights Reserved.