AWS Athena Querying by an attributes of a struct with an array

AWS Athena

SQL

Structs

Arrays

Data Querying

AWS Athena Querying by an attributes of a struct with an array

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

Athena can query nested data directly, but arrays of structs require a different style than flat SQL tables. If the value you want to filter on lives inside a struct that is itself inside an array, the two main tools are UNNEST and Athena's array functions.

Understanding the Shape of the Data

Suppose each row stores an order with an items array. Each item is a struct, and each item contains nested metadata.

json

1{
2  "order_id": "O-1001",
3  "items": [
4    {"sku": "A1", "qty": 2, "meta": {"color": "red", "size": "M"}},
5    {"sku": "B9", "qty": 1, "meta": {"color": "blue", "size": "L"}}
6  ]
7}

A matching Athena table might look like this:

sql

1CREATE EXTERNAL TABLE orders_raw (
2  order_id string,
3  items array<struct<
4    sku:string,
5    qty:int,
6    meta:struct<color:string,size:string>
7  >>
8)
9ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
10LOCATION 's3://example-bucket/orders/';

Once the schema is correct, querying nested attributes becomes much easier.

Use `UNNEST` for Element-Level Filtering

If you want to query each item inside the array as if it were a row, flatten the array with UNNEST.

sql

1SELECT o.order_id, i.sku, i.meta.color
2FROM orders_raw o
3CROSS JOIN UNNEST(o.items) AS t(i)
4WHERE i.meta.color = 'red';

This is the most direct approach when you need one result row per matching nested item. It is conceptually similar to exploding the array first and then filtering the exploded rows.

Use `EXISTS` When You Only Need the Parent Row

Sometimes you do not want every matching item. You only want the parent order if any nested item satisfies a condition. In that case, EXISTS is cleaner than returning exploded rows and deduplicating later.

sql

1SELECT o.order_id
2FROM orders_raw o
3WHERE EXISTS (
4  SELECT 1
5  FROM UNNEST(o.items) AS t(i)
6  WHERE i.meta.size = 'L'
7);

This pattern is useful when the query is logically about orders, not about individual items.

Array Functions for Nested Output

Athena also supports lambda-based array functions. These are helpful when you want to keep the nested shape instead of flattening everything.

For example, keep only the red items:

sql

1SELECT
2  order_id,
3  filter(items, x -> x.meta.color = 'red') AS red_items
4FROM orders_raw;

Or project only the SKUs from the array:

sql

1SELECT
2  order_id,
3  transform(items, x -> x.sku) AS skus
4FROM orders_raw;

These functions are often easier to read when the output should remain nested.

Null Handling Matters

Nested semi-structured data often contains missing fields. If meta or meta.color is null for some elements, a simple filter can silently skip those rows.

sql

1SELECT o.order_id
2FROM orders_raw o
3WHERE EXISTS (
4  SELECT 1
5  FROM UNNEST(o.items) AS t(i)
6  WHERE i.meta IS NOT NULL
7    AND i.meta.color = 'blue'
8);

That extra null check is often worth adding when the raw data is not perfectly clean.

Performance Considerations

Nested queries in Athena can become expensive if the source data is large and still stored as raw JSON. Once the schema is stable, converting the data to Parquet or ORC usually improves both cost and performance.

It also helps to select only the nested fields you actually need. Avoid broad SELECT * queries on wide nested documents if the filter depends on only one attribute.

When to Choose Which Pattern

Use UNNEST when:

you need one row per matching array element,
you want to aggregate by inner-item attributes,
the analysis is item-centric.

Use EXISTS when:

you only need to know whether any nested item matches,
you want to return one row per parent record,
you want to avoid duplicate parent rows.

Use filter or transform when:

you want to preserve nested output,
you want compact transformations without full row explosion.

Common Pitfalls

A common mistake is trying to reference an array of structs like a scalar value. Arrays need either UNNEST or an array function before you can query inside them naturally.

Another issue is using UNNEST when the real question is about the parent row. That can multiply rows and force unnecessary deduplication.

Teams also forget about null nested fields. In semi-structured datasets, missing attributes are normal, not exceptional.

Summary

Arrays of structs in Athena are usually queried with UNNEST or array functions.
Use UNNEST when you want one row per nested element.
Use EXISTS when you only need the parent row if any nested element matches.
Use filter and transform when you want to keep nested output.
Stable schemas and columnar storage make nested Athena queries much easier to maintain and cheaper to run.

AWS Athena Querying by an attributes of a struct with an array

Master System Design with Codemia

Introduction

Understanding the Shape of the Data

Use UNNEST for Element-Level Filtering

Use EXISTS When You Only Need the Parent Row

Array Functions for Nested Output

Null Handling Matters

Performance Considerations

When to Choose Which Pattern

Common Pitfalls

Summary

Use `UNNEST` for Element-Level Filtering

Use `EXISTS` When You Only Need the Parent Row