Boto3
S3
AWS
Python
Cloud Storage

Boto3 grabbing only selected objects from the S3 resource

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

When working with large S3 buckets, fetching every object and filtering locally is slow and expensive. In Boto3, the efficient approach is narrowing results server-side as much as possible, then applying lightweight client-side filtering for the remaining conditions. This keeps code fast and easier to maintain.

Use Prefix Filtering First

S3 API supports prefix filtering natively through list operations. Always use that first if object keys share meaningful path prefixes.

python
1import boto3
2
3s3 = boto3.client('s3')
4bucket = 'my-data-bucket'
5prefix = 'reports/2026/03/'
6
7response = s3.list_objects_v2(Bucket=bucket, Prefix=prefix, MaxKeys=1000)
8for obj in response.get('Contents', []):
9    print(obj['Key'], obj['Size'])

Prefix filtering reduces transferred metadata and speeds iteration significantly.

Handle Large Listings with Paginators

list_objects_v2 returns at most one thousand objects per request. For real buckets, always use paginators.

python
1import boto3
2
3s3 = boto3.client('s3')
4paginator = s3.get_paginator('list_objects_v2')
5
6pages = paginator.paginate(Bucket='my-data-bucket', Prefix='logs/app1/')
7
8for page in pages:
9    for obj in page.get('Contents', []):
10        if obj['Key'].endswith('.json'):
11            print(obj['Key'])

This pattern is reliable and memory-friendly for large inventories.

Resource API and Client API Tradeoff

Boto3 resource API can feel more object-oriented, while client API gives explicit control and usually maps closer to AWS documentation.

Resource example:

python
1import boto3
2
3s3r = boto3.resource('s3')
4bucket = s3r.Bucket('my-data-bucket')
5
6for obj in bucket.objects.filter(Prefix='images/raw/'):
7    if obj.key.endswith('.png'):
8        print(obj.key)

For simple iteration, resource API is concise. For advanced options, client API is often clearer.

Select by Time, Size, or Pattern

After prefix narrowing, apply additional filters in Python.

python
1from datetime import datetime, timezone
2import boto3
3
4cutoff = datetime(2026, 3, 1, tzinfo=timezone.utc)
5s3 = boto3.client('s3')
6
7for page in s3.get_paginator('list_objects_v2').paginate(Bucket='my-data-bucket', Prefix='events/'):
8    for obj in page.get('Contents', []):
9        if obj['LastModified'] >= cutoff and obj['Size'] > 0 and obj['Key'].endswith('.parquet'):
10            print('selected:', obj['Key'])

This avoids unnecessary downloads when selection can be made from metadata.

Download Only Selected Keys

Once keys are selected, stream only those files.

python
1import boto3
2from pathlib import Path
3
4s3 = boto3.client('s3')
5selected_keys = ['reports/2026/03/summary.csv', 'reports/2026/03/errors.csv']
6out_dir = Path('downloads')
7out_dir.mkdir(exist_ok=True)
8
9for key in selected_keys:
10    target = out_dir / Path(key).name
11    s3.download_file('my-data-bucket', key, str(target))
12    print('downloaded', target)

Keep the selection stage separate from the transfer stage for cleaner logging and retry behavior.

Robust Selection Pipelines with Logging and Retries

In production jobs, object selection should be observable. Log why each object is included or excluded, then persist selected keys for reproducibility.

python
1import boto3
2from botocore.config import Config
3
4cfg = Config(retries={'max_attempts': 8, 'mode': 'standard'})
5s3 = boto3.client('s3', config=cfg)
6
7selected = []
8for page in s3.get_paginator('list_objects_v2').paginate(Bucket='my-data-bucket', Prefix='reports/'):
9    for obj in page.get('Contents', []):
10        key = obj['Key']
11        if key.endswith('.csv') and obj['Size'] > 100:
12            selected.append(key)
13
14print('selected count:', len(selected))
15for key in selected[:10]:
16    print('sample:', key)

Persisting this list to a manifest file allows deterministic reruns and easier debugging when downstream processing fails.

For high-volume pipelines, consider S3 Inventory for daily object manifests. Inventory can be more efficient than repeated full-prefix listing in large buckets.

For very high request volume, add request metrics and backoff telemetry so throttling patterns are visible in logs.

Common Pitfalls

A common pitfall is listing entire bucket contents and filtering in memory. This scales poorly and can trigger API throttling.

Another issue is forgetting pagination, which silently misses objects beyond the first page.

Developers also assume wildcard syntax in S3 list APIs. S3 supports prefix filtering, not arbitrary glob matching.

Finally, avoid broad IAM permissions. Restrict list and get permissions to required prefixes when possible.

Summary

  • Use prefix filtering first to reduce S3 listing scope.
  • Use paginators for complete and scalable iteration.
  • Apply additional metadata filters client-side only after narrowing.
  • Separate selection and download phases for cleaner workflows.
  • Keep IAM permissions scoped to required bucket paths.

Course illustration
Course illustration

All Rights Reserved.