Load S3 Data into AWS SageMaker Notebook

AWS

SageMaker

Data Loading

Cloud Computing

Load S3 Data into AWS SageMaker Notebook

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

In a SageMaker notebook, the normal way to load data from Amazon S3 is to use the notebook's execution role for permission and then read objects with boto3, pandas, or the SageMaker SDK. The two things that usually matter most are IAM access to the bucket and choosing whether you want to download the file locally or stream it directly into a DataFrame.

Check Permissions First

SageMaker notebook instances and notebook environments rely on an execution role. That role must be allowed to access the bucket and objects you want to read, usually with permissions such as:

's3:ListBucket'
's3:GetObject'

If the role cannot read the bucket, your notebook code will fail no matter which Python library you use.

Read an Object With `boto3`

For direct control, boto3 is the most explicit option:

python

1import boto3
2
3s3 = boto3.client("s3")
4bucket = "my-data-bucket"
5key = "datasets/customers.csv"
6
7response = s3.get_object(Bucket=bucket, Key=key)
8body = response["Body"].read().decode("utf-8")
9
10print(body[:200])

This is useful when you want the raw object contents or need to inspect metadata as well as the payload.

Load a CSV Into pandas

For tabular data, pandas is usually more convenient:

python

1import io
2import boto3
3import pandas as pd
4
5s3 = boto3.client("s3")
6response = s3.get_object(Bucket="my-data-bucket", Key="datasets/customers.csv")
7
8df = pd.read_csv(io.BytesIO(response["Body"].read()))
9print(df.head())

This keeps the code simple and works well for files that fit comfortably in notebook memory.

Read Directly From an S3 URI

In many environments you can also read directly from an s3:// path, often through pandas plus the appropriate filesystem support:

python

1import pandas as pd
2
3df = pd.read_csv("s3://my-data-bucket/datasets/customers.csv")
4print(df.head())

This style is concise, but it still depends on the same underlying permissions and environment support. If it fails, dropping back to explicit boto3 code often makes debugging easier.

Use the SageMaker Session When Helpful

The SageMaker SDK can help you work with S3 paths and default buckets:

python

1import sagemaker
2
3session = sagemaker.Session()
4default_bucket = session.default_bucket()
5
6print(default_bucket)

That is especially useful when your notebook participates in training or processing jobs and you want to keep data in the same region and account flow SageMaker already expects.

Download Locally First for Large or Reused Files

If you will read the same file many times in one notebook session, downloading it once can be practical:

python

1import boto3
2
3s3 = boto3.client("s3")
4s3.download_file(
5    "my-data-bucket",
6    "datasets/customers.csv",
7    "/home/ec2-user/SageMaker/customers.csv"
8)

Then load it normally:

python

import pandas as pd

df = pd.read_csv("/home/ec2-user/SageMaker/customers.csv")

This reduces repeated S3 calls during iterative notebook work.

Region and Networking Details

S3 buckets and SageMaker environments work best when they are in the same AWS Region for latency and simpler operations. If your notebook runs inside a VPC without general internet access, make sure the environment still has the required path to S3, such as an S3 VPC endpoint where appropriate.

Those issues often look like code bugs at first, but they are really infrastructure configuration problems.

Common Pitfalls

The biggest mistake is assuming the notebook automatically has access to every S3 bucket in the account. Access is controlled by the execution role and bucket policy, not by the fact that the notebook is running in AWS.

Another issue is loading a very large object fully into memory in pandas when a streaming or chunked approach would be safer. Developers also sometimes forget that S3 and the notebook should usually be in the same Region to avoid unnecessary friction and slower data access.

Summary

SageMaker notebooks usually access S3 through the notebook execution role.
Use boto3 for explicit object access and pandas for convenient table loading.
Direct s3:// reads are concise but still depend on the same IAM permissions.
Download locally when repeated notebook reads make that more practical.
If access fails, check IAM, bucket policy, region, and network path before blaming the code.

Load S3 Data into AWS SageMaker Notebook

Master System Design with Codemia

Introduction

Check Permissions First

Read an Object With boto3

Load a CSV Into pandas

Read Directly From an S3 URI

Use the SageMaker Session When Helpful

Download Locally First for Large or Reused Files

Region and Networking Details

Common Pitfalls

Summary

Read an Object With `boto3`