Load S3 Data into AWS SageMaker Notebook
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
In a SageMaker notebook, the normal way to load data from Amazon S3 is to use the notebook's execution role for permission and then read objects with boto3, pandas, or the SageMaker SDK. The two things that usually matter most are IAM access to the bucket and choosing whether you want to download the file locally or stream it directly into a DataFrame.
Check Permissions First
SageMaker notebook instances and notebook environments rely on an execution role. That role must be allowed to access the bucket and objects you want to read, usually with permissions such as:
- '
s3:ListBucket' - '
s3:GetObject'
If the role cannot read the bucket, your notebook code will fail no matter which Python library you use.
Read an Object With boto3
For direct control, boto3 is the most explicit option:
This is useful when you want the raw object contents or need to inspect metadata as well as the payload.
Load a CSV Into pandas
For tabular data, pandas is usually more convenient:
This keeps the code simple and works well for files that fit comfortably in notebook memory.
Read Directly From an S3 URI
In many environments you can also read directly from an s3:// path, often through pandas plus the appropriate filesystem support:
This style is concise, but it still depends on the same underlying permissions and environment support. If it fails, dropping back to explicit boto3 code often makes debugging easier.
Use the SageMaker Session When Helpful
The SageMaker SDK can help you work with S3 paths and default buckets:
That is especially useful when your notebook participates in training or processing jobs and you want to keep data in the same region and account flow SageMaker already expects.
Download Locally First for Large or Reused Files
If you will read the same file many times in one notebook session, downloading it once can be practical:
Then load it normally:
This reduces repeated S3 calls during iterative notebook work.
Region and Networking Details
S3 buckets and SageMaker environments work best when they are in the same AWS Region for latency and simpler operations. If your notebook runs inside a VPC without general internet access, make sure the environment still has the required path to S3, such as an S3 VPC endpoint where appropriate.
Those issues often look like code bugs at first, but they are really infrastructure configuration problems.
Common Pitfalls
The biggest mistake is assuming the notebook automatically has access to every S3 bucket in the account. Access is controlled by the execution role and bucket policy, not by the fact that the notebook is running in AWS.
Another issue is loading a very large object fully into memory in pandas when a streaming or chunked approach would be safer. Developers also sometimes forget that S3 and the notebook should usually be in the same Region to avoid unnecessary friction and slower data access.
Summary
- SageMaker notebooks usually access S3 through the notebook execution role.
- Use
boto3for explicit object access andpandasfor convenient table loading. - Direct
s3://reads are concise but still depend on the same IAM permissions. - Download locally when repeated notebook reads make that more practical.
- If access fails, check IAM, bucket policy, region, and network path before blaming the code.

