Kafka Connect
S3 Sink
TimeBasedPartitioner
Data Partitioning
Cloud Storage Configuration

Properly Configuring Kafka Connect S3 Sink TimeBasedPartitioner

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

The S3 Sink connector can write Kafka records into S3 paths organized by time, which is useful for analytics, retention, and downstream query engines. A correct TimeBasedPartitioner setup depends on a few properties working together: the partitioner class, partition duration, path format, timezone, and timestamp extraction strategy.

The Core Properties You Need

The key configuration starts with partitioner.class and the time-based properties it unlocks.

json
1{
2  "name": "s3-sink",
3  "config": {
4    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
5    "tasks.max": "1",
6    "topics": "orders",
7    "s3.bucket.name": "analytics-bucket",
8    "s3.region": "us-east-1",
9    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
10    "format.class": "io.confluent.connect.s3.format.json.JsonFormat",
11    "flush.size": "1000",
12    "partitioner.class": "io.confluent.connect.storage.partitioner.TimeBasedPartitioner",
13    "partition.duration.ms": "3600000",
14    "path.format": "'year'=YYYY/'month'=MM/'day'=dd/'hour'=HH",
15    "timezone": "UTC",
16    "locale": "en-US",
17    "timestamp.extractor": "Record"
18  }
19}

This configuration creates hourly directories based on the Kafka record timestamp. 3600000 milliseconds means one hour per partition.

Understand How the Path Is Built

path.format controls the directory structure under the topic path. For the example above, records land in paths like:

text
topics/orders/year=2026/month=03/day=11/hour=14/

The formatting string is not arbitrary. It is interpreted by the connector's partitioning logic, so quote literal directory labels as shown in the example. If the format does not match your intended query layout, you will feel that pain later in Athena, Spark, or any downstream job that expects consistent partition paths.

Choose the Right Timestamp Source

One of the most important settings is timestamp.extractor. Common values are:

  • 'Wallclock'
  • 'Record'
  • 'RecordField'

Wallclock uses the connector machine's current time when processing the record. That is simple, but it means late-arriving records are partitioned by ingestion time rather than event time.

Record uses the Kafka record timestamp. That is often a better default when producer timestamps are meaningful.

RecordField uses a field inside the record value. When you choose that mode, you also need timestamp.field.

json
1{
2  "timestamp.extractor": "RecordField",
3  "timestamp.field": "event_time"
4}

This is useful when the payload contains the true business event time and Kafka's record timestamp is not what you want for partitioning.

Match Partition Duration to Query Patterns

Smaller partitions create more directories and more files. Larger partitions reduce directory count but can make queries scan more data than necessary.

Typical choices:

  • hourly partitions for high-volume event streams
  • daily partitions for lower-volume or reporting-oriented data

Example daily setup:

json
1{
2  "partition.duration.ms": "86400000",
3  "path.format": "'year'=YYYY/'month'=MM/'day'=dd"
4}

Do not pick a duration just because it "looks organized." Pick it based on how downstream systems will read the data and how many objects the connector will produce.

Timezone and Late Data Matter

timezone controls how timestamps are translated into folder boundaries. If analytics teams expect UTC but the connector partitions in a local timezone, partition paths become confusing fast.

Late-arriving data is another important operational detail. The connector does not magically rewrite old partitions just because a record belongs to an earlier event time window. Your timestamp choice determines where late events go, and that affects downstream expectations.

A Practical Validation Workflow

Before calling the configuration done:

  1. produce a few sample messages with known timestamps
  2. verify the connector writes to the expected S3 path
  3. confirm the timestamps used are the ones you intended
  4. inspect file volume and partition spread over a realistic interval

That catches misconfigurations earlier than waiting for the first broken Athena query.

Common Pitfalls

The biggest mistake is using the wrong property name for timestamp extraction. For the S3 Sink connector, the relevant property is timestamp.extractor, not an older or invented class-style setting.

Another issue is choosing Wallclock when the business requirement is event-time partitioning. That makes late data land in the wrong time folders for analytical purposes.

Developers also often choose partition durations that are too fine-grained, which creates too many small files in S3.

Finally, always set timezone intentionally. Leaving it mismatched with downstream expectations causes subtle reporting errors that are hard to spot from the connector side alone.

Summary

  • 'TimeBasedPartitioner needs partitioner.class, partition.duration.ms, path.format, and related time settings to work together.'
  • 'timestamp.extractor determines whether partitioning follows wall-clock time, Kafka record time, or a field in the payload.'
  • 'path.format should be designed for downstream storage and query patterns, not just aesthetics.'
  • 'timezone must match how your organization interprets time-based partitions.'
  • Test with known timestamps before trusting the connector in production.

Course illustration
Course illustration

All Rights Reserved.