string size
bytes calculation
string length
programming guide
data measurement

How to know the size of the string in bytes?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

A string's length in characters and its size in bytes are different things. In ASCII, each character is 1 byte. In UTF-8, characters take 1-4 bytes depending on the code point. UTF-16 uses 2-4 bytes. The byte size depends on the encoding, and every language has a different way to measure it. Getting this right matters for database column sizing, network protocol buffers, file I/O, and API payload limits.

Python

python
1text = "Hello, World!"
2
3# Length in characters
4print(len(text))  # 13
5
6# Size in bytes (UTF-8)
7print(len(text.encode('utf-8')))   # 13 (ASCII chars = 1 byte each)
8
9# With multi-byte characters
10text = "Hello, δΈ–η•Œ!"
11print(len(text))                    # 9 characters
12print(len(text.encode('utf-8')))   # 15 bytes (Chinese chars = 3 bytes each in UTF-8)
13print(len(text.encode('utf-16')))  # 22 bytes (includes 2-byte BOM)
14
15# sys.getsizeof includes Python object overhead
16import sys
17print(sys.getsizeof(text))  # 82 (includes object header, not just string data)

Use len(s.encode('utf-8')) for the actual byte count of the string content.

JavaScript

javascript
1// String length (UTF-16 code units)
2const text = "Hello, δΈ–η•Œ!";
3console.log(text.length);  // 9
4
5// Byte size in UTF-8
6const encoder = new TextEncoder();
7const bytes = encoder.encode(text);
8console.log(bytes.length);  // 15
9
10// Using Blob (browser)
11const blob = new Blob([text]);
12console.log(blob.size);  // 15
13
14// Node.js
15console.log(Buffer.byteLength(text, 'utf-8'));  // 15
16
17// Emoji handling
18const emoji = "Hello πŸ‘‹";
19console.log(emoji.length);                       // 8 (surrogate pair = 2 code units)
20console.log(new TextEncoder().encode(emoji).length);  // 11 bytes in UTF-8

JavaScript's .length returns UTF-16 code units, not characters or bytes.

Java

java
1String text = "Hello, δΈ–η•Œ!";
2
3// Character count
4System.out.println(text.length());  // 9
5
6// Byte size in UTF-8
7byte[] utf8Bytes = text.getBytes("UTF-8");
8System.out.println(utf8Bytes.length);  // 15
9
10// Byte size in other encodings
11byte[] utf16Bytes = text.getBytes("UTF-16");
12System.out.println(utf16Bytes.length);  // 20 (includes 2-byte BOM)
13
14byte[] asciiBytes = text.getBytes("US-ASCII");
15System.out.println(asciiBytes.length);  // 9 (non-ASCII replaced with '?')

C#

csharp
1string text = "Hello, δΈ–η•Œ!";
2
3// Character count
4Console.WriteLine(text.Length);  // 9
5
6// Byte size in UTF-8
7int utf8Size = System.Text.Encoding.UTF8.GetByteCount(text);
8Console.WriteLine(utf8Size);  // 15
9
10// Byte size in UTF-16 (C# internal representation)
11int utf16Size = System.Text.Encoding.Unicode.GetByteCount(text);
12Console.WriteLine(utf16Size);  // 18
13
14// Get the actual bytes
15byte[] bytes = System.Text.Encoding.UTF8.GetBytes(text);
16Console.WriteLine(bytes.Length);  // 15

Go

go
1package main
2
3import (
4    "fmt"
5    "unicode/utf8"
6)
7
8func main() {
9    text := "Hello, δΈ–η•Œ!"
10
11    // Byte length (Go strings are UTF-8 by default)
12    fmt.Println(len(text))  // 15
13
14    // Character (rune) count
15    fmt.Println(utf8.RuneCountInString(text))  // 9
16
17    // len() on a Go string already gives bytes, not characters
18    ascii := "Hello"
19    fmt.Println(len(ascii))  // 5 bytes = 5 characters (all ASCII)
20}

Go strings are byte slices encoded in UTF-8. len(s) returns bytes, not characters.

C / C++

c
1#include <stdio.h>
2#include <string.h>
3
4int main() {
5    // ASCII string
6    const char *text = "Hello, World!";
7    printf("Bytes: %zu\n", strlen(text));  // 13 (excluding null terminator)
8    printf("With null: %zu\n", strlen(text) + 1);  // 14
9
10    // UTF-8 string
11    const char *utf8 = "Hello, δΈ–η•Œ!";
12    printf("Bytes: %zu\n", strlen(utf8));  // 15
13    // strlen counts bytes, not characters in UTF-8
14    return 0;
15}
cpp
1#include <iostream>
2#include <string>
3
4int main() {
5    std::string text = "Hello, δΈ–η•Œ!";
6    std::cout << text.size() << std::endl;   // 15 bytes
7    std::cout << text.length() << std::endl; // 15 bytes (same as size())
8
9    // For character count, use a UTF-8 library or count manually
10}

Ruby

ruby
1text = "Hello, δΈ–η•Œ!"
2
3puts text.length          # 9 (characters)
4puts text.bytesize        # 15 (bytes in UTF-8)
5puts text.encode('UTF-16').bytesize  # 20 (bytes in UTF-16)
6
7# Encoding info
8puts text.encoding        # UTF-8

PHP

php
1$text = "Hello, δΈ–η•Œ!";
2
3echo strlen($text);        // 15 (bytes)
4echo mb_strlen($text);     // 9 (characters, requires mbstring extension)
5
6// strlen() in PHP returns bytes, not characters
7// mb_strlen() counts actual characters

Encoding Size Reference

CharacterUTF-8UTF-16UTF-32
ASCII (A, 1, !)1 byte2 bytes4 bytes
Latin (e, n, u)2 bytes2 bytes4 bytes
CJK (δΈ­, δΈ–, η•Œ)3 bytes2 bytes4 bytes
Emoji (πŸ‘‹, πŸŽ‰)4 bytes4 bytes4 bytes

Database Considerations

sql
1-- MySQL: VARCHAR(255) means 255 characters, not bytes
2-- But the byte limit depends on the row format:
3-- UTF-8 (utf8mb3): up to 3 bytes per character
4-- UTF-8 (utf8mb4): up to 4 bytes per character (supports emoji)
5
6CREATE TABLE users (
7    name VARCHAR(100) CHARACTER SET utf8mb4
8    -- Can store 100 characters, using up to 400 bytes
9);
10
11-- Check actual byte length in MySQL
12SELECT LENGTH(name) AS bytes, CHAR_LENGTH(name) AS chars FROM users;

Common Pitfalls

  • Confusing len() with byte size: In Python and Ruby, len() returns characters. In Go and C, len() returns bytes. Know what your language's default string length function measures.
  • Assuming 1 character = 1 byte: True only for ASCII. A single emoji like πŸ‘‹ takes 4 bytes in UTF-8. Chinese characters take 3 bytes. Always use the encoding-specific byte count function.
  • JavaScript .length is not characters or bytes: It returns UTF-16 code units. Emoji and some characters use 2 code units (surrogate pairs), so .length overcounts characters and undercounts bytes.
  • Database VARCHAR limits: MySQL's VARCHAR(n) is in characters, but the row has a byte limit. A VARCHAR(255) column with utf8mb4 can use up to 1020 bytes per value.
  • Null terminators in C: strlen() returns the byte count excluding the null terminator. The actual memory used is strlen(s) + 1.

Summary

  • String length (characters) and byte size are different β€” always specify which you need
  • Use encoding-specific functions: Python len(s.encode('utf-8')), JS new TextEncoder().encode(s).length, Java s.getBytes("UTF-8").length
  • UTF-8 uses 1-4 bytes per character; UTF-16 uses 2-4 bytes; ASCII uses exactly 1 byte
  • Go's len(s) returns bytes; Python/Ruby len(s) returns characters
  • Always check your database's character set when sizing VARCHAR columns

Course illustration
Course illustration

All Rights Reserved.