Introduction
A string's length in characters and its size in bytes are different things. In ASCII, each character is 1 byte. In UTF-8, characters take 1-4 bytes depending on the code point. UTF-16 uses 2-4 bytes. The byte size depends on the encoding, and every language has a different way to measure it. Getting this right matters for database column sizing, network protocol buffers, file I/O, and API payload limits.
Python
1text = "Hello, World!"
2
3# Length in characters
4print(len(text)) # 13
5
6# Size in bytes (UTF-8)
7print(len(text.encode('utf-8'))) # 13 (ASCII chars = 1 byte each)
8
9# With multi-byte characters
10text = "Hello, δΈη!"
11print(len(text)) # 9 characters
12print(len(text.encode('utf-8'))) # 15 bytes (Chinese chars = 3 bytes each in UTF-8)
13print(len(text.encode('utf-16'))) # 22 bytes (includes 2-byte BOM)
14
15# sys.getsizeof includes Python object overhead
16import sys
17print(sys.getsizeof(text)) # 82 (includes object header, not just string data)
Use len(s.encode('utf-8')) for the actual byte count of the string content.
JavaScript
1// String length (UTF-16 code units)
2const text = "Hello, δΈη!";
3console.log(text.length); // 9
4
5// Byte size in UTF-8
6const encoder = new TextEncoder();
7const bytes = encoder.encode(text);
8console.log(bytes.length); // 15
9
10// Using Blob (browser)
11const blob = new Blob([text]);
12console.log(blob.size); // 15
13
14// Node.js
15console.log(Buffer.byteLength(text, 'utf-8')); // 15
16
17// Emoji handling
18const emoji = "Hello π";
19console.log(emoji.length); // 8 (surrogate pair = 2 code units)
20console.log(new TextEncoder().encode(emoji).length); // 11 bytes in UTF-8
JavaScript's .length returns UTF-16 code units, not characters or bytes.
Java
1String text = "Hello, δΈη!";
2
3// Character count
4System.out.println(text.length()); // 9
5
6// Byte size in UTF-8
7byte[] utf8Bytes = text.getBytes("UTF-8");
8System.out.println(utf8Bytes.length); // 15
9
10// Byte size in other encodings
11byte[] utf16Bytes = text.getBytes("UTF-16");
12System.out.println(utf16Bytes.length); // 20 (includes 2-byte BOM)
13
14byte[] asciiBytes = text.getBytes("US-ASCII");
15System.out.println(asciiBytes.length); // 9 (non-ASCII replaced with '?')
C#
1string text = "Hello, δΈη!";
2
3// Character count
4Console.WriteLine(text.Length); // 9
5
6// Byte size in UTF-8
7int utf8Size = System.Text.Encoding.UTF8.GetByteCount(text);
8Console.WriteLine(utf8Size); // 15
9
10// Byte size in UTF-16 (C# internal representation)
11int utf16Size = System.Text.Encoding.Unicode.GetByteCount(text);
12Console.WriteLine(utf16Size); // 18
13
14// Get the actual bytes
15byte[] bytes = System.Text.Encoding.UTF8.GetBytes(text);
16Console.WriteLine(bytes.Length); // 15
Go
1package main
2
3import (
4 "fmt"
5 "unicode/utf8"
6)
7
8func main() {
9 text := "Hello, δΈη!"
10
11 // Byte length (Go strings are UTF-8 by default)
12 fmt.Println(len(text)) // 15
13
14 // Character (rune) count
15 fmt.Println(utf8.RuneCountInString(text)) // 9
16
17 // len() on a Go string already gives bytes, not characters
18 ascii := "Hello"
19 fmt.Println(len(ascii)) // 5 bytes = 5 characters (all ASCII)
20}
Go strings are byte slices encoded in UTF-8. len(s) returns bytes, not characters.
C / C++
1#include <stdio.h>
2#include <string.h>
3
4int main() {
5 // ASCII string
6 const char *text = "Hello, World!";
7 printf("Bytes: %zu\n", strlen(text)); // 13 (excluding null terminator)
8 printf("With null: %zu\n", strlen(text) + 1); // 14
9
10 // UTF-8 string
11 const char *utf8 = "Hello, δΈη!";
12 printf("Bytes: %zu\n", strlen(utf8)); // 15
13 // strlen counts bytes, not characters in UTF-8
14 return 0;
15}
1#include <iostream>
2#include <string>
3
4int main() {
5 std::string text = "Hello, δΈη!";
6 std::cout << text.size() << std::endl; // 15 bytes
7 std::cout << text.length() << std::endl; // 15 bytes (same as size())
8
9 // For character count, use a UTF-8 library or count manually
10}
Ruby
1text = "Hello, δΈη!"
2
3puts text.length # 9 (characters)
4puts text.bytesize # 15 (bytes in UTF-8)
5puts text.encode('UTF-16').bytesize # 20 (bytes in UTF-16)
6
7# Encoding info
8puts text.encoding # UTF-8
PHP
1$text = "Hello, δΈη!";
2
3echo strlen($text); // 15 (bytes)
4echo mb_strlen($text); // 9 (characters, requires mbstring extension)
5
6// strlen() in PHP returns bytes, not characters
7// mb_strlen() counts actual characters
Encoding Size Reference
| Character | UTF-8 | UTF-16 | UTF-32 |
| ASCII (A, 1, !) | 1 byte | 2 bytes | 4 bytes |
| Latin (e, n, u) | 2 bytes | 2 bytes | 4 bytes |
| CJK (δΈ, δΈ, η) | 3 bytes | 2 bytes | 4 bytes |
| Emoji (π, π) | 4 bytes | 4 bytes | 4 bytes |
Database Considerations
1-- MySQL: VARCHAR(255) means 255 characters, not bytes
2-- But the byte limit depends on the row format:
3-- UTF-8 (utf8mb3): up to 3 bytes per character
4-- UTF-8 (utf8mb4): up to 4 bytes per character (supports emoji)
5
6CREATE TABLE users (
7 name VARCHAR(100) CHARACTER SET utf8mb4
8 -- Can store 100 characters, using up to 400 bytes
9);
10
11-- Check actual byte length in MySQL
12SELECT LENGTH(name) AS bytes, CHAR_LENGTH(name) AS chars FROM users;
Common Pitfalls
Confusing len() with byte size: In Python and Ruby, len() returns characters. In Go and C, len() returns bytes. Know what your language's default string length function measures.
Assuming 1 character = 1 byte: True only for ASCII. A single emoji like π takes 4 bytes in UTF-8. Chinese characters take 3 bytes. Always use the encoding-specific byte count function.
JavaScript .length is not characters or bytes: It returns UTF-16 code units. Emoji and some characters use 2 code units (surrogate pairs), so .length overcounts characters and undercounts bytes.
Database VARCHAR limits: MySQL's VARCHAR(n) is in characters, but the row has a byte limit. A VARCHAR(255) column with utf8mb4 can use up to 1020 bytes per value.
Null terminators in C: strlen() returns the byte count excluding the null terminator. The actual memory used is strlen(s) + 1.
Summary
String length (characters) and byte size are different β always specify which you need
Use encoding-specific functions: Python len(s.encode('utf-8')), JS new TextEncoder().encode(s).length, Java s.getBytes("UTF-8").length
UTF-8 uses 1-4 bytes per character; UTF-16 uses 2-4 bytes; ASCII uses exactly 1 byte
Go's len(s) returns bytes; Python/Ruby len(s) returns characters
Always check your database's character set when sizing VARCHAR columns