What's the difference between UTF-8 and UTF-8 with BOM?

UTF-8

BOM

Text Encoding

File Formats

Programming Languages

What's the difference between UTF-8 and UTF-8 with BOM?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Start Practicing Learn More

Introduction

UTF-8 and UTF-8 with BOM are the same character encoding. The only difference is that UTF-8 with BOM prepends three invisible bytes (EF BB BF) at the very start of the file. This Byte Order Mark acts as a signature telling editors "this file is UTF-8," but since UTF-8 has no byte-order ambiguity, the BOM is unnecessary and frequently causes bugs in Unix tools, web servers, and programming languages.

What Is UTF-8?

UTF-8 is a variable-width encoding that can represent every Unicode code point using one to four bytes. It is backward-compatible with ASCII: any valid ASCII file is also valid UTF-8.

Code point range	Bytes used	Example character
U+0000 to U+007F	1 byte	`A` (0x41)
U+0080 to U+07FF	2 bytes	`e` (0xC3 0xA9)
U+0800 to U+FFFF	3 bytes	`$` (0xE2 0x82 0xAC)
U+10000 to U+10FFFF	4 bytes	(0xF0 0x9F 0x98 0x80)

UTF-8 is the dominant encoding on the web. Over 98% of all websites use it.

What Is the BOM?

BOM stands for Byte Order Mark. It is the Unicode code point U+FEFF. For encodings like UTF-16 and UTF-32 where byte order matters, the BOM serves a real purpose: it tells the decoder whether the file is big-endian or little-endian.

Encoding	BOM bytes	Purpose
UTF-8	`EF BB BF`	Signature only (no byte-order issue)
UTF-16 BE	`FE FF`	Indicates big-endian byte order
UTF-16 LE	`FF FE`	Indicates little-endian byte order
UTF-32 BE	`00 00 FE FF`	Indicates big-endian byte order
UTF-32 LE	`FF FE 00 00`	Indicates little-endian byte order

UTF-8 does not have a byte-order issue because it reads one byte at a time. The BOM in UTF-8 is purely a marker that says "this file is UTF-8." Most modern tools already detect UTF-8 without it.

Side-by-Side Comparison

Let's see how the string Hello is stored in both formats.

UTF-8 (no BOM):

48 65 6C 6C 6F

UTF-8 with BOM:

EF BB BF 48 65 6C 6C 6F

The content is identical. The only difference is the three leading bytes.

Verifying with code

Python:

python

1# Write UTF-8 with BOM
2with open("with_bom.txt", "w", encoding="utf-8-sig") as f:
3    f.write("Hello")
4
5# Write UTF-8 without BOM
6with open("no_bom.txt", "w", encoding="utf-8") as f:
7    f.write("Hello")
8
9# Check the raw bytes
10with open("with_bom.txt", "rb") as f:
11    print(f.read())  # b'\xef\xbb\xbfHello'
12
13with open("no_bom.txt", "rb") as f:
14    print(f.read())  # b'Hello'

Command line (Linux/macOS):

bash

1# Check for BOM
2hexdump -C file.txt | head -1
3# BOM present: 00000000  ef bb bf 48 65 6c 6c 6f
4# No BOM:      00000000  48 65 6c 6c 6f
5
6# Remove BOM with sed
7sed -i '1s/^\xEF\xBB\xBF//' file.txt

PowerShell (Windows):

powershell

# Read and re-save without BOM
$content = Get-Content -Path "file.txt" -Raw
[System.IO.File]::WriteAllText("file.txt", $content, [System.Text.UTF8Encoding]::new($false))

When the BOM Causes Problems

The BOM is three invisible bytes. Any tool that does not expect them will treat them as data, leading to subtle and frustrating bugs.

PHP and web output

php

1<?php
2// If this file is saved as UTF-8 with BOM, the three BOM bytes
3// are sent before any PHP code executes.
4// This breaks header() calls:
5header("Location: /dashboard");
6// Warning: Cannot modify header information - headers already sent

The BOM bytes constitute output, so PHP thinks headers have already been sent.

Bash scripts

bash

1#!/bin/bash
2# If saved with BOM, the shebang line becomes:
3# <BOM>#!/bin/bash
4# The OS cannot find the interpreter and throws:
5# /bin/bash^M: bad interpreter: No such file or directory

JSON parsing

python

1import json
2
3# A JSON file with BOM may fail to parse:
4with open("data.json", "r", encoding="utf-8") as f:
5    data = json.load(f)
6# json.decoder.JSONDecodeError: Unexpected UTF-8 BOM
7
8# Fix: use utf-8-sig to strip BOM automatically
9with open("data.json", "r", encoding="utf-8-sig") as f:
10    data = json.load(f)  # Works

CSV files and Excel

Ironically, Excel on Windows sometimes requires a BOM to correctly display UTF-8 CSV files. Without it, Excel may default to a local encoding (like Windows-1252) and corrupt accented characters. This is one of the few legitimate reasons to include a BOM.

python

1# Generate CSV that Excel opens correctly
2with open("report.csv", "w", encoding="utf-8-sig", newline="") as f:
3    writer = csv.writer(f)
4    writer.writerow(["Name", "City"])
5    writer.writerow(["Rene", "Montreal"])

String comparison and concatenation

python

1with open("bom_file.txt", "r", encoding="utf-8") as f:
2    content = f.read()
3
4# The BOM character is invisible but present
5print(repr(content))  # 'Hello'
6print(content == "Hello")  # False!
7print(len(content))  # 6 (not 5)

Full Comparison Table

Aspect	UTF-8	UTF-8 with BOM
Leading bytes	None	`EF BB BF`
File size overhead	0 bytes	3 bytes
Web (HTML, CSS, JS)	Recommended	Discouraged by W3C
Unix/Linux tools	Fully compatible	May cause bugs
PHP	Works	Breaks `header()` and sessions
Python `json.load`	Works	Fails without `utf-8-sig`
Bash/shell scripts	Works	Breaks shebang line
Excel CSV import	May misdetect encoding	Displays correctly
Windows Notepad (old)	May default to ANSI	Detects as UTF-8
Visual Studio	Works	Works (strips BOM on read)
Git diffs	Clean	BOM shows as diff noise

When to Use Each

Use UTF-8 (no BOM) for:

Web files: HTML, CSS, JavaScript, JSON, XML
Source code in any language
Unix/Linux configuration files
API responses
Anything stored in version control

Use UTF-8 with BOM for:

CSV files that must open correctly in Excel on Windows
Files consumed by legacy Windows applications that rely on the BOM for encoding detection
Interoperability with older Microsoft tools (pre-2019 Notepad, older Visual Studio versions)

How to Convert Between Formats

In VS Code

Click the encoding label in the bottom status bar (e.g., "UTF-8").
Select Save with Encoding.
Choose UTF-8 or UTF-8 with BOM.

In Notepad++

Go to Encoding in the menu bar.
Select Convert to UTF-8 (removes BOM) or Convert to UTF-8-BOM.

Programmatically

Python (remove BOM):

python

1with open("input.txt", "r", encoding="utf-8-sig") as f:
2    content = f.read()
3
4with open("output.txt", "w", encoding="utf-8") as f:
5    f.write(content)

Node.js (remove BOM):

javascript

1const fs = require("fs");
2
3let content = fs.readFileSync("input.txt", "utf-8");
4if (content.charCodeAt(0) === 0xFEFF) {
5  content = content.slice(1);
6}
7fs.writeFileSync("output.txt", content, "utf-8");

Common Pitfalls

Invisible bugs. The BOM is invisible in most editors. If a file "looks right" but behaves wrong, check for a BOM with a hex editor or hexdump.
Git BOM noise. A BOM can show up as a diff change when collaborators use different editors. Add an .editorconfig with charset = utf-8 to standardize.
Windows Notepad legacy. Older versions of Windows Notepad saved UTF-8 with BOM by default. Since Windows 10 version 1903, Notepad defaults to UTF-8 without BOM.
Double BOM. Concatenating two BOM files produces a BOM in the middle of the output, which appears as the invisible character U+FEFF in your data.
Encoding detection heuristics. Some tools use the BOM to decide encoding. If you remove the BOM from a file that a legacy tool depends on, the tool may fall back to ASCII or a local codepage.

Summary

UTF-8 and UTF-8 with BOM are the same encoding. The BOM is just three extra bytes (EF BB BF) at the start of the file.
UTF-8 has no byte-order ambiguity, so the BOM serves no technical purpose beyond being a file signature.
The BOM causes real bugs in PHP, Bash, JSON parsers, and Unix tools. Avoid it unless you have a specific reason.
The main legitimate use case for BOM is CSV files opened in Excel on Windows.
Use utf-8-sig in Python to read files that may or may not have a BOM.
Set your editor to save as UTF-8 without BOM by default. Standardize with .editorconfig.