text formatting
algorithms
hard-wrapped text
re-wrapping
text processing

Algorithm for re-wrapping hard-wrapped text?

Master System Design with Codemia

Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.

Introduction

Hard-wrapped text contains line breaks that were inserted for display width rather than for structure. Re-wrapping it means recovering the original paragraph flow, then formatting it again at a new width without damaging lists, code blocks, or other intentional line breaks.

Treat Structure Before Width

The main mistake is trying to wrap the whole file as one block. A good algorithm first decides which line breaks are structural and which are accidental. In practice, most text can be handled with four rules:

  1. blank lines separate paragraphs
  2. indented blocks are usually code or preformatted text
  3. bullet lines should stay as separate items
  4. ordinary prose lines inside one paragraph should be joined with spaces

That means the problem is really a segmentation problem first and a wrapping problem second.

A Practical Re-Wrapping Algorithm

The following Python example uses those rules and then relies on textwrap.fill for the final wrapping pass.

python
1import re
2import textwrap
3
4BULLET_RE = re.compile(r"^\s*(?:[-*+]|[0-9]+\.)\s+")
5
6
7def is_preformatted(block: str) -> bool:
8    lines = block.splitlines()
9    return any(line.startswith("    ") or line.startswith("\t") for line in lines)
10
11
12def is_bullet_block(block: str) -> bool:
13    lines = [line for line in block.splitlines() if line.strip()]
14    return bool(lines) and all(BULLET_RE.match(line) for line in lines)
15
16
17def normalize_prose(block: str) -> str:
18    parts = []
19    for raw_line in block.splitlines():
20        line = raw_line.strip()
21        if not line:
22            continue
23
24        if parts and parts[-1].endswith("-"):
25            parts[-1] = parts[-1][:-1] + line
26        else:
27            parts.append(line)
28
29    return " ".join(parts)
30
31
32def rewrap_block(block: str, width: int) -> str:
33    if is_preformatted(block):
34        return block
35
36    if is_bullet_block(block):
37        wrapped_items = []
38        for line in block.splitlines():
39            stripped = line.strip()
40            if not stripped:
41                continue
42            wrapped_items.append(
43                textwrap.fill(
44                    stripped,
45                    width=width,
46                    subsequent_indent="  ",
47                )
48            )
49        return "\n".join(wrapped_items)
50
51    prose = normalize_prose(block)
52    return textwrap.fill(prose, width=width)
53
54
55def rewrap_text(text: str, width: int = 72) -> str:
56    blocks = re.split(r"\n\s*\n", text.strip())
57    return "\n\n".join(rewrap_block(block, width) for block in blocks)

The algorithm works because it separates paragraphs first, preserves blocks that should not be touched, and only joins lines when they look like ordinary prose.

Why Joining Lines Is the Hard Part

The tricky step is deciding when two adjacent lines belong to the same sentence. Most hard-wrapped text from email, Markdown prose, or old documentation was wrapped mechanically at a fixed column width, so joining with a single space is correct. But there are exceptions:

  • headings
  • lists
  • tables
  • quoted email replies
  • code snippets

That is why heuristic detection is more reliable than a pure character-count rule. A line ending near column 72 is only a hint, not proof that it should be joined.

You can add more heuristics depending on the source format. For example, email quote lines beginning with > should usually be treated as their own structural unit, while Markdown headings beginning with # should be preserved exactly.

Choosing the New Width

After line joining, the actual re-wrapping step is easy. Python already handles width, indentation, and word boundaries well. The important decision is whether your target format expects plain text, Markdown, terminal output, or generated documentation.

For plain text, a width around 72 or 80 is still common. For rendered Markdown, you may choose a wider width or skip hard wrapping entirely. The recovery step is still valuable because it removes misleading old line breaks and gives you one clean paragraph representation.

Common Pitfalls

One common failure is merging lines inside code blocks. If the source contains indentation-sensitive code, joining those lines destroys semantics immediately. Preserve indented or fenced sections before touching prose.

Another problem is flattening bullet lists into a paragraph. A block of short wrapped list items can look like prose unless you explicitly check for list markers. Handle list structure before calling textwrap.fill.

Hyphenated line endings are also easy to mishandle. Sometimes a trailing hyphen means a word was broken across lines and should be removed during joining. Other times the hyphen is part of the real text. If your source includes OCR or PDF output, expect this case to need custom cleanup.

Finally, do not assume there is one perfect heuristic for all corpora. Re-wrapping export from email clients, scanned documents, and Markdown repositories may require slightly different structure rules.

Summary

  • Re-wrapping hard-wrapped text is mainly about identifying structural line breaks.
  • Split into paragraph blocks before attempting any width-based formatting.
  • Preserve code, bullets, and other preformatted content as separate cases.
  • Join ordinary prose lines first, then reflow them with a normal wrapping function.
  • Tune heuristics to the source format instead of relying on line length alone.

Course illustration
Course illustration

All Rights Reserved.