I need a fast 96-bit on 64-bit specific division algorithm for a fixed-point math library
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
Introduction
For a fixed-point library, a 96-bit numerator divided by a 64-bit denominator is specific enough that a generic big-integer routine is often unnecessary overhead. The fastest practical answer depends on your platform: if the compiler gives you a cheap 128-bit intermediate type, use that first; if it does not, move to a specialized normalized multiword divide or reciprocal-based routine.
Define the Exact Problem First
A common layout is:
- numerator as one 32-bit high limb plus one 64-bit low limb
- denominator as one 64-bit unsigned integer
- quotient expected to fit in 32 bits or 64 bits depending on the fixed-point format
For example, in C you might represent the 96-bit numerator as:
That layout makes it easy to build the full value when a wider temporary type is available.
The Simplest Fast Path on Modern 64-Bit Compilers
If you have __uint128_t, use it. This is usually the cleanest and fastest baseline on GCC and Clang for 64-bit targets.
This code is easy to audit, exact, and often better than writing an elaborate hand-rolled routine too early.
Why This Is Often Good Enough
A lot of fixed-point libraries over-engineer division before measuring. If your target compiler lowers 128-bit arithmetic efficiently, the code above wins on:
- correctness
- simplicity
- maintainability
- optimization freedom for the compiler
That matters because division bugs in numeric libraries are expensive and subtle.
When You Need a Specialized Multiword Algorithm
If __uint128_t is unavailable or too slow on the target, the next step is a specialized multiword divide. The standard approach is a normalized long-division variant derived from Knuth's Algorithm D:
- normalize the 64-bit divisor so its top bit is set
- shift the 96-bit numerator by the same amount
- estimate the quotient from the top limbs
- correct the estimate if it is one or two steps too high
- subtract and de-normalize the remainder
Because the shape is fixed at 96 divided by 64, you can write a much smaller routine than a fully general big-number divider. That is the real optimization opportunity: specialize to the exact limb count.
Reciprocal Multiplication Is for Repeated Divisors
If the same denominator is reused many times, reciprocal multiplication can beat repeated division. In that strategy, you precompute an approximate reciprocal of the 64-bit divisor and turn division into multiplication plus correction.
That is a good fit when:
- you divide many numerators by the same denominator
- the denominator changes infrequently
- the target CPU has strong multiply throughput
If the denominator changes every call, the precomputation cost may erase the benefit.
Fixed-Point Design Implications
In fixed-point code, it is worth asking whether you actually need a raw 96-by-64 divide or whether the format can be rearranged to reduce the problem. For example, some libraries can:
- normalize inputs earlier
- limit the quotient range
- use reciprocal caching
- change scaling so the high limb carries less dynamic range
Those design decisions can matter more than micro-optimizing a single division routine.
Common Pitfalls
- Writing a complex manual divider before benchmarking the
__uint128_tpath on the actual target. - Ignoring division-by-zero and quotient-overflow conditions in the low-level routine.
- Using reciprocal multiplication even when the divisor changes every call.
- Forgetting normalization in a hand-written multiword divide, which leads to bad quotient estimates.
- Optimizing for one compiler and assuming the same codegen quality everywhere.
Summary
- On modern 64-bit compilers, start with a
__uint128_timplementation and measure it. - If that is not available or not fast enough, use a specialized normalized 96-by-64 division routine.
- Reciprocal methods are most useful when the same divisor is reused many times.
- Keep the exact limb layout explicit so the implementation stays auditable.
- In fixed-point libraries, overall representation choices can matter as much as the divide algorithm itself.

