Why/how does the Longest Proper Prefix/Suffix algorithm work?
Master System Design with Codemia
Enhance your system design skills with over 120 practice problems, detailed solutions, and hands-on exercises.
In the domain of computer science and string processing, the concept of the "Longest Proper Prefix/Suffix" (LPS) is most commonly associated with the Knuth-Morris-Pratt (KMP) pattern matching algorithm. Understanding why and how this algorithm works requires delving into the nuances of string patterns and efficient searching techniques.
Understanding Pattern Matching
When you're searching for a particular substring within a string, an efficient method is to reduce unnecessary comparisons. The naive approach would compare the substring to each segment of the string one by one, which is not optimal.
For example, searching for the pattern "ABCD" in the text "ABC ABCDAB ABCDABCDABDE" makes several redundant comparisons. The KMP algorithm, on the other hand, smartly skips over portions of characters that have already been matched, using precomputed information stored in an array corresponding to the pattern string.
The Role of LPS in KMP
The Longest Proper Prefix which is also a Suffix (LPS) array is crucial in achieving the efficient running of the KMP algorithm. It helps in determining how far to skip in the pattern when a mismatch occurs.
Definition of LPS
- Proper Prefix: A prefix of a string that is not equal to the string itself.
- Suffix: A substring that contains the ending part of the string.
- LPS array: For a pattern of length `n`, the LPS array is of the same length and at index `i` holds the length of the longest prefix which is also a suffix for the substring from the start of the pattern to `i`.
Technical Explanation and Example
Consider the pattern "ABABAC". Constructing the LPS array involves determining the longest border (a proper prefix which is also a suffix) at each position in the string.
- Pattern: A B A B A C
- LPS: 0 0 1 2 3 0
- At position 0 ('A'), the longest proper prefix which is also a suffix is "" (empty string), so LPS[0] = 0.
- At position 1 ('B'), the longest proper prefix/suffix is "" again.
- At position 2 ('A'), "A" itself is the longest, making LPS[2] = 1.
- At position 3 ('B'), "AB" is the longest.
- At position 4 ('A'), "ABA" is the longest.
- At position 5 ('C'), no proper prefix is a suffix, so LPS[5] = 0.
Constructing the LPS Table
Here's a procedural view of constructing an LPS table:
- Initialize: `length = 0`, `LPS[0] = 0`, and iterate from `i = 1` to `n-1`.
- If `pattern[i] == pattern[length]`, increment `length` and set `LPS[i] = length`.
- If not, set `LPS[i]` to the previous `LPS[length-1]` until a match or zero is found.
A concise pythonic representation is:
- Shift the search position in linear time complexity, avoiding quadratic checks.
- Ideal for scenarios like DNA sequencing, where large patterns search within even larger bodies of text.

