Knuth–Morris–Pratt (KMP)
When people first learn substring search, they usually write something like this:
- Try matching the pattern at index
0 - On mismatch, shift by one
- Start over
- Repeat
That works but in the worst case it’s O(n × m).
KMP is different.
It achieves O(n + m) by ensuring the text index never moves backward.
Let’s unpack how.
The Core Insight
When a mismatch happens, naive search throws away information.
KMP doesn’t.
Instead of restarting from scratch, it asks:
“Given what I already know matched, what is the longest prefix of the pattern that is also a suffix of what I just matched?”
That information is precomputed in something called the:
LPS Array
(Longest Prefix that is also a Suffix)
Also called:
- prefix function
- failure function
- partial match table
What Is LPS?
For every index i in the pattern:
lps[i] = length of the longest proper prefix of pattern[0..i]
that is also a suffix of pattern[0..i]
- “Proper” means not the whole substring.
- It tells us how far we can jump back in the pattern.
Visual Intuition
Let’s build the LPS table for:
Pattern: "ababaca"
Index: 0 1 2 3 4 5 6
Chars: a b a b a c a
We build it left to right.
Step-by-step LPS Construction
i = 0
Substring: "a"
No proper prefix/suffix match.
lps[0] = 0
i = 1
Substring: "ab"
Prefixes: a
Suffixes: b
No match.
lps[1] = 0
i = 2
Substring: "aba"
Prefixes: a, ab
Suffixes: a, ba
Match: "a"
lps[2] = 1
i = 3
Substring: "abab"
Prefixes: a, ab, aba
Suffixes: b, ab, bab
Match: "ab"
lps[3] = 2
i = 4
Substring: "ababa"
Match: "aba"
lps[4] = 3
i = 5
Substring: "ababac"
No prefix-suffix match.
We fall back using previous LPS values.
lps[5] = 0
i = 6
Substring: "ababaca"
Match: "a"
lps[6] = 1
Final LPS:
Index: 0 1 2 3 4 5 6
LPS: 0 0 1 2 3 0 1
How the Fallback Works (Mermaid Diagram)
Here’s the mental model of fallback transitions:
flowchart LR
Start[Match k characters]
Mismatch[Character mismatch]
Jump["Jump to lps[k-1]"]
Continue[Continue comparison]
Reset[If lps=0, advance text index]
Start --> Mismatch
Mismatch --> Jump
Jump --> Continue
Continue -->|Mismatch again| Jump
Continue -->|Match| Start
Jump -->|lps = 0| Reset
Key idea:
We never re-check characters in the text. We only shift inside the pattern.
Full Search Example
Text:
ababcabcabababd
Pattern:
ababd
When mismatch happens after matching "abab":
- Instead of restarting from scratch,
- We use
lps[3] = 2 - Resume matching from pattern index
2
That’s the entire optimisation.
No backtracking in the text. Ever.
Complexity Analysis
Let:
n = length of textm = length of pattern
LPS construction
Each character is processed once, with occasional fallback. Total: O(m)
Search
Each character in text is visited once.
Pattern index may move backward, but never more than total m times overall.
Total: O(n)
Final Complexity
Time: O(n + m)
Space: O(m)
Compare to naive worst-case:
O(n × m)
Classic pathological example for naive:
Text: aaaaaaaaaaaaaab
Pattern: aaaaab
Naive keeps restarting. KMP doesn’t.
Implementation
<?php
declare(strict_types=1);
function buildLps(string $pattern): array
{
$m = strlen($pattern);
$lps = array_fill(0, $m, 0);
$len = 0; // length of the previous longest prefix suffix
$i = 1;
while ($i < $m) {
if ($pattern[$i] === $pattern[$len]) {
$len++;
$lps[$i] = $len;
$i++;
} else {
if ($len !== 0) {
$len = $lps[$len - 1];
} else {
$lps[$i] = 0;
$i++;
}
}
}
return $lps;
}
/**
* @return int[] 0-based match start indices
*/
function kmpSearch(string $text, string $pattern): array
{
$n = strlen($text);
$m = strlen($pattern);
if ($m === 0) {
// Convention: empty pattern matches at every position (including end)
$all = [];
for ($i = 0; $i <= $n; $i++) $all[] = $i;
return $all;
}
if ($n === 0 || $m > $n) return [];
$lps = buildLps($pattern);
$matches = [];
$i = 0; // index for text
$j = 0; // index for pattern
while ($i < $n) {
if ($text[$i] === $pattern[$j]) {
$i++;
$j++;
if ($j === $m) {
$matches[] = $i - $j;
$j = $lps[$j - 1];
}
} else {
if ($j !== 0) {
$j = $lps[$j - 1];
} else {
$i++;
}
}
}
return $matches;
}
// Example:
// print_r(kmpSearch("ababcabcabababd", "ababd")); // [10]
from typing import List
def build_lps(pattern: str) -> List[int]:
m = len(pattern)
lps = [0] * m
length = 0
i = 1
while i < m:
if pattern[i] == pattern[length]:
length += 1
lps[i] = length
i += 1
else:
if length != 0:
length = lps[length - 1]
else:
lps[i] = 0
i += 1
return lps
def kmp_search(text: str, pattern: str) -> List[int]:
n, m = len(text), len(pattern)
if m == 0:
return list(range(n + 1))
if n == 0 or m > n:
return []
lps = build_lps(pattern)
matches: List[int] = []
i = 0 # text index
j = 0 # pattern index
while i < n:
if text[i] == pattern[j]:
i += 1
j += 1
if j == m:
matches.append(i - j)
j = lps[j - 1]
else:
if j != 0:
j = lps[j - 1]
else:
i += 1
return matches
# Example:
# print(kmp_search("ababcabcabababd", "ababd")) # [10]
using System;
using System.Collections.Generic;
public static class Kmp
{
public static int[] BuildLps(string pattern)
{
int m = pattern.Length;
var lps = new int[m];
int len = 0;
int i = 1;
while (i < m)
{
if (pattern[i] == pattern[len])
{
len++;
lps[i] = len;
i++;
}
else
{
if (len != 0)
{
len = lps[len - 1];
}
else
{
lps[i] = 0;
i++;
}
}
}
return lps;
}
// Returns 0-based match start indices
public static List<int> Search(string text, string pattern)
{
int n = text.Length;
int m = pattern.Length;
if (m == 0)
{
var all = new List<int>(n + 1);
for (int k = 0; k <= n; k++) all.Add(k);
return all;
}
if (n == 0 || m > n) return new List<int>();
int[] lps = BuildLps(pattern);
var matches = new List<int>();
int i = 0; // text
int j = 0; // pattern
while (i < n)
{
if (text[i] == pattern[j])
{
i++;
j++;
if (j == m)
{
matches.Add(i - j);
j = lps[j - 1];
}
}
else
{
if (j != 0)
{
j = lps[j - 1];
}
else
{
i++;
}
}
}
return matches;
}
}
// Example:
// var hits = Kmp.Search("ababcabcabababd", "ababd"); // [10]
import Foundation
func buildLps(_ pattern: [Character]) -> [Int] {
let m = pattern.count
if m == 0 { return [] }
var lps = Array(repeating: 0, count: m)
var length = 0
var i = 1
while i < m {
if pattern[i] == pattern[length] {
length += 1
lps[i] = length
i += 1
} else {
if length != 0 {
length = lps[length - 1]
} else {
lps[i] = 0
i += 1
}
}
}
return lps
}
// Returns 0-based match start indices
func kmpSearch(text: String, pattern: String) -> [Int] {
let t = Array(text)
let p = Array(pattern)
let n = t.count
let m = p.count
if m == 0 {
return Array(0...n) // empty pattern matches at all positions including end
}
if n == 0 || m > n {
return []
}
let lps = buildLps(p)
var matches: [Int] = []
var i = 0 // text index
var j = 0 // pattern index
while i < n {
if t[i] == p[j] {
i += 1
j += 1
if j == m {
matches.append(i - j)
j = lps[j - 1]
}
} else {
if j != 0 {
j = lps[j - 1]
} else {
i += 1
}
}
}
return matches
}
// Example:
// print(kmpSearch(text: "ababcabcabababd", pattern: "ababd")) // [10]
Common Pitfalls
1. Forgetting “proper” prefix
You cannot use the entire substring as both prefix and suffix.
2. Off-by-one errors
The fallback is:
j = lps[j - 1]
Not:
j = lps[j]
That one mistake breaks everything.
3. Empty Pattern Handling
Conventionally:
search("", "") → [0]
search("abc", "") → [0,1,2,3]
Decide your convention and document it.
4. Unicode Issues
Languages differ:
- PHP and C# index bytes unless using Unicode-aware APIs.
- Swift
Stringindexing is not O(1). - Python handles Unicode cleanly.
If doing heavy search in Swift, convert to [UInt8] or [Character].
5. Misunderstanding What LPS Represents
It does NOT mean:
“How many characters matched previously.”
It means:
“If we fail here, what is the largest prefix that we already know matches?”
That’s a subtle but critical difference.
Why KMP Matters
KMP is more than a string search algorithm.
It introduces a pattern you'll see in advanced algorithms:
Precompute structure from the input
So the main algorithm becomes linear.
This same idea appears in:
- Z-algorithm
- Aho–Corasick
- Rabin–Karp
- Suffix arrays
- Even dynamic programming patterns
Final Takeaway
KMP eliminates redundant work by remembering what the pattern already proved.
It’s deterministic. It’s linear. It’s elegant.
And once you truly understand LPS, it feels almost obvious.