String Matching with Finite Automata
Dr. Kiran K
Assistant Professor
Department of CSE
UVCE
Bengaluru, India.
Finite Automata
A Finite Automata is a 5-tuple (Q, q0, A, ∑, δ), where
• Q : Finite set of States
• q0 є Q : Start State
• A C Q : Distinguished set of Accepting States
• ∑ : Finite Input Alphabet
• δ : Transition Function. A function from Q x ∑ into Q,
• Φ : Final-State Function from ∑* to Q such that Φ (w) is the
state M ends up in after scanning the string w.
M accepts a string w iff Φ (w) є A.
Φ (ε) = q0
Φ (ε) = δ (Φ (w), a) for w є ∑*, a є ∑
Finite Automata…
• The finite automaton begins in state q0 and reads the characters of its input string
one at a time.
• If the automaton is in state q and reads input character a, it moves (“makes a
transition”) from state q to state δ (q, a).
• Whenever the current state q is a member of A, the machine M has accepted the
string read so far.
• An input that is not accepted is rejected.
Finite Automata…
Q = {0, 1}
q0 = 0
∑ = {a, b}
State Transition Diagram
State 1: Accepting State
Directed Edges: Transitions
abaaa : Enters states <0, 1, 0, 1, 0, 1>; Accepts.
abbaa : Enters states <0, 1, 0, 0, 1, 0>; Rejects.
Eg.: Finite Automata that accepts strings ending with odd number of a’s
Transition Function, δ
Input
State a b
0 1 0
1 0 0
String Matching Automata
• A String Matching Automaton is constructed for a given pattern P in the
Preprocessing step.
• Suffix Function (σ):
 An auxiliary function that specifies the string-matching automaton
corresponding to a given pattern P [1 . . m].
 It maps ∑* to {0, 1, . . ., m} such that σ (x) is the length of the longest prefix
of P that is also a suffix of x: σ (x) = max {k: Pk x}.
 P0 = ε is a suffix of every string.
 For a pattern P of length m, σ (x) = m iff P x
 x y → σ (x) ≤ σ (y)
Eg.: P = ab
σ (ε) = 0
σ (ccaca) = 1
σ (ccab) = 2
String Matching Automata…
The String-Matching Automaton that corresponds to a given pattern P [1 . . m] is
defined as follows:
• The state set Q is {0, 1, . . ., m}.
State 0 - Start state, q0;
State m - Accepting state.
• The transition state for any state q and character x is defined as:
δ (q, x) = σ (Pq x)
Note: The transition function δ keeps track of the longest prefix of the pattern P
that has matched the text string T so far.
String Matching Automata…
δ (q, x) = σ (Pq x) ?
• If the substring ending at T[i] matches some prefix Pj of P, then Pj must be a
suffix of Ti .
• If (q = Φ (Ti), the automaton is in state q after reading Ti
• In state q, (Pq Ti) and (q = σ (Ti))
• Thus, δ is defined to give the length of the longest prefix of P that matches a
suffix of Ti
Note: Φ (Ti) and σ (Ti) both equal q, thus the automaton maintains the invariant:
Φ (Ti) = σ (Ti)
String Matching Automata…
• If the automaton is in state q and reads the next character T [i + 1] = x, it enters
the state σ (Tix) if the longest prefix of P is a suffix of Tix.
• Pq is the longest prefix of P that is a suffix of Ti → σ (Pqx) is the longest prefix
of P that is a suffix of Tix.
• There are two cases:
Case 1: x = P [q + 1]
Character x continues to match the pattern.
δ (q, a) = q + 1.
Case 2: x ≠ P [q + 1]
Character x does not continue to match the pattern.
Find a smaller prefix of P that is also a suffix of Ti.
Example
P = ababaca
δ (q, x) = σ (Pq x)
q = 0
P0 = ε ; x = {a, b, c}
δ (0, a) = σ (a) = 1 (a is the longest prefix of the pattern ababaca)
δ (0, b) = σ (b) = 0 (b is the not a prefix of the pattern ababaca)
δ (0, c) = σ (c) = 0 (c is the not a prefix of the pattern ababaca)
0
0 1
a
Example…
P = ababaca
δ (q, x) = σ (Pq x)
q = 1
P1 = a ; x = {a, b, c}
δ (1, a) = σ (aa) = 1 (a is the longest prefix of the pattern ababaca)
δ (1, b) = σ (ab) = 2 (ab is the longest prefix of the pattern ababaca)
δ (1, c) = σ (ac) = 0 (No suffix of ac is a prefix of the pattern ababaca)
20 1
a b
a
Example…
P = ababaca
δ (q, x) = σ (Pq x)
q = 2
P2 = ab ; x = {a, b, c}
δ (2, a) = σ (aba) = 3 (aba is the longest prefix of the pattern ababaca)
δ (2, b) = σ (abb) = 0 (No suffix of abb is a prefix of the pattern ababaca)
δ (2, c) = σ (abc) = 0 (No suffix of abc is a prefix of the pattern ababaca)
3
a
20 1
a b
a
Example…
P = ababaca
δ (q, x) = σ (Pq x)
q = 3
P3 = aba ; x = {a, b, c}
δ (3, a) = σ (abaa) = 1 (a is the longest prefix of the pattern ababaca)
δ (3, b) = σ (abab) = 4 (abab is the longest prefix of the pattern ababaca)
δ (3, c) = σ (abac) = 0 (No suffix of abac is a prefix of the pattern ababaca)
3
a
20 1
a b
a
4
b
a
Example…
P = ababaca
δ (q, x) = σ (Pq x)
q = 4
P4 = abab ; x = {a, b, c}
δ (4, a) = σ (ababa) = 5 (ababa is the longest prefix of the pattern ababaca)
δ (4, b) = σ (ababb) = 0 (No suffix of ababb is a prefix of the pattern ababaca)
δ (4, c) = σ (ababc) = 0 (No suffix of ababc is a prefix of the pattern ababaca)
3
a
20 1
a b
a
4
b
a
5
a
Example…
P = ababaca
δ (q, x) = σ (Pq x)
q = 5
P5 = ababa ; x = {a, b, c}
δ (5, a) = σ (ababaa) = 1 (a is the longest prefix of the pattern ababaca)
δ (5, b) = σ (ababab) = 4 (abab is the longest prefix of the pattern ababaca)
δ (5, c) = σ (ababac) = 6 (ababac is the longest prefix of the pattern ababaca)
a
3
a
20 1
a b
a
4
b
a
5 6
a
b
c
Example…
P = ababaca
δ (q, x) = σ (Pq x)
q = 6
P6 = ababac ; x = {a, b, c}
δ (6, a) = σ (ababaca) = 7 (ababaca is the longest prefix of the pattern ababaca)
δ (6, b) = σ (ababacb) = 0 (No suffix of ababacb is a prefix of the pattern ababaca)
δ (6, c) = σ (ababacc) = 0 (No suffix of ababacc is a prefix of the pattern ababaca)
a
73
a
20 1
a b
a
4
b
a
5 6
a
b
c a
Example…
P = ababaca
δ (q, x) = σ (Pq x)
q = 7
P7 = ababaca ; x = {a, b, c}
δ (7, a) = σ (ababacaa) = 1 (a is the longest prefix of the pattern ababaca)
δ (7, b) = σ (ababacab) = 2 (ab is the longest prefix of the pattern ababaca)
δ (7, c) = σ (ababacac) = 0 (No suffix of ababacac is a prefix of the pattern ababaca)
b
a
73
a
20 1
a b
a
4
b
a
5 6
a
b
c a
a
Example…
Transition Function, δ
Input
State a b c
0 1 0 0
1 1 2 0
2 3 0 0
3 1 4 0
4 5 0 0
5 1 4 6
6 7 0 0
7 1 2 0
b
a
73
a
20 1
a b
a
4
b
a
5 6
a
b
c a
a
State-Transition Diagram for the String-Matching Automaton
that Accepts all Strings ending in the String ababaca.
COMPUTE-TRANSITION-FUNCTION (P, ∑)
m = P.length
For (q = 0 to m)
For (Each character x ϵ ∑)
k = min (m + 1, q + 2)
Repeat
k = k – 1
Until (Pk Pq x)
δ (q, x) = k
Return δ
Transition Function Algorithm
Running Time: O (m3 | ∑ |)
Outer For Loop: m times.
Inner For Loop: m x | ∑ | times
∑ times for each value of m.
Repeat Loop: (m x | ∑ |) x m2 times
Repeat loop maximum m + 1 times,
with the test Pk Pq x requiring
upto m character comparisons.
Example
P = ababaca; m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k
q = 0; k = min (7 + 1, 0 + 2) = min (8, 2) = 2
x = a
p0 a = a
k = 1; p1 = a; a a
δ (0, a) = 1
x = b
p0 b = b
k = 1; p1 = a; a b
k = 0; p0 = ε; ε b
δ (0, b) = 0
x = c
p0 c = c
k = 1; p1= a; a c
k = 0; p0= ε; ε b
δ (0, c) = 0
q = 1; k = min (7 + 1, 1 + 2) = min (8, 3) = 3
x = a
p1 a = aa
k = 2; p2 = ab; ab aa
k = 1; p1 = a; a aa
δ (1, a) = 1
x = b
p1 b = ab
k = 2; p2 = ab; ab ab
δ (1, b) = 2
x = c
p1 c = ac
k = 2; p2 = ab; ab ac
k = 1; p1 = a; a ac
k = 0; p0 = ε; ε ac
δ (1, c) = 0
Example…
P = ababaca; m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k
q = 2; k = min (7 + 1, 2 + 2) = min (8, 2) = 4
x = a
p2 a = aba
k = 3; p3 = aba; aba aba
δ (2, a) = 3
x = b
p2 b = abb
k = 3; p3 = aba; aba abb
k = 2; p2 = ab; ab abb
k = 1; p1 = a; a abb
k = 0; p0 = ε; ε abb
δ (2, b) = 0
x = c
p2 c = abc
k = 3; p3 = aba; aba abc
k = 2; p2 = ab; ab abc
k = 1; p1 = a; a abc
k = 0; p0 = ε; ε abc
δ (2, c) = 0
Example…
P = ababaca; m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k
q = 3; k = min (7 + 1, 3 + 2) = min (8, 5) = 5
x = a
p3 a = abaa
k = 4; p4 = abab; abab abaa
k = 3; p3 = aba; aba abaa
k = 2; p2 = ab; ab abaa
k = 1; p1 = a; a abaa
δ (3, a) = 1
x = b
p3 b = abab
k = 4; p4 = abab; abab abab
δ (3, b) = 4
x = c
p3 c = abac
k = 4; p4 = abab; abab abac
k = 3; p3 = aba; aba abac
k = 2; p2 = ab; ab abac
k = 1; p1 = a; a abac
k = 0; p0 = ε; ε abac
δ (3, c) = 0
Example…
P = ababaca; m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k
q = 4; k = min (7 + 1, 4 + 2) = min (8, 6) = 6
x = a
p4 a = ababa
k=5;p5=ababa;ababa ababa
δ (4, a) = 5
x = b
p4 b = ababb
k=5;p5=ababa;ababa ababb
k=4;p4=abab;abab ababb
k=3;p3=aba;aba ababb
k=2;p2=ab;ab ababb
k=1;p1=a;a ababb
k=0;p0=ε;ε ababb
δ (4, b) = 0
x = c
p4 c = ababc
k=5;p5=ababa;ababa ababc
k=4;p4=abab;abab ababc
k=3;p3=aba;aba ababc
k=2;p2=ab;ab ababc
k=1;p1=a;a ababc
k=0;p0=ε;ε ababc
δ (4, c) = 0
Example…
P = ababaca; m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k
q = 5; k = min (7 + 1, 5 + 2) = min (8, 7) = 7
x = a
p5 a = ababaa
k = 6; p6 = ababac; ababac ababaa
k = 5; p5 = ababa; ababa ababaa
k = 4; p4 = abab; abab ababaa
k = 3; p3 = aba; aba ababaa
k = 2; p2 = ab; ab ababaa
k = 1; p1 = a; a ababaa
δ (5, a) = 1
x = b
p5 b = ababab
k = 6; p6 = ababac; ababac ababab
k = 5; p5 = ababa; ababa ababab
k = 4; p4 = abab; abab ababab
δ (5, b) = 4
x = c
p5 c = ababac
k = 6; p6 = ababac; ababac ababac
δ (5, c) = 6
Example…
P = ababaca; m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k
q = 6; k = min (7 + 1, 6 + 2) = min (8, 8) = 8
x = a
p6 a = ababaca
k = 7; p7 = ababaca; ababaca ababaa
δ (6, a) = 7
x = b
p6 b = ababacb
k = 7; p7 = ababaca; ababaca ababacb
k = 6; p6 = ababac; ababac ababacb
k = 5; p5 = ababa; ababa ababacb
k = 4; p4 = abab; abab ababacb
k = 3; p3 = aba; aba ababacb
k = 2; p2 = ab; ab ababacb
k = 1; p1 = a; a ababacb
k = 0; p0 = ε; ε ababacb
δ (6, b) = 0
x = c
p6 c = ababacc
k = 7; p7 = ababaca; ababaca ababacc
k = 6; p6 = ababac; ababac ababacc
k = 5; p5 = ababa; ababa ababacc
k = 4; p4 = abab; abab ababacc
k = 3; p3 = aba; aba ababacc
k = 2; p2 = ab; ab ababacc
k = 1; p1 = a; a ababacc
k = 0; p0 = ε; ε ababacc
δ (6, c) = 0
Example…
P = ababaca; m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k
q = 7; k = min (7 + 1, 7 + 2) = min (9, 8) = 8
x = a
p7 a = ababacaa
k = 7; p7 = ababaca; ababaca ababacaa
k = 6; p6 = ababac; ababac ababacaa
k = 5; p5 = ababa; ababa ababacaa
k = 4; p4 = abab; abab ababacaa
k = 3; p3 = aba; aba ababacaa
k = 2; p2 = ab; ab ababacaa
k = 1; p1 = a; a ababacaa
δ (7, a) = 1
x = b
p7 b = ababacab
k = 7; p7 = ababaca; ababaca ababacab
k = 6; p6 = ababac; ababac ababacab
k = 5; p5 = ababa; ababa ababacab
k = 4; p4 = abab; abab ababacab
k = 3; p3 = aba; aba ababacab
k = 2; p2 = ab; ab ababacab
δ (7, b) = 2
x = c
p7 c = ababacac
k = 7; p7 = ababaca; ababaca ababacac
k = 6; p6 = ababac; ababac ababacac
k = 5; p5 = ababa; ababa ababacac
k = 4; p4 = abab; abab ababacac
k = 3; p3 = aba; aba ababacac
k = 2; p2 = ab; ab ababacac
k = 1; p1 = a; a ababacac
k = 0; p0 = ε; ε ababacac
δ (7, c) = 0
Example…
Transition Function, δ
Input
State a b c
0 1 0 0
1 1 2 0
2 3 0 0
3 1 4 0
4 5 0 0
5 1 4 6
6 7 0 0
7 1 2 0
For any string y and character x, we have σ (yx) ≤ σ (y) + 1
With reference to the figure, Let r = σ (yx) (L1.1)
Case1: Let, r = 0 (L1.2)
(L1.1) and (L1.2) → σ (yx) = r = 0 (L1.3)
wkt σ (y) ≥ 0 (σ cannot be negative) (L1.4)
(L1.3) and (L1.4) → 0 ≤ σ (y) + 1 (L1.5)
(L1.3) and (L1.4) → σ (yx) ≤ σ (y) + 1
Suffix-Function Inequality
Pr – 1 x
Pr
y
Case2: Let, r > 0 (L1.6)
wkt σ (y) = max {k: Pk y} (Suffix Definition) (L1.7)
(L1.1), (L1.6) and (L1.7) → Pr yx (L1.8)
(L1.8) → Pr – 1 y (L1.9)
(L1.7) and (L1.9) → r – 1 ≤ σ (y) (L1.10)
(L1.10) → r ≤ σ (y) + 1 (L1.11)
(L1.1) and (L1.11) → σ (yx) ≤ σ (y) + 1
Suffix-Function Inequality…
For any string y and character x, if q = σ (y), then σ (yx) = σ (Pqx)
Wkt, σ (y) = max {k: Pk y}(Suffix Definition) (L2.1)
Also, Pq x yx (From the Figure) (L2.2)
Let, r = σ (yx) (L2.3)
(L2.3) → Pr yx (L2.4)
Wkt, σ (yx) ≤ σ (y) + 1 (Suffix-Function inequality Lemma) (L2.5)
Also, q = σ (y) (Given) (L2.6)
(L2.3), (L2.5) and (L2.6) → r ≤ q + 1 (L2.7)
Suffix-Function Recursion Lemma
x
Pq x
Pr
y
Wkt, | Pr | = r and |Pq x| = q + 1 (L2.8)
(L2.7) and (L2.8) → | Pr | ≤ |Pq x| (L2.9)
(L2.2), (L2.4) and (L2.9) → Pr Pq x (Overlapping-Suffix Lemma) (L2.10)
Wkt, x y → σ (x) ≤ σ (y) (L2.11)
(L2.10) and (L2.11) → r ≤ σ (Pq x) (L2.12)
(L2.3) and (L2.12) → σ (yx) ≤ σ (Pq x) (L2.13)
(L2.2) and (L2.11) → σ (Pq x) ≤ σ (yx) (L2.14)
(L2.13) and (L2.14) → σ (yx) = σ (Pq x)
Suffix-Function Recursion Lemma…
FINITE-AUTOMATON-MATCHER (T, δ, m)
n = T.length
q = 0
For (i = 1 to n)
q = δ (q, T [i])
If (q == m)
Print “Pattern occurs with shift” i – m
Running Time: Ө (n)
Matching Algorithm
T = abababacaba; P = ababaca q = δ (q, T [i])
n = T.length = 11; m = 7
q = 0
i = 1, q = δ (0, T [1]) = δ (0, a) = 1
q ≠ m → i = 2, q = δ (1, T [2]) = δ (1, b) = 2
q ≠ m → i = 3, q = δ (2, T [3]) = δ (2, a) = 3
q ≠ m → i = 4, q = δ (3, T [4]) = δ (3, b) = 4
q ≠ m → i = 5, q = δ (4, T [5]) = δ (4, a) = 5
q ≠ m → i = 6, q = δ (5, T [6]) = δ (5, b) = 4
q ≠ m → i = 7, q = δ (4, T [7]) = δ (4, a) = 5
q ≠ m → i = 8, q = δ (5, T [8]) = δ (5, c) = 6
q ≠ m → i = 9, q = δ (6, T [9]) = δ (6, a) = 7
q = m → Pattern occurs with shift i – m = 9 – 7 = 2
Example
Transition Function, δ
Input
State a b c
0 1 0 0
1 1 2 0
2 3 0 0
3 1 4 0
4 5 0 0
5 1 4 6
6 7 0 0
7 1 2 0
Theorem
If Φ is the final-state function of a string-matching automaton for a given pattern P and
T [1 . . n] is an input text for the automaton, then Φ (Ti) = σ (Ti) for i = 0, . . . , n.
Proof:
Basis: For i = 0
T0 = ε
→ Φ (T0) = 0 = σ (T0)
Assumption: Let, Φ (Ti) = σ (Ti)
Induction:
Let q = Φ (Ti) and
x = T [i + 1]
Theorem…
Φ (Ti + 1) = Φ (Ti x)
= δ (Φ (Ti), x)
= δ (q, x) (Φ (Ti) = q)
= σ (Pq x) (δ (q, x) = σ (Pq x))
= σ (Ti x) (Suffix-function recursion lemma)
= σ (Ti + 1)
When the machine enters state q in the algorithm by executing the statement
q = δ (q, T [i]), q is the largest value such that Pq T [i].
→ q = m iff machine has just scanned an occurrence of the pattern P.
Thus the FINITEAUTOMATON-MATCHER operates correctly
Knuth-Morris-Pratt Algorithm
Introduction
• This algorithm avoids computing the transition function δ.
• It uses an auxiliary function ᴨ, called the Prefix Function, which is precomputed
from the pattern of length m, and stored in an array ᴨ [1 . . m].
• For any state q = 0, 1, . . . , m and any character x є ∑ the value, ᴨ [q] contains
the information needed to compute δ (q, x) but does not depend on x.
• Array ᴨ contains only m entries, whereas δ has Ө (m |∑|) entries. Hence,
computing ᴨ saves a factor of |∑| in the preprocessing time compared to
computing δ.
• The algorithm has a preprocessing time of Ө (m) and a matching time of Ө (n).
Prefix Function
The prefix function ᴨ for a pattern encapsulates knowledge about how the pattern
matches against shifts of itself. This information helps to avoid testing useless
shifts in the naive pattern-matching algorithm and to avoid precomputing the full
transition function δ for a string-matching automaton.
Eg.: T = bacbababaabcbab P = ababaca
• Consider the shift s as shown in the figure using a Naïve string Matcher.
• q = 5 characters have matched successfully, but the 6th character fails to match.
Prefix Function…
• The information that q characters have matched successfully determines the
corresponding text characters.
• Knowing these q text characters allows to determine immediately that certain
shifts are invalid. Eg.: s + 1 is invalid as it aligns the first character of the pattern
‘a’ with character ‘b’ of text.
• The shift s ʹ = s + 2 aligns the first 3 characters of the pattern with the text that
match.
• s ʹ can be computed as follows:
 If pattern characters P [1 . . q] match text characters T [s + 1 . . s + q], the
least shift s ʹ > s such that for some k < q , P [1 . . k] = T [s ʹ + 1 . . s ʹ + k]
where s ʹ + k = s + q. i.e.,
Prefix Function…
 If Pq Ts + q is known, find the longest prefix Pk of Pq that is also a suffix of
Ts + q
 The new shift s ʹ is found by adding the difference (q – k) to s
s = s ʹ + (q – k )
Note:
• If k = 0 then s = s ʹ + q and all the shifts s + 1, s + 2, , , s + q – 1 are ruled out
and it is the best shift.
• At the new shift s ʹ, the first k characters of P need not be compared with the
corresponding characters of T because s ʹ is computed after ensuring
P [1 . . k] = T [s ʹ + 1 . . s ʹ + k].
Prefix Function…
s ʹ can be computed as follows by comparing the pattern with itself as follows:
• Since T [s ʹ + 1 . . s ʹ + k] is part of the known portion of the text
it is a suffix of the string Pq.
• Hence, P [1 . . k] = T [s ʹ + 1 . . s ʹ + k] → Find the greatest k < q such that
Pk Pq.
Formal Definition:
Given a pattern P [1 . . m], the prefix function for the pattern P is the function
ᴨ: [1 . . m] → {0, 1, . . . , m – 1} such that ᴨ [q] = max {k: k < q and Pk Pq}.
Note:
It convenient to store, for each value of q, the number k of matching characters at
the new shift s ʹ.
Example
Pattern: ababaca; Pk Pq k < q
i = 1; q = 1; P1 = a; k < 1; Pk = ε
ᴨ[1] = 0 (ε is the longest prefix of a)
i = 2; q = 2; P2 = ab; k < 2; Pk = ε
ᴨ[2] = 0 (No prefix of size < 2, (a), is a suffix of ab except ε)
i = 3; q = 3; P3 = aba; k < 3; Pk = a
ᴨ[3] = 1 (a is the longest prefix with size < 3, that is a suffix of aba)
i = 4; q = 4; P4 = abab; k < 4; Pk = ab
ᴨ[4] = 2 (ab is the longest prefix with size < 4, that is a suffix of abab)
i = 5; q = 5; Pq = ababa; k < 5; Pk = aba
ᴨ[5] = 3 (aba is the longest prefix with size < 4, that is a suffix of ababa)
Example…
Pattern: ababaca; Pk Pq
i = 6; q = 6; Pq = ababac; k < 6; Pk = ε
ᴨ[6] = 0 (No prefix of size < 6, (ababa, abab, aba,
ab, a), is a suffix of ababac except ε)
i = 7; q = 7; Pq = ababaca; k < 7; Pk = a
ᴨ[7] = 1 (a is the longest prefix with size < 7, that is
a suffix of ababaca)
i ᴨ [i]
1 0
2 0
3 1
4 2
5 3
6 0
7 1
Prefix Function Algorithm
COMPUTE-PREFIX-FUNCTION(P)
m = P.length
Let Π [1 . . m] be a new array
Π [1] = 0
k = 0
For (q = 2 to m)
While (k > 0 and P [k + 1] ≠ P [q])
k = Π [k]
If (P [k + 1] == P [q])
k = k + 1
Π [q] = k
return Π
Running Time: Θ(m)
1. For loop increases the value of k once
for each iteration and hence the total
increase in k is at most m – 1.
2. k < q ensuring that Π [q] < q which
means that each iteration of the while
loop decrements k.
3. k never becomes negative.
These facts put together, it is observed
that total decrease in k from the while
loop is bounded from above by the total
increase in k over all iterations of the for
loop, which is m – 1.
Example
For (q = 2 to m)
While (k > 0 and P [k + 1] ≠ P [q])
k = Π [k]
If (P [k + 1] == P [q])
k = k + 1
Π [q] = k
P = ababaca
Π [1] = 0; k = 0;
q = 2; k = 0;
While loop: not executed (k = 0);
If condition: Fails (a ≠ b; P [1] ≠ P [2]));
Π [2] = 0;
q = 3; k = 0;
While loop: not executed (k = 0)
If condition: (a == a; (P [1] == P [3])); → k = 0 + 1 = 1;
Π [3] = 1;
q = 4; k = 1;
While loop: not executed (b = b; (P [2] == P [4]))
If condition: (b == b; (P [2] == P [4])); → k = 1 + 1 = 2;
Π [4] = 2
Example…
q = 5; k = 2;
While loop: not executed (a = a; (P [3] == P [5]))
If condition: (a == a; (P [3] == P [5])); → k = 2 + 1 = 3;
Π [5] = 3
q = 6; k = 3;
While loop: b ≠ c; (P [4] ≠ P [6]); → k = Π [3] = 1
b ≠ c; (P [2] ≠ P [6]); → k = Π [1] = 0
If condition: Fails (a ≠ c; (P [1] ≠ P [6]))
Π [6] = 0
q = 7; k = 0;
While loop: not executed (k = 0)
If condition: (a == a; (P [1]= = P [7])); k = 0 + 1 = 1;
Π [7] = 1
i ᴨ [i]
1 0
2 0
3 1
4 2
5 3
6 0
7 1
For (q = 2 to m)
While (k > 0 and P [k + 1] ≠ P [q])
k = Π [k]
If (P [k + 1] == P [q])
k = k + 1
Π [q] = k
Matching Algorithm
KMP-MATCHER (T, P)
n = T.length
m = P.length
Π = COMPUTE-PREFIX-FUNCTION (P)
q = 0
For (i = 1 to n)
While (q > 0 and P [q + 1] ≠ T [i])
q = Π [q]
If (P [q + 1] == T [i])
q = q + 1
If (q == m)
Print “Pattern occurs with shift” i – m + 1
q = Π [q]
Running Time: Θ(n)
1. For loop increases the value of q once
for each iteration and hence the total
increase in q is at most n.
2. Π [q] < q which means that each iteration
of the while loop decrements q.
3. q never becomes negative.
These facts put together, it is observed that
total decrease in q from the while loop is
bounded from above by the total increase in q
over all iterations of the for loop, which is n.
Example
For (i = 1 to n)
While (q > 0 and P [q + 1] ≠ T [i])
q = Π [q]
If (P [q + 1] == T [i])
q = q + 1
If (q == m)
Print “Pattern occurs with shift” i – m +1
q = Π [q]
T: bacbababaababacababa; n = 20
P: ababaca; m = 7
i = 1; q = 0
While loop: not executed (q = 0)
If condition: fails (a ≠ b; (P [1] ≠ T [1]))
q ≠ m
i = 2; q = 0
While loop: not executed (q = 0)
If condition: (a == a; (P [1] == T [2])); q = 0 + 1 = 1
q ≠ m
i = 3; q = 1
While loop: (b ≠ c; (P [2] ≠ T [3])); q = Π [1] = 0
If condition: Fails (b ≠ c; (P [2] ≠ T [3]));
q ≠ m
i = 4; q = 0
While loop: not executed (q = 0)
If condition: Fails (a ≠ b; (P [1] ≠ T [4]));
i ᴨ [i]
1 0
2 0
3 1
4 2
5 3
6 0
7 1
Example…
T: bacbababaababacababa; n = 20
P: ababaca; m = 7
q ≠ m
i = 5; q = 0
While loop: not executed (q = 0)
If condition: (a == a; (P [1] == T [5])); q = 0 + 1 = 1
q ≠ m
i = 6; q = 1
While loop: not executed (b == b; (P [2] == T [6]))
If condition: (b == b; (P [2] == T [6])); q = 1 + 1 = 2
q ≠ m
i = 7; q = 2
While loop: not executed (a == a; (P [3] == T [7]))
If condition: (a == a; (P [3] == T [7])); q = 2 + 1 = 3
i ᴨ [i]
1 0
2 0
3 1
4 2
5 3
6 0
7 1
For (i = 1 to n)
While (q > 0 and P [q + 1] ≠ T [i])
q = Π [q]
If (P [q + 1] == T [i])
q = q + 1
If (q == m)
Print “Pattern occurs with shift” i – m +1
q = Π [q]
Example…
T: bacbababaababacababa; n = 20
P: ababaca; m = 7
q ≠ m
i = 8; q = 3
While loop: not executed (b == b; (P [4] == T [8]))
If condition: (b == b; (P [4] == T [8])); q = 3 + 1 = 4
q ≠ m
i = 9; q = 4
While loop: not executed (a == a; (P [5] == T [9]))
If condition: (a == a; (P [5] == T [9])); q = 4 + 1 = 5
q ≠ m
i = 10; q = 5
While loop: (c ≠ a; (P [6] ≠ T [10])); q = Π [5] = 3
(b ≠ a; (P [4] ≠ T [10])); q = Π [3] = 1
(b ≠ a; (P [2] ≠ T [10])); q = Π [1] = 0
If condition:(a == a;(P [1] == T [10]));q = 0 + 1 = 1
i ᴨ [i]
1 0
2 0
3 1
4 2
5 3
6 0
7 1
For (i = 1 to n)
While (q > 0 and P [q + 1] ≠ T [i])
q = Π [q]
If (P [q + 1] == T [i])
q = q + 1
If (q == m)
Print “Pattern occurs with shift” i – m +1
q = Π [q]
Example…
T: bacbababaababacababa; n = 20
P: ababaca; m = 7
q ≠ m
i = 11; q = 1
While loop: not executed (b == b; (P [2] == T [11]));
If condition:(b == b;(P [2] == T [11]));q = 1 + 1 = 2
q ≠ m
i = 12; q = 2
While loop:not executed (a == a; (P [3] == T [12]));
If condition:(a == a;(P [3] == T [12]));q = 2 + 1 = 3
q ≠ m
i = 13; q = 3
While loop:not executed (b == b; (P [4] == T [13]));
If condition:(b == b;(P [4] == T [13]));q = 3 + 1 = 4
i ᴨ [i]
1 0
2 0
3 1
4 2
5 3
6 0
7 1
For (i = 1 to n)
While (q > 0 and P [q + 1] ≠ T [i])
q = Π [q]
If (P [q + 1] == T [i])
q = q + 1
If (q == m)
Print “Pattern occurs with shift” i – m +1
q = Π [q]
Example…
T: bacbababaababacababa; n = 20
P: ababaca; m = 7
q ≠ m
i = 14; q = 4
While loop:not executed (a == a; (P [5] == T [14]));
If condition:(a == a;(P [5] == T [14]));q = 4 + 1 = 5
q ≠ m
i = 15; q = 5
While loop:not executed (c == c; (P [6] == T [15]));
If condition:(c == c;(P [6] == T [15]));q = 5 + 1 = 6
q ≠ m
i = 16; q = 6
While loop:not executed (a == a; (P [7] == T [16]));
If condition:(a == a;(P [7] == T [16]));q = 6 + 1 = 7
q == m
Pattern occurs with shift i – m + 1 = 16 – 7 + 1 = 10
i ᴨ [i]
1 0
2 0
3 1
4 2
5 3
6 0
7 1
For (i = 1 to n)
While (q > 0 and P [q + 1] ≠ T [i])
q = Π [q]
If (P [q + 1] == T [i])
q = q + 1
If (q == m)
Print “Pattern occurs with shift” i – m +1
q = Π [q]Running Time: Θ(m)
Prefix-Function Iteration Lemma
Let P be a pattern of length m with prefix function Π. Then for q = 1, 2, . . . , m, we have
Π* [q] ={k: k < q and Pk Pq}
Proof:
Step 1: Prove Π* [q] {k: k < q and Pk Pq} ≡ i Є Π * [q] → Pi Pq – By Induction
i Є Π* [q] (Given) (L1.1)
(L1.1) → i = Π(u) [q] for some u > 0 (L1.2)
Basis: u = 1 (L1.3)
(L1.2) and (L1.3) → i = Π [q] (L1.4)
wkt, i < q (L1.5)
ᴨ [q] = max {k: k < q and Pk Pq} (L1.6)
(L1.4) (L1.5) and (L1.6) → PΠ [q] Pq (L1.7)
Induction: (L1.4) and (L1.5) → Π [i] < i (L1.8)
(L1.6) and (L1.8) → PΠ [i] Pi (L1.9)
wkt, < and are transitive (L1.10)
(L1.8), (L1.9) and (L1.10) → PΠ [i] Pi for all i (L1.11)
(L1.7) and (L1.11) → Π* [q] {k: k < q and Pk Pq} (L1.12)
Prefix-Function Iteration Lemma…
Step 2: Prove {k: k < q and Pk Pq} Π* [q] – By Contradiction
Assume {k: k < q and Pk Pq} – Π* [q] is non empty (L1.13)
Let j be the largest integer in the set in (L1.13) (L1.14)
(L1.6) → Π [q] is the largest value in {k: k < q and Pk Pq} (L1.15)
wkt, Π [q] Є Π* [q] (Definition of Π* [q]) (L1.16)
(L1.14) and (L1.15) → j < Π [q] (L1.17)
(L1.13) → j Є Π [q] (L1.18)
From (L1.13) and (L1.17), Let jʹ denote the smallest integer in Π* [q] > j (L1.19)
(L1.6) and (L1.14) → Pj Pq (L1.20)
(L1.15) and (L1.19) → Pj ʹ Pq (L1.21)
(L1.19), (L1.20) and (L1.21) → Pj Pj ʹ (L1.22)
(L1.20), (L1.21) and (L1.22) → j is the largest value < j ʹ (L1.23)
(L1.23) → Π [ j ʹ ] = j (L1.24)
(L1.1), (L1.2) and (L1.24) → j Є Π *[q] (L1.25)
(L1.25) contradicts the assumption (L1.26)
(L1.26) → {k: k < q and Pk Pq} Π* [q] (L1.27)
(L1.12) and (L1.27) → Π* [q] = {k: k < q and Pk Pq}
Lemma
Let P be a pattern of length m and Π be the prefix function for P. For q = 1, 2, . . . , m, if
Π [q] > 0 then Π [q] – 1 Є Π* [q – 1]
Proof:
Let r = Π [q] > 0, so that r < q and Pr Pq (L2.1)
(L2.1) → r – 1 < q – 1 and Pr – 1 Pq – 1 (L2.2)
(L2.2) → r – 1 Є Π* [q – 1] (Prefix Function Iteration Lemma) (L2.3)
(L2.1) and (L2.3) → Π [q] – 1 Є Π* [q – 1]
Corollary
Let P be a pattern of length m and let Π be the prefix function for P. For q = 2, 3, ... , m,
0 if E q – 1 = Ø
1 + max {k Є E q – 1} if E q – 1 ≠ Ø
Proof:
E q – 1 = Ø (C1.1)
(C1.1) → there is no k Є E q – 1 (including k = 0) for which Pk
can be extended to Pk + 1 and get a proper suffix of Pq (C1.2)
(C1.2) → Π [q] = 0 (C1.3)
E q – 1 ≠ Ø (C1.4)
(C1.4) → there exists k Є E q – 1 such that for each k,
k + 1 < q and Pk + 1 Pq (C1.5)
ᴨ [q] = max {k: k < q and Pk Pq} (C1.6)
(C1.5) and (C1.6) → ᴨ [q] ≥ 1 + max {k Є E q – 1} (C1.7)
(C1.7) → Π [q] > 0 (C1.8)
Let r = Π [q] – 1 (C1.9)
Π [q] =
Corollary…
(C1.9) → r + 1 = Π [q] (C1.10)
(C1.6) and (C1.10) → Pr + 1 Pq (C1.11)
(C1.8) and (C1.10) → r + 1 > 0 (C1.12)
(C1.11) and (C1.12) → P [r + 1] = P [q] (C1.13)
wkt, if Π [q] > 0 then Π [q] – 1 Є Π* [q – 1] (Lemma) (C1.14)
(C1.8), (C1.9) and (C1.14) → r Є Π* [q – 1] (C1.15)
(C1.15) → r Є E q – 1 (C1.16)
(C1.16) → r ≤ max {k Є E q – 1} (C1.17)
Adding 1 on both sides of (C1.17), we get, r + 1 ≤ 1 + max {k Є E q – 1} (C1.18)
(C1.10) and (C1.18) → Π [q] ≤ 1 + max {k Є E q – 1} (C1.19)
(C1.7) and (C1.19) → Π [q] = 1 + max {k Є E q – 1} (C1.20)
0 if E q – 1 = Ø
1 + max {k Є E q – 1} if E q – 1 ≠ Ø
(C1.3) and (C1.20) → Π [q] =
Correctness of Compute Prefix Function
COMPUTE-PREFIX-FUNCTION(P)
m = P.length
Let Π [1 . . m] be a new array
Π [1] = 0
k = 0
For (q = 2 to m)
While (k > 0 and P [k + 1] ≠ P [q])
k = Π [k]
If (P [k + 1] == P [q])
k = k + 1
Π [q] = k
return Π
1. k = Π [q – 1] at the start of each iteration of the For
loop.
• Π [1] = 0 and k = 0 when the loop is first entered.
• In successive iterations Π [q] = k.
2. The while loop and the if condition adjust k so that
it becomes the correct value of Π [q].
3. While loop searches through all values k Є Π*[q–1]
until it finds a value of k for which P [k + 1] = P [q]
• k is the largest value in the set E q – 1.
• → Π [q] = k + 1. (Corollary)
4. If the While loop cannot find a k Є Π* [q – 1] such
that P [k + 1] = P [q], then k = 0.
5. If P [1] = P [q] then both k and Π [q] has to be set
to 1, otherwise only Π [q] has to be set to 0. This is
set correctly by the if statement.
6. Thus the function computes Π correctly.
Similarity between Prefix function and Transition Function
1. Upon reading a character x = T [i] in a state q, it moves to a new state δ (q, x).
String Matching Automaton:
• If (x = P [q + 1]), x continues to match the pattern and δ (q, x) = q + 1.
• If (x ≠ P [q + 1]), x does not continue to match the pattern and 0 ≤ δ (q, x) ≤ q.
KMP Matcher:
• If (x = P [q + 1]), x continues to match the pattern and it reaches state q + 1
without referring to Π.
o Since (T [i] = P [q + 1])), the while loop fails but the if condition is true and
thus it increments q to q + 1.
• If (x ≠ P [q + 1]), x does not continue to match the pattern and the new state is
either q or to the left of q along the spine of the automaton.
o The while loop iterates through the states in Π* [q], stopping either when it
arrives in a state, say j ʹ, such that x matches P [q ʹ + 1]) or q ʹ has gone all the
way down to 0.
o If (x = P [q ʹ + 1]) then the new state is set to q ʹ + 1 which is equal to δ (q, x).
Similarity between Prefix function and Transition Function…
Example: Pattern ababaca
State q = 5
1. Next Character – c
• While loop: c = c; Fails;
• If condition: c = c; true; q moves to state 6 = δ (5, c).
2. Next Character – b
• While loop: (1) c ≠ b → q = Π(5) = 3;
(2) b = b; Fails;
• If condition: b = b; true; q moves to state 4 = δ (5, b).
3. Next Character – a
• While loop: (1) c ≠ a → q = Π(5) = 3 (1st State in Π(5));
(2) b ≠ a → q = Π(3) = 1 (2nd State in Π(5));
(3) b ≠ a → q = Π(1) = 0 (3rd State in Π(5));
• If condition: a = a; true; q moves to state 1 = δ (5, a).
Transition
Function, δ
ᴨ
[i]
Input
q a b c
0 1 0 0
1 1 2 0 0
2 3 0 0 0
3 1 4 0 1
4 5 0 0 2
5 1 4 6 3
6 7 0 0 0
7 1 2 0 1
Correctness of KMP-Matcher
KMP-MATCHER (T, P)
n = T.length
m = P.length
Π = COMPUTE-PREFIX-FUNCTION (P)
q = 0
For (i = 1 to n)
While (q > 0 and P [q + 1] ≠ T [i])
q = Π [q]
If (P [q + 1] == T [i])
q = q + 1
If (q == m)
Print “Pattern occurs with shift” i – m
q = Π [q]
I: Show that q = σ (Ti) with regard to the for loop of
KMP-Matcher by Induction:
Basis:
Initially, both the procedures set q to 0 which is σ (T0).
Assumption:
qʹ = σ (Ti – 1), where qʹ is the state at the start of the for
loop.
Induction:
When the character Ti is considered, the longest prefix
of P that is a suffix of Ti is:
• Pqʹ + 1 (If P [qʹ + 1] = T [i])
• Some prefix of Pqʹ
FINITE-AUTOMATON-MATCHER (T, δ, m)
n = T.length
q = 0
For (i = 1 to n)
q = δ (q, T [i])
If (q == m)
Print “Pattern occurs with shift” i – m
Correctness of KMP-Matcher…
KMP-MATCHER (T, P)
n = T.length
m = P.length
Π = COMPUTE-PREFIX-FUNCTION (P)
q = 0
For (i = 1 to n)
While (q > 0 and P [q + 1] ≠ T [i])
q = Π [q]
If (P [q + 1] == T [i])
q = q + 1
If (q == m)
Print “Pattern occurs with shift” i – m+1
q = Π [q]
There are 3 cases:
Case 1: σ (Ti) = 0
• P0 = ε is the only prefix of P that is a suffix of Ti.
• The while loop iterates through the values in Π* [qʹ].
• Although Pq Ti for every q Є Π* [qʹ], the loop
never finds a q such that P [qʹ + 1] = T [i].
• The loop terminates when q reaches 0.
• The if condition fails since P [qʹ + 1] ≠ T [i].
• Hence, q = 0 = σ (Ti).
Case 2: σ (Ti) = qʹ + 1
• P [qʹ + 1] = T [i].
• While loop fails.
• If condition is true, and hence q gets incremented to
qʹ + 1 = σ (Ti).
FINITE-AUTOMATON-MATCHER (T, δ, m)
n = T.length
q = 0
For (i = 1 to n)
q = δ (q, T [i])
If (q == m)
Print “Pattern occurs with shift” i – m
Correctness of KMP-Matcher…
KMP-MATCHER (T, P)
n = T.length
m = P.length
Π = COMPUTE-PREFIX-FUNCTION (P)
q = 0
For (i = 1 to n)
While (q > 0 and P [q + 1] ≠ T [i])
q = Π [q]
If (P [q + 1] == T [i])
q = q + 1
If (q == m)
Print “Pattern occurs with shift” i – m+1
q = Π [q]
Case 3: 0 < σ (Ti) ≤ qʹ (3.1)
• The while loop iterates at least once, checking in
decreasing order each value q Є Π* [qʹ] until it stops
at some q < qʹ.
• → Pq is the longest prefix of Pqʹ for which
P [qʹ + 1] = T [i].
• When the while loop terminates q + 1 = σ (Pqʹ T [i]).
(Transition of state q to the next state which is to the
left of q along the spine). (3.2)
• (3.1) → qʹ = σ (Ti – 1) (3.3)
• (3.3) → σ (Ti – 1 T [i]) = σ (Pqʹ T [i]) (3.4)
• (3.2) and (3.3) → q + 1 = σ (Ti – 1 T [i])
= σ (Ti) (3.5)
• (3.5) → q = σ (T [i]) – 1 when while loop terminates.
• The if condition increments q so tat q + 1 = σ (Ti).
FINITE-AUTOMATON-MATCHER (T, δ, m)
n = T.length
q = 0
For (i = 1 to n)
q = δ (q, T [i])
If (q == m)
Print “Pattern occurs with shift” i – m
Correctness of KMP-Matcher…
KMP-MATCHER (T, P)
n = T.length
m = P.length
Π = COMPUTE-PREFIX-FUNCTION (P)
q = 0
For (i = 1 to n)
While (q > 0 and P [q + 1] ≠ T [i])
q = Π [q]
If (P [q + 1] == T [i])
q = q + 1
If (q == m)
Print “Pattern occurs with shift” i – m+1
q = Π [q]
q is assigned to Π [q] after an occurrence of the pattern
is found otherwise the search will proceed by matching
P [m + 1].
II. Show that σ (Ti) = Φ (Ti)
Basis:
For i = 0, T0 = ε
→ Φ (T0) = 0 = σ (T0)
Assumption:
Let, Φ (Ti) = σ (Ti)
Induction:
Let q = Φ (Ti) and
x = T [i + 1]
FINITE-AUTOMATON-MATCHER (T, δ, m)
n = T.length
q = 0
For (i = 1 to n)
q = δ (q, T [i])
If (q == m)
Print “Pattern occurs with shift” i – m
Theorem…
Φ (Ti + 1) = Φ (Ti x)
= δ (Φ (Ti), x)
= δ (q, x) (Φ (Ti) = q)
= σ (Pq x) (δ (q, x) = σ (Pq x))
= σ (Ti x) (Suffix-function recursion lemma)
= σ (Ti + 1)
When the machine enters state q it is the largest value such that Pq T [i].
→ q = m iff machine has just scanned an occurrence of the pattern P.
Thus the KMP-MATCHER operates correctly
Appendix
ε : Empty String
Ø : Empty Language
∑ : Finite Input Alphabet
∑* : language of all strings over ∑
Eg.: If ∑ = {0, 1}, then ∑* = {ε, 0, 1, 00, 01, 10, 11, 000, . . .} is the set
of all binary strings.
Overlapping-Suffix Lemma:
Suppose that x, y, and z are strings such that x z and y z.
If |x| ≤ |y|, then x y. If |x| ≥ |y|, then y x. If |x| = |y|, then x = y.
Appendix…
Π* [q]:
Π* [q] = {Π [q], Π(2) [q], Π(3) [q], . . ., Π(t) [q]}, where Π(i) [q] is defined in terms of
functional iteration, so that Π(0) [q] = q and Π(i) [q] = Π [Π (i – 1) [q]] for i ≥ 1, and where
the sequence in Π* [q] stops upon reaching Π(t) [q] = 0.
E q – 1:
For q = 2, 3, . . . , m, define the subset E q – 1 Π* [q – 1] by
E q – 1 = {k Є Π* [q – 1]: P [k + 1] = P [q]}
= {k: k < q – 1 and Pk Pq – 1 and P [k + 1] = P [q]} (Prefix Function Iteration Lemma)
= {k: k < q – 1 and Pk + 1 Pq}
E q – 1 consists of those values of k Є Π* [q – 1] such that Pk can be extended to Pk + 1
and get a proper suffix of Pq.
References:
• Thomas H Cormen. Charles E Leiserson, Ronald L Rivest, Clifford Stein,
Introduction to Algorithms, Third Edition, The MIT Press Cambridge,
Massachusetts London, England.

String Matching with Finite Automata and Knuth Morris Pratt Algorithm

  • 1.
    String Matching withFinite Automata Dr. Kiran K Assistant Professor Department of CSE UVCE Bengaluru, India.
  • 2.
    Finite Automata A FiniteAutomata is a 5-tuple (Q, q0, A, ∑, δ), where • Q : Finite set of States • q0 є Q : Start State • A C Q : Distinguished set of Accepting States • ∑ : Finite Input Alphabet • δ : Transition Function. A function from Q x ∑ into Q, • Φ : Final-State Function from ∑* to Q such that Φ (w) is the state M ends up in after scanning the string w. M accepts a string w iff Φ (w) є A. Φ (ε) = q0 Φ (ε) = δ (Φ (w), a) for w є ∑*, a є ∑
  • 3.
    Finite Automata… • Thefinite automaton begins in state q0 and reads the characters of its input string one at a time. • If the automaton is in state q and reads input character a, it moves (“makes a transition”) from state q to state δ (q, a). • Whenever the current state q is a member of A, the machine M has accepted the string read so far. • An input that is not accepted is rejected.
  • 4.
    Finite Automata… Q ={0, 1} q0 = 0 ∑ = {a, b} State Transition Diagram State 1: Accepting State Directed Edges: Transitions abaaa : Enters states <0, 1, 0, 1, 0, 1>; Accepts. abbaa : Enters states <0, 1, 0, 0, 1, 0>; Rejects. Eg.: Finite Automata that accepts strings ending with odd number of a’s Transition Function, δ Input State a b 0 1 0 1 0 0
  • 5.
    String Matching Automata •A String Matching Automaton is constructed for a given pattern P in the Preprocessing step. • Suffix Function (σ):  An auxiliary function that specifies the string-matching automaton corresponding to a given pattern P [1 . . m].  It maps ∑* to {0, 1, . . ., m} such that σ (x) is the length of the longest prefix of P that is also a suffix of x: σ (x) = max {k: Pk x}.  P0 = ε is a suffix of every string.  For a pattern P of length m, σ (x) = m iff P x  x y → σ (x) ≤ σ (y) Eg.: P = ab σ (ε) = 0 σ (ccaca) = 1 σ (ccab) = 2
  • 6.
    String Matching Automata… TheString-Matching Automaton that corresponds to a given pattern P [1 . . m] is defined as follows: • The state set Q is {0, 1, . . ., m}. State 0 - Start state, q0; State m - Accepting state. • The transition state for any state q and character x is defined as: δ (q, x) = σ (Pq x) Note: The transition function δ keeps track of the longest prefix of the pattern P that has matched the text string T so far.
  • 7.
    String Matching Automata… δ(q, x) = σ (Pq x) ? • If the substring ending at T[i] matches some prefix Pj of P, then Pj must be a suffix of Ti . • If (q = Φ (Ti), the automaton is in state q after reading Ti • In state q, (Pq Ti) and (q = σ (Ti)) • Thus, δ is defined to give the length of the longest prefix of P that matches a suffix of Ti Note: Φ (Ti) and σ (Ti) both equal q, thus the automaton maintains the invariant: Φ (Ti) = σ (Ti)
  • 8.
    String Matching Automata… •If the automaton is in state q and reads the next character T [i + 1] = x, it enters the state σ (Tix) if the longest prefix of P is a suffix of Tix. • Pq is the longest prefix of P that is a suffix of Ti → σ (Pqx) is the longest prefix of P that is a suffix of Tix. • There are two cases: Case 1: x = P [q + 1] Character x continues to match the pattern. δ (q, a) = q + 1. Case 2: x ≠ P [q + 1] Character x does not continue to match the pattern. Find a smaller prefix of P that is also a suffix of Ti.
  • 9.
    Example P = ababaca δ(q, x) = σ (Pq x) q = 0 P0 = ε ; x = {a, b, c} δ (0, a) = σ (a) = 1 (a is the longest prefix of the pattern ababaca) δ (0, b) = σ (b) = 0 (b is the not a prefix of the pattern ababaca) δ (0, c) = σ (c) = 0 (c is the not a prefix of the pattern ababaca) 0 0 1 a
  • 10.
    Example… P = ababaca δ(q, x) = σ (Pq x) q = 1 P1 = a ; x = {a, b, c} δ (1, a) = σ (aa) = 1 (a is the longest prefix of the pattern ababaca) δ (1, b) = σ (ab) = 2 (ab is the longest prefix of the pattern ababaca) δ (1, c) = σ (ac) = 0 (No suffix of ac is a prefix of the pattern ababaca) 20 1 a b a
  • 11.
    Example… P = ababaca δ(q, x) = σ (Pq x) q = 2 P2 = ab ; x = {a, b, c} δ (2, a) = σ (aba) = 3 (aba is the longest prefix of the pattern ababaca) δ (2, b) = σ (abb) = 0 (No suffix of abb is a prefix of the pattern ababaca) δ (2, c) = σ (abc) = 0 (No suffix of abc is a prefix of the pattern ababaca) 3 a 20 1 a b a
  • 12.
    Example… P = ababaca δ(q, x) = σ (Pq x) q = 3 P3 = aba ; x = {a, b, c} δ (3, a) = σ (abaa) = 1 (a is the longest prefix of the pattern ababaca) δ (3, b) = σ (abab) = 4 (abab is the longest prefix of the pattern ababaca) δ (3, c) = σ (abac) = 0 (No suffix of abac is a prefix of the pattern ababaca) 3 a 20 1 a b a 4 b a
  • 13.
    Example… P = ababaca δ(q, x) = σ (Pq x) q = 4 P4 = abab ; x = {a, b, c} δ (4, a) = σ (ababa) = 5 (ababa is the longest prefix of the pattern ababaca) δ (4, b) = σ (ababb) = 0 (No suffix of ababb is a prefix of the pattern ababaca) δ (4, c) = σ (ababc) = 0 (No suffix of ababc is a prefix of the pattern ababaca) 3 a 20 1 a b a 4 b a 5 a
  • 14.
    Example… P = ababaca δ(q, x) = σ (Pq x) q = 5 P5 = ababa ; x = {a, b, c} δ (5, a) = σ (ababaa) = 1 (a is the longest prefix of the pattern ababaca) δ (5, b) = σ (ababab) = 4 (abab is the longest prefix of the pattern ababaca) δ (5, c) = σ (ababac) = 6 (ababac is the longest prefix of the pattern ababaca) a 3 a 20 1 a b a 4 b a 5 6 a b c
  • 15.
    Example… P = ababaca δ(q, x) = σ (Pq x) q = 6 P6 = ababac ; x = {a, b, c} δ (6, a) = σ (ababaca) = 7 (ababaca is the longest prefix of the pattern ababaca) δ (6, b) = σ (ababacb) = 0 (No suffix of ababacb is a prefix of the pattern ababaca) δ (6, c) = σ (ababacc) = 0 (No suffix of ababacc is a prefix of the pattern ababaca) a 73 a 20 1 a b a 4 b a 5 6 a b c a
  • 16.
    Example… P = ababaca δ(q, x) = σ (Pq x) q = 7 P7 = ababaca ; x = {a, b, c} δ (7, a) = σ (ababacaa) = 1 (a is the longest prefix of the pattern ababaca) δ (7, b) = σ (ababacab) = 2 (ab is the longest prefix of the pattern ababaca) δ (7, c) = σ (ababacac) = 0 (No suffix of ababacac is a prefix of the pattern ababaca) b a 73 a 20 1 a b a 4 b a 5 6 a b c a a
  • 17.
    Example… Transition Function, δ Input Statea b c 0 1 0 0 1 1 2 0 2 3 0 0 3 1 4 0 4 5 0 0 5 1 4 6 6 7 0 0 7 1 2 0 b a 73 a 20 1 a b a 4 b a 5 6 a b c a a State-Transition Diagram for the String-Matching Automaton that Accepts all Strings ending in the String ababaca.
  • 18.
    COMPUTE-TRANSITION-FUNCTION (P, ∑) m= P.length For (q = 0 to m) For (Each character x ϵ ∑) k = min (m + 1, q + 2) Repeat k = k – 1 Until (Pk Pq x) δ (q, x) = k Return δ Transition Function Algorithm Running Time: O (m3 | ∑ |) Outer For Loop: m times. Inner For Loop: m x | ∑ | times ∑ times for each value of m. Repeat Loop: (m x | ∑ |) x m2 times Repeat loop maximum m + 1 times, with the test Pk Pq x requiring upto m character comparisons.
  • 19.
    Example P = ababaca;m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k q = 0; k = min (7 + 1, 0 + 2) = min (8, 2) = 2 x = a p0 a = a k = 1; p1 = a; a a δ (0, a) = 1 x = b p0 b = b k = 1; p1 = a; a b k = 0; p0 = ε; ε b δ (0, b) = 0 x = c p0 c = c k = 1; p1= a; a c k = 0; p0= ε; ε b δ (0, c) = 0 q = 1; k = min (7 + 1, 1 + 2) = min (8, 3) = 3 x = a p1 a = aa k = 2; p2 = ab; ab aa k = 1; p1 = a; a aa δ (1, a) = 1 x = b p1 b = ab k = 2; p2 = ab; ab ab δ (1, b) = 2 x = c p1 c = ac k = 2; p2 = ab; ab ac k = 1; p1 = a; a ac k = 0; p0 = ε; ε ac δ (1, c) = 0
  • 20.
    Example… P = ababaca;m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k q = 2; k = min (7 + 1, 2 + 2) = min (8, 2) = 4 x = a p2 a = aba k = 3; p3 = aba; aba aba δ (2, a) = 3 x = b p2 b = abb k = 3; p3 = aba; aba abb k = 2; p2 = ab; ab abb k = 1; p1 = a; a abb k = 0; p0 = ε; ε abb δ (2, b) = 0 x = c p2 c = abc k = 3; p3 = aba; aba abc k = 2; p2 = ab; ab abc k = 1; p1 = a; a abc k = 0; p0 = ε; ε abc δ (2, c) = 0
  • 21.
    Example… P = ababaca;m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k q = 3; k = min (7 + 1, 3 + 2) = min (8, 5) = 5 x = a p3 a = abaa k = 4; p4 = abab; abab abaa k = 3; p3 = aba; aba abaa k = 2; p2 = ab; ab abaa k = 1; p1 = a; a abaa δ (3, a) = 1 x = b p3 b = abab k = 4; p4 = abab; abab abab δ (3, b) = 4 x = c p3 c = abac k = 4; p4 = abab; abab abac k = 3; p3 = aba; aba abac k = 2; p2 = ab; ab abac k = 1; p1 = a; a abac k = 0; p0 = ε; ε abac δ (3, c) = 0
  • 22.
    Example… P = ababaca;m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k q = 4; k = min (7 + 1, 4 + 2) = min (8, 6) = 6 x = a p4 a = ababa k=5;p5=ababa;ababa ababa δ (4, a) = 5 x = b p4 b = ababb k=5;p5=ababa;ababa ababb k=4;p4=abab;abab ababb k=3;p3=aba;aba ababb k=2;p2=ab;ab ababb k=1;p1=a;a ababb k=0;p0=ε;ε ababb δ (4, b) = 0 x = c p4 c = ababc k=5;p5=ababa;ababa ababc k=4;p4=abab;abab ababc k=3;p3=aba;aba ababc k=2;p2=ab;ab ababc k=1;p1=a;a ababc k=0;p0=ε;ε ababc δ (4, c) = 0
  • 23.
    Example… P = ababaca;m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k q = 5; k = min (7 + 1, 5 + 2) = min (8, 7) = 7 x = a p5 a = ababaa k = 6; p6 = ababac; ababac ababaa k = 5; p5 = ababa; ababa ababaa k = 4; p4 = abab; abab ababaa k = 3; p3 = aba; aba ababaa k = 2; p2 = ab; ab ababaa k = 1; p1 = a; a ababaa δ (5, a) = 1 x = b p5 b = ababab k = 6; p6 = ababac; ababac ababab k = 5; p5 = ababa; ababa ababab k = 4; p4 = abab; abab ababab δ (5, b) = 4 x = c p5 c = ababac k = 6; p6 = ababac; ababac ababac δ (5, c) = 6
  • 24.
    Example… P = ababaca;m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k q = 6; k = min (7 + 1, 6 + 2) = min (8, 8) = 8 x = a p6 a = ababaca k = 7; p7 = ababaca; ababaca ababaa δ (6, a) = 7 x = b p6 b = ababacb k = 7; p7 = ababaca; ababaca ababacb k = 6; p6 = ababac; ababac ababacb k = 5; p5 = ababa; ababa ababacb k = 4; p4 = abab; abab ababacb k = 3; p3 = aba; aba ababacb k = 2; p2 = ab; ab ababacb k = 1; p1 = a; a ababacb k = 0; p0 = ε; ε ababacb δ (6, b) = 0 x = c p6 c = ababacc k = 7; p7 = ababaca; ababaca ababacc k = 6; p6 = ababac; ababac ababacc k = 5; p5 = ababa; ababa ababacc k = 4; p4 = abab; abab ababacc k = 3; p3 = aba; aba ababacc k = 2; p2 = ab; ab ababacc k = 1; p1 = a; a ababacc k = 0; p0 = ε; ε ababacc δ (6, c) = 0
  • 25.
    Example… P = ababaca;m = 7; k = min (m + 1, q + 2) Pk Pq x δ (q, x) = k q = 7; k = min (7 + 1, 7 + 2) = min (9, 8) = 8 x = a p7 a = ababacaa k = 7; p7 = ababaca; ababaca ababacaa k = 6; p6 = ababac; ababac ababacaa k = 5; p5 = ababa; ababa ababacaa k = 4; p4 = abab; abab ababacaa k = 3; p3 = aba; aba ababacaa k = 2; p2 = ab; ab ababacaa k = 1; p1 = a; a ababacaa δ (7, a) = 1 x = b p7 b = ababacab k = 7; p7 = ababaca; ababaca ababacab k = 6; p6 = ababac; ababac ababacab k = 5; p5 = ababa; ababa ababacab k = 4; p4 = abab; abab ababacab k = 3; p3 = aba; aba ababacab k = 2; p2 = ab; ab ababacab δ (7, b) = 2 x = c p7 c = ababacac k = 7; p7 = ababaca; ababaca ababacac k = 6; p6 = ababac; ababac ababacac k = 5; p5 = ababa; ababa ababacac k = 4; p4 = abab; abab ababacac k = 3; p3 = aba; aba ababacac k = 2; p2 = ab; ab ababacac k = 1; p1 = a; a ababacac k = 0; p0 = ε; ε ababacac δ (7, c) = 0
  • 26.
    Example… Transition Function, δ Input Statea b c 0 1 0 0 1 1 2 0 2 3 0 0 3 1 4 0 4 5 0 0 5 1 4 6 6 7 0 0 7 1 2 0
  • 27.
    For any stringy and character x, we have σ (yx) ≤ σ (y) + 1 With reference to the figure, Let r = σ (yx) (L1.1) Case1: Let, r = 0 (L1.2) (L1.1) and (L1.2) → σ (yx) = r = 0 (L1.3) wkt σ (y) ≥ 0 (σ cannot be negative) (L1.4) (L1.3) and (L1.4) → 0 ≤ σ (y) + 1 (L1.5) (L1.3) and (L1.4) → σ (yx) ≤ σ (y) + 1 Suffix-Function Inequality Pr – 1 x Pr y
  • 28.
    Case2: Let, r> 0 (L1.6) wkt σ (y) = max {k: Pk y} (Suffix Definition) (L1.7) (L1.1), (L1.6) and (L1.7) → Pr yx (L1.8) (L1.8) → Pr – 1 y (L1.9) (L1.7) and (L1.9) → r – 1 ≤ σ (y) (L1.10) (L1.10) → r ≤ σ (y) + 1 (L1.11) (L1.1) and (L1.11) → σ (yx) ≤ σ (y) + 1 Suffix-Function Inequality…
  • 29.
    For any stringy and character x, if q = σ (y), then σ (yx) = σ (Pqx) Wkt, σ (y) = max {k: Pk y}(Suffix Definition) (L2.1) Also, Pq x yx (From the Figure) (L2.2) Let, r = σ (yx) (L2.3) (L2.3) → Pr yx (L2.4) Wkt, σ (yx) ≤ σ (y) + 1 (Suffix-Function inequality Lemma) (L2.5) Also, q = σ (y) (Given) (L2.6) (L2.3), (L2.5) and (L2.6) → r ≤ q + 1 (L2.7) Suffix-Function Recursion Lemma x Pq x Pr y
  • 30.
    Wkt, | Pr| = r and |Pq x| = q + 1 (L2.8) (L2.7) and (L2.8) → | Pr | ≤ |Pq x| (L2.9) (L2.2), (L2.4) and (L2.9) → Pr Pq x (Overlapping-Suffix Lemma) (L2.10) Wkt, x y → σ (x) ≤ σ (y) (L2.11) (L2.10) and (L2.11) → r ≤ σ (Pq x) (L2.12) (L2.3) and (L2.12) → σ (yx) ≤ σ (Pq x) (L2.13) (L2.2) and (L2.11) → σ (Pq x) ≤ σ (yx) (L2.14) (L2.13) and (L2.14) → σ (yx) = σ (Pq x) Suffix-Function Recursion Lemma…
  • 31.
    FINITE-AUTOMATON-MATCHER (T, δ,m) n = T.length q = 0 For (i = 1 to n) q = δ (q, T [i]) If (q == m) Print “Pattern occurs with shift” i – m Running Time: Ө (n) Matching Algorithm
  • 32.
    T = abababacaba;P = ababaca q = δ (q, T [i]) n = T.length = 11; m = 7 q = 0 i = 1, q = δ (0, T [1]) = δ (0, a) = 1 q ≠ m → i = 2, q = δ (1, T [2]) = δ (1, b) = 2 q ≠ m → i = 3, q = δ (2, T [3]) = δ (2, a) = 3 q ≠ m → i = 4, q = δ (3, T [4]) = δ (3, b) = 4 q ≠ m → i = 5, q = δ (4, T [5]) = δ (4, a) = 5 q ≠ m → i = 6, q = δ (5, T [6]) = δ (5, b) = 4 q ≠ m → i = 7, q = δ (4, T [7]) = δ (4, a) = 5 q ≠ m → i = 8, q = δ (5, T [8]) = δ (5, c) = 6 q ≠ m → i = 9, q = δ (6, T [9]) = δ (6, a) = 7 q = m → Pattern occurs with shift i – m = 9 – 7 = 2 Example Transition Function, δ Input State a b c 0 1 0 0 1 1 2 0 2 3 0 0 3 1 4 0 4 5 0 0 5 1 4 6 6 7 0 0 7 1 2 0
  • 33.
    Theorem If Φ isthe final-state function of a string-matching automaton for a given pattern P and T [1 . . n] is an input text for the automaton, then Φ (Ti) = σ (Ti) for i = 0, . . . , n. Proof: Basis: For i = 0 T0 = ε → Φ (T0) = 0 = σ (T0) Assumption: Let, Φ (Ti) = σ (Ti) Induction: Let q = Φ (Ti) and x = T [i + 1]
  • 34.
    Theorem… Φ (Ti +1) = Φ (Ti x) = δ (Φ (Ti), x) = δ (q, x) (Φ (Ti) = q) = σ (Pq x) (δ (q, x) = σ (Pq x)) = σ (Ti x) (Suffix-function recursion lemma) = σ (Ti + 1) When the machine enters state q in the algorithm by executing the statement q = δ (q, T [i]), q is the largest value such that Pq T [i]. → q = m iff machine has just scanned an occurrence of the pattern P. Thus the FINITEAUTOMATON-MATCHER operates correctly
  • 35.
  • 36.
    Introduction • This algorithmavoids computing the transition function δ. • It uses an auxiliary function ᴨ, called the Prefix Function, which is precomputed from the pattern of length m, and stored in an array ᴨ [1 . . m]. • For any state q = 0, 1, . . . , m and any character x є ∑ the value, ᴨ [q] contains the information needed to compute δ (q, x) but does not depend on x. • Array ᴨ contains only m entries, whereas δ has Ө (m |∑|) entries. Hence, computing ᴨ saves a factor of |∑| in the preprocessing time compared to computing δ. • The algorithm has a preprocessing time of Ө (m) and a matching time of Ө (n).
  • 37.
    Prefix Function The prefixfunction ᴨ for a pattern encapsulates knowledge about how the pattern matches against shifts of itself. This information helps to avoid testing useless shifts in the naive pattern-matching algorithm and to avoid precomputing the full transition function δ for a string-matching automaton. Eg.: T = bacbababaabcbab P = ababaca • Consider the shift s as shown in the figure using a Naïve string Matcher. • q = 5 characters have matched successfully, but the 6th character fails to match.
  • 38.
    Prefix Function… • Theinformation that q characters have matched successfully determines the corresponding text characters. • Knowing these q text characters allows to determine immediately that certain shifts are invalid. Eg.: s + 1 is invalid as it aligns the first character of the pattern ‘a’ with character ‘b’ of text. • The shift s ʹ = s + 2 aligns the first 3 characters of the pattern with the text that match. • s ʹ can be computed as follows:  If pattern characters P [1 . . q] match text characters T [s + 1 . . s + q], the least shift s ʹ > s such that for some k < q , P [1 . . k] = T [s ʹ + 1 . . s ʹ + k] where s ʹ + k = s + q. i.e.,
  • 39.
    Prefix Function…  IfPq Ts + q is known, find the longest prefix Pk of Pq that is also a suffix of Ts + q  The new shift s ʹ is found by adding the difference (q – k) to s s = s ʹ + (q – k ) Note: • If k = 0 then s = s ʹ + q and all the shifts s + 1, s + 2, , , s + q – 1 are ruled out and it is the best shift. • At the new shift s ʹ, the first k characters of P need not be compared with the corresponding characters of T because s ʹ is computed after ensuring P [1 . . k] = T [s ʹ + 1 . . s ʹ + k].
  • 40.
    Prefix Function… s ʹcan be computed as follows by comparing the pattern with itself as follows: • Since T [s ʹ + 1 . . s ʹ + k] is part of the known portion of the text it is a suffix of the string Pq. • Hence, P [1 . . k] = T [s ʹ + 1 . . s ʹ + k] → Find the greatest k < q such that Pk Pq. Formal Definition: Given a pattern P [1 . . m], the prefix function for the pattern P is the function ᴨ: [1 . . m] → {0, 1, . . . , m – 1} such that ᴨ [q] = max {k: k < q and Pk Pq}. Note: It convenient to store, for each value of q, the number k of matching characters at the new shift s ʹ.
  • 41.
    Example Pattern: ababaca; PkPq k < q i = 1; q = 1; P1 = a; k < 1; Pk = ε ᴨ[1] = 0 (ε is the longest prefix of a) i = 2; q = 2; P2 = ab; k < 2; Pk = ε ᴨ[2] = 0 (No prefix of size < 2, (a), is a suffix of ab except ε) i = 3; q = 3; P3 = aba; k < 3; Pk = a ᴨ[3] = 1 (a is the longest prefix with size < 3, that is a suffix of aba) i = 4; q = 4; P4 = abab; k < 4; Pk = ab ᴨ[4] = 2 (ab is the longest prefix with size < 4, that is a suffix of abab) i = 5; q = 5; Pq = ababa; k < 5; Pk = aba ᴨ[5] = 3 (aba is the longest prefix with size < 4, that is a suffix of ababa)
  • 42.
    Example… Pattern: ababaca; PkPq i = 6; q = 6; Pq = ababac; k < 6; Pk = ε ᴨ[6] = 0 (No prefix of size < 6, (ababa, abab, aba, ab, a), is a suffix of ababac except ε) i = 7; q = 7; Pq = ababaca; k < 7; Pk = a ᴨ[7] = 1 (a is the longest prefix with size < 7, that is a suffix of ababaca) i ᴨ [i] 1 0 2 0 3 1 4 2 5 3 6 0 7 1
  • 43.
    Prefix Function Algorithm COMPUTE-PREFIX-FUNCTION(P) m= P.length Let Π [1 . . m] be a new array Π [1] = 0 k = 0 For (q = 2 to m) While (k > 0 and P [k + 1] ≠ P [q]) k = Π [k] If (P [k + 1] == P [q]) k = k + 1 Π [q] = k return Π Running Time: Θ(m) 1. For loop increases the value of k once for each iteration and hence the total increase in k is at most m – 1. 2. k < q ensuring that Π [q] < q which means that each iteration of the while loop decrements k. 3. k never becomes negative. These facts put together, it is observed that total decrease in k from the while loop is bounded from above by the total increase in k over all iterations of the for loop, which is m – 1.
  • 44.
    Example For (q =2 to m) While (k > 0 and P [k + 1] ≠ P [q]) k = Π [k] If (P [k + 1] == P [q]) k = k + 1 Π [q] = k P = ababaca Π [1] = 0; k = 0; q = 2; k = 0; While loop: not executed (k = 0); If condition: Fails (a ≠ b; P [1] ≠ P [2])); Π [2] = 0; q = 3; k = 0; While loop: not executed (k = 0) If condition: (a == a; (P [1] == P [3])); → k = 0 + 1 = 1; Π [3] = 1; q = 4; k = 1; While loop: not executed (b = b; (P [2] == P [4])) If condition: (b == b; (P [2] == P [4])); → k = 1 + 1 = 2; Π [4] = 2
  • 45.
    Example… q = 5;k = 2; While loop: not executed (a = a; (P [3] == P [5])) If condition: (a == a; (P [3] == P [5])); → k = 2 + 1 = 3; Π [5] = 3 q = 6; k = 3; While loop: b ≠ c; (P [4] ≠ P [6]); → k = Π [3] = 1 b ≠ c; (P [2] ≠ P [6]); → k = Π [1] = 0 If condition: Fails (a ≠ c; (P [1] ≠ P [6])) Π [6] = 0 q = 7; k = 0; While loop: not executed (k = 0) If condition: (a == a; (P [1]= = P [7])); k = 0 + 1 = 1; Π [7] = 1 i ᴨ [i] 1 0 2 0 3 1 4 2 5 3 6 0 7 1 For (q = 2 to m) While (k > 0 and P [k + 1] ≠ P [q]) k = Π [k] If (P [k + 1] == P [q]) k = k + 1 Π [q] = k
  • 46.
    Matching Algorithm KMP-MATCHER (T,P) n = T.length m = P.length Π = COMPUTE-PREFIX-FUNCTION (P) q = 0 For (i = 1 to n) While (q > 0 and P [q + 1] ≠ T [i]) q = Π [q] If (P [q + 1] == T [i]) q = q + 1 If (q == m) Print “Pattern occurs with shift” i – m + 1 q = Π [q] Running Time: Θ(n) 1. For loop increases the value of q once for each iteration and hence the total increase in q is at most n. 2. Π [q] < q which means that each iteration of the while loop decrements q. 3. q never becomes negative. These facts put together, it is observed that total decrease in q from the while loop is bounded from above by the total increase in q over all iterations of the for loop, which is n.
  • 47.
    Example For (i =1 to n) While (q > 0 and P [q + 1] ≠ T [i]) q = Π [q] If (P [q + 1] == T [i]) q = q + 1 If (q == m) Print “Pattern occurs with shift” i – m +1 q = Π [q] T: bacbababaababacababa; n = 20 P: ababaca; m = 7 i = 1; q = 0 While loop: not executed (q = 0) If condition: fails (a ≠ b; (P [1] ≠ T [1])) q ≠ m i = 2; q = 0 While loop: not executed (q = 0) If condition: (a == a; (P [1] == T [2])); q = 0 + 1 = 1 q ≠ m i = 3; q = 1 While loop: (b ≠ c; (P [2] ≠ T [3])); q = Π [1] = 0 If condition: Fails (b ≠ c; (P [2] ≠ T [3])); q ≠ m i = 4; q = 0 While loop: not executed (q = 0) If condition: Fails (a ≠ b; (P [1] ≠ T [4])); i ᴨ [i] 1 0 2 0 3 1 4 2 5 3 6 0 7 1
  • 48.
    Example… T: bacbababaababacababa; n= 20 P: ababaca; m = 7 q ≠ m i = 5; q = 0 While loop: not executed (q = 0) If condition: (a == a; (P [1] == T [5])); q = 0 + 1 = 1 q ≠ m i = 6; q = 1 While loop: not executed (b == b; (P [2] == T [6])) If condition: (b == b; (P [2] == T [6])); q = 1 + 1 = 2 q ≠ m i = 7; q = 2 While loop: not executed (a == a; (P [3] == T [7])) If condition: (a == a; (P [3] == T [7])); q = 2 + 1 = 3 i ᴨ [i] 1 0 2 0 3 1 4 2 5 3 6 0 7 1 For (i = 1 to n) While (q > 0 and P [q + 1] ≠ T [i]) q = Π [q] If (P [q + 1] == T [i]) q = q + 1 If (q == m) Print “Pattern occurs with shift” i – m +1 q = Π [q]
  • 49.
    Example… T: bacbababaababacababa; n= 20 P: ababaca; m = 7 q ≠ m i = 8; q = 3 While loop: not executed (b == b; (P [4] == T [8])) If condition: (b == b; (P [4] == T [8])); q = 3 + 1 = 4 q ≠ m i = 9; q = 4 While loop: not executed (a == a; (P [5] == T [9])) If condition: (a == a; (P [5] == T [9])); q = 4 + 1 = 5 q ≠ m i = 10; q = 5 While loop: (c ≠ a; (P [6] ≠ T [10])); q = Π [5] = 3 (b ≠ a; (P [4] ≠ T [10])); q = Π [3] = 1 (b ≠ a; (P [2] ≠ T [10])); q = Π [1] = 0 If condition:(a == a;(P [1] == T [10]));q = 0 + 1 = 1 i ᴨ [i] 1 0 2 0 3 1 4 2 5 3 6 0 7 1 For (i = 1 to n) While (q > 0 and P [q + 1] ≠ T [i]) q = Π [q] If (P [q + 1] == T [i]) q = q + 1 If (q == m) Print “Pattern occurs with shift” i – m +1 q = Π [q]
  • 50.
    Example… T: bacbababaababacababa; n= 20 P: ababaca; m = 7 q ≠ m i = 11; q = 1 While loop: not executed (b == b; (P [2] == T [11])); If condition:(b == b;(P [2] == T [11]));q = 1 + 1 = 2 q ≠ m i = 12; q = 2 While loop:not executed (a == a; (P [3] == T [12])); If condition:(a == a;(P [3] == T [12]));q = 2 + 1 = 3 q ≠ m i = 13; q = 3 While loop:not executed (b == b; (P [4] == T [13])); If condition:(b == b;(P [4] == T [13]));q = 3 + 1 = 4 i ᴨ [i] 1 0 2 0 3 1 4 2 5 3 6 0 7 1 For (i = 1 to n) While (q > 0 and P [q + 1] ≠ T [i]) q = Π [q] If (P [q + 1] == T [i]) q = q + 1 If (q == m) Print “Pattern occurs with shift” i – m +1 q = Π [q]
  • 51.
    Example… T: bacbababaababacababa; n= 20 P: ababaca; m = 7 q ≠ m i = 14; q = 4 While loop:not executed (a == a; (P [5] == T [14])); If condition:(a == a;(P [5] == T [14]));q = 4 + 1 = 5 q ≠ m i = 15; q = 5 While loop:not executed (c == c; (P [6] == T [15])); If condition:(c == c;(P [6] == T [15]));q = 5 + 1 = 6 q ≠ m i = 16; q = 6 While loop:not executed (a == a; (P [7] == T [16])); If condition:(a == a;(P [7] == T [16]));q = 6 + 1 = 7 q == m Pattern occurs with shift i – m + 1 = 16 – 7 + 1 = 10 i ᴨ [i] 1 0 2 0 3 1 4 2 5 3 6 0 7 1 For (i = 1 to n) While (q > 0 and P [q + 1] ≠ T [i]) q = Π [q] If (P [q + 1] == T [i]) q = q + 1 If (q == m) Print “Pattern occurs with shift” i – m +1 q = Π [q]Running Time: Θ(m)
  • 52.
    Prefix-Function Iteration Lemma LetP be a pattern of length m with prefix function Π. Then for q = 1, 2, . . . , m, we have Π* [q] ={k: k < q and Pk Pq} Proof: Step 1: Prove Π* [q] {k: k < q and Pk Pq} ≡ i Є Π * [q] → Pi Pq – By Induction i Є Π* [q] (Given) (L1.1) (L1.1) → i = Π(u) [q] for some u > 0 (L1.2) Basis: u = 1 (L1.3) (L1.2) and (L1.3) → i = Π [q] (L1.4) wkt, i < q (L1.5) ᴨ [q] = max {k: k < q and Pk Pq} (L1.6) (L1.4) (L1.5) and (L1.6) → PΠ [q] Pq (L1.7) Induction: (L1.4) and (L1.5) → Π [i] < i (L1.8) (L1.6) and (L1.8) → PΠ [i] Pi (L1.9) wkt, < and are transitive (L1.10) (L1.8), (L1.9) and (L1.10) → PΠ [i] Pi for all i (L1.11) (L1.7) and (L1.11) → Π* [q] {k: k < q and Pk Pq} (L1.12)
  • 53.
    Prefix-Function Iteration Lemma… Step2: Prove {k: k < q and Pk Pq} Π* [q] – By Contradiction Assume {k: k < q and Pk Pq} – Π* [q] is non empty (L1.13) Let j be the largest integer in the set in (L1.13) (L1.14) (L1.6) → Π [q] is the largest value in {k: k < q and Pk Pq} (L1.15) wkt, Π [q] Є Π* [q] (Definition of Π* [q]) (L1.16) (L1.14) and (L1.15) → j < Π [q] (L1.17) (L1.13) → j Є Π [q] (L1.18) From (L1.13) and (L1.17), Let jʹ denote the smallest integer in Π* [q] > j (L1.19) (L1.6) and (L1.14) → Pj Pq (L1.20) (L1.15) and (L1.19) → Pj ʹ Pq (L1.21) (L1.19), (L1.20) and (L1.21) → Pj Pj ʹ (L1.22) (L1.20), (L1.21) and (L1.22) → j is the largest value < j ʹ (L1.23) (L1.23) → Π [ j ʹ ] = j (L1.24) (L1.1), (L1.2) and (L1.24) → j Є Π *[q] (L1.25) (L1.25) contradicts the assumption (L1.26) (L1.26) → {k: k < q and Pk Pq} Π* [q] (L1.27) (L1.12) and (L1.27) → Π* [q] = {k: k < q and Pk Pq}
  • 54.
    Lemma Let P bea pattern of length m and Π be the prefix function for P. For q = 1, 2, . . . , m, if Π [q] > 0 then Π [q] – 1 Є Π* [q – 1] Proof: Let r = Π [q] > 0, so that r < q and Pr Pq (L2.1) (L2.1) → r – 1 < q – 1 and Pr – 1 Pq – 1 (L2.2) (L2.2) → r – 1 Є Π* [q – 1] (Prefix Function Iteration Lemma) (L2.3) (L2.1) and (L2.3) → Π [q] – 1 Є Π* [q – 1]
  • 55.
    Corollary Let P bea pattern of length m and let Π be the prefix function for P. For q = 2, 3, ... , m, 0 if E q – 1 = Ø 1 + max {k Є E q – 1} if E q – 1 ≠ Ø Proof: E q – 1 = Ø (C1.1) (C1.1) → there is no k Є E q – 1 (including k = 0) for which Pk can be extended to Pk + 1 and get a proper suffix of Pq (C1.2) (C1.2) → Π [q] = 0 (C1.3) E q – 1 ≠ Ø (C1.4) (C1.4) → there exists k Є E q – 1 such that for each k, k + 1 < q and Pk + 1 Pq (C1.5) ᴨ [q] = max {k: k < q and Pk Pq} (C1.6) (C1.5) and (C1.6) → ᴨ [q] ≥ 1 + max {k Є E q – 1} (C1.7) (C1.7) → Π [q] > 0 (C1.8) Let r = Π [q] – 1 (C1.9) Π [q] =
  • 56.
    Corollary… (C1.9) → r+ 1 = Π [q] (C1.10) (C1.6) and (C1.10) → Pr + 1 Pq (C1.11) (C1.8) and (C1.10) → r + 1 > 0 (C1.12) (C1.11) and (C1.12) → P [r + 1] = P [q] (C1.13) wkt, if Π [q] > 0 then Π [q] – 1 Є Π* [q – 1] (Lemma) (C1.14) (C1.8), (C1.9) and (C1.14) → r Є Π* [q – 1] (C1.15) (C1.15) → r Є E q – 1 (C1.16) (C1.16) → r ≤ max {k Є E q – 1} (C1.17) Adding 1 on both sides of (C1.17), we get, r + 1 ≤ 1 + max {k Є E q – 1} (C1.18) (C1.10) and (C1.18) → Π [q] ≤ 1 + max {k Є E q – 1} (C1.19) (C1.7) and (C1.19) → Π [q] = 1 + max {k Є E q – 1} (C1.20) 0 if E q – 1 = Ø 1 + max {k Є E q – 1} if E q – 1 ≠ Ø (C1.3) and (C1.20) → Π [q] =
  • 57.
    Correctness of ComputePrefix Function COMPUTE-PREFIX-FUNCTION(P) m = P.length Let Π [1 . . m] be a new array Π [1] = 0 k = 0 For (q = 2 to m) While (k > 0 and P [k + 1] ≠ P [q]) k = Π [k] If (P [k + 1] == P [q]) k = k + 1 Π [q] = k return Π 1. k = Π [q – 1] at the start of each iteration of the For loop. • Π [1] = 0 and k = 0 when the loop is first entered. • In successive iterations Π [q] = k. 2. The while loop and the if condition adjust k so that it becomes the correct value of Π [q]. 3. While loop searches through all values k Є Π*[q–1] until it finds a value of k for which P [k + 1] = P [q] • k is the largest value in the set E q – 1. • → Π [q] = k + 1. (Corollary) 4. If the While loop cannot find a k Є Π* [q – 1] such that P [k + 1] = P [q], then k = 0. 5. If P [1] = P [q] then both k and Π [q] has to be set to 1, otherwise only Π [q] has to be set to 0. This is set correctly by the if statement. 6. Thus the function computes Π correctly.
  • 58.
    Similarity between Prefixfunction and Transition Function 1. Upon reading a character x = T [i] in a state q, it moves to a new state δ (q, x). String Matching Automaton: • If (x = P [q + 1]), x continues to match the pattern and δ (q, x) = q + 1. • If (x ≠ P [q + 1]), x does not continue to match the pattern and 0 ≤ δ (q, x) ≤ q. KMP Matcher: • If (x = P [q + 1]), x continues to match the pattern and it reaches state q + 1 without referring to Π. o Since (T [i] = P [q + 1])), the while loop fails but the if condition is true and thus it increments q to q + 1. • If (x ≠ P [q + 1]), x does not continue to match the pattern and the new state is either q or to the left of q along the spine of the automaton. o The while loop iterates through the states in Π* [q], stopping either when it arrives in a state, say j ʹ, such that x matches P [q ʹ + 1]) or q ʹ has gone all the way down to 0. o If (x = P [q ʹ + 1]) then the new state is set to q ʹ + 1 which is equal to δ (q, x).
  • 59.
    Similarity between Prefixfunction and Transition Function… Example: Pattern ababaca State q = 5 1. Next Character – c • While loop: c = c; Fails; • If condition: c = c; true; q moves to state 6 = δ (5, c). 2. Next Character – b • While loop: (1) c ≠ b → q = Π(5) = 3; (2) b = b; Fails; • If condition: b = b; true; q moves to state 4 = δ (5, b). 3. Next Character – a • While loop: (1) c ≠ a → q = Π(5) = 3 (1st State in Π(5)); (2) b ≠ a → q = Π(3) = 1 (2nd State in Π(5)); (3) b ≠ a → q = Π(1) = 0 (3rd State in Π(5)); • If condition: a = a; true; q moves to state 1 = δ (5, a). Transition Function, δ ᴨ [i] Input q a b c 0 1 0 0 1 1 2 0 0 2 3 0 0 0 3 1 4 0 1 4 5 0 0 2 5 1 4 6 3 6 7 0 0 0 7 1 2 0 1
  • 60.
    Correctness of KMP-Matcher KMP-MATCHER(T, P) n = T.length m = P.length Π = COMPUTE-PREFIX-FUNCTION (P) q = 0 For (i = 1 to n) While (q > 0 and P [q + 1] ≠ T [i]) q = Π [q] If (P [q + 1] == T [i]) q = q + 1 If (q == m) Print “Pattern occurs with shift” i – m q = Π [q] I: Show that q = σ (Ti) with regard to the for loop of KMP-Matcher by Induction: Basis: Initially, both the procedures set q to 0 which is σ (T0). Assumption: qʹ = σ (Ti – 1), where qʹ is the state at the start of the for loop. Induction: When the character Ti is considered, the longest prefix of P that is a suffix of Ti is: • Pqʹ + 1 (If P [qʹ + 1] = T [i]) • Some prefix of Pqʹ FINITE-AUTOMATON-MATCHER (T, δ, m) n = T.length q = 0 For (i = 1 to n) q = δ (q, T [i]) If (q == m) Print “Pattern occurs with shift” i – m
  • 61.
    Correctness of KMP-Matcher… KMP-MATCHER(T, P) n = T.length m = P.length Π = COMPUTE-PREFIX-FUNCTION (P) q = 0 For (i = 1 to n) While (q > 0 and P [q + 1] ≠ T [i]) q = Π [q] If (P [q + 1] == T [i]) q = q + 1 If (q == m) Print “Pattern occurs with shift” i – m+1 q = Π [q] There are 3 cases: Case 1: σ (Ti) = 0 • P0 = ε is the only prefix of P that is a suffix of Ti. • The while loop iterates through the values in Π* [qʹ]. • Although Pq Ti for every q Є Π* [qʹ], the loop never finds a q such that P [qʹ + 1] = T [i]. • The loop terminates when q reaches 0. • The if condition fails since P [qʹ + 1] ≠ T [i]. • Hence, q = 0 = σ (Ti). Case 2: σ (Ti) = qʹ + 1 • P [qʹ + 1] = T [i]. • While loop fails. • If condition is true, and hence q gets incremented to qʹ + 1 = σ (Ti). FINITE-AUTOMATON-MATCHER (T, δ, m) n = T.length q = 0 For (i = 1 to n) q = δ (q, T [i]) If (q == m) Print “Pattern occurs with shift” i – m
  • 62.
    Correctness of KMP-Matcher… KMP-MATCHER(T, P) n = T.length m = P.length Π = COMPUTE-PREFIX-FUNCTION (P) q = 0 For (i = 1 to n) While (q > 0 and P [q + 1] ≠ T [i]) q = Π [q] If (P [q + 1] == T [i]) q = q + 1 If (q == m) Print “Pattern occurs with shift” i – m+1 q = Π [q] Case 3: 0 < σ (Ti) ≤ qʹ (3.1) • The while loop iterates at least once, checking in decreasing order each value q Є Π* [qʹ] until it stops at some q < qʹ. • → Pq is the longest prefix of Pqʹ for which P [qʹ + 1] = T [i]. • When the while loop terminates q + 1 = σ (Pqʹ T [i]). (Transition of state q to the next state which is to the left of q along the spine). (3.2) • (3.1) → qʹ = σ (Ti – 1) (3.3) • (3.3) → σ (Ti – 1 T [i]) = σ (Pqʹ T [i]) (3.4) • (3.2) and (3.3) → q + 1 = σ (Ti – 1 T [i]) = σ (Ti) (3.5) • (3.5) → q = σ (T [i]) – 1 when while loop terminates. • The if condition increments q so tat q + 1 = σ (Ti). FINITE-AUTOMATON-MATCHER (T, δ, m) n = T.length q = 0 For (i = 1 to n) q = δ (q, T [i]) If (q == m) Print “Pattern occurs with shift” i – m
  • 63.
    Correctness of KMP-Matcher… KMP-MATCHER(T, P) n = T.length m = P.length Π = COMPUTE-PREFIX-FUNCTION (P) q = 0 For (i = 1 to n) While (q > 0 and P [q + 1] ≠ T [i]) q = Π [q] If (P [q + 1] == T [i]) q = q + 1 If (q == m) Print “Pattern occurs with shift” i – m+1 q = Π [q] q is assigned to Π [q] after an occurrence of the pattern is found otherwise the search will proceed by matching P [m + 1]. II. Show that σ (Ti) = Φ (Ti) Basis: For i = 0, T0 = ε → Φ (T0) = 0 = σ (T0) Assumption: Let, Φ (Ti) = σ (Ti) Induction: Let q = Φ (Ti) and x = T [i + 1] FINITE-AUTOMATON-MATCHER (T, δ, m) n = T.length q = 0 For (i = 1 to n) q = δ (q, T [i]) If (q == m) Print “Pattern occurs with shift” i – m
  • 64.
    Theorem… Φ (Ti +1) = Φ (Ti x) = δ (Φ (Ti), x) = δ (q, x) (Φ (Ti) = q) = σ (Pq x) (δ (q, x) = σ (Pq x)) = σ (Ti x) (Suffix-function recursion lemma) = σ (Ti + 1) When the machine enters state q it is the largest value such that Pq T [i]. → q = m iff machine has just scanned an occurrence of the pattern P. Thus the KMP-MATCHER operates correctly
  • 65.
    Appendix ε : EmptyString Ø : Empty Language ∑ : Finite Input Alphabet ∑* : language of all strings over ∑ Eg.: If ∑ = {0, 1}, then ∑* = {ε, 0, 1, 00, 01, 10, 11, 000, . . .} is the set of all binary strings. Overlapping-Suffix Lemma: Suppose that x, y, and z are strings such that x z and y z. If |x| ≤ |y|, then x y. If |x| ≥ |y|, then y x. If |x| = |y|, then x = y.
  • 66.
    Appendix… Π* [q]: Π* [q]= {Π [q], Π(2) [q], Π(3) [q], . . ., Π(t) [q]}, where Π(i) [q] is defined in terms of functional iteration, so that Π(0) [q] = q and Π(i) [q] = Π [Π (i – 1) [q]] for i ≥ 1, and where the sequence in Π* [q] stops upon reaching Π(t) [q] = 0. E q – 1: For q = 2, 3, . . . , m, define the subset E q – 1 Π* [q – 1] by E q – 1 = {k Є Π* [q – 1]: P [k + 1] = P [q]} = {k: k < q – 1 and Pk Pq – 1 and P [k + 1] = P [q]} (Prefix Function Iteration Lemma) = {k: k < q – 1 and Pk + 1 Pq} E q – 1 consists of those values of k Є Π* [q – 1] such that Pk can be extended to Pk + 1 and get a proper suffix of Pq.
  • 67.
    References: • Thomas HCormen. Charles E Leiserson, Ronald L Rivest, Clifford Stein, Introduction to Algorithms, Third Edition, The MIT Press Cambridge, Massachusetts London, England.