jueves, 9 de agosto de 2012

Borders of a string

I'm going to start by defining a border of a string.

Definition:Given a string s of length n, we'll call a border of s to every prefix s', of length n' < n, that is also a suffix of s.

Example: abc is a border of abcdabc, but aa is not a border of aa.

Lemma: A border of a border, is a border.

To see this we take s' as the border of s and s'' as the border of s', and the result is followed by the observation that if s'' is a border of s' then it is a prefix and suffix of s', and given the fact that s' is a prefix and a suffix of s, we get that s'' is a prefix and a suffix of s.#

We are interested in obtaining the maximal length border for a string. To calculate this value we are going to use an algorithm that calculates the border of every suffix of the string.

First let's see what is the expected output of the algorithm:

for a string abracadabra the returned array should be:
       S =   a  b  r  a  c  a  d  a  b  r  a
border[] = { 0, 0, 0, 1, 0, 1, 0, 1, 2, 3, 4};

We can check that the border of abracadabra is the word abra.

Taking a close look at the border array we can start having some ideas of how to calculate the border of a string. Let's see if it is possible to calculate the border of the string s[0..i+1] given the border of the string s[0..i] in an efficient manner.

Suppose that given i >= 0 we have the border of all the prefixes of length less or equal than i of s. Let's call j = border[i], to the length of the border of s[0..i]. And now:
  • If s[i+1] == s[j+1], then we can "grow" the border s[0..j] of s[0..i] to s[0..j+1] for s[0..i+1].
  • If s[i+1] != s[j+1], then the border we had calculated for s[0..i] can't be pumped for s[0..i+1], but maybe there is a border of the border s[0..j] that can be pumped. So we should try with all the borders of s[i], and we do it in an ordered fashion, first with the border of s[0..j] then with the border of the border of s[0..j] and so on, until either there is no border or  we find a border that satisfies the first condition.
Algorithm:
int border[MAXN];
void calculate_border(const char ∗s) {
  int i = 1, j = −1, n = strlen ( s ) ;
  border[0] = −1;
  while(i < n) {
    while(j >= 0 && s[i] != s[j+1]) j = border[j];
    if (s[i] == s[j+1]) j++;
    border[i++] = j;
  }
}
Complexity: O(length(s))

The algorithm does not produce the array we expected, this array is shifted by -1 in every cell. It is like that for implementation reasons.
To see why its complexity is O(n), we can start  by noticing that we increment i on every iteration of the outer while loop, and we increment j at most as much as i. And on the second while loop we decrement j by at least 1 every time (by definition of border).

Uses:

Period of a string: the border of a string is related to the period of a string, to learn about it you can try to solve the problem PERIOD.

String matching: everybody knows that KMP is the de facto algorithm for string matching, but we can modify slightly this algorithm to find the matches of a string s in a string S. We can do so by sticking them together with a character that we know doesn't belong to s nor S.
We then run the border algorithm over the new word and count a match of s whenever we get j == length(s).

This new algorithm has the same space-time complexity as KMP, with the difference that the worst case is O(length(s)) instead of logφ(length(s)), where φ = (1+sqrt(5))/2.

You can try this with NHAY, but be aware that the size of S in this problem is not bounded ;).

4 comentarios:

nicolasw dijo...

Minor remarks:
- lenth
- We are ** interested in obtaining the maximal length ...
- let's
- "... with the difference that the worst case is O(n) instead of (1+sqrt(5))/2.". Isn't it the case there is some missing n in the second expression?

Major remarks
- Awesome idea.
- Awesome blog name.

Gaston dijo...

Thank you very much :)

Unknown dijo...

A shorter version:

int border[MAXN], p;
void calculate_border(const char ∗s) {
border[0] = −1, p = -1;
int n = strlen(border);
for(int i = 1; i <= n; i++){
while(p >= 0 && s[p] != s[i - 1]) p = border[p];
border[i] = ++p;
}
}

Diego dijo...

Felicitaciones por el blog! algunos comentarios sobre el post:
No es trivial que el algoritmo es lineal (tu analisis me parece que esta mal). Tambien hay un error sobre lo de KMP, su complejidad temporal es claramente O(m+n). Y al usar esto para string matching (asi a lo bruto) el espacio extra es O(m+n) en vez del O(m) de KMP. Pero fijate que tu algoritmo y KMP son MUY similares, diria casi que son el mismo.