viernes, 10 de agosto de 2012

Segment Tree

Disclaimer

Today we are going to talk about a structure that I'll call segment tree. It might not be the politically correct name for it, but anyway.

Introduction

The segment tree data structure described in this post is useful when the solution to our problem comes from a reduction with a given operator within a range of elements in an array. Updating the original array is property that we'll ask for later, and the time complexity for the query/update actions will be O(log(n)).

Let's start by giving some examples for what is it good for. Suppose we have an array of numbers arr of length n. Given [a, b) where 0 <= a < b <= n, a range of elements:
  • Getting the maximum/minimum element of the range. (the max/min operator)
  • Getting the sum/product/xor of the elements in the range. (the +/*/^ operator). *
  • Getting the biggest continuous subsequence in that range. (custom operator that combines segments)
This few examples are very representative of the different uses of segment tree in programming competitions.

* OK, we can do this ones easily in O(1), but if we are going to update the values in the array, we are going to need a better data structure.

First approach

Now let's think of how can we accomplish such time complexity without update. We are going to write a simpler than segment tree data structure that only answers query operations. To achieve this we are going to store the result of applying the given operation in all segments starting at any cell of the array and of length 2^k for k = 0..log2(n) and we'll call the array seg_tree(even though it is not a tree).

So to calculate the operation on the segment of length 2^(k+1) starting at position i we are going to combine the segment of length 2^k starting at i and the segment of length 2^k starting at (i+2^k).

int seg_tree[MAXN][LOGN];

void build(int n) {
  for (int i = 0; i < n;i++) seg_tree[i][0] = arr[i];
  for (int j = 1; j <= n;j<<=1)
  for (int i = 0; i < n;i++)
    seg_tree[i][j] = seg_tree[i][j-1];
    if (i+j < n) seg_tree[i][j] += seg_tree[i+j][j-1];
  }
}

int query(int a, int b) {
  int idx = a, len = b-a-1, r = 0, rmbit;
  while(len) {
    r += seg_tree[idx][__builtin_ctz(rmbit = len&-len)];
    idx += rmbit;
    len -= rmbit;
  }
  return r;
}

One of the bit tricks we are using is __builtin_ctz which gives us the amount of trailing zeroes in the binary representation of the number given to it as argument. It is a builtin function provided by the GNU C++ compiler, but if it is not available we can emulate it with a while loop.

The other trick is doing len&-len, iff len != 0 this expression returns the rightmost bit set in len.

So basically what we do when answering a query is: we position ourselves in the position a, and then we accumulate the result up to b using the binary representation of the distance between them. Every bit set to 1 in a position k of the binary representation of that distance is advanced using the segment of length 2^k starting at the current position.

Complexity

build: O(n . log(n))

query: O(log(n))

Now we have the problem of the update. Let's try to get that right.

Second approach

We can start by noticing that there is a lot of redundant information. We can create a different 'segment structure' that only creates enough segments of length 2^k to cover the entire array, not n segments of length 2^k for every k. Here is when the binary tree structure comes in.

We are going to store the information of those segments using a binary tree.

The root will contain the information of the segment of length 2^k that covers the array completely (so 2^k > n). Then the left child will contain the information of the segment of length 2^(k-1) that covers the first half of the array, and the right child will contain the information of the other segment of length 2^(k-1) corresponding to the other half, and so on for their children's until we are left with segments of length 2^0. And those will be our leafs.

We can see that the resulting tree is a complete binary tree, except for the leaf level, and we know those trees are very easy to store in practice using an array of length equal to the number of nodes. The number of nodes is going to be two times the length of the original array. To see this let's count the number of nodes. The leaf level will have n nodes, the level above this one n/2, and the one above it n/4, until we hit just one node.

n + n/2 + n/4 + n/8 + .. = n . (1 + 1/2 + 1/4 + 1/8 + ...) = n . 2

Now let's go with the implementation.

int seg_tree[MAXN*2+1];

int combine(int a, int b) {
  // const complexity function to combine two solutions(+,-,^,min,max).
  ...
}

void build(int node, int left, int right) {
  if (left+1 == right) {
    // leaf case.
    seg_tree[node] = arr[left];
    return;
  }
  int mid = left+right>>1;
  build(node * 2, left, mid);
  build(node * 2 + 1, mid, right);
  // we are a non-leaf node, so we should do some processing.
  seg_tree[node] = combine(seg_tree[node * 2], seg_tree[node * 2 + 1]);
}

int query(int node, int left, int right, int a, int b) {
  // we are completely outside the range.
  if (b <= left or a >= right) return -1; // neutral of the 'operator'
  // we are completely inside the range. 
  if (left <= a and b <= right) return seg_tree[node];
  int mid = left+right>>1;
  int answ_left = query(node * 2, left, mid, a, b);
  int answ_right = query(node * 2 + 1, mid, right, a, b);
  return combine(answ_left, answ_right);
}

int update(int node, int left, int right, int idx, int val) {
  if (left+1 == right) {
    // leaf node, implies left == idx
    return seg_tree[node] = arr[left] = val;
  }
  int mid = left+right>>1;
  if (idx > mid) update(node * 2, left, mid, idx, val);
  else           update(node * 2 + 1, mid, right, idx, val);
  return seg_tree[node] = combine(seg_tree[node * 2], seg_tree[node * 2 + 1]);
}

The query segments are interpreted as a closed/open interval [a,b). We need to call all functions with the arguments node number 1, left 0, and right n.

And that's pretty much it.

Complexity:

Assuming combine operation runs in constant time.

build: O(n)

The argument is the one we used before, the number of nodes is linear.

query: O(log(n))

Suppose that we are going to be queried for a segment that starts where a segment of length 2^k starts in our binary tree, but ends 2^k-1 positions ahead, that means that we are going to have to use all segments 2^(k-1) , 2^(k-2)2^(k-3)... until we reach 2^0. That is clearly a sequence of length at most log(n).

Now let's forget about that restriction (starts where a 2^k length segment does and all other segments used to calculate have length less than 2^k) and find the biggest segment that splits our range in two (it is also a split in the binary tree), and do the exact same calculation as we did before but for the first half and for the second half now. Now we see that the maximum possible number of segments that we can touch is two times log(n).

update: O(log(n))

This one is easier, we only update the segments that intersect our update point, and there are exactly log(n) such segments.


Lazy Propagation

The idea behind lazy propagation is to do just that. In the update event, if we are updating a contiguous sequence of cells, instead of updating all the cells in the given range, we propagate the change while the segment being analysed by the update call is not contained completely by the updated range. When that happens we just mark a flag saying 'hey continue to update this range latter'. We'll have a problem to practice this.

That delays the update action until a query comes in to get the actual value. Just like in lazy evaluation.


Problems

The problems are by no means easy, but if you manage to solve them you are pretty much done with learning about segment tree.

Frequent values: this is a classic example of segment tree. The easy one.


Can you answer these queries I: not so classic, but fun. The hard one.


Ahoy, Pirates!: try to solve this problem, it is hard and it also needs lazy propagation for it to be within the time limit, but it is a very good one.

Other problems

Other problems we are able to solve using this structure are:

Given a dynamic set of integers, meaning that you can add/remove elements, getting the k-th element in log(n). Note that the sorting approach is not good enough  and that using k-th element of STL gives you linear time, while it will diminish your ability to remove elements. Exercise with  K-th Number.

Sweep-line 'geometric' algorithms, like we have in Election Posters.

Pretty sure I'm missing some techniques or tricks, so if you know about one, let me know so I can add it.

jueves, 9 de agosto de 2012

Borders of a string

I'm going to start by defining a border of a string.

Definition:Given a string s of length n, we'll call a border of s to every prefix s', of length n' < n, that is also a suffix of s.

Example: abc is a border of abcdabc, but aa is not a border of aa.

Lemma: A border of a border, is a border.

To see this we take s' as the border of s and s'' as the border of s', and the result is followed by the observation that if s'' is a border of s' then it is a prefix and suffix of s', and given the fact that s' is a prefix and a suffix of s, we get that s'' is a prefix and a suffix of s.#

We are interested in obtaining the maximal length border for a string. To calculate this value we are going to use an algorithm that calculates the border of every suffix of the string.

First let's see what is the expected output of the algorithm:

for a string abracadabra the returned array should be:
       S =   a  b  r  a  c  a  d  a  b  r  a
border[] = { 0, 0, 0, 1, 0, 1, 0, 1, 2, 3, 4};

We can check that the border of abracadabra is the word abra.

Taking a close look at the border array we can start having some ideas of how to calculate the border of a string. Let's see if it is possible to calculate the border of the string s[0..i+1] given the border of the string s[0..i] in an efficient manner.

Suppose that given i >= 0 we have the border of all the prefixes of length less or equal than i of s. Let's call j = border[i], to the length of the border of s[0..i]. And now:
  • If s[i+1] == s[j+1], then we can "grow" the border s[0..j] of s[0..i] to s[0..j+1] for s[0..i+1].
  • If s[i+1] != s[j+1], then the border we had calculated for s[0..i] can't be pumped for s[0..i+1], but maybe there is a border of the border s[0..j] that can be pumped. So we should try with all the borders of s[i], and we do it in an ordered fashion, first with the border of s[0..j] then with the border of the border of s[0..j] and so on, until either there is no border or  we find a border that satisfies the first condition.
Algorithm:
int border[MAXN];
void calculate_border(const char ∗s) {
  int i = 1, j = −1, n = strlen ( s ) ;
  border[0] = −1;
  while(i < n) {
    while(j >= 0 && s[i] != s[j+1]) j = border[j];
    if (s[i] == s[j+1]) j++;
    border[i++] = j;
  }
}
Complexity: O(length(s))

The algorithm does not produce the array we expected, this array is shifted by -1 in every cell. It is like that for implementation reasons.
To see why its complexity is O(n), we can start  by noticing that we increment i on every iteration of the outer while loop, and we increment j at most as much as i. And on the second while loop we decrement j by at least 1 every time (by definition of border).

Uses:

Period of a string: the border of a string is related to the period of a string, to learn about it you can try to solve the problem PERIOD.

String matching: everybody knows that KMP is the de facto algorithm for string matching, but we can modify slightly this algorithm to find the matches of a string s in a string S. We can do so by sticking them together with a character that we know doesn't belong to s nor S.
We then run the border algorithm over the new word and count a match of s whenever we get j == length(s).

This new algorithm has the same space-time complexity as KMP, with the difference that the worst case is O(length(s)) instead of logφ(length(s)), where φ = (1+sqrt(5))/2.

You can try this with NHAY, but be aware that the size of S in this problem is not bounded ;).

Welcome

I'm writing this blog with the purpose of passing my experience and knowledge to others and to improve my understanding of the different techniques that are going to be explained. I don't have a defined goal or roadmap, so I'll just write what comes to my mind when I feel like writing.

If you would like to see some algorithm explained just let me know and I'll do my best.