Thursday, November 17, 2011

Learning Scala : Boyer–Moore search 1

Authored by Win Myo Htet

I have been tipped off about Boyer-Moore for this Learning Scala series. Here is the wiki. Basically, it tries to do the matching of the search string at the rear, thus if it does not match, it can hop the length of the search string. So, the longer the search string, the more efficient it is.

As usual, the wiki will be our common reference. It is said that BM maintain two tables. The second table check if the character is in the search string or not. If the character is not in the search string at all, then, you can easily jump to the length of the search string. It also stores each character of the search string with the value which is the distance from the end of the string of the character.

It is the first table that is quite complex to construct. I would recommend you  to  read again and again until you understand the construction of this table before you get to the code itself. This table will determine if the program will do proper thorough search without bugs or not. Let's try to dive into the first 4 rows of the table since the rest of the value is the same as the 4th row.

(The current character inspected is in bold and next nearest possible match is underlined )
1st row. If the character is not N, it'd better be A or M or P. Otherwise, we will hop the length of the search string, which is 8, the length of the search string, according to the second table. So what is the minimum jump we can make, now that the character is not N. It has to be 1 jump to cover the possibility of the character being A. We assign the value 1 here.

2nd row. Now that we have the last character matching N and get to the character before it, we have the case of string "AN". So if the character is not A, as usual, it'd better be M or N or P. Then, we have strings of "MN", "NN" and "PN". None of them are part of the actual search string "ANPANMAN". The value for not matching the needed character lead us to assign 8, the length of the search string.

3rd row. "MAN" means "AAN", "NAN" and "PAN". Of these, "PAN" is the sub string of our "ANPANMAN". The wiki said that its jump value is 3. How do we get 3 ?  We know that the matched P (which is M) from our current search position is 5. And P position in the search string is 2. We will subtract these values and we get 3.

4th row. "NMAN" does not give us valid sub string.  So, we will backtrack one more to look for the valid sub string, "NMAN", "MAN", which is a substring of "ANPANMAN". Our position of M in our search String is 5. The position of the first sub string of MAN in search string also is 5. If we do the difference, 5-5=0 and we are not making any jump. We don't want that! Let's backtrack one more time. "NMAN", "AN" is a sub string of "ANPANMAN".  Our A position in sub string NMAN in our search is 6 and the position of the first sub string AN in ANPANMAN is 0. 6-0=6.

If you have thus far followed me, good! Let's go over this table  one more time!

1st row redux. This is the first case and there are little info to do any sort of meaningful processing. We simply can assume that the value will always be 1.

2nd row redux. This row is an odd one in the sequence of 1, 8, 3, 6, ... The jump value 8 is meaningful for the search string ANPANMAN, where AN is not a sub string of the search string,ANPANMAN . How about if these two characters string is a substring of the search string. For example, search String is ANPANMPN and our case is AAAAAAANPANMPN, where PN is AN. So, it is similar to 4th row above now. So, in this case, we will have have 6-0=6.

Now that we have a bit of general pattern, we won't go over the rest of the row. The first row will always be 1. From 2nd and onward, we can do processing in a way that if the string of length 2 is not substring of the search string, the value will be the length of the search string, else we will always look for the difference of the position of the character in a current string to the position of the first valid sub string in the search string. If the difference is 0, we will backtrack the character and do the processing again. Now that we have analyzed what we can, let's try to put them into code next.

Authored by Win Myo Htet

No comments:

Post a Comment