Authored by Win Myo Htet
I have been tipped off about Boyer-Moore for this Learning Scala series. Here is the wiki. Basically, it tries to do the matching of the search string at the rear, thus if it does not match, it can hop the length of the search string. So, the longer the search string, the more efficient it is.
As usual, the wiki will be our common reference. It is said that BM maintain two tables. The second table check if the character is in the search string or not. If the character is not in the search string at all, then, you can easily jump to the length of the search string. It also stores each character of the search string with the value which is the distance from the end of the string of the character.
It is the first table that is quite complex to construct. I would recommend you to read again and again until you understand the construction of this table before you get to the code itself. This table will determine if the program will do proper thorough search without bugs or not. Let's try to dive into the first 4 rows of the table since the rest of the value is the same as the 4th row.
(The current character inspected is in bold and next nearest possible match is underlined )
Case : AANPANMAN
1st row. If the character is not N, it'd better be A or M or P. Otherwise, we will hop the length of the search string, which is 8, the length of the search string, according to the second table. So what is the minimum jump we can make, now that the character is not N. It has to be 1 jump to cover the possibility of the character being A. We assign the value 1 here.
Case : ANPANMPNANPANMAN
2nd row. Now that we have the last character matching N and get to the character before it, we have the case of string "
Case : AAAANPANMAN
3rd row. "
Case : "AAAAAMANPANMAN"
4th row. "
If you have thus far followed me, good! Let's go over this table one more time!
1st row redux. This is the first case and there are little info to do any sort of meaningful processing. We simply can assume that the value will always be 1.
2nd row redux. This row is an odd one in the sequence of 1, 8, 3, 6, ... The jump value 8 is meaningful for the search string ANPANMAN, where
Now that we have a bit of general pattern, we won't go over the rest of the row. The first row will always be 1. From 2nd and onward, we can do processing in a way that if the string of length 2 is not substring of the search string, the value will be the length of the search string, else we will always look for the difference of the position of the character in a current string to the position of the first valid sub string in the search string. If the difference is 0, we will backtrack the character and do the processing again. Now that we have analyzed what we can, let's try to put them into code next.
Authored by Win Myo Htet