Friday, November 18, 2011

Learning Scala : Boyer–Moore search 2

Authored by Win Myo Htet



In the previous post, I have explained the second table before the first table. The reason is that the second table is easy to understand. We will do the same and code the second table first.

def createTable2(str : String) : Map[Char, Int] = {
  val strOps = new StringOps(str).reverse.tail

    val result = strOps.foldLeft(Map[Char,Int](),1)  {
      case ((mMap,pos),ch) if ! mMap.isDefinedAt(ch) 
        => (mMap ++ Map(ch -> pos), pos+1)
      case ((mMap,pos),ch)  => (mMap,pos +1)
    }
    return result._1
We convert our search string, str, to StringOps type, reverse it and drop the first charcter while keeping the rest. If it is of StringOps format, we can do manipulation like List. It is being reversed so that we can operate foldLeft (foldLeft is preferred over foldRight whenever it is possible). We drop the first character, which was the last character before because the last character is not considered in the second table. After that, our usual foldLeft manipulates the string to have a tuple of Map[Char,Int] and Int. The last Int is to track the position of the iteration process. We add each character into the Map if it is not there. Anything else, we just increment the pos to track the iteration position. We return the Map from the tuple.

The second table is done and we will work on the first table. It is pretty much the same pattern: StringOps, reverse, foldLeft, tuple, Map, etc ... The only differences will be calling calculate function and using the filter on the List. filter filters out character based on the condition. Here is filter's signature "def filter (p: (Char) ⇒ Boolean): String"

def createTable1(str : String) : Map[Int, Int] = {
  val strOps = new StringOps(str).reverse
  val strList = strOps.foldLeft(List[Char]()) {
   case (l, ch) if !l.contains(ch) => ch :: l
   case (l, ch) => l
  }
  val result = strOps.foldLeft(Map[Int, Int](), 0, "") {
   case ((mMap, pos, subStr), ch) if pos != 0 => (mMap ++ Map(pos -> calculate(pos, subStr, `str`,
    `strList`.filter {
     case x if x == ch => false
     case _ => true
    })), pos + 1, ch + subStr)
   case ((mMap, pos, subStr), ch) => (Map(0 -> 1), pos + 1, ch + subStr)
  }
  result._1
 }

There also is a (`backtick quote`), which allows us to refer to the variables outside of the scope. So, the general structure of the code is similar to that of the second table. Here we have Map[Int,Int] instead of Map[Char,Int]. I have decided to use Int, the length of sub string for the Map key. strList with filter will give us a new List with valid character. For example, in valid characters List (A,M,N,P), M will generate List (A,N,P). The calculate function will use strList to create a new sub string and get their position in the search string by calling subSetJump function.

def calculate(pos : Int, subStr : String,
  str : String, strList : List[Char]) : Int = {
  
  val jList = strList.foldLeft(List[Int]()) {
   (j, ch) => subSetJump(ch + `subStr`, `str`) :: j
  }

  if (jList.max < 0) {
   if (subStr.length == 1) return str.length
   return calculate(pos - 1, subStr.slice(0, subStr.length),
    str, List[Char]())
  }
  if (str.length - 1 - pos - jList.max == 0)
   return calculate(pos - 1, subStr.slice(1, subStr.length),
    str, List[Char]())
  str.length - 1 - pos - jList.max
 }

def subSetJump(subStr : String, str : String) : Int = {
  val list = (for (i <- 0 until str.length)
   yield str.slice(i, i + subStr.length)).toList
  list.indexOf(subStr)
 }

subSetJump function will return the index or -1 if there is no match. We check the condition on end-of-string, mis-match and zero-jump; and recursively call the calculate function again with the updated pos and subStr. When we call the calculate function again, we are doing the case for backtrack. That means there won't be any variation of sub strings. Thus, I have added the code for that above jList. It can be a bit of redundant and messey but I am a bit more into getting the working code fast ;) .

calculate function with more code
def calculate(pos : Int, subStr : String,
  str : String, strList : List[Char]) : Int = {
  if (strList.isEmpty) {
   val jListMax = subSetJump(subStr, str)
   if (jListMax < 0)
    return calculate(pos - 1, subStr.slice(1, subStr.length),
     str, List[Char]())
   if (str.length - 1 - pos - jListMax == 0)
    return calculate(pos - 1, subStr.slice(1, subStr.length),
     str, List[Char]())
   return str.length - 1 - pos - jListMax
  }

  val jList = strList.foldLeft(List[Int]()) {
   (j, ch) => subSetJump(ch + `subStr`, `str`) :: j
  }

  if (jList.max < 0) {
   if (subStr.length == 1) return str.length
   return calculate(pos - 1, subStr.slice(0, subStr.length),
    str, List[Char]())
  }
  if (str.length - 1 - pos - jList.max == 0)
   return calculate(pos - 1, subStr.slice(1, subStr.length),
    str, List[Char]())
  str.length - 1 - pos - jList.max
 }

Now, we are done with the first table too. Next we will use these two tables to do the search.

Authored by Win Myo Htet

No comments:

Post a Comment