Boyer-Moore string search algorithm in ruby

Update: I cleaned the code up a bit and added more comments.

I was looking for a fast algorithms to search for matching strings. Boyer-Moore seems to be a good choice. It has the peculiar property that it gets faster once you're looking for longer strings. Wikipedia: Boyer–Moore string search algorithm. This site also has a good explanation with graphical examples: Boyer-Moore algorithm

I translated the wikipedia example implementation to Ruby to gain more insight, and I'd like to share.

#############################################################
# Boyer-Moore for Ruby                                      #
# Author: Arne Brasseur (myfirstname@firstnamelastname.net) #
#                                                           #
# Licence: public domain                                    #
#                                                           #
#############################################################

# Some helpers

class String  

  # Take first n characters
  def first(n=1)  
    self.unpack('U*').first(n).pack('U*')
  end  

  # Take last n characters
  def last(n=1)  
    self.unpack('U*').last(n).pack('U*')
  end  
end

module Kernel

  # Shorthand
  def max(i1,i2); i1 > i2 ? i1 : i2; end
end


# This implementation is based on the C implementation found
# on Wikipedia. The purpose is mostly educational, since 
# String#index has the same functionality. If you want String#index
# to use this algorithm, include this module in String.
#
# It is assumed your strings are UTF-8, $KCODE is ignored.

module BoyerMoore    

  # return the position of needle in haystack, or nil if not found
  def self.search(haystack, needle)

    # Work on Arrays containing Unicode codepoint indices
    needle, haystack = needle.unpack('U*'), haystack.unpack('U*')
    
    # Maps a position in needle to a number of bytes to shift 
    # should the preceding byte differ  
    skip = []  
    
    # Maps characters in needle to their last index in needle
    occ = Hash.new {-1}
    
    return unless needle.length > 0;  
    
    #Preprocess #1: init occ[]  
    needle[0..-2].each_with_index{|c,i| occ[c]=i}  
    
    #Preprocess #2: init skip[]  
    needle.length.times do |i|  
      value=0  
      while (value < needle.length && !needlematch(needle, i, value)) do  
        value+=1  
      end  
      skip[needle.length-i-1] = value  
    end  
    
    #Search  
    hpos=0  
    while (hpos <= haystack.length - needle.length) do  
      npos = needle.length-1  
      
      # traverse the needle in reverse, if all bytes match we have a winner  
      while (needle[npos] == haystack[npos+hpos]) do
        return hpos if npos==0  
        npos -= 1;  
      end  

      # otherwise shift, either based on skip[] or based on occ[]  
      hpos += max(skip[npos], npos - occ[haystack[npos+hpos]]);
    end  
  end

  # Alternative index method for String
  def index(needle)
    BoyerMoore.search(self, needle)
  end
  
  private
    def self.needlematch(needle, length, offset)  
      #cut off offset bytes from needle  
      needle_begin = needle.first(needle.length-offset)  
      
      #if both needle and needle_begin contain at least length+1 bytes 
      if (needle_begin.length > length)  
        needle[-length-1] != needle_begin[-length-1]  &&  
          needle.last(length) == needle_begin.last(length)  
      else  
        needle.last(needle_begin.length) == needle_begin  
      end  
    end

end

#class String
#  include BoyerMoore
#end

# example

needle='abcab'
['12abcabc', 'abcgghhhaabcabccccc', '123456789abc123abc', 'aabbcc'].each do |hay|
    puts "#{BoyerMoore.search(hay, needle)} -- #{hay.index(needle)}"
end

It's already more idiomatic Ruby than the original, but I believe it can be made even more so. I tried to make a readable implementation so it's easy to grasp what the algorithm is doing. Hence the prefix/suffix methods of String.

Comments

Can you give an example on

Can you give an example on how to use it?

Thanks a lot,
Luan

I figured it out include

I figured it out

include BoyerMoore

str = "asdfasjljfads"
pattern = "sdkfaks"

if BoyerMoore.search(str, pattern)
printf("true")
else
print("false")
end

Thank you so much for the code!
Luan

I figured it out include

I figured it out

include BoyerMoore

str = "asdfasjljfads"
pattern = "sdkfaks"

if BoyerMoore.search(str, pattern)
printf("true")
else
print("false")
end

Thank you so much for the code!
Luan

Very nice! Are you interested

Very nice! Are you interested in releasing this as a open source software? I can help you to release it. Thanks.

I hereby put the above code

I hereby put the above code under the 3-clause BSD licence. Copy it, alter it, use it for any purpose you see fit or unfit.

Feed it to your cat if you like, just don't blame me if she chokes.

Looking at it again there's

Looking at it again there's some ugly stuff in there, like the definition of max inside another method. But still, use it however you see fit.

Please let me know if you actually use this on a project.

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Syndicate content