Boyer-Moore string search algorithm in ruby
Update: I cleaned the code up a bit and added more comments.
I was looking for a fast algorithms to search for matching strings. Boyer-Moore seems to be a good choice. It has the peculiar property that it gets faster once you're looking for longer strings. Wikipedia: Boyer–Moore string search algorithm. This site also has a good explanation with graphical examples: Boyer-Moore algorithm
I translated the wikipedia example implementation to Ruby to gain more insight, and I'd like to share.
############################################################# # Boyer-Moore for Ruby # # Author: Arne Brasseur (myfirstname@firstnamelastname.net) # # # # Licence: public domain # # # ############################################################# # Some helpers class String # Take first n characters def first(n=1) self.unpack('U*').first(n).pack('U*') end # Take last n characters def last(n=1) self.unpack('U*').last(n).pack('U*') end end module Kernel # Shorthand def max(i1,i2); i1 > i2 ? i1 : i2; end end # This implementation is based on the C implementation found # on Wikipedia. The purpose is mostly educational, since # String#index has the same functionality. If you want String#index # to use this algorithm, include this module in String. # # It is assumed your strings are UTF-8, $KCODE is ignored. module BoyerMoore # return the position of needle in haystack, or nil if not found def self.search(haystack, needle) # Work on Arrays containing Unicode codepoint indices needle, haystack = needle.unpack('U*'), haystack.unpack('U*') # Maps a position in needle to a number of bytes to shift # should the preceding byte differ skip = [] # Maps characters in needle to their last index in needle occ = Hash.new {-1} return unless needle.length > 0; #Preprocess #1: init occ[] needle[0..-2].each_with_index{|c,i| occ[c]=i} #Preprocess #2: init skip[] needle.length.times do |i| value=0 while (value < needle.length && !needlematch(needle, i, value)) do value+=1 end skip[needle.length-i-1] = value end #Search hpos=0 while (hpos <= haystack.length - needle.length) do npos = needle.length-1 # traverse the needle in reverse, if all bytes match we have a winner while (needle[npos] == haystack[npos+hpos]) do return hpos if npos==0 npos -= 1; end # otherwise shift, either based on skip[] or based on occ[] hpos += max(skip[npos], npos - occ[haystack[npos+hpos]]); end end # Alternative index method for String def index(needle) BoyerMoore.search(self, needle) end private def self.needlematch(needle, length, offset) #cut off offset bytes from needle needle_begin = needle.first(needle.length-offset) #if both needle and needle_begin contain at least length+1 bytes if (needle_begin.length > length) needle[-length-1] != needle_begin[-length-1] && needle.last(length) == needle_begin.last(length) else needle.last(needle_begin.length) == needle_begin end end end #class String # include BoyerMoore #end # example needle='abcab' ['12abcabc', 'abcgghhhaabcabccccc', '123456789abc123abc', 'aabbcc'].each do |hay| puts "#{BoyerMoore.search(hay, needle)} -- #{hay.index(needle)}" end
It's already more idiomatic Ruby than the original, but I believe it can be made even more so. I tried to make a readable implementation so it's easy to grasp what the algorithm is doing. Hence the prefix/suffix methods of String.










Comments
Can you give an example on
Can you give an example on how to use it?
Thanks a lot,
Luan
I figured it out include
I figured it out
include BoyerMoore
str = "asdfasjljfads"
pattern = "sdkfaks"
if BoyerMoore.search(str, pattern)
printf("true")
else
print("false")
end
Thank you so much for the code!
Luan
I figured it out include
I figured it out
include BoyerMoore
str = "asdfasjljfads"
pattern = "sdkfaks"
if BoyerMoore.search(str, pattern)
printf("true")
else
print("false")
end
Thank you so much for the code!
Luan
Very nice! Are you interested
Very nice! Are you interested in releasing this as a open source software? I can help you to release it. Thanks.
I hereby put the above code
I hereby put the above code under the 3-clause BSD licence. Copy it, alter it, use it for any purpose you see fit or unfit.
Feed it to your cat if you like, just don't blame me if she chokes.
Looking at it again there's
Looking at it again there's some ugly stuff in there, like the definition of max inside another method. But still, use it however you see fit.
Please let me know if you actually use this on a project.
Post new comment