Luigi Auriemma

aluigi.org (ARCHIVE-ONLY FORUM!)
It is currently 19 Jul 2012 13:32

All times are UTC [ DST ]





Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 5 posts ] 
Author Message
 Post subject: signsrch search algorithm
PostPosted: 12 Oct 2007 23:03 

Joined: 10 Sep 2007 05:40
Posts: 6
I downported David Musser's "Fast Generic Search Algorithm" to C and adapted it to signsrch. The algorithm is a frankenstein of Knuth-Morris-Pratt and Boyer-Moore, combining the best attributes of both with some extra tricks thrown in making it extremely fast (especially for large datasets and search strings). The C code is a little spaghetti-ish since it's mostly a line for line conversion; it's somewhat easy to screw it up if you try to refactor and miss a small detail, but this version has matched the brute-force results on everything I've tested.

To use it with signsrch, just #include "hal_search.h" and replace the search_file() call in signsrch.c with search_hashed(). Function arguments and behavior are the same as the default search. I get a ~14-16x speedup with the current signature db (1752 sigs/1.84mb) and search_hashed(), allowing even my crappy 1.1ghz tbird to scan Photoshop 7 (16mb) in around a minute.

You can toy around with max_hash_range, suffix_size, and the hash function to squeeze out some extra performance. Musser has 4 different variations in his source code (DNA_search.h), but they're tuned for DNA sequences. I've managed to take Photoshop from ~65 seconds to ~55 with the current settings.


Musser's Original Paper:
http://www.cs.rpi.edu/~musser/gp/gensearch1.pdf
http://www.cs.rpi.edu/~musser/gp/gensearch/


Attachments:
File comment: hash accelerated linear search
hal_search.zip [1.4 KiB]
Downloaded 190 times
Top
 Profile  
 
 
 Post subject:
PostPosted: 14 Oct 2007 12:52 

Joined: 13 Aug 2007 21:44
Posts: 4068
Location: http://aluigi.org
Wow Andrew it's a great work!
The speed improvement is simply... awesome.
I will release the new version today or tomorrow since I want to add also some new important signatures to the database.
I have added your name and the link to your website in hal_search.c, hope it's ok for you.
Thanx


Top
 Profile  
 
 Post subject:
PostPosted: 15 Oct 2007 19:48 

Joined: 10 Sep 2007 05:40
Posts: 6
That sounds good! I've got a few more additions as well.

The AND handling you had wasn't quite right since it would fail if the first instance found was incomplete. I re-did it to be less inaccurate, but I had to assume that the pattern is in sequential order. 20 dwords that could be found in any order in an X byte window would take forever to search for!

So now the AND handling will find the first instance in the pattern and then search for the remaining items sequentially in the max_and_distance window. If any of the remaining items fail, it will restart the search from ( first item + AND width ), etc, etc. I changed max_and_distance from a constant to 16x the pattern length to save some time on shorter patterns. It could probably be pushed to 32x if it's too small, or be a constant distance from the last found item instead of a shrinking window.

As long as the initial value in the pattern is fairly unique, this will be great. The only current signature that suffers a major hit is RIPEMD because it starts with 0x00000000, so the search fails through every 0x00000000 in the target. I suspect you can probably just chop the 0 out and it should be fine.

I also included the Torque engine huffman freq table.

BTW, you left out "DOIT(DOUBLE, dbl, 64)" in signcfg.h so the double tables weren't being added.


Attachments:
File comment: new and search + torque huffman table
misc.zip [1.69 KiB]
Downloaded 155 times
Top
 Profile  
 
 Post subject:
PostPosted: 16 Oct 2007 05:37 

Joined: 21 Aug 2007 17:12
Posts: 28
Andrew is one of the best ASM people in Starsiege: TRIBES.

Love Hudbot and Lasthope. :D


Top
 Profile  
 
 Post subject:
PostPosted: 16 Oct 2007 10:23 

Joined: 13 Aug 2007 21:44
Posts: 4068
Location: http://aluigi.org
Still excellent job Andrew.

I have decided to remove the DOUBLEs definitely, since using FLOAT already does the double convertion too.

While about the initial zeroes, for the moment I prefer to take them because having the entire "block" is useful for possible future updates and for helping the users to locate the exact starting of the block during disassembling.

Signsrch 0.1.3 is out!


Top
 Profile  
 
Display posts from previous:  Sort by  
Forum locked This topic is locked, you cannot edit posts or make further replies.  [ 5 posts ] 

All times are UTC [ DST ]


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
cron