Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

speeding up a file-based text search

by perrin (Chancellor)
on May 06, 2003 at 17:41 UTC ( #255961=perlquestion: print w/ replies, xml ) Need Help??
perrin has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to speed up a script which finds all lines in a large (20MB) file that contain a certain string. Because partial matches are allowed, I can't use an inverted word list approach for this. I've sped things up by using index() instead of a regex, but it still takes too long.

So far, the only idea I've had is to use read() to pull in 4K chunks, qualify them with index() and only parse them line-by-line if the chunk qualifies.

Does anyone have any other ideas? FYI, this script will be running under mod_perl.

Comment on speeding up a file-based text search
Select or Download Code
Re: speeding up a file-based text search
by dragonchild (Archbishop) on May 06, 2003 at 17:45 UTC
    Put the file into a database. MySQL would be perfect for this effort. If you can't, rewrite the effort in C.

    Of course, if you want to avoid all those problems, why not post your solution (appropriately scrubbed) and see if we can't find other reasons it's running slow. You know the usual suspects, but I know I miss random stuff after staring at code too long ...

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: speeding up a file-based text search
by BrowserUk (Pope) on May 06, 2003 at 18:02 UTC

    You might benefit from using the sliding buffer technique (see: Re: split and sysread()).


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      That's about what I was thinking of doing with read. Thanks for the sample code!
Re: speeding up a file-based text search
by dws (Chancellor) on May 06, 2003 at 18:14 UTC
    When dealing with flat files this large, a big part of the game is managing the disk head motion. If the disk seeks elsewhere, getting it back to the right place in your file (so that you can keep reading) is relatively expensive. Given that you're on a box that's sharing the disk with lots of processes, the read head isn't something that you can control, but you can influence it. One trick is to read in big chunks (but not so big that you risk having to page, which just causes more disk activity). You're on the right track by thinking 4K chunks, but I'd try something larger, like 8K, 16K, or more. (Use sysread() rather than read(), since the latter will actually use a sequence of OS reads if you try to read a large chunk. You can see this for yourself using ktrace (or equivalent).) By using large reads, you reduce the chance that some other process will sneak in and move the disk head.

    Matching in huge files demonstrates a trick for scanning for a pattern through a sliding window in a large file, using sysread() to pull in 8K chunks. You might be able to adapt it to your problem.

      I tried this, but unfortunately I only got about a 10% performance gain. I do need to know the line that matched, so I used the chunked stuff to qualify chunks and then check them line by line if they match. This lazy way of doing things was quick to try out, but I would probably get better performance if I actually take the time to write code that can extract the line from a chunk based on the position of a match. I'll try that next.
Re: speeding up a file-based text search
by Thelonius (Curate) on May 06, 2003 at 18:19 UTC
    You should see if you can get some kind of more sophisticated indexing system. I don't remember if Glimpse speeds up within-file sorts, but if it does you could use it with "agrep". (Google(TM) it).

    I haven't worked with the module Search::InvertedIndex, but you could still use it, or a similar approach. You need to keep a list of all the indexed words so that you can do a fast serial scan over it (I don't know if Search::InvertedIndex will allow this) and see which of these your pattern matches. Then you look those up in the InvertedIndex to get the list of actual matches. You should probably do a merge/sort of all the matches before you retrieve them from the actual data file.

Re: speeding up a file-based text search
by Limbic~Region (Chancellor) on May 06, 2003 at 21:09 UTC
    perrin,
    I hear through the grape vine that you are a fellow Mainiac (from Maine - for those who don't know). I found tcgrep helpful when tackling a similar problem. I wasn't interested in the line where the data was, but if the file contained the information. Regardless, if you tear tcgrep apart - you should be able to find something to help speed things up.

    Cheers - L~R

      You heard correctly, I am from Maine. Thanks for the pointer.
Re: speeding up a file-based text search
by Aristotle (Chancellor) on May 07, 2003 at 01:06 UTC

    How large would an inverted word list be? It might work better to try to match against a join(' ', @wordlist) (computed only once, obviously). Unless your data is highly random, for a document of 20MB that string probably won't exceed a few dozen kbytes (if it's even that large). Upon finding a match, pos, index, rindex, substr would serve to extract the full word the match landed on, which you can then look up in your inverted word list.

    Regex::PreSuf would be of use to increase the pattern efficiency if it's still an issue.

    Makeshifts last the longest.

      I might be able to do something with a word list. This is actually a medical glossary, so the number of distinct terms may be higher than common documents, but probably not too bad. The situation is complicated by the fact that I need to support phrase searching as well as and/or boolean searching. I could use a word list just to qualify records for further checking. I already have a dbm file for fast random access to the records once I know which ones I want. I'll try that and see how much of a difference it makes.

        Could you you show us a few examples queries that you wish to support, specifically, the format in which the queries are defined?


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
Re: speeding up a file-based text search
by perrin (Chancellor) on May 08, 2003 at 19:25 UTC
    Just to follow up on this for posterity, I adapted the "sliding window" technique using sysread to read in 16K chunks, and then crawl through them with index() looking for lines that I should check more thoroughly. After some tuning, this provided about a 300% speed improvement. Good enough, and not that much effort required.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://255961]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (10)
As of 2014-07-22 21:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (129 votes), past polls