Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: What DB style to use with search engine

by mpeg4codec (Pilgrim)
on Nov 10, 2009 at 21:58 UTC ( #806346=note: print w/ replies, xml ) Need Help??


in reply to What DB style to use with search engine

I don't have any experience in this field either, but I recommend trying a few approaches to see how they fare. Measurement is key here.

Start with 'one huge file to rule them all'. First benchmark how long it takes to while (<>) { } the whole file, then see how much running a regex on each line slows that down.

Most SQL databases are pretty good at efficiently storing gobs of data, even if you're only accessing it sequentially. Try something similar to the above approach but just use a SQL table to back it.

Finally, are you sure it's the open/close overhead that would kill the naive approach? I'm with you on this, but the point is that neither of us can tell without measuring. You should have a pretty good baseline of how long a while (<>) { } takes on the raw data from the first approach, so compare to that.


Comment on Re: What DB style to use with search engine
Select or Download Code
Re^2: What DB style to use with search engine
by halfcountplus (Hermit) on Nov 10, 2009 at 22:53 UTC
    Finally, are you sure it's the open/close overhead that would kill the naive approach? I'm with you on this, but the point is that neither of us can tell without measuring.

    More or less. The reason I said "positive" is because a sizable portion of the site is a photo archive under one directory, which contains a whole slew of pages that present thumbnails -- they have little or no text in them. Of course, there is a little more involved than just "opening and closing" them, they must also be parsed to eliminate the html tags. Which that is an inescapable part of the deal. But if I exclude that directory -- which contains a negligible portion of the data -- the search is very very noticeably faster.

    I also know from other directory tree stuff that even WITHOUT this parsing, a few hundred or thousand files spread across 10+ gigs is a LOT just to stat the files. Try "du / >tmp.txt" on your hard drive. It will take several minutes at least, and that is a C program just collecting file sizes. A site-search engine is not much good if it takes more than 5-10 seconds, methinks.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://806346]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2014-10-22 01:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (112 votes), past polls