http://www.perlmonks.org?node_id=806342

halfcountplus has asked for the wisdom of the Perl Monks concerning the following question:

I am working on a site-search for someone that allows perl regexp searches. The site is not that big (10 gb) but it is still too much to actually be parsing all the pages for every search, so I have to extract all the content and put it in some kind of database.

My first thought thought was to use SQL, which I presume that will work fine. I am positive that the "expense" of the operation is not the volume of data searched -- it's having to parse through the directory tree opening and closing hundreds of files. One giant SQL table will probably be much faster.

HOWEVER, there will not be much further purpose to using an SQL db, since I want to use the perl regexp engine on the data, not SQL type queries (ie, I would just being going straight thru the table pulling all data anyway).

So, I could just put ALL the text into one flat file, with tag lines to indicate paths. This will be just as quick -- or quicker? -- than using SQL (I think). Another option would be to use Storable or YAML.

For sure someone(s) around here have done much writing of search engines like this. Any opinions? My intuition tells me that the flat file will actually be fastest, but I really don't know.
  • Comment on What DB style to use with search engine

Replies are listed 'Best First'.
Re: What DB style to use with search engine
by BrowserUk (Patriarch) on Nov 10, 2009 at 22:45 UTC

    There is a good reaason why search engines don't allow full regex searches--they are simply too slow. You'd be better only allowing keyword searches.

    But, if you decide to try the big file route, a couple of things that will help are:

    • When concatenating your files, remove all the newlines so that each file becomes a single line prefixed by the path information.

      This makes searching for phrases that span lines much simpler and much, much faster.

      If you need to obtain position information, re-search the actual files once you've located them from the master file.

    • If you have multiple cores available, split the master into as many parts as you have cores and search them in parallel.

      If you can distribute these parts across different spindles, you'll gain the maximum advantage of the parallelism. Otherwise, you may not see much benefit due to head thrash.

      None the less, by making each file a single long line in the master, the ratio of cpu to IO will be greatly improved.

    • If your OS/filesystem allows you to override the default IO buffer size, increasing it to encompass the average (or maximum) line length may yield benefits.
    • Make your master file(s) contiguous if your filesystem allows that.

    If you can limit a first pass to keywords only, then building an inverted index would be the fastest way of trimming the dataset. You could then allow a full regex search to be run on the subset of documents.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      When concatenating your files, remove all the newlines so that each file becomes a single line prefixed by the path information. This makes searching for phrases that span lines much simpler and much, much faster.

      Excellent point, thanks for that. Vis optimizing on the server, I do not have root access and that may be a hassle, but I will keep this in mind once everything is finalized.

      I am sure the regexp is feasible on this scale -- as it is now, most of the work is using regexps to parse the tags out, which must be done, and it still performs usably fast. No matter how I database the data, it should be way, way, way quicker with the tags preprocessed out.
      There is a good reaason why search engines don't allow full regex searches--they are simply too slow.

      Yep. The way we typically implement a regex query against an inverted index is...

      1. Scan the over the whole term dictionary looking for terms that match the regex.
      2. Iterate over the posting lists (enumeration of doc ids that match) for all matching terms.

      If too many terms match, that could end up being slower than a full table scan. Depending on implementation and index size, you could also end up running out of memory (e.g. if the posting lists are all iterated concurrently). Futhermore, that algo limits the scope of regex matches to individual terms.

      Getting good performance out of indexed data is all about planning what queries you need in advance. Regexes are so flexible that they're hard to plan for.

        that algo limits the scope of regex matches to individual terms.

        Allowing for post-wildcarded keywords stem* is relatively easy, especially if you limit the prefix to a minimum of (say) 3-chars or more.

        Where I had to provide for a pre & post wildcards, I built a trigram index (and insisted upon a minimum of 3 contiguous chars). That worked well, but required a substantial sized index.

        The one search engine I am familiar with that does allow full regex including term spanning is Google's codesearch. I'd love to see insights into how they do that. It's quite interesting to see the difference in the time it takes to do

        • for.*map lang:perl 0.6 seconds.
        • fo.*ap lang:perl 4.6 seconds.
        • lang:perl \$[abcd].+[pqrs] also 4.6 seconds.

        Really quite remarkable even given the massive parallelism. They must do some pretty clever parsing of the regexes in order to avoid searching the entire dictionary. I guess the code corpus is considerably smaller than the main set which will help a lot.

        That said: It makes you wonder why the OP doesn't just use Googles search engine?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: What DB style to use with search engine
by erix (Prior) on Nov 10, 2009 at 23:25 UTC

    PostgreSQL does natively support POSIX-style regular expressions, which can do advanced stuff like captures, positive and negative lookahead (alas, no lookbehind), etc.

    See also here: docs Pg pattern-matching (be sure to read past the simple LIKE expressions! )

    PostgreSQL does support regular expression indexing, which is powerful, but it has some limitations too. See the 'text_pattern_ops' operator class (here). (generally: Pg manual, chapter 11)

    Pg also has full text indexing, which can index on words and their derivations (but no regexen).

Re: What DB style to use with search engine
by keszler (Priest) on Nov 10, 2009 at 22:58 UTC
    Another option would be to add lib_mysqludf_preg to a MySQL server. That library provides:
    "...access to the PCRE (perl compatible-regular-expressions) library for pattern matching. The PCRE library is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5."
Re: What DB style to use with search engine
by mpeg4codec (Pilgrim) on Nov 10, 2009 at 21:58 UTC
    I don't have any experience in this field either, but I recommend trying a few approaches to see how they fare. Measurement is key here.

    Start with 'one huge file to rule them all'. First benchmark how long it takes to while (<>) { } the whole file, then see how much running a regex on each line slows that down.

    Most SQL databases are pretty good at efficiently storing gobs of data, even if you're only accessing it sequentially. Try something similar to the above approach but just use a SQL table to back it.

    Finally, are you sure it's the open/close overhead that would kill the naive approach? I'm with you on this, but the point is that neither of us can tell without measuring. You should have a pretty good baseline of how long a while (<>) { } takes on the raw data from the first approach, so compare to that.

      Finally, are you sure it's the open/close overhead that would kill the naive approach? I'm with you on this, but the point is that neither of us can tell without measuring.

      More or less. The reason I said "positive" is because a sizable portion of the site is a photo archive under one directory, which contains a whole slew of pages that present thumbnails -- they have little or no text in them. Of course, there is a little more involved than just "opening and closing" them, they must also be parsed to eliminate the html tags. Which that is an inescapable part of the deal. But if I exclude that directory -- which contains a negligible portion of the data -- the search is very very noticeably faster.

      I also know from other directory tree stuff that even WITHOUT this parsing, a few hundred or thousand files spread across 10+ gigs is a LOT just to stat the files. Try "du / >tmp.txt" on your hard drive. It will take several minutes at least, and that is a C program just collecting file sizes. A site-search engine is not much good if it takes more than 5-10 seconds, methinks.
Re: What DB style to use with search engine
by jfroebe (Parson) on Nov 10, 2009 at 23:28 UTC

    Why reinvent the wheel if you don't have to? mnoGoSearch is free on Linux/*nux machines (GPL license). At the very least, take a look at how they do it for pointers. :)

    Jason L. Froebe

    Blog, Tech Blog