comment on

There is a good reaason why search engines don't allow full regex searches--they are simply too slow.

Yep. The way we typically implement a regex query against an inverted index is...

Scan the over the whole term dictionary looking for terms that match the regex.
Iterate over the posting lists (enumeration of doc ids that match) for all matching terms.

If too many terms match, that could end up being slower than a full table scan. Depending on implementation and index size, you could also end up running out of memory (e.g. if the posting lists are all iterated concurrently). Futhermore, that algo limits the scope of regex matches to individual terms.

Getting good performance out of indexed data is all about planning what queries you need in advance. Regexes are so flexible that they're hard to plan for.

In reply to Re^2: What DB style to use with search engine by creamygoodness
in thread What DB style to use with search engine by halfcountplus

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


There's more than one way to do things
	PerlMonks