|Pathologically Eclectic Rubbish Lister|
Re^9: Writing a database lookup toolby erix (Vicar)
|on Jan 05, 2013 at 08:31 UTC||Need Help??|
(It sounds to me you need a word indexer like Lucene or SOLR -- I don't know much about those. I'll only talk about RDBMS, and then mainly PostgreSQL; I hope that's still interesting.)
Finding the records in which a 10-character string occurs in a given field would qualify as a "simple" query, I would think.
A query may look simple, but still be slow because it has to do a lot of work to retrieve the correct result. The usual performance complication is a query that cannot use an index, or that for another reason has to read the whole of the table data (a.k.a. a sequential scan or seqscan), and/or that cannot keep the whole of the data (or indexdata) in memory.
Searching for an arbitrary substring of a field precludes for most (or all) databases the use of an index. (Word-search is of course already easier, and done by Lucene et similar).
If such a query is a main or important use in your application, you can calculate with disk speed and database size the expected response time, and consider whether it's acceptable.
Or simply try it out. I happen to have a 10M row table handy in Pg (postgres), called azjunk7, which is roughly similar to what you would have. It is somewhat smaller (10M rows) and narrower (1 column, each filled with 80 random characters) than what you mentioned.
Here are some timings for postgres for searches for 'erix', which happens to occur 1008 times. It takes 5 seconds. (Grep takes 1 second, also shown)
(Interestingly (but admittedly not much use for you), there is an improvement in the make for postgresql 9.3 (unreleased) that enables indexing on regular expressions. On my version of that system this same query takes just 15 ms ( milliseconds ) :-)
I for one will be interested to hear what you eventually chose, and how it performs.
(All the above on PostgreSQL 9.3devel, on a simple 4 GB desktop, AMD FX8120, with a single SATA/7200 disk)