http://www.perlmonks.org?node_id=1011765


in reply to Re^8: Writing a database lookup tool
in thread Writing a database lookup tool

(It sounds to me you need a word indexer like Lucene or SOLR -- I don't know much about those. I'll only talk about RDBMS, and then mainly PostgreSQL; I hope that's still interesting.)

Finding the records in which a 10-character string occurs in a given field would qualify as a "simple" query, I would think.

A query may look simple, but still be slow because it has to do a lot of work to retrieve the correct result. The usual performance complication is a query that cannot use an index, or that for another reason has to read the whole of the table data (a.k.a. a sequential scan or seqscan), and/or that cannot keep the whole of the data (or indexdata) in memory.

Searching for an arbitrary substring of a field precludes for most (or all) databases the use of an index. (Word-search is of course already easier, and done by Lucene et similar).

If such a query is a main or important use in your application, you can calculate with disk speed and database size the expected response time, and consider whether it's acceptable.

Or simply try it out. I happen to have a 10M row table handy in Pg (postgres), called azjunk7, which is roughly similar to what you would have. It is somewhat smaller (10M rows) and narrower (1 column, each filled with 80 random characters) than what you mentioned.

Here are some timings for postgres for searches for 'erix', which happens to occur 1008 times. It takes 5 seconds. (Grep takes 1 second, also shown)

testdb=# \dt+ azjunk7 List of relations Schema | Name | Type | Owner | Size | Description --------+---------+-------+----------+---------+------------- public | azjunk7 | table | aardvark | 1116 MB | (1 row) testdb=# select count(*) from azjunk7; count ---------- 10000000 (1 row) testdb=# explain analyze select * from azjunk7 where position('erix' i +n txt) > 0; QUERY PLAN ---------------------------------------------------------------------- +--------------------------------------------- Seq Scan on azjunk7 (cost=0.00..292858.04 rows=3333334 width=81) (ac +tual time=9.515..5075.596 rows=1008 loops=1) Filter: ("position"(txt, 'erix'::text) > 0) Rows Removed by Filter: 9998992 Total runtime: 5075.836 ms (4 rows) Time: 5076.568 ms -- OK. 5 seconds. testdb=# copy (select * from azjunk7) to '/tmp/rawdump'; COPY 10000000 Time: 6538.458 ms testdb=# \q # # bash shell: # $ time grep -c 'erix' /tmp/rawdump 1008 real 0m0.875s user 0m0.683s sys 0m0.161s

(Interestingly (but admittedly not much use for you), there is an improvement in the make for postgresql 9.3 (unreleased) that enables indexing on regular expressions. On my version of that system this same query takes just 15 ms ( milliseconds ) :-)

QUERY PLA +N ---------------------------------------------------------------------- +--------------------------------------------------------------- Bitmap Heap Scan on azjunk7 (cost=43.75..3794.38 rows=1000 width=81) + (actual time=12.137..14.765 rows=1008 loops=1) Recheck Cond: (txt ~ 'erix'::text) Rows Removed by Index Recheck: 83 -> Bitmap Index Scan on azjunk7_trgm_re_idx (cost=0.00..43.50 row +s=1000 width=0) (actual time=12.016..12.016 rows=1091 loops=1) Index Cond: (txt ~ 'erix'::text) Total runtime: 14.918 ms (6 rows)
)

I for one will be interested to hear what you eventually chose, and how it performs.

(All the above on PostgreSQL 9.3devel, on a simple 4 GB desktop, AMD FX8120, with a single SATA/7200 disk)

Replies are listed 'Best First'.
Re^10: Writing a database lookup tool
by elef (Friar) on Jan 18, 2013 at 11:45 UTC
    Thanks, that is interesting.
    Your query times give me pause. Ideally, I'd be shooting for about 1sec on most searches. If 5+ second query times are the norm, that kind of kills it because this thing would be used for quite frequent lookups. Obviously, I don't want to spend weeks building this thing only to find out that it's too slow to be practical when I load it up with all my data, so I'll have test any tool I consider with all the data before I go ahead.
    Now, many searches could be done as full word searches. Would those be radically faster than general substring searches if I were to use sql via DBI:DB or something similar, or would I need to use something that was optimised for text search (e.g. Lucene/Solr) to get that performance benefit? I posted a question on FTS in reply to NetWallah's post further down that's kind of related.

    At this point the whole thing is just an idea I'm considering. I guess I will play with some of the tools suggested here, and maybe look into completely different approaches like trying to use libreoffice base.