Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

(It sounds to me you need a word indexer like Lucene or SOLR -- I don't know much about those. I'll only talk about RDBMS, and then mainly PostgreSQL; I hope that's still interesting.)

Finding the records in which a 10-character string occurs in a given field would qualify as a "simple" query, I would think.

A query may look simple, but still be slow because it has to do a lot of work to retrieve the correct result. The usual performance complication is a query that cannot use an index, or that for another reason has to read the whole of the table data (a.k.a. a sequential scan or seqscan), and/or that cannot keep the whole of the data (or indexdata) in memory.

Searching for an arbitrary substring of a field precludes for most (or all) databases the use of an index. (Word-search is of course already easier, and done by Lucene et similar).

If such a query is a main or important use in your application, you can calculate with disk speed and database size the expected response time, and consider whether it's acceptable.

Or simply try it out. I happen to have a 10M row table handy in Pg (postgres), called azjunk7, which is roughly similar to what you would have. It is somewhat smaller (10M rows) and narrower (1 column, each filled with 80 random characters) than what you mentioned.

Here are some timings for postgres for searches for 'erix', which happens to occur 1008 times. It takes 5 seconds. (Grep takes 1 second, also shown)

testdb=# \dt+ azjunk7 List of relations Schema | Name | Type | Owner | Size | Description --------+---------+-------+----------+---------+------------- public | azjunk7 | table | aardvark | 1116 MB | (1 row) testdb=# select count(*) from azjunk7; count ---------- 10000000 (1 row) testdb=# explain analyze select * from azjunk7 where position('erix' i +n txt) > 0; QUERY PLAN ---------------------------------------------------------------------- +--------------------------------------------- Seq Scan on azjunk7 (cost=0.00..292858.04 rows=3333334 width=81) (ac +tual time=9.515..5075.596 rows=1008 loops=1) Filter: ("position"(txt, 'erix'::text) > 0) Rows Removed by Filter: 9998992 Total runtime: 5075.836 ms (4 rows) Time: 5076.568 ms -- OK. 5 seconds. testdb=# copy (select * from azjunk7) to '/tmp/rawdump'; COPY 10000000 Time: 6538.458 ms testdb=# \q # # bash shell: # $ time grep -c 'erix' /tmp/rawdump 1008 real 0m0.875s user 0m0.683s sys 0m0.161s

(Interestingly (but admittedly not much use for you), there is an improvement in the make for postgresql 9.3 (unreleased) that enables indexing on regular expressions. On my version of that system this same query takes just 15 ms ( milliseconds ) :-)

QUERY PLA +N ---------------------------------------------------------------------- +--------------------------------------------------------------- Bitmap Heap Scan on azjunk7 (cost=43.75..3794.38 rows=1000 width=81) + (actual time=12.137..14.765 rows=1008 loops=1) Recheck Cond: (txt ~ 'erix'::text) Rows Removed by Index Recheck: 83 -> Bitmap Index Scan on azjunk7_trgm_re_idx (cost=0.00..43.50 row +s=1000 width=0) (actual time=12.016..12.016 rows=1091 loops=1) Index Cond: (txt ~ 'erix'::text) Total runtime: 14.918 ms (6 rows)

I for one will be interested to hear what you eventually chose, and how it performs.

(All the above on PostgreSQL 9.3devel, on a simple 4 GB desktop, AMD FX8120, with a single SATA/7200 disk)

In reply to Re^9: Writing a database lookup tool by erix
in thread Writing a database lookup tool by elef

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others surveying the Monastery: (6)
    As of 2018-03-24 23:53 GMT
    Find Nodes?
      Voting Booth?
      When I think of a mole I think of:

      Results (299 votes). Check out past polls.