Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^6: Writing a database lookup tool

by elef (Friar)
on Jan 04, 2013 at 16:01 UTC ( #1011664=note: print w/ replies, xml ) Need Help??


in reply to Re^5: Writing a database lookup tool
in thread Writing a database lookup tool

Again, a reasonable and helpful person might answer:
"- I used database module X in this and that manner for a somewhat similar project with a database about the same size and I saw lookup times between 0.1 sec and 0.3 sec on XYZ hardware out of the box. From what you describe, your lookups should be in the same ballpark."
Or: "- If 20MB including the DB engine and the TK gui is reasonable for you, then yes. PAR::Packer should have no problem packing the DB module XXX, I've done it before. It can't pack DB module YYY, though, so don't use that if you need to pack the whole thing into an exe with PAR::Packer."

BTW "run-of-the-mill laptop" is a meaningful performance spec. It can be reasonably assumed to mean something along the lines of a mid-tier corei3 or i5 with the integrated video card, 4GB of DDR3 RAM and a 5400 RPM 2.5" platter drive.


Comment on Re^6: Writing a database lookup tool
Re^7: Writing a database lookup tool
by marto (Chancellor) on Jan 04, 2013 at 16:09 UTC

    "BTW "run-of-the-mill laptop" is a meaningful performance spec. It can be reasonably assumed to mean something along the lines of a mid-tier corei3 or i5 with the integrated video card, 4GB of DDR3 RAM and a 5400 RPM 2.5" platter drive."

    Under what circumstance is it safe to assume that the term "run-of-the-mill" in this context equates to a multi core 64bit CPU? You seem determined to validate the use of assumptions based on arbitrary phrases.

    "Again, a reasonable and helpful person might answer: "- I used database module X in this and that manner for a somewhat similar project with a database about the same size and I saw lookup times between 0.1 sec and 0.3 sec on XYZ hardware out of the box. From what you describe, your lookups should be in the same ballpark." Or: "- If 20MB including the DB engine and the TK gui is reasonable for you, then yes. PAR::Packer should have no problem packing the DB module XXX, I've done it before. It can't pack DB module YYY, though, so don't use that if you need to pack the whole thing into an exe with PAR::Packer."

    You keep providing pointless responses like this, yet not actually addressing any of the pertinant points raised in response to your posts. You have experience of pp, in the time it's taken you to post this you could have discovered the generated exe size for a basic script reading in Tk and several different database drivers (for comparison since you have yet to choose a database platform). In short, you could answer this yourself in a matter of minutes if you were actually concerned with the answer, rather than semantical arguments.

Re^7: Writing a database lookup tool
by choroba (Abbot) on Jan 04, 2013 at 16:16 UTC
    We @ work use PostgreSQL and Oracle via DBI for a bit different data: lots of rather short strings with many relations among them (around 1 GB in XML). Simple queries take less than 1s, complicated ones might take more than 10 minutes. Go figure.
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      Well, in the post you're replying to, I did give a fairly detailed description of a sample query I had in mind:
      "let's assume a there are 15 million records with a 100 characters in each (in the field that we're searching). I look up a 10-character string. There are 1000 hits. How much time would it take for those 1000 hits to be found if the database design and implementation is not particulary well optimized? 0.01 second? 1 second? 5 seconds?"

      Finding the records in which a 10-character string occurs in a given field would qualify as a "simple" query, I would think.
        The change in data size might influence the performance significantly. We once tried to enlarge the data to GBs and the application became unusable.
        لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        (It sounds to me you need a word indexer like Lucene or SOLR -- I don't know much about those. I'll only talk about RDBMS, and then mainly PostgreSQL; I hope that's still interesting.)

        Finding the records in which a 10-character string occurs in a given field would qualify as a "simple" query, I would think.

        A query may look simple, but still be slow because it has to do a lot of work to retrieve the correct result. The usual performance complication is a query that cannot use an index, or that for another reason has to read the whole of the table data (a.k.a. a sequential scan or seqscan), and/or that cannot keep the whole of the data (or indexdata) in memory.

        Searching for an arbitrary substring of a field precludes for most (or all) databases the use of an index. (Word-search is of course already easier, and done by Lucene et similar).

        If such a query is a main or important use in your application, you can calculate with disk speed and database size the expected response time, and consider whether it's acceptable.

        Or simply try it out. I happen to have a 10M row table handy in Pg (postgres), called azjunk7, which is roughly similar to what you would have. It is somewhat smaller (10M rows) and narrower (1 column, each filled with 80 random characters) than what you mentioned.

        Here are some timings for postgres for searches for 'erix', which happens to occur 1008 times. It takes 5 seconds. (Grep takes 1 second, also shown)

        testdb=# \dt+ azjunk7 List of relations Schema | Name | Type | Owner | Size | Description --------+---------+-------+----------+---------+------------- public | azjunk7 | table | aardvark | 1116 MB | (1 row) testdb=# select count(*) from azjunk7; count ---------- 10000000 (1 row) testdb=# explain analyze select * from azjunk7 where position('erix' i +n txt) > 0; QUERY PLAN ---------------------------------------------------------------------- +--------------------------------------------- Seq Scan on azjunk7 (cost=0.00..292858.04 rows=3333334 width=81) (ac +tual time=9.515..5075.596 rows=1008 loops=1) Filter: ("position"(txt, 'erix'::text) > 0) Rows Removed by Filter: 9998992 Total runtime: 5075.836 ms (4 rows) Time: 5076.568 ms -- OK. 5 seconds. testdb=# copy (select * from azjunk7) to '/tmp/rawdump'; COPY 10000000 Time: 6538.458 ms testdb=# \q # # bash shell: # $ time grep -c 'erix' /tmp/rawdump 1008 real 0m0.875s user 0m0.683s sys 0m0.161s

        (Interestingly (but admittedly not much use for you), there is an improvement in the make for postgresql 9.3 (unreleased) that enables indexing on regular expressions. On my version of that system this same query takes just 15 ms ( milliseconds ) :-)

        QUERY PLA +N ---------------------------------------------------------------------- +--------------------------------------------------------------- Bitmap Heap Scan on azjunk7 (cost=43.75..3794.38 rows=1000 width=81) + (actual time=12.137..14.765 rows=1008 loops=1) Recheck Cond: (txt ~ 'erix'::text) Rows Removed by Index Recheck: 83 -> Bitmap Index Scan on azjunk7_trgm_re_idx (cost=0.00..43.50 row +s=1000 width=0) (actual time=12.016..12.016 rows=1091 loops=1) Index Cond: (txt ~ 'erix'::text) Total runtime: 14.918 ms (6 rows)
        )

        I for one will be interested to hear what you eventually chose, and how it performs.

        (All the above on PostgreSQL 9.3devel, on a simple 4 GB desktop, AMD FX8120, with a single SATA/7200 disk)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1011664]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (7)
As of 2014-07-13 07:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (248 votes), past polls