Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: Writing a database lookup tool

by elef (Friar)
on Jan 04, 2013 at 14:30 UTC ( #1011639=note: print w/ replies, xml ) Need Help??


in reply to Re: Writing a database lookup tool
in thread Writing a database lookup tool

The data would be offline, i.e. on the user's computer.
I realize that speed depends on the specs and the implementation, but it should be possible to give a ballpark estimate of some sort. I.e. let's assume a there are 15 million records with a 100 characters in each (in the field that we're searching). I look up a 10-character string. There are 1000 hits. How much time would it take for those 1000 hits to be found if the database design and implementation is not particulary well optimized? 0.01 second? 1 second? 5 seconds?

Regarding file size, sure, it depends, but again, I'm looking for a ballpark. If the source data is 8GB of UTF-8 text, what are we looking at? More than the 8GB or less (due to some internal compression the DB format might use). Could one throw away the original text files after importing?

Re: Solr, it has a lot of the features I would want (optimized for text search, regex and sounds-like filters, hit highligting), but it looks like it's designed to run on a server, not offline.


Comment on Re^2: Writing a database lookup tool
Re^3: Writing a database lookup tool
by tobyink (Abbot) on Jan 04, 2013 at 14:42 UTC

    Forget your quaint "offline" concept.

    The laptop is the server (and the client).

    perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
Re^3: Writing a database lookup tool
by marto (Bishop) on Jan 04, 2013 at 14:48 UTC

    Please clearly mark updates to your posts. I suggest you actually take the time to try this out for yourself, the learning experience will be worth while and you'll soon relise how vague your questions actually are, and how essentially meaningless it would be to give you a result of a query running on X million records within my tuned environment for a database platform you'll never use.

    Your concept of server and offline is flawed. Your laptop would be the server in that it would host the database, webserver, Solr instance or whatever.

    Update: consider someone asking a piano maker "I have no experience in building a piano, no knowledge of wood and little knowledge of metal work. Roughly how long will it me take to build a piano by hand?"

    Update: fixed typo.

      I used the "offline" concept to clarify for you that the data would not live "on a remote server over a slow network".
      My questions are not vague or meaningless. At this point, the whole project is just taking shape, hence it obviously cannot have a full specification. Which is why I'm looking for guidance on what direction to take or whether the whole idea is feasible or not. This is clearly stated in the first post. I'm obviously not looking exact figures on anything.
      Sticking with your example, a reasonable and helpful perlmonks user might answer: "If you want to build a piano with no experience in building a piano, no knowledge of wood and little knowledge of metal work, then be prepared that this project will take several years to complete - if you ever manage complete it. You would need to learn a lot and there are no 'piano building for dummies' guides to help you along. The idea is best abandoned." Alternatively, a helpful perlmonks user might answer: "You would need to learn how to use Solr but there's a decent tutorial and documentation at XXX. It can run offline and lookup times in the 0.1 - 1 sec range should be easily attainable. No need to learn other languages, you can put it together in perl exclusively. It would take me ~5 hours to put a basic working app together... I guess it shouldn't take more than a week even if you're completely new to databases."

        "I used the "offline" concept to clarify for you that the data would not live "on a remote server over a slow network".

        However:

        "Re: Solr, it has a lot of the features I would want (optimized for text search, regex and sounds-like filters, hit highligting), but it looks like it's designed to run on a server, not offline."

        So you're assuming that Solr can't run on a laptop for some reason? Again, your offline concept is wrong, in the way you use it and what you seem to think it means.

        "My questions are not vague or meaningless."

        Your questions are very vague, e.g. (emphasis added by me)

        • "What sort of performance can I expect from whatever database engine I would end up using?" - No database platform specified.
        • "How much time would it take to import 8GB of text into a database format and how much space would it take up?"
        • "Most importantly, how much time would a lookup take on a run-of-the-mill laptop?" - No specification at all.
        • "Could the whole app be packaged up into a reasonably-sized .exe file with PAR::Packer?" - Subjective
        • "How much time would it take for those 1000 hits to be found if the database design and implementation is not particulary well optimized?" - Which database, how poorly optimized?

        At no point did I say they were meaningless**. IMHO a "reasonable" person would suggest you actually spend some time trying some of this out using different databases on the system you intend to run it on. I suggested this here. How long it would take you to deveop such a system depends on you, how much you understand about the issues involved, how much time you're prepaired to spend. Given that you've looked at Solr and think it can't run on your laptop, investigations aren't going well so far. I wouldn't like to speculate how long it'll take you to develop a working system.

        Update: ** Ah, perhaps you interpreted me saying "..essentially meaningless it would be to give you a result of a query running on X million records within my tuned environment for a database platform you'll never use." as somehow being a slight against you or your questions. If so please re read and understand that it would be meaningless for me to provide an arbitrary metric.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1011639]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2014-12-27 06:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls