Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

Re: Writing a database lookup tool

by marto (Bishop)
on Jan 04, 2013 at 14:16 UTC ( #1011636=note: print w/ replies, xml ) Need Help??

in reply to Writing a database lookup tool

The answer to a lot of your questions is, it depends. How fast something runs on a machine of unknown spec depends on many factors, including your how you've written the app, where the database will live (a remote server over a slow network?), the staste of the machine, and so on.

"What sort of performance can I expect from whatever database engine I would end up using?"

Many modern databases are capible of some fancy text searching capabilities 'out of the box'. The performance depends to a great extent on your database of choice, how you confirugre it, how you index the data and you query the data.

"How much time would it take to import 8GB of text into a database format and how much space would it take up?"

Importing the data into a database shouldn't take too long, and it's a one time thing. For size it depends on the database and it's data compression.

"Could the whole app be packaged up into a reasonably-sized .exe file with PAR::Packer?"

Depending on your defination of reasonable, yes.

If your goal is to package an application to allow users to remotely query a database, consider the alternatives, for example a web based search tool running on the same server as the database. Consider also that other open source products already exist for text searching, for example Solr (note, it's not Perl) and the Perl module Solr.

Update: fixed typo.

Comment on Re: Writing a database lookup tool
Replies are listed 'Best First'.
Re^2: Writing a database lookup tool
by elef (Friar) on Jan 04, 2013 at 14:30 UTC
    The data would be offline, i.e. on the user's computer.
    I realize that speed depends on the specs and the implementation, but it should be possible to give a ballpark estimate of some sort. I.e. let's assume a there are 15 million records with a 100 characters in each (in the field that we're searching). I look up a 10-character string. There are 1000 hits. How much time would it take for those 1000 hits to be found if the database design and implementation is not particulary well optimized? 0.01 second? 1 second? 5 seconds?

    Regarding file size, sure, it depends, but again, I'm looking for a ballpark. If the source data is 8GB of UTF-8 text, what are we looking at? More than the 8GB or less (due to some internal compression the DB format might use). Could one throw away the original text files after importing?

    Re: Solr, it has a lot of the features I would want (optimized for text search, regex and sounds-like filters, hit highligting), but it looks like it's designed to run on a server, not offline.

      Forget your quaint "offline" concept.

      The laptop is the server (and the client).

      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

      Please clearly mark updates to your posts. I suggest you actually take the time to try this out for yourself, the learning experience will be worth while and you'll soon relise how vague your questions actually are, and how essentially meaningless it would be to give you a result of a query running on X million records within my tuned environment for a database platform you'll never use.

      Your concept of server and offline is flawed. Your laptop would be the server in that it would host the database, webserver, Solr instance or whatever.

      Update: consider someone asking a piano maker "I have no experience in building a piano, no knowledge of wood and little knowledge of metal work. Roughly how long will it me take to build a piano by hand?"

      Update: fixed typo.

        I used the "offline" concept to clarify for you that the data would not live "on a remote server over a slow network".
        My questions are not vague or meaningless. At this point, the whole project is just taking shape, hence it obviously cannot have a full specification. Which is why I'm looking for guidance on what direction to take or whether the whole idea is feasible or not. This is clearly stated in the first post. I'm obviously not looking exact figures on anything.
        Sticking with your example, a reasonable and helpful perlmonks user might answer: "If you want to build a piano with no experience in building a piano, no knowledge of wood and little knowledge of metal work, then be prepared that this project will take several years to complete - if you ever manage complete it. You would need to learn a lot and there are no 'piano building for dummies' guides to help you along. The idea is best abandoned." Alternatively, a helpful perlmonks user might answer: "You would need to learn how to use Solr but there's a decent tutorial and documentation at XXX. It can run offline and lookup times in the 0.1 - 1 sec range should be easily attainable. No need to learn other languages, you can put it together in perl exclusively. It would take me ~5 hours to put a basic working app together... I guess it shouldn't take more than a week even if you're completely new to databases."

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1011636]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2016-02-13 09:15 GMT
Find Nodes?
    Voting Booth?

    How many photographs, souvenirs, artworks, trophies or other decorative objects are displayed in your home?

    Results (422 votes), past polls