Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

comment on

( [id://3333] : superdoc . print w/replies, xml ) Need Help??
I'm building a text search app using DBD::SQLite and FTS4. I'm completely new to databases, but I got the basic things working well. Now there are a couple of things I need advice on.

(1) I would like to compress the databases. A lot of people will be using dbs that contain 10 million records or more (upwards of 10GB or even 50GB uncompressed). This is all UTF-8 text data, some of it very repetitive so it compresses well. How do I go about this? The DBD::SQLite cpan page mentions compression briefly, and there is a bit on this at http://www.sqlite.org/fts3.html#section_6_1 but I can find no actual working sample code that I could use. The sqlite.org page seems to assume that I will write the compress and uncompress functions myself, which is way above my pay grade. Is there a ready-made solution somewhere that I am missing?

(2) Currently, my app can handle one database file at a time. It does a search and returns the results ordered by length. I would like to add support for multiple database files, allowing the user to set a ranking and displaying hits in that order (first, all hits from 'tier 1' dbs ordered by length, then all hits from 'tier 2' dbs ordered by length etc.). Is this feasible? Would I need to ATTACH each of the dbs to the same connection? How can I run the same query on all the dbs? Create a foreach my $db (@databases) loop and iterate through them, executing the query in each one in turn?

(3) I would like to add match highlighting but the DBD::SQLite page says "The current FTS implementation in SQLite is far from complete with respect to utf8 handling : in particular, variable-length characters are not treated correctly by the builtin functions offsets() and snippet()." Has that been fixed since this was written or should I forget about offsets() and try to write my own code that tries to analyze each string returned from the database and find the search terms in it so I can highlight them?

I can post the my current (working) code if anyone's interested.

In reply to DBD::SQLite questions by elef

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.