Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re^2: How do I read from a compressed SQLite FTS4 database with DBD::SQLite?

by elef (Friar)
on Nov 29, 2013 at 15:36 UTC ( [id://1064977]=note: print w/replies, xml ) Need Help??


in reply to Re: How do I read from a compressed SQLite FTS4 database with DBD::SQLite?
in thread How do I read from a compressed SQLite FTS4 database with DBD::SQLite?

Thank you, that fixes it. It runs correctly with ASCII text now.
I did get another problem though: a bunch of SQL logic errors on INSERT, which seem to be caused by the fact that IO::Compress::Gzip can't handle non-ASCII text, at least not in the way I've been using it.
So, is there some flag I can set to make it handle UTF-8 text? Or should I use some other compression method? All the compression-related perl modules I found seemed to be designed to work with files; IO::Compress::Gzip was the only one that I could find that offers a simple way to compress strings.
Also, my test db created with compression is much larger and slower than the same test db without compression, which is not exactly what I was hoping for...
  • Comment on Re^2: How do I read from a compressed SQLite FTS4 database with DBD::SQLite?
  • Download Code

Replies are listed 'Best First'.
Re^3: How do I read from a compressed SQLite FTS4 database with DBD::SQLite?
by Corion (Patriarch) on Nov 29, 2013 at 15:39 UTC

    I would assume that you should explicitly encode your text to UTF-8 on compressing and explicitly decode from UTF-8 on decompressing. IO::Compress::Gzip likely only works on octets and expects octets. I wonder why it doesn't scream bloody murder...

    Also, maybe you need to explicitly set sqlite_unicode if you are reading/storing UTF-8 data in SQLite.

      Thanks. I did have sqlite_unicode => 1, and the db worked with non-ASCII text if I didn't try to compress it.

      This seems to fix the problem:
      sub compressor { my $in = shift; $in = encode ('utf8', $in); my $out; gzip \$in => \$out; return ($out); } sub uncompressor { my $in = shift; my $out; gunzip \$in => \$out; return (decode ('utf8', $out)); }


      I tested it with some real-life sample data and the compression isn't doing too well: the source data is a 9.4MB text file that compresses down to a 2.4MB zip file. When I imort it without compression, I get a 19.7MB db file. With compression, the db file is 17.0MB. That's a little smaller than the original but not enough to make it worth it. I was hoping for something in the 10MB range (~50% compression). I imagine it could be because each string is compressed separately so repeated strings or parts of strings can't be exploited during compression. Is this a lost battle? If not, I would be grateful for suggestions on a better algorithm.
        Greetings, elef

        Have you tried any of the other different forms of compression IO::Compress offers? My personal experiences when creating archives, seems to indicate the xz algorithm provides better results, more often than not. I notice IO::Compress also offers IO::Compress::Xz. Of course all the algorithm's have different results given the type of input data. But thought it worth mentioning.

        Best Wishes.

        --Chris

        #!/usr/bin/perl -Tw
        use Perl::Always or die;
        my $perl_version = (5.12.5);
        print $perl_version;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1064977]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2024-04-16 10:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found