comment on

I'm finally happy with the code I wrote to browse wikipedia offline.

The most tricky part was to keep the database small. So I made a database with blocks of 256 articles. Each block is frozen using Storable and then compressed with Bzip2. Doing so, the created database is only about 15% larger than the original xml.bz2

I also use XML::Parser to parse wikipedia's database dump.

Here is the most difficult part: converting the XML database (see http://download.wikimedia.org) into a usable one:

#!/usr/bin/perl -w
use v5.14;
use strict;
use warnings;
die "please provide a database name" unless $ARGV[0];
my $rootname = $ARGV[0] =~ s/\.xml\.bz2\E//r =~ s,.*/,,r;

use Encode;
use XML::Parser;
use IO::Uncompress::Bunzip2;
use IO::Compress::Bzip2 qw(bzip2 $Bzip2Error);

use Digest::MD5 qw(md5);
use Storable qw(freeze thaw);
open my $db, "> $rootname.db";
END { close $db }

open my $t, "> $rootname.titles";
END { close $t }

my ($title, @block, $char);

my %debug;

use DB_File;
tie my %index, 'DB_File', "$rootname.index";
END { untie %index }

$SIG{INT} = sub { die "caught INT signal" };
END { printf "%d entries made\n", scalar keys %index }

sub store {
    my $freeze = freeze shift;
    bzip2 \($freeze, my $z);
    my $start = tell $db;
    print $db pack('L', length $z), $z;
    printf "block %d -> %d, compressed ratio is %2.2f%%\n",
    $start, tell($db), 100*length($z)/length($freeze),
    ;
}

my $parser = new XML::Parser Handlers => {
    Char => sub { shift; $char .= shift },
    Start => sub { undef $char },
    End => sub {
    shift;
    given( $_[0] ) {
        when( 'title' ) { $title = encode 'utf8', $char; say $t $title
+ }
        when( 'text' )  {
        push @block, $char;
        $index{length($title) > 16 ? md5 $title : $title} =
            pack 'LC', tell($db), scalar(@block) - 1;
        if (@block == 256) {
            store \@block;
            undef @block;
        }
        }
    }
    },
};

$parser->parse( new IO::Uncompress::Bunzip2 $ARGV[0] );
END { store \@block if @block }
[download]

I think it works pretty well, even if the rendering of the Text::Mediawiki module is a bit ugly for some pages. I need to take care of the references for instance. Still, it does the job, and it's much faster than on-line browsing.

I posted everything (including the CGI script) on my wikipedia userpage, as it also concerns wikipedia users:

http://fr.wikipedia.org/wiki/Utilisateur:Grondilu/Offline_Wikipedia_Perl

EDIT. I also set up a github repo: https://github.com/grondilu/offline-wikipedia-perl

In reply to Offline wikipedia using Perl by grondilu

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


Your skill will accomplish what the force of many cannot
	PerlMonks