Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
Think about Loose Coupling
 
PerlMonks  

Offline wikipedia using Perl

by grondilu (Pilgrim)
on Mar 08, 2012 at 13:49 UTC ( #958466=CUFP: print w/ replies, xml ) Need Help??

I'm finally happy with the code I wrote to browse wikipedia offline.

The most tricky part was to keep the database small. So I made a database with blocks of 256 articles. Each block is frozen using Storable and then compressed with Bzip2. Doing so, the created database is only about 15% larger than the original xml.bz2

I also use XML::Parser to parse wikipedia's database dump.

Here is the most difficult part: converting the XML database (see http://download.wikimedia.org) into a usable one:

#!/usr/bin/perl -w use v5.14; use strict; use warnings; die "please provide a database name" unless $ARGV[0]; my $rootname = $ARGV[0] =~ s/\.xml\.bz2\E//r =~ s,.*/,,r; use Encode; use XML::Parser; use IO::Uncompress::Bunzip2; use IO::Compress::Bzip2 qw(bzip2 $Bzip2Error); use Digest::MD5 qw(md5); use Storable qw(freeze thaw); open my $db, "> $rootname.db"; END { close $db } open my $t, "> $rootname.titles"; END { close $t } my ($title, @block, $char); my %debug; use DB_File; tie my %index, 'DB_File', "$rootname.index"; END { untie %index } $SIG{INT} = sub { die "caught INT signal" }; END { printf "%d entries made\n", scalar keys %index } sub store { my $freeze = freeze shift; bzip2 \($freeze, my $z); my $start = tell $db; print $db pack('L', length $z), $z; printf "block %d -> %d, compressed ratio is %2.2f%%\n", $start, tell($db), 100*length($z)/length($freeze), ; } my $parser = new XML::Parser Handlers => { Char => sub { shift; $char .= shift }, Start => sub { undef $char }, End => sub { shift; given( $_[0] ) { when( 'title' ) { $title = encode 'utf8', $char; say $t $title + } when( 'text' ) { push @block, $char; $index{length($title) > 16 ? md5 $title : $title} = pack 'LC', tell($db), scalar(@block) - 1; if (@block == 256) { store \@block; undef @block; } } } }, }; $parser->parse( new IO::Uncompress::Bunzip2 $ARGV[0] ); END { store \@block if @block }

I think it works pretty well, even if the rendering of the Text::Mediawiki module is a bit ugly for some pages. I need to take care of the references for instance. Still, it does the job, and it's much faster than on-line browsing.

I posted everything (including the CGI script) on my wikipedia userpage, as it also concerns wikipedia users:

http://fr.wikipedia.org/wiki/Utilisateur:Grondilu/Offline_Wikipedia_Perl

EDIT. I also set up a github repo: https://github.com/grondilu/offline-wikipedia-perl

Comment on Offline wikipedia using Perl
Download Code
Re: Offline wikipedia using Perl
by wazoox (Prior) on Mar 09, 2012 at 18:09 UTC

    This looks nice, but I don't really get how it must be used, I suppose I should check your wikipedia page for the missing parts :)

    Just a couple of proposed enhancements :

    • as you're using "warnings", there is not point calling "perl -w"
    • you don't check for errors when opening files and writing. This is worse than a crime, a fault :)

      Once the database has been built, it is supposed to be used with a CGI script and a local webserver. The CGI script is on the wikipedia page indeed, but it is kind of ugly so I didn't post it here as I am not much proud of it :) A CGI is easy to write anyway. Notice that it requires Text::Mediawiki in order to turn wiki format into HTML.

      As for checking errors during file openings and writings, I'll try to correct this.

      as you're using "warnings", there is not point calling "perl -w"
      There is a difference, from perldoc warnings:
      The warnings pragma is a replacement for the command line flag -w , but the pragma is limited to the enclosing block, while the flag is global. See perllexwarn for more information.
      -w does everything warnings does, not the other way around. That being said, it is unlikely the OP wants to enable warnings for use'd modules (XML::Parser, etc.).
Re: Offline wikipedia using Perl
by spx2 (Chaplain) on Mar 13, 2012 at 13:25 UTC
    this project looks very interesting, put it up on github.com , maybe some people might want to fork it and add stuff to it

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: CUFP [id://958466]
Approved by marto
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (15)
As of 2014-04-18 20:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (471 votes), past polls