Re^3: Create a dictionary from wikipedia

It seems to me that the sticking point is going to be deciding what qualifies as "pure text." Getting the content out of the XML is fairly trivial: just walk recursively through the XML file after loading it into some XML parsing module, and grab the values of the "content" keys. (I only downloaded about 0.2% of the file as a sample, but that appears to be consistent.) The simple bit of code below does that, counts the "words" in a dictionary hash, and outputs the sorted results. However, since it splits the text on whitespace, the resulting words contain a lot of punctuation, including wiki formatting. So you'll have to parse that out, and also deal with other issues: Unicode and HTML encoded characters, embedded HTML tags, "wide characters," and more.

#!/usr/bin/env perl
use Modern::Perl;
use XML::Simple;
use Data::Dumper;

my $xml = XML::Simple->new();
my $in = $xml->XMLin('wiki.xml');
my %dict;
walk($in);

for (sort {$a cmp $b} keys %dict){
    say "$dict{$_}    $_";
}

sub walk {
    my $h = shift;
    for my $k (keys %$h){
        if($k eq 'content'){
        add_to_dict($h->{$k});
    } elsif( ref($h->{$k}) eq 'HASH' ){
        walk($h->{$k});
        }
    }
}

sub add_to_dict {
    my $text = shift;
    for my $w (split /\s+/, $text){
    $dict{$w}++;
    }
}
[download]

Aaron B.
Available for small or large Perl jobs; see my home node.

Comment on Re^3: Create a dictionary from wikipedia Download Code


P is for Practical
	PerlMonks