comment on

I stumbled across a site called “FLOSS Manuals” recently, and thought that it would be a great place to create some new Plucker documents for our users, and distribute them. I’ve create hundreds of other Plucker documents for users in years past, so this was a natural progression of that.

When I quickly Googled around, I found someone was already doing exactly that, albeit in a one-off shell script.

I decided to take his work and build upon it, making it self-healing, and created what I call the “Plucker FLOSS Manuals Autobuilder v1.0″ :)

Optimizations, comments and suggestions welcome and appreciated...

#!/usr/bin/perl -w

use strict;
use warnings;
use diagnostics;
use LWP::UserAgent;
use HTML::SimpleLinkExtor;

my $flossurl    = 'http://en.flossmanuals.net';
my $ua          = 'Plucker FLOSS Manuals Autobuilder v1.0 [desrod@gnu-
+designs.com]';
my $top_extor   = HTML::SimpleLinkExtor->new();

# fetch the top-level page and extract the child pages
$top_extor->parse_url($flossurl, $ua);
my @links       = grep(m:^/:, $top_extor->a);
pop @links;     # get rid of '/news' item from @links; fragile

# Get the print-only page of each child page
get_printpages($flossurl . $_) for @links;

######################################################################
+####### 
#
# Get the pages themselves, and return their content to the caller
#
######################################################################
+#######
sub get_content {
        my $url = shift;

        my $ua          = 'Mozilla/5.0 (en-US; rv:1.4b) Gecko/20030514
+';
        my $browser     = LWP::UserAgent->new();
        $browser->agent($ua);
        my $response    = $browser->get($url);
        my $decoded     = $response->decoded_content;
 
        # This was necessary, because of a bug in ::SimpleLinkExtor,
        # otherwise this code would be 10 lines shorter. Sigh.
        if ($response->is_success) {
                return $decoded;
        }
}


######################################################################
+####### 
#
# Fetch the print links from the child pages snarfed from the top-leve
+l page
#
######################################################################
+#######
sub get_printpages {
        my $page = shift;

        my $sub_extor   = HTML::SimpleLinkExtor->new();
        $sub_extor->parse(get_content($page));

        # Single out only the /print links on each sub-page
        my @printlinks  = grep(m:^/.*/print$:, $sub_extor->a);

        my $url         = $flossurl . $printlinks[0];
        (my $title      = $printlinks[0]) =~ s,\/(\w+)\/print,$1,;

        # Build it with Plucker
        print "Building $title from $url\n";
        plucker_build($url, $title);
}


######################################################################
+####### 
#
# Build the content with Plucker, using a "safe" system() call in list
+-mode
#
######################################################################
+#######
sub plucker_build {
        my ($url, $title) = @_;

        my $workpath            = "/tmp/";
        my $pl_url              = $url;
        my $pl_bpp              = "8";   
        my $pl_compression      = "zlib";
        my $pl_title            = $title;
        my $pl_copyprevention   = "0";
        my $pl_no_url_info      = "0";
        my $pdb                 = $title;

        my $systemcmd   = "/usr/bin/plucker-build";

        my @systemargs  = (
                        '-p', $workpath, 
                        '-P', $workpath,
                        '-H', $pl_url,
                        $pl_bpp ? "--bpp=$pl_bpp" : (),
                        ($pl_compression ? "--${pl_compression}-compre
+ssion" : ''),
                        '-N', $pl_title,
                        $pl_copyprevention ? $pl_copyprevention : (),
                        $pl_no_url_info ? $pl_no_url_info : (),
                        '-V1',
                        "--staybelow=$flossurl/floss/pub/$title/",
                        '--stayonhost',
                           '-f', "$pdb");

        system($systemcmd, @systemargs);
}
[download]

In reply to Converting FLOSS manuals to Plucker format by hacker

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


No such thing as a small change
	PerlMonks