Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

WebFetch::PerlMonks

by zdog (Priest)
on Jun 02, 2001 at 08:13 UTC ( [id://85153]=sourcecode: print w/replies, xml ) Need Help??
Category: PerlMonks Related Scripts
Author/Contact Info Zenon Zabinski | zdog7@hotmail.com
Description: This modules grabs the most recent PerlMonks.org posts using XML::Parser and generates a HTML file containing a list of links to those posts.

By default, the file is written to perlmonks.html. If that file already exists, a backup will be created at Operlmonks.html before the file is overwritten.

I guess you need to have the WebFetch module installed to run this.

Special thanks to :
- OeufMayo for creating the XML::Parser tutorial that helped me create this.
- mirod for helping me with the XML::Parser problem.

Bugfixes:
- xml_char () was altered to read in entire string as mirod suggested
- the tests to return () were moved to xml_end () from xml_start () to keep from reading too many strings into one field

Suggestions please ...

#
# WebFetch::PerlMonks.pm - get recent posts on PerlMonks.org
#
# Copyright (c) 2001 Zenon Zabinski (zdog7@hotmail.com).
# All rights reserved. This program is free software;
# you can redistribute it and/or modify it under the
# same terms as Perl itself.
#
# Based on the source code of the module 
# WebFetch::DebianNews and WebFetch::Slashdot.
#

package WebFetch::PerlMonks;

use strict;
use vars qw ($VERSION @ISA @EXPORT @Options $parser @bad_nodes @posts 
+$post);

use Exporter;
use XML::Parser;
use WebFetch;

@ISA = qw (Exporter WebFetch);
@EXPORT = qw (fetch_main);

# configuration parameters
$WebFetch::PerlMonks::filename = "perlmonks.html";
$WebFetch::PerlMonks::num_links = 30;
$WebFetch::PerlMonks::url = "http://www.perlmonks.org/index.pl?node=ne
+west+nodes+xml+generator";

# no user-servicable parts beyond this point

# XML stuff
$parser = XML::Parser->new (
    Handlers => {
        Start => \&xml_start,
        End   => \&xml_end,
        Char  => \&xml_char
    },
);

@bad_nodes = ('note', 'user', 'categorized answer');

sub fetch_main { WebFetch::run (); }

sub fetch
{
    my ( $self ) = @_;

    # set parameters for WebFetch routines
    $self->{url} = $WebFetch::PerlMonks::url;
    $self->{num_links} = $WebFetch::PerlMonks::num_links;
    $self->{table_sections} = $WebFetch::PerlMonks::table_sections;

    # process the links
    my $content = $self->get;
    $parser->parse ($$content);

    my @temp_posts = sort { $$b[1] <=> $$a[1] } @posts;
    undef @posts;

    for (my $i = 0; $i < $self->{num_links} && @temp_posts; $i++)
    {
        $temp_posts[0][1] =~ s/(\d{4})(\d{2})(\d{2})(\d{2})(\d{2})(\d{
+2})/$4:$5:$6 $3-$2-$1/;
        $temp_posts[0][2] = "http://www.perlmonks.org/?node_id=". $tem
+p_posts[0][2];
        push @posts, shift (@temp_posts);
    }
    
    $self->html_gen ( $WebFetch::PerlMonks::filename, 
        sub { return "<a href=\"".$_[2]."\">".$_[0]."</a> (".$_[1].")"
+; },
        \@posts );

    # export content if --export was specified
    if ( defined $self->{export}) {
        $self->wf_export( $self->{export},
            [ "title", "date", "url" ],
            \@posts,
            "Exported from WebFetch::PerlMonks\n"
                ."\"title\" is article title\n"
                ."\"date\" is the date stamp\n"
                ."\"url\" is article URL" );
    }
}

sub xml_start
{
    my ($p, $el, %atts) = @_;
    $atts{'title'} = '';
    $post = \%atts;
}

sub xml_end
{
    my ($p, $el) = @_;
    return unless $el eq 'NODE';
    return if grep { m/^$atts{'nodetype'}$/ } @bad_nodes;
    push @posts, [$post->{'title'}, $post->{'createtime'}, $post->{'no
+de_id'}]
}

sub xml_char
{
    my ($p, $title) = @_;
    $post->{'title'} .= $title;
}

1;

__END__

# POD docs follow

=head1 NAME

WebFetch::PerlMonks - generate a file of recent PerlMonks.org posts

=head1 SYNOPSIS

>In perl scripts:

use WebFetch::PerlMonks; &fetch_main

>From the command line:

perl -w -MWebFetch::PerlMonks -e "&fetch_main" -- --dir directory

=head1 DESCRIPTION

This modules grabs the most recent PerlMonks.org posts using
XML::Parser and generates a HTML file containing a list of 
links to those posts.

By default, the file is written to perlmonks.html. If that file
already exists, a backup will be created at Operlmonks.html
before the file is overwritten.

=head1 AUTHOR

WebFetch was written by Ian Kluft
for the Silicon Valley Linux User Group (SVLUG).

The WebFetch::PerlMonks module was written by Zenon Zabinski.
Send patches or maintenance requests for this module to
C<zdog7@hotmail.com>.

=head1 SEE ALSO

WebFetch

=cut
Replies are listed 'Best First'.
Re: WebFetch::PerlMonks
by mirod (Canon) on Jun 02, 2001 at 09:36 UTC

    I am afraid you have the usual XML::Parser problem: the Char handler does not garantee that it will return the entire content of an element at once: it can be called several times for a single string, depending on entities being present and on the overall length of the document.

    So if one of the title includes an entity, as in "I Love B &D Perl", the char handler will be called 3 times: once with 'I Love B', once with '&' and once with 'D Perl', and you will only get the last part in $post->{title}.

    The solution in this case is simply to replace $post->{'title'} = $title; by $post->{'title'} .= $title;, but look for a more generic solution in the review.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://85153]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2025-06-17 00:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.