hesco has asked for the wisdom of the Perl Monks concerning the following question:

Based on advise from Michel Rodriguez, author of XML::Twig, I am trying to employ ->findnodes() to parse from an html file a the content of each li tag. This is what I've got so far:

sub parse_mm_archive_cycle { my $self = shift; my $base_url = shift; my $cycle = shift; $cycle =~ s/:$//; my $url = "$base_url/$cycle/date.html"; print STDERR $url, "\n"; $self->{'agent'}->get( $url ); my $html = $self->{'agent'}->content(); $self->{'twig'}->parseurl($url,$self->{'agent'}); print STDERR 'Our twig objects is: ' . Dumper($self->{'twig'}); # $self->{'twig'}->parseurl($url); # my $root = $self->{'twig'}->root; for my $story ($self->{'twig'}->findnodes("li")){ print STDERR 'Next story: ' . Dumper($story); } return; # $html; }
But this keeps throwing an error on the ->parseurl line, reading:

syntax error at line 1, column 48, byte 48 at /usr/lib/perl5/XML/Parse line 187 at lib/CF/ line 113
Can anyone who has used this before please offer some advise?

if( $lal && $lol ) { $life++; }
if( $insurance->rationing() ) { $people->die(); }

Replies are listed 'Best First'.
Re: XML::Twig won't parse my url
by ambrus (Abbot) on Jan 06, 2011 at 10:11 UTC

    You need to tell XML::Twig explicitly that you want to parse XML, not HTML. There probably should be a parseurl_html method (there's parse_html and parsefile_html), but it's missing, so you need one of the following workarounds:

    $twig->safe_parseurl_html($url) or die;
    use LWP::Simple; $twig->parse_html(get($url) or die "error downloading HTML");

    Update: Fixed parse to parse_html, sorry. Also, as you already get the html above, you only need $twig->parse_html($html). Though I'd recommend $twig->safe_parseurl_html($url, $agent) or die instead because that way it sure gets the encoding right.

Re: XML::Twig won't parse my url (RTFMS)
by tye (Sage) on Jan 06, 2011 at 00:59 UTC

    Look at line 187 of XML/ and figure out what broke about it. Compare to the same version from CPAN, XML-Parser. Fix the breakage, perhaps by (re)installing some version of XML::Parser.

    - tye        

Re: XML::Twig won't parse my url
by ikegami (Patriarch) on Jan 06, 2011 at 02:27 UTC
    Your input file isn't valid XML. The parser found an error at line 1, column 48 of your input file.