Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Re: Parsing HTML/XML with Regular Expressions (HTML::Parser)

by fishy (Pilgrim)
on Oct 17, 2017 at 22:22 UTC ( #1201540=note: print w/replies, xml ) Need Help??

in reply to Parsing HTML/XML with Regular Expressions

Hi Monks,
someone had to try with HTML::Parser... Here I am:
use warnings; use strict; use HTML::Parser; my $parser = HTML::Parser->new( api_version => 3, start_h => [\&start_handler, "self, tagname, attr"] +, strict_names => 1, empty_element_tags => 1, ); my $file = "1201438.html"; open(my $fh, "<", $file) or die "Can't open < $file: $!"; my $contents = do { local $/; <$fh> }; close $fh; $parser->parse($contents); for (keys %{$parser->{_numbers}}) { print "$_=", join("", @{$parser->{_numbers}->{$_}}), ", "; } print "\n"; sub start_handler { my ($self, $tag, $attr) = @_; return unless $tag eq 'div'; $self->handler(start => \&number_start_handler, "self,tagname,attr") +; } # <div class="data" id="Zero" /> sub number_start_handler { my ($self, $tag, $attr) = @_; if ( exists $attr->{class} && $attr->{class} eq 'data' && exists $attr->{id} && $attr->{id} =~ /(Zero|One|Two|Three|Four|Five|Six|Seven)/ ) +{ $self->{_now} = $1; $self->{_numbers}->{$1} = []; $self->handler(text => \&number_text_handler, "self,text"); } elsif ($tag eq 'b') { $self->handler(text => \&number_text_handler, "self,text"); } elsif ($tag eq 'div' && ! exists $attr->{class} ) { $self->handler(text => \&number_text_handler, "self,text"); } else { $self->handler(text => undef); } } sub number_text_handler { my ($self, $text) = @_; $text =~ s/^\s+//; $text =~ s/\s+$//; push @{$self->{_numbers}->{$self->{_now}}}, $text; }

No perfect output:
One=Monday, Six=Saturday, Three=Wednesday, Five=Friday, Two=Tuesday, S +even=Sunda&#121;&nbsp;, Four=Thursday,

If someone could give me some hint why I miss 'Zero' and don't get right 'Sunday'?


Replies are listed 'Best First'.
Re^2: Parsing HTML/XML with Regular Expressions (HTML::Parser)
by haukex (Abbot) on Oct 18, 2017 at 21:47 UTC

    Thanks for your contribution! A few comments:

    • The output is unordered since you're using a hash, I'd suggest an array instead.
    • The way your code is checking the id attribute limits the script to only the one example file, which could of course change.
    • As far as I can tell, the reason you're missing Zero is because when you encounter the first <div>, your start_handler is just installing a new handler, which at that point doesn't get called. I'd recommend not changing around the handlers, but instead just using a single handler per event, and keeping state inside the handler, kind of like tangent does here with $in_wanted_div, except that I would recommend keeping the state in the parser object or at least a more tightly scoped variable instead of in a "global" variable.
    • You're not getting the right Sunday because you're using the text argument type, instead of dtext for "decoded text".
      Thank you, haukex for your comments and for your interesting OP.
      Yes, tangent's code boosted my knowledge.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1201540]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (10)
As of 2018-06-25 20:45 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (128 votes). Check out past polls.