Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Parsing HTML/XML with Regular Expressions

by haukex (Archbishop)
on Oct 16, 2017 at 11:48 UTC ( #1201438=perlmeditation: print w/replies, xml ) Need Help??

A followup: Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks

Your employer/interviewer/professor/teacher has given you a task with the following specification:

Given an XHTML file, find all the <div> tags with the class attribute "data"1 and extract their id attribute as well as their text content, or an empty string if they have no content. The text content is to be stripped of all non-word characters (\W) and tags, text from nested tags is to be included in the output. There may be other divs, other tags, and other attributes present anywhere, but divs with the class data are guaranteed to have an id attribute and not be nested inside each other. The output of your script is to be a single comma-separated list of the form id=text, id=text, .... You are to write your code first, and then you will be given a test file, guaranteed to be valid and standards-conforming, for which the expected output of your program is "Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=Friday, Six=Saturday, Seven=Sunday"2.

Updates - Clarifications:
1 The class attribute should be exactly the string data (that is, ignoring the special treatment given to CSS classes). Examples below updated accordingly.
2 Your solution should be generic enough to support any arbitrary strings for the id and text content, and be easily modifiable to change the expected class attribute.

Ok, you think, I know Perl is a powerful text processing language and regexes are great! And you write your code and it works well for the test cases you came up with. ... But did you think of everything? Here's the test file you end up getting:

I encourage everyone to try and write a parser using your favorite module, be it:

Honorable mentions: Grimy for a regex solution and RonW for a regex-based parser :-)

I'll kick things off with Mojo::DOM (compacted somewhat, with potential for a lot more golfing or verboseness):

use warnings; use strict; use Mojo::DOM; my $dom = Mojo::DOM->new( do { open my $fh, '<', 'example.xhtml' or die $!; local $/; <$fh> } ); my $found = $dom->find('div[class="data"]')->map(sub { ( my $text = $_->all_text ) =~ s/\W//g; { id=>$_->attr('id'), text=>$text } })->to_array; my $out = join ', ', map { $_->{id}.'='.$_->{text} } @$found; print $out,"\n"; $out eq "Zero=, One=Monday, Two=Tuesday, Three=Wednesday, " ."Four=Thursday, Five=Friday, Six=Saturday, Seven=Sunday" ? print "Good!\n" : die "BAD!\n";

Updates after posting: Minor updates to wording for clarification. Added test more cases to example file. 2017-10-17: Replaced &nbsp; as discussed in the replies. Switched from XHTML 1.0 Transitional to XHTML 1.0 Strict. Added Schema declaration. Added output check to Mojo::DOM example. 2017-10-20: A few minor updates to text.

Update 2017-10-18: Thank you very much to everyone who has replied and posted their solutions so far, keep em coming! :-)

Replies are listed 'Best First'.
Re: Parsing HTML/XML with Regular Expressions (XML::LibXML)
by Your Mother (Archbishop) on Oct 16, 2017 at 13:20 UTC

    Overly idiomatic but this was for fun, not production :P–

    use XML::LibXML; my $doc = XML::LibXML->load_html( location => "example.html", { recover => 1 } ); my @ids2text = map { [ $_->value, $_->getOwnerElement->textContent ] } $doc->findnodes('//@id'); $_->[1] =~ s/\W+//g for @ids2text; print join ", ", map sprintf("%s=%s", @$_), @ids2text;
    While this happens to be XHTML

    Sidenote on that. I am sure you know the sample is not XHTML but I thought I'd call it out for the sake of readers.

    Update: I missed the "transitional" part of the XHTML declaration. It is indeed, shockingly, valid transitional XHTML. Goes to show how on point haukex is on this matter.

    Update 2: updated node title per LanX. Pulled strict/warnings to shorten post. Plus link to module: XML::LibXML

      <update nr="4"> For the sake of completeness, here's a working script with the changes mentioned below:

      use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml( location => 'example.xhtml', no_network=>1, recover=>1 ); my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs('html', ''); my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $xpc->findnodes(q{//html:div[@class='data']}); $_->[1] =~ s/\W+//g for @ids2text; print join ", ", map sprintf("%s=%s", @$_), @ids2text;


      Thanks very much for the reply! Your post inspired some more test cases for my file, and I'm sorry to say I broke your code :-( But here's the fix:

      my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $doc->findnodes(q{//div[@class='data']});

      Update: And yes, it does seem that load_html doesn't like XHTML - load_xml seems to work a bit better, although fetching the DTD from the net is pretty slow at the moment; adding the options {no_network=>1,recover=>1} disables the network check. However, with load_xml one also has to start using XML::LibXML::XPathContext:

      my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs('html', ''); my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $xpc->findnodes(q{//html:div[@class='data']});

      Update 2: Even with network, XML::LibXML is still complaining about &nbsp; ("Entity 'nbsp' not defined"), I'm not entirely sure why yet, as it seems to be defined in the DTD... Update 3: The W3C Validator doesn't complain...

        Hello again haukex,

        the thread is interesting and I made my best last night to provide an XML::Twig solution, but due to limited understanding of the XML in general I report here some thing i do not understand about the file you presentend as input.

        First I cheated because I get the sample XML file before writing the program, because with XML i always go for a try-and-check path..

        Second, in my wide ignorance, I really dont know how XHTML, DTD, DOM and transitional can affect the approach to the XML to parse. My sin.

        Third: if XML::Twig (the only module I use for these task) complains about the document I'll use W3C validator to check the content, before crashing my head with the content, task i very dont like.

        So, your sample is a valid one. I put it after the __DATA__ token and I got the following error:

        no element found at line 2, column 0, byte 39 at D:/ulisse/perl5.26.64 +bit/perl/vendor/lib/XML/ line 187. at line 20.

        After half an hour searching the web I ended reading of xpath bugs dated 2009 but no clue at all.

        Any attempt to brutally cut the XML, removing lines and tags ended with the very same error, at the same line (??).

        So I tested the YourMother's solution with your own modification and I get many errors but also the correct solution:

        sample.html:11: HTML parser error : Element script embeds close tag console.log(' <div class="data" id="Hello">World</div> '); ^ sample.html:49: HTML parser error : htmlParseStartTag: invalid element + name <![CDATA[ ^ sample.html:50: HTML parser error : Unexpected end tag : div <div class="data" id="Bye">Bye</div> ^ Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sunday

        So i assumed the XML had some problems effectively: my others attempts to fix it using such detailed reports emitted by XML::LibXML had no more luck that previous ones.

        As last resource i put the XML sample into a separate file and: TADA' all run smooth (not considering the &nbsp issue) with XML::Twig as presented above.

        Any suggestion? Which is the best module to report formal errors in the XML structure? are the above reported errors real ones or are due to limits of the parsing module?

        If the thread will continue can be the Rosetta of Perl XML parsing. Goood one!


        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Parsing HTML/XML with Regular Expressions (XML::Twig)
by Discipulus (Abbot) on Oct 16, 2017 at 22:00 UTC
    Hello haukex

    I normally use XML::Twig in the sad occasions I need to deal with XML. With small xml data i use __DATA__ and $twig->parse(<DATA>) but with your sample I got no element found at line 2, column 0, byte 39 at.. even if W3C validator parses the file as correct. Using a real file I had no errors. I dunno why and I rarely inspect XML with my eyes; doctor said is no good ;=)

    I have no managed to strip out nbsp from Sunday, but now it's to late to deal with entities and the biiig XML::Twig manpage. See you Sundaynbsp at the Pubnbsp ;=)

    use strict; use warnings; use XML::Twig; my @days; my $twig= XML::Twig->new( twig_handlers=>{ 'div[@class="data"]'=>sub{ (my $txt = $_[1]->text)=~s/\W//g; push @days, $_[1]->att('id')."=$txt"; } } ); $twig->parsefile ('example.html'); print join ', ', @days; # output Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sundaynbsp

    PS i bet tybalt89 will come out with some working regex solution! ;=)


    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

      Thanks very much for the contribution! Regarding the DATA and &nbsp; issues, see my reply here - although I assume you meant $twig->parse(*DATA) instead of $twig->parse(<DATA>)? With the updated example in the root node, your code works!

      And yes, I assumed someone might take up the challenge of actually using a regex - but of course then I'd have to try to break it with more test cases ;-)

        You presumed ~right about DATA filehandle.

        The and docs specify parse    $string or \*OPEN_FILEHANDLE among twig's methods.

        So you are right: I had to pass an handle not an iterator (?) like <DATA>

        I dunno when I took this bad habit but if you look at this and this other one and this other too and probably many others of mines, $twig->parse(<DATA>) works!!

        So $twig->parse(<DATA>) does not works with your example but i can confirm that passing the filehandle $twig->parse(\*DATA) or even $twig->parse(*DATA) works as expected.

        Can be that wrong form works (at least sometimes) because of the XML::Twig ability to parse streams of XML?


        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      some working regex solution
      That's certainly possible. It was possible to produce a regex that parses all of Perl, why not one for HTML?


      You can lead your users to water, but alas, you cannot drown them.
        It was possible to produce a regex that parses all of Perl, why not one for HTML?

        There is a regex to parse XML (so, therefore, XHTML): XML Shallow Parsing

        That regex produces a list of strings that will need further processing. Shallow parsing is mostly useful for XML-to-XML filtering. Technically, this challenge could be considered filtering, just not to XML. Will need to keep track of <div> nesting to find the end of the contained text.

        # Not tested and assumes proper nesting of <div> elements (and valid X +ML syntax) # (Warning: Messy hack. Read at your own risk.) my $nest = 0; my $out = ''; my @elements = $xml =~ /$XML_SPE/g; # see +n/REX.html#AppA for (@elements) { if (/^<div/) { $nest++ if ($nest > 0); # only increment if inside an interest +ing <div> next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal w +hite space next unless (/id\h*=\h*['"](\w+)['"]/); $out .= ", $1="; $nest = 1 if ($nest == 0); # if this is the outer most interes +ting <div> next; } $nest--, next if (/^<\/div/); next if (/^[<]/); # skip other mark-up $out .= $_ if ($nest > 0); } $out =~ s/^, //; say "$out\n";

        Update: Changed title to indicate (regex)

        I am not sure wether such a regex would fit even into the 18 Exabyte-limit of most modern file systems …
Re: Parsing HTML/XML with Regular Expressions
by choroba (Archbishop) on Oct 17, 2017 at 14:49 UTC
    Using XML::XSH2 , I had to fix the script after downloading the XML: I hadn't had the namespace there, and I tried normalize-space instead of substitution which didn't work correctly.
    open 1201438.xml ; register-namespace xh ; my $first = 1 ; for //xh:div[@class='data'] { if not($first) echo :n ', ' ; $first = 0; echo :s :n @id '=' xsh:subst(., '\W', '', 'g') ; } echo ;

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Parsing HTML/XML with Regular Expressions
by Grimy (Pilgrim) on Oct 17, 2017 at 16:32 UTC
    Obligatory Zalgo-summoning solution:
    #!/usr/bin/perl -p0 s/<!([^<>]|<(?1)*>)*>//gs; s/<(?!div\b[^>]*\bclass\s*=\s*(['"])data\1)([^<>]|<(?2)*>)*>//gs; s/.*?<(?:[^'"]|(['"]).*?\1)*?\bid\s*=\s*(['"])(.*?)\2.*?>([^<]*)/$3=$4 +, /gs; s/&#(\w+);/chr $1/ge; s/[^\w=, ]|, $|(.)\1\1//g;
    To make it harder on regexes, I suggest:
    • throwing unbalanced [<>'"] inside CDATA sections / comments / attributes (especially class='data"')
    • using names that can be confused with the interesting ones: <divx, aclass="data",
    • using XML namespaces liberally
    • using external entities

      Impressive, thank you! As previously threatened, and as per your comments, some notes on trying to break the regex solution ;-)

      using names that can be confused with the interesting ones: <divx, aclass="data", ...

      Good point, but without some trickery those would no longer validate properly as XHTML either.

      using XML namespaces liberally

      Indeed, I tested this and it does cause trouble: Unsurprisingly the regex and HTML parsers can't handle it, but a little more surprising is that Mojo::DOM ignores namespaces and therefore fails with the following, and that also XML::Twig has trouble with namespaces, or at least I haven't found the right options yet. Only the XML::LibXML and XML::XSH2 solutions handle this correctly:

      <html xmlns:foo="" xmlns:bar="" ... <foo:div class="data" id="Zero" /> <bar:div class="data" id="Hi">there</bar:div>

      (Update: Hmm, even the W3C Validator is having trouble with the namespaces...)

      using external entities

      As noted here, even some XML parsers seem to have trouble loading all the external entities. But even entities declared within the document should make life difficult for regexes:

      <!ENTITY atad "data"> ... <div class="&atad;" id="Zero" />

      Only the XML::LibXML and XML::Twig solutions handle that correctly, everything else (including XML::XSH2) fails.

      Looks like XML::LibXML <update> and XML::XSH2 </update> are the only ones left standing in this torture test so far! :-)

      And one more thing: currently entities with hex values like &#xA0; aren't supported by the regex (although that's not too difficult to fix).

      Updated since the issue with XML::XSH2 was worked out further down in this thread.

        XML::XSH2 is just a wrapper around XML::LibXML. I'd be surprised if it didn't work the same. And indeed, the following doesn't print the id of the div that uses the &atad; class:
        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::LibXML; my $dom = 'XML::LibXML'->load_xml(location => '1.xml', load_ext_dtd => + 0); my $xpc = 'XML::LibXML::XPathContext'->new; $xpc->registerNs(xh => ''); for my $div ($xpc->findnodes('//xh:div[@class="data"]', $dom)) { print $div->{id}, "\n" }

        Interestingly, at the same time the following shows the classes of all the divs as data:

        for my $div ($xpc->findnodes('//xh:div', $dom)) { print join ' ', @{ $div }{qw{ id class }}, "\n" }

        Bugreport anyone?

        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Parsing HTML/XML with Regular Expressions (HTML::Parser)
by tangent (Vicar) on Oct 18, 2017 at 02:18 UTC
    My favourite module for parsing HTML is HTML::TreeBuilder::XPath, but it misses out on the first div (id=Zero). It uses HTML::Parser internally but I could not find a way to pass the necessary attribute empty_element_tags=>1 from HTML::TreeBuilder to HTML::Parser.

    So here is a fairly verbose version using just HTML::Parser:

    use HTML::Parser; my $file = 'example.html'; my ($in_div,$in_wanted_div) = (0,0); my @result; my $parser = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], text_h => [\&text, "dtext"], end_h => [\&end, "tagname"], empty_element_tags => 1, ); $parser->parse_file($file); print join(', ',@result); sub start { my ($tag, $attr) = @_; return unless ($tag eq 'div'); if (exists $attr->{'class'} and $attr->{'class'} eq 'data') { $in_div = 1; $in_wanted_div = 1; push(@result, "$attr->{'id'}="); } else { $in_div++; } } sub text { my ($text) = @_; return unless $in_wanted_div; $text =~ s/\W//g; $result[-1] .= $text; } sub end { my ($tag) = @_; return unless ($tag eq 'div'); $in_div--; $in_wanted_div = 0 if not $in_div; }
    Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sunday

      Many thanks.
Re: Parsing HTML/XML with Regular Expressions (HTML::TreeBuilder::XPath)
by tangent (Vicar) on Oct 18, 2017 at 02:34 UTC
    In my previous comment I mentioned that I could not find a way to pass the attribute empty_element_tags from HTML::TreeBuilder to HTML::Parser. Looking at the source code for HTML::TreeBuilder I found this:
    our @ISA = qw(HTML::Element HTML::Parser); # This looks schizoid, I know...
    So I've learnt something there! I can call empty_element_tags(1) and now it works.
    use HTML::TreeBuilder::XPath; my $file = 'example.html'; my @result; my $tree = HTML::TreeBuilder::XPath->new; $tree->empty_element_tags(1); # calls this on HTML::Parser $tree->parse_file($file); $tree->eof; my @divs = $tree->findnodes('//div[@class="data"]'); for my $div (@divs) { my $text = $div->as_text || ''; $text =~ s/\W//g; push(@result, $div->attr('id') . "=$text"); } print join(', ',@result);
    Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sunday
Re: Parsing HTML/XML with Regular Expressions (HTML::Parser)
by fishy (Friar) on Oct 17, 2017 at 22:22 UTC
    Hi Monks,
    someone had to try with HTML::Parser... Here I am:
    use warnings; use strict; use HTML::Parser; my $parser = HTML::Parser->new( api_version => 3, start_h => [\&start_handler, "self, tagname, attr"] +, strict_names => 1, empty_element_tags => 1, ); my $file = "1201438.html"; open(my $fh, "<", $file) or die "Can't open < $file: $!"; my $contents = do { local $/; <$fh> }; close $fh; $parser->parse($contents); for (keys %{$parser->{_numbers}}) { print "$_=", join("", @{$parser->{_numbers}->{$_}}), ", "; } print "\n"; sub start_handler { my ($self, $tag, $attr) = @_; return unless $tag eq 'div'; $self->handler(start => \&number_start_handler, "self,tagname,attr") +; } # <div class="data" id="Zero" /> sub number_start_handler { my ($self, $tag, $attr) = @_; if ( exists $attr->{class} && $attr->{class} eq 'data' && exists $attr->{id} && $attr->{id} =~ /(Zero|One|Two|Three|Four|Five|Six|Seven)/ ) +{ $self->{_now} = $1; $self->{_numbers}->{$1} = []; $self->handler(text => \&number_text_handler, "self,text"); } elsif ($tag eq 'b') { $self->handler(text => \&number_text_handler, "self,text"); } elsif ($tag eq 'div' && ! exists $attr->{class} ) { $self->handler(text => \&number_text_handler, "self,text"); } else { $self->handler(text => undef); } } sub number_text_handler { my ($self, $text) = @_; $text =~ s/^\s+//; $text =~ s/\s+$//; push @{$self->{_numbers}->{$self->{_now}}}, $text; }

    No perfect output:
    One=Monday, Six=Saturday, Three=Wednesday, Five=Friday, Two=Tuesday, S +even=Sunda&#121;&nbsp;, Four=Thursday,

    If someone could give me some hint why I miss 'Zero' and don't get right 'Sunday'?


      Thanks for your contribution! A few comments:

      • The output is unordered since you're using a hash, I'd suggest an array instead.
      • The way your code is checking the id attribute limits the script to only the one example file, which could of course change.
      • As far as I can tell, the reason you're missing Zero is because when you encounter the first <div>, your start_handler is just installing a new handler, which at that point doesn't get called. I'd recommend not changing around the handlers, but instead just using a single handler per event, and keeping state inside the handler, kind of like tangent does here with $in_wanted_div, except that I would recommend keeping the state in the parser object or at least a more tightly scoped variable instead of in a "global" variable.
      • You're not getting the right Sunday because you're using the text argument type, instead of dtext for "decoded text".
        Thank you, haukex for your comments and for your interesting OP.
        Yes, tangent's code boosted my knowledge.

Re: Parsing HTML/XML with Regular Expressions (Mojo::DOM)
by LanX (Sage) on Oct 16, 2017 at 14:46 UTC
    Thumbs up! :)

    Would be nice if each contributer tagged his title with the name of the used module.

    I'll start - as an unorthodox example - with the tag for the root post.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://1201438]
Front-paged by Arunbear
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2022-08-17 10:38 GMT
Find Nodes?
    Voting Booth?

    No recent polls found