Parsing HTML/XML with Regular Expressions

A followup: Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks

Your employer/interviewer/professor/teacher has given you a task with the following specification:

Given an XHTML file, find all the <div> tags with the class attribute "data"¹ and extract their id attribute as well as their text content, or an empty string if they have no content. The text content is to be stripped of all non-word characters (\W) and tags, text from nested tags is to be included in the output. There may be other divs, other tags, and other attributes present anywhere, but divs with the class data are guaranteed to have an id attribute and not be nested inside each other. The output of your script is to be a single comma-separated list of the form id=text, id=text, .... You are to write your code first, and then you will be given a test file, guaranteed to be valid and standards-conforming, for which the expected output of your program is "Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=Friday, Six=Saturday, Seven=Sunday"².

Updates - Clarifications:
¹ The class attribute should be exactly the string data (that is, ignoring the special treatment given to CSS classes). Examples below updated accordingly.
² Your solution should be generic enough to support any arbitrary strings for the id and text content, and be easily modifiable to change the expected class attribute.

Ok, you think, I know Perl is a powerful text processing language and regexes are great! And you write your code and it works well for the test cases you came up with. ... But did you think of everything? Here's the test file you end up getting:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[
<!ATTLIST html
    xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation CDATA #IMPLIED  > ]>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/1999/xhtml
    http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd">
<head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"
+ />
    <title>Hello, World</title>
    <script type="text/javascript">
//<![CDATA[
console.log(' <div class="data" id="Hello">World</div> ');
//]]>
    </script>
</head>
<body>
<div class="data" id="Zero" />
<div class="data" id="One">Monday</div><div class="data" id="Two">Tues
+day</div>
<div id="Three" class='data'>Wednes<div id="day">day</div></div>
<div class="data" id='Four'><b>Thursday</b></div>
<div
class="data" id="Five">
Friday
</div>
<div
class
=
"data"
id
=
"Six"
>
<div
>
Satur
</div
>
day
</div
>
<div title=" class='data' id='Foo'>Bar"
id="Seven" class="data">&#xA0;Sunda&#121;</div>
<div class="data otherclass" id="aaa">bbb</div>
<div class="otherclass" id="ccc">ddd</div>
<p class="data">eee</p>
<p id="fff">ggg</p>
<!--
<div class="data" id="Quz">Baz</div>
-->
<p><![CDATA[
<div class="data" id="Bye">Bye</div>
]]></p>
</body>
</html>
[download]

Fun, right? While the above happens to be XHTML, the same problems (and more) of course apply to XML, and HTML also quickly gets much worse - just to name one example, did you know about SGML shorthand markup? The following is perfectly valid HTML 4.01 (and even browsers may have trouble with it): <p<a href="/">foo</> (source)

If you're now thinking to yourself, "I've been working with this XML for a few years now and have never seen its format change", then imagine this scenario. The third party providing the XML hasn't touched their code, but decides to upgrade their OS libraries, including the one that writes the XML. Suddenly, you can no longer rely on the order of attributes, whitespace, etc., and your regex starts failing, and your bosses are on your back because they are losing money because their feeds are interrupted. Now take that thought a step further, and imagine you start working for a company where the person who wrote that regex is long gone, and you get to maintain it...

So I hope it is clear: Please, don't try to parse arbitrary XML/HTML with regexes! (...just for fun) Do yourself and the future maintainers of your code a favor and just use one of the many modules that are available.

I encourage everyone to try and write a parser using your favorite module, be it:

HTML::Parser - Thank you, fishy and tangent!
HTML::TreeBuilder / HTML::TreeBuilder::XPath - Thank you, tangent!
Mojo::DOM - see below
XML::LibXML - Thank you, Your Mother!
XML::Twig - Thank you, Discipulus!
XML::XSH2 - Thank you, choroba!
... your favorite here

Honorable mentions: Grimy for a regex solution and RonW for a regex-based parser :-)

I'll kick things off with Mojo::DOM (compacted somewhat, with potential for a lot more golfing or verboseness):

use warnings;
use strict;

use Mojo::DOM;
my $dom = Mojo::DOM->new(
    do { open my $fh, '<', 'example.xhtml' or die $!;
    local $/; <$fh> } );
my $found = $dom->find('div[class="data"]')->map(sub {
        ( my $text = $_->all_text ) =~ s/\W//g;
        { id=>$_->attr('id'), text=>$text }
    })->to_array;
my $out = join ', ', map { $_->{id}.'='.$_->{text} } @$found;
print $out,"\n";

$out eq "Zero=, One=Monday, Two=Tuesday, Three=Wednesday, "
    ."Four=Thursday, Five=Friday, Six=Saturday, Seven=Sunday"
    ? print "Good!\n" : die "BAD!\n";
[download]

Updates after posting: Minor updates to wording for clarification. Added test more cases to example file. 2017-10-17: Replaced   as discussed in the replies. Switched from XHTML 1.0 Transitional to XHTML 1.0 Strict. Added Schema declaration. Added output check to Mojo::DOM example. 2017-10-20: A few minor updates to text.

Update 2017-10-18: Thank you very much to everyone who has replied and posted their solutions so far, keep em coming! :-)

Comment on Parsing HTML/XML with Regular Expressions Select or Download Code

Replies are listed 'Best First'.
Re: Parsing HTML/XML with Regular Expressions (XML::LibXML) by Your Mother (Archbishop) on Oct 16, 2017 at 13:20 UTC
Overly idiomatic but this was for fun, not production :P– `use XML::LibXML; my $doc = XML::LibXML->load_html( location => "example.html", { recover => 1 } ); my @ids2text = map { [ $_->value, $_->getOwnerElement->textContent ] } $doc->findnodes('//@id'); $_->[1] =~ s/\W+//g for @ids2text; print join ", ", map sprintf("%s=%s", @$_), @ids2text;` [download] While this happens to be XHTML ~~Sidenote on that. I am sure you know the sample is not XHTML but I thought I'd call it out for the sake of readers.~~ Update: I missed the "transitional" part of the XHTML declaration. It is indeed, shockingly, valid transitional XHTML. Goes to show how on point haukex is on this matter. Update 2: updated node title per LanX. Pulled strict/warnings to shorten post. Plus link to module: XML::LibXML	[reply] [d/l]
Re^2: Parsing HTML/XML with Regular Expressions (XML::LibXML; updated!) by haukex (Archbishop) on Oct 16, 2017 at 15:12 UTC
`<update nr="4">` For the sake of completeness, here's a working script with the changes mentioned below: `use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml( location => 'example.xhtml', no_network=>1, recover=>1 ); my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs('html', 'http://www.w3.org/1999/xhtml'); my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $xpc->findnodes(q{//html:div[@class='data']}); $_->[1] =~ s/\W+//g for @ids2text; print join ", ", map sprintf("%s=%s", @$_), @ids2text;` [download] `</update>` Thanks very much for the reply! Your post inspired some more test cases for my file, and I'm sorry to say I broke your code `:-(` But here's the fix: `my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $doc->findnodes(q{//div[@class='data']});` [download] Update: And yes, it does seem that `load_html` doesn't like XHTML - `load_xml` seems to work a bit better, although fetching the DTD from the net is pretty slow at the moment; adding the options `{no_network=>1,recover=>1}` disables the network check. However, with `load_xml` one also has to start using XML::LibXML::XPathContext: `my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs('html', 'http://www.w3.org/1999/xhtml'); my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $xpc->findnodes(q{//html:div[@class='data']});` [download] Update 2: Even with network, XML::LibXML is still complaining about ` ` ("`Entity 'nbsp' not defined`"), I'm not entirely sure why yet, as it seems to be defined in the DTD... Update 3: The W3C Validator doesn't complain...	[reply] [d/l] [select]
Re^3: Parsing HTML/XML with Regular Expressions (validation of the content) by Discipulus (Canon) on Oct 17, 2017 at 07:38 UTC
Hello again haukex, the thread is interesting and I made my best last night to provide an XML::Twig solution, but due to limited understanding of the XML in general I report here some thing i do not understand about the file you presentend as input. First I cheated because I get the sample XML file before writing the program, because with XML i always go for a try-and-check path.. Second, in my wide ignorance, I really dont know how XHTML, DTD, DOM and transitional can affect the approach to the XML to parse. My sin. Third: if XML::Twig (the only module I use for these task) complains about the document I'll use W3C validator to check the content, before crashing my head with the content, task i very dont like. So, your sample is a valid one. I put it after the `__DATA__` token and I got the following error: `no element found at line 2, column 0, byte 39 at D:/ulisse/perl5.26.64 +bit/perl/vendor/lib/XML/Parser.pm line 187. at dontregexXML03.pl line 20.` [download] After half an hour searching the web I ended reading of xpath bugs dated 2009 but no clue at all. Any attempt to brutally cut the XML, removing lines and tags ended with the very same error, at the same line (??). So I tested the YourMother's solution with your own modification and I get many errors but also the correct solution: `sample.html:11: HTML parser error : Element script embeds close tag console.log(' <div class="data" id="Hello">World</div> '); ^ sample.html:49: HTML parser error : htmlParseStartTag: invalid element + name <![CDATA[ ^ sample.html:50: HTML parser error : Unexpected end tag : div <div class="data" id="Bye">Bye</div> ^ Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sunday` [download] So i assumed the XML had some problems effectively: my others attempts to `fix` it using such detailed reports emitted by XML::LibXML had no more luck that previous ones. As last resource i put the XML sample into a separate file and: TADA' all run smooth (not considering the `&nbsp` issue) with XML::Twig as presented above. Any suggestion? Which is the best module to report formal errors in the XML structure? are the above reported errors real ones or are due to limits of the parsing module? If the thread will continue can be the Rosetta of Perl XML parsing. Goood one! L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^4: Parsing HTML/XML with Regular Expressions (validation of the content) by haukex (Archbishop) on Oct 17, 2017 at 11:24 UTC
Re: Parsing HTML/XML with Regular Expressions (XML::Twig) by Discipulus (Canon) on Oct 16, 2017 at 22:00 UTC
Hello haukex I normally use XML::Twig in the sad occasions I need to deal with XML. With small xml data i use `__DATA__` and `$twig->parse(<DATA>)` but with your sample I got `no element found at line 2, column 0, byte 39 at..` even if W3C validator parses the file as correct. Using a real file I had no errors. I dunno why and I rarely inspect XML with my eyes; doctor said is no good ;=) I have no managed to strip out `nbsp` from Sunday, but now it's to late to deal with entities and the biiig XML::Twig manpage. See you Sundaynbsp at the Pubnbsp ;=) `use strict; use warnings; use XML::Twig; my @days; my $twig= XML::Twig->new( twig_handlers=>{ 'div[@class="data"]'=>sub{ (my $txt = $_[1]->text)=~s/\W//g; push @days, $_[1]->att('id')."=$txt"; } } ); $twig->parsefile ('example.html'); print join ', ', @days; # output Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sundaynbsp` [download] PS i bet tybalt89 will come out with some working regex solution! ;=) L* There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^2: Parsing HTML/XML with Regular Expressions (XML::Twig) by haukex (Archbishop) on Oct 17, 2017 at 11:25 UTC
Thanks very much for the contribution! Regarding the `DATA` and ` ` issues, see my reply here - although I assume you meant `$twig->parse(*DATA)` instead of `$twig->parse(<DATA>)`? With the updated example in the root node, your code works! And yes, I assumed someone might take up the challenge of actually using a regex - but of course then I'd have to try to break it with more test cases ;-)	[reply] [d/l] [select]
Re^3: Parsing HTML/XML with Regular Expressions (XML::Twig) by Discipulus (Canon) on Oct 17, 2017 at 19:45 UTC
You presumed ~right about `DATA` filehandle. The xmltwig.org and docs specify `parse $string or \OPEN_FILEHANDLE` among twig's methods. So you are right: I had to pass an handle not an iterator (?) like `<DATA>` I dunno when I took this bad habit but if you look at this and this other one and this other too and probably many others of mines, `$twig->parse(<DATA>)` works!! So `$twig->parse(<DATA>)` does not works with your example but i can confirm that passing the filehandle `$twig->parse(\DATA)` or even `$twig->parse(DATA)` works as expected. Can be that wrong form works (at least sometimes) because of the XML::Twig ability to parse streams of XML? L There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.	[reply] [d/l] [select]
Re^4: Parsing HTML/XML with Regular Expressions (XML::Twig) by haukex (Archbishop) on Oct 18, 2017 at 19:21 UTC
Re^2: Parsing HTML/XML with Regular Expressions (XML::Twig) by holli (Abbot) on Oct 17, 2017 at 09:58 UTC
some working regex solution That's certainly possible. It was possible to produce a regex that parses all of Perl, why not one for HTML? holli You can lead your users to water, but alas, you cannot drown them.	[reply] [d/l]
Re^3: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 19, 2017 at 00:05 UTC
It was possible to produce a regex that parses all of Perl, why not one for HTML? There is a regex to parse XML (so, therefore, XHTML): XML Shallow Parsing That regex produces a list of strings that will need further processing. Shallow parsing is mostly useful for XML-to-XML filtering. Technically, this challenge could be considered filtering, just not to XML. Will need to keep track of `<div>` nesting to find the end of the contained text. # Not tested and assumes proper nesting of <div> elements (and valid X +ML syntax) # (Warning: Messy hack. Read at your own risk.) my $nest = 0; my $out = ''; my @elements = $xml =~ /$XML_SPE/g; # see http://www.cs.sfu.ca/~camero +n/REX.html#AppA for (@elements) { if (/^<div/) { $nest++ if ($nest > 0); # only increment if inside an interest +ing <div> next unless (/class\h=\h['"]data['"]/); # \h is horizontal w +hite space next unless (/id\h=\h['"](\w+)['"]/); $out .= ", $1="; $nest = 1 if ($nest == 0); # if this is the outer most interes +ting <div> next; } $nest--, next if (/^<\/div/); next if (/^[<]/); # skip other mark-up $out .= $_ if ($nest > 0); } $out =~ s/^, //; say "$out\n"; [download] Update: Changed title to indicate (regex)	[reply] [d/l] [select]
Re^4: Parsing HTML/XML with Regular Expressions (XML::Twig) by haukex (Archbishop) on Oct 19, 2017 at 16:27 UTC
Re^5: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 19, 2017 at 23:50 UTC
Some notes below your chosen depth have not been shown here
Re^5: Parsing HTML/XML with Regular Expressions (regex) by RonW (Parson) on Oct 19, 2017 at 22:13 UTC
Re^3: Parsing HTML/XML with Regular Expressions (XML::Twig) by soonix (Canon) on Oct 17, 2017 at 11:45 UTC
I am not sure wether such a regex would fit even into the 18 Exabyte-limit of most modern file systems … :-)	[reply]
Re^4: Parsing HTML/XML with Regular Expressions (XML::Twig) by holli (Abbot) on Oct 17, 2017 at 17:27 UTC
Re^5: Parsing HTML/XML with Regular Expressions (XML::Twig) by soonix (Canon) on Oct 18, 2017 at 06:22 UTC
Re: Parsing HTML/XML with Regular Expressions by choroba (Cardinal) on Oct 17, 2017 at 14:49 UTC
Using XML::XSH2 , I had to fix the script after downloading the XML: I hadn't had the namespace there, and I tried normalize-space instead of substitution which didn't work correctly. `open 1201438.xml ; register-namespace xh http://www.w3.org/1999/xhtml ; my $first = 1 ; for //xh:div[@class='data'] { if not($first) echo :n ', ' ; $first = 0; echo :s :n @id '=' xsh:subst(., '\W', '', 'g') ; } echo ;` [download] ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: Parsing HTML/XML with Regular Expressions by Grimy (Pilgrim) on Oct 17, 2017 at 16:32 UTC
Obligatory Zalgo-summoning solution: `#!/usr/bin/perl -p0 s/<!([^<>]\|<(?1)>)>//gs; s/<(?!div\b[^>]\bclass\s=\s(['"])data\1)([^<>]\|<(?2)>)>//gs; s/.?<(?:[^'"]\|(['"]).?\1)?\bid\s=\s(['"])(.?)\2.?>([^<]*)/$3=$4 +, /gs; s/&#(\w+);/chr $1/ge; s/[^\w=, ]\|, $\|(.)\1\1//g;` [download] To make it harder on regexes, I suggest: throwing unbalanced `[<>'"]` inside CDATA sections / comments / attributes (especially `class='data"'`) using names that can be confused with the interesting ones: `<divx`, `aclass="data"`, � using XML namespaces liberally using external entities	[reply] [d/l] [select]
Re^2: Parsing HTML/XML with Regular Expressions by haukex (Archbishop) on Oct 18, 2017 at 21:11 UTC
Impressive, thank you! As previously threatened, and as per your comments, some notes on trying to break the regex solution ;-) using names that can be confused with the interesting ones: `<divx`, `aclass="data"`, ... Good point, but without some trickery those would no longer validate properly as XHTML either. using XML namespaces liberally Indeed, I tested this and it does cause trouble: Unsurprisingly the regex and HTML parsers can't handle it, but a little more surprising is that Mojo::DOM ignores namespaces and therefore fails with the following, and that also XML::Twig has trouble with namespaces, or at least I haven't found the right options yet. Only the XML::LibXML and XML::XSH2 solutions handle this correctly: `<html xmlns:foo="http://www.w3.org/1999/xhtml" xmlns:bar="http://www.perlmonks.com" ... <foo:div class="data" id="Zero" /> <bar:div class="data" id="Hi">there</bar:div>` [download] (Update: Hmm, even the W3C Validator is having trouble with the namespaces...) using external entities As noted here, even some XML parsers seem to have trouble loading all the external entities. But even entities declared within the document should make life difficult for regexes: `<!ENTITY atad "data"> ... <div class="&atad;" id="Zero" />` [download] Only the XML::LibXML and XML::Twig solutions handle that correctly, everything else ~~(including XML::XSH2)~~ fails. Looks like XML::LibXML `<update>` and XML::XSH2 `</update>` are the only ones left standing in this torture test so far! :-) And one more thing: currently entities with hex values like ` ` aren't supported by the regex (although that's not too difficult to fix). Updated since the issue with XML::XSH2 was worked out further down in this thread.	[reply] [d/l] [select]
Re^3: Parsing HTML/XML with Regular Expressions by choroba (Cardinal) on Oct 18, 2017 at 21:52 UTC
XML::XSH2 is just a wrapper around XML::LibXML. I'd be surprised if it didn't work the same. And indeed, the following doesn't print the id of the div that uses the `&atad;` class: `#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::LibXML; my $dom = 'XML::LibXML'->load_xml(location => '1.xml', load_ext_dtd => + 0); my $xpc = 'XML::LibXML::XPathContext'->new; $xpc->registerNs(xh => 'http://www.w3.org/1999/xhtml'); for my $div ($xpc->findnodes('//xh:div[@class="data"]', $dom)) { print $div->{id}, "\n" }` [download] Interestingly, at the same time the following shows the classes of all the divs as `data`: `for my $div ($xpc->findnodes('//xh:div', $dom)) { print join ' ', @{ $div }{qw{ id class }}, "\n" }` [download] Bugreport anyone? ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re^4: Parsing HTML/XML with Regular Expressions by haukex (Archbishop) on Oct 18, 2017 at 22:05 UTC
Re^5: Parsing HTML/XML with Regular Expressions by choroba (Cardinal) on Oct 18, 2017 at 22:13 UTC
Some notes below your chosen depth have not been shown here
Re: Parsing HTML/XML with Regular Expressions (HTML::Parser) by tangent (Parson) on Oct 18, 2017 at 02:18 UTC
My favourite module for parsing HTML is HTML::TreeBuilder::XPath, but it misses out on the first div (id=Zero). It uses HTML::Parser internally but I could not find a way to pass the necessary attribute `empty_element_tags=>1` from HTML::TreeBuilder to HTML::Parser. So here is a fairly verbose version using just HTML::Parser: use HTML::Parser; my $file = 'example.html'; my ($in_div,$in_wanted_div) = (0,0); my @result; my $parser = HTML::Parser->new( api_version => 3, start_h => [\&start, "tagname, attr"], text_h => [\&text, "dtext"], end_h => [\&end, "tagname"], empty_element_tags => 1, ); $parser->parse_file($file); print join(', ',@result); sub start { my ($tag, $attr) = @_; return unless ($tag eq 'div'); if (exists $attr->{'class'} and $attr->{'class'} eq 'data') { $in_div = 1; $in_wanted_div = 1; push(@result, "$attr->{'id'}="); } else { $in_div++; } } sub text { my ($text) = @_; return unless $in_wanted_div; $text =~ s/\W//g; $result[-1] .= $text; } sub end { my ($tag) = @_; return unless ($tag eq 'div'); $in_div--; $in_wanted_div = 0 if not $in_div; } [download] Output: `Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sunday` [download]	[reply] [d/l] [select]
Re^2: Parsing HTML/XML with Regular Expressions (HTML::Parser) by fishy (Friar) on Oct 18, 2017 at 07:06 UTC
Wow! Many thanks.	[reply]
Re: Parsing HTML/XML with Regular Expressions (HTML::TreeBuilder::XPath) by tangent (Parson) on Oct 18, 2017 at 02:34 UTC
In my previous comment I mentioned that I could not find a way to pass the attribute `empty_element_tags` from HTML::TreeBuilder to HTML::Parser. Looking at the source code for HTML::TreeBuilder I found this: `our @ISA = qw(HTML::Element HTML::Parser); # This looks schizoid, I know...` [download] So I've learnt something there! I can call `empty_element_tags(1)` and now it works. `use HTML::TreeBuilder::XPath; my $file = 'example.html'; my @result; my $tree = HTML::TreeBuilder::XPath->new; $tree->empty_element_tags(1); # calls this on HTML::Parser $tree->parse_file($file); $tree->eof; my @divs = $tree->findnodes('//div[@class="data"]'); for my $div (@divs) { my $text = $div->as_text \|\| ''; $text =~ s/\W//g; push(@result, $div->attr('id') . "=$text"); } print join(', ',@result);` [download] Output: `Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sunday` [download]	[reply] [d/l] [select]
Re^2: Parsing HTML/XML with Regular Expressions (HTML::TreeBuilder::XPath) by fishy (Friar) on Oct 18, 2017 at 07:08 UTC
Great! Thanks.	[reply]
Re: Parsing HTML/XML with Regular Expressions (HTML::Parser) by fishy (Friar) on Oct 17, 2017 at 22:22 UTC
Hi Monks, someone had to try with HTML::Parser... Here I am: use warnings; use strict; use HTML::Parser; my $parser = HTML::Parser->new( api_version => 3, start_h => [\&start_handler, "self, tagname, attr"] +, strict_names => 1, empty_element_tags => 1, ); my $file = "1201438.html"; open(my $fh, "<", $file) or die "Can't open < $file: $!"; my $contents = do { local $/; <$fh> }; close $fh; $parser->parse($contents); for (keys %{$parser->{_numbers}}) { print "$_=", join("", @{$parser->{_numbers}->{$_}}), ", "; } print "\n"; sub start_handler { my ($self, $tag, $attr) = @_; return unless $tag eq 'div'; $self->handler(start => \&number_start_handler, "self,tagname,attr") +; } # <div class="data" id="Zero" /> sub number_start_handler { my ($self, $tag, $attr) = @_; if ( exists $attr->{class} && $attr->{class} eq 'data' && exists $attr->{id} && $attr->{id} =~ /(Zero\|One\|Two\|Three\|Four\|Five\|Six\|Seven)/ ) +{ $self->{_now} = $1; $self->{_numbers}->{$1} = []; $self->handler(text => \&number_text_handler, "self,text"); } elsif ($tag eq 'b') { $self->handler(text => \&number_text_handler, "self,text"); } elsif ($tag eq 'div' && ! exists $attr->{class} ) { $self->handler(text => \&number_text_handler, "self,text"); } else { $self->handler(text => undef); } } sub number_text_handler { my ($self, $text) = @_; $text =~ s/^\s+//; $text =~ s/\s+$//; push @{$self->{_numbers}->{$self->{_now}}}, $text; } [download] No perfect output: `One=Monday, Six=Saturday, Three=Wednesday, Five=Friday, Two=Tuesday, S +even=Sunday , Four=Thursday,` [download] If someone could give me some hint why I miss 'Zero' and don't get right 'Sunday'? Thanks!	[reply] [d/l] [select]
Re^2: Parsing HTML/XML with Regular Expressions (HTML::Parser) by haukex (Archbishop) on Oct 18, 2017 at 21:47 UTC
Thanks for your contribution! A few comments: The output is unordered since you're using a hash, I'd suggest an array instead. The way your code is checking the `id` attribute limits the script to only the one example file, which could of course change. As far as I can tell, the reason you're missing `Zero` is because when you encounter the first `<div>`, your `start_handler` is just installing a new handler, which at that point doesn't get called. I'd recommend not changing around the handlers, but instead just using a single handler per event, and keeping state inside the handler, kind of like tangent does here with `$in_wanted_div`, except that I would recommend keeping the state in the parser object or at least a more tightly scoped variable instead of in a "global" variable. You're not getting the right `Sunday` because you're using the `text` argument type, instead of `dtext` for "decoded text".	[reply] [d/l] [select]
Re^3: Parsing HTML/XML with Regular Expressions (HTML::Parser) by fishy (Friar) on Oct 19, 2017 at 08:01 UTC
Thank you, haukex for your comments and for your interesting OP. Yes, tangent's code boosted my knowledge. Cheers	[reply]
Re: Parsing HTML/XML with Regular Expressions (Mojo::DOM) by LanX (Saint) on Oct 16, 2017 at 14:46 UTC
Thumbs up! :) Would be nice if each contributer tagged his title with the name of the used module. I'll start - as an unorthodox example - with the tag for the root post. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^2: Parsing HTML/XML with Regular Expressions (Mojo::DOM) by haukex (Archbishop) on Oct 16, 2017 at 15:32 UTC
Thanks! I'd like to keep the title of the root node clean so that links to it don't get cluttered up ("Parsing HTML/XML with Regular Expressions"), but I like the idea for any replies!	[reply]


Perl Monk, Perl Meditation
	PerlMonks