A followup: Why a regex *really* isn't good enough for HTML and XML, even for "simple" tasks
Your employer/interviewer/professor/teacher has given you a task with the following specification:
Given an XHTML file, find all the <div> tags with the class attribute "data"1 and extract their id attribute as well as their text content, or an empty string if they have no content. The text content is to be stripped of all non-word characters (\W) and tags, text from nested tags is to be included in the output. There may be other divs, other tags, and other attributes present anywhere, but divs with the class data are guaranteed to have an id attribute and not be nested inside each other. The output of your script is to be a single comma-separated list of the form id=text, id=text, .... You are to write your code first, and then you will be given a test file, guaranteed to be valid and standards-conforming, for which the expected output of your program is "Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=Friday, Six=Saturday, Seven=Sunday"2.
Updates - Clarifications:
1 The class attribute should be exactly the string data (that is, ignoring the special treatment given to CSS classes). Examples below updated accordingly.
2 Your solution should be generic enough to support any arbitrary strings for the id and text content, and be easily modifiable to change the expected class attribute.
Ok, you think, I know Perl is a powerful text processing language and regexes are great! And you write your code and it works well for the test cases you came up with. ... But did you think of everything? Here's the test file you end up getting:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"[
<!ATTLIST html
xmlns:xsi CDATA #FIXED "http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation CDATA #IMPLIED > ]>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/1999/xhtml
http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"
+ />
<title>Hello, World</title>
<script type="text/javascript">
//<![CDATA[
console.log(' <div class="data" id="Hello">World</div> ');
//]]>
</script>
</head>
<body>
<div class="data" id="Zero" />
<div class="data" id="One">Monday</div><div class="data" id="Two">Tues
+day</div>
<div id="Three" class='data'>Wednes<div id="day">day</div></div>
<div class="data" id='Four'><b>Thursday</b></div>
<div
class="data" id="Five">
Friday
</div>
<div
class
=
"data"
id
=
"Six"
>
<div
>
Satur
</div
>
day
</div
>
<div title=" class='data' id='Foo'>Bar"
id="Seven" class="data"> Sunday</div>
<div class="data otherclass" id="aaa">bbb</div>
<div class="otherclass" id="ccc">ddd</div>
<p class="data">eee</p>
<p id="fff">ggg</p>
<!--
<div class="data" id="Quz">Baz</div>
-->
<p><![CDATA[
<div class="data" id="Bye">Bye</div>
]]></p>
</body>
</html>
Fun, right? While the above happens to be XHTML, the same problems (and more) of course apply to XML, and HTML also quickly gets much worse - just to name one example, did you know about SGML shorthand markup? The following is perfectly valid HTML 4.01 (and even browsers may have trouble with it): <p<a href="/">foo</> (source)
If you're now thinking to yourself, "I've been working with this XML for a few years now and have never seen its format change", then imagine this scenario. The third party providing the XML hasn't touched their code, but decides to upgrade their OS libraries, including the one that writes the XML. Suddenly, you can no longer rely on the order of attributes, whitespace, etc., and your regex starts failing, and your bosses are on your back because they are losing money because their feeds are interrupted. Now take that thought a step further, and imagine you start working for a company where the person who wrote that regex is long gone, and you get to maintain it...
So I hope it is clear: Please, don't try to parse arbitrary XML/HTML with regexes! (...just for fun) Do yourself and the future maintainers of your code a favor and just use one of the many modules that are available.
I encourage everyone to try and write a parser using your favorite module, be it:
Honorable mentions: Grimy for a regex solution and RonW for a regex-based parser :-)
I'll kick things off with Mojo::DOM (compacted somewhat, with potential for a lot more golfing or verboseness):
use warnings;
use strict;
use Mojo::DOM;
my $dom = Mojo::DOM->new(
do { open my $fh, '<', 'example.xhtml' or die $!;
local $/; <$fh> } );
my $found = $dom->find('div[class="data"]')->map(sub {
( my $text = $_->all_text ) =~ s/\W//g;
{ id=>$_->attr('id'), text=>$text }
})->to_array;
my $out = join ', ', map { $_->{id}.'='.$_->{text} } @$found;
print $out,"\n";
$out eq "Zero=, One=Monday, Two=Tuesday, Three=Wednesday, "
."Four=Thursday, Five=Friday, Six=Saturday, Seven=Sunday"
? print "Good!\n" : die "BAD!\n";
Updates after posting: Minor updates to wording for clarification. Added test more cases to example file. 2017-10-17: Replaced as discussed in the replies. Switched from XHTML 1.0 Transitional to XHTML 1.0 Strict. Added Schema declaration. Added output check to Mojo::DOM example. 2017-10-20: A few minor updates to text.
Update 2017-10-18: Thank you very much to everyone who has replied and posted their solutions so far, keep em coming! :-)
Re: Parsing HTML/XML with Regular Expressions (XML::LibXML)
by Your Mother (Archbishop) on Oct 16, 2017 at 13:20 UTC
|
Overly idiomatic but this was for fun, not production :P–
use XML::LibXML;
my $doc = XML::LibXML->load_html( location => "example.html",
{ recover => 1 } );
my @ids2text = map { [ $_->value, $_->getOwnerElement->textContent ] }
$doc->findnodes('//@id');
$_->[1] =~ s/\W+//g for @ids2text;
print join ", ", map sprintf("%s=%s", @$_), @ids2text;
While this happens to be XHTML
Sidenote on that. I am sure you know the sample is not XHTML but I thought I'd call it out for the sake of readers.
Update: I missed the "transitional" part of the XHTML declaration. It is indeed, shockingly, valid transitional XHTML. Goes to show how on point haukex is on this matter.
Update 2: updated node title per LanX. Pulled strict/warnings to shorten post. Plus link to module: XML::LibXML | [reply] [d/l] |
|
use warnings;
use strict;
use XML::LibXML;
my $doc = XML::LibXML->load_xml( location => 'example.xhtml',
no_network=>1, recover=>1 );
my $xpc = XML::LibXML::XPathContext->new($doc);
$xpc->registerNs('html', 'http://www.w3.org/1999/xhtml');
my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] }
$xpc->findnodes(q{//html:div[@class='data']});
$_->[1] =~ s/\W+//g for @ids2text;
print join ", ", map sprintf("%s=%s", @$_), @ids2text;
</update>
Thanks very much for the reply! Your post inspired some more test cases for my file, and I'm sorry to say I broke your code :-( But here's the fix:
my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] }
$doc->findnodes(q{//div[@class='data']});
Update: And yes, it does seem that load_html doesn't like XHTML - load_xml seems to work a bit better, although fetching the DTD from the net is pretty slow at the moment; adding the options {no_network=>1,recover=>1} disables the network check. However, with load_xml one also has to start using XML::LibXML::XPathContext:
my $xpc = XML::LibXML::XPathContext->new($doc);
$xpc->registerNs('html', 'http://www.w3.org/1999/xhtml');
my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] }
$xpc->findnodes(q{//html:div[@class='data']});
Update 2: Even with network, XML::LibXML is still complaining about ("Entity 'nbsp' not defined"), I'm not entirely sure why yet, as it seems to be defined in the DTD... Update 3: The W3C Validator doesn't complain...
| [reply] [d/l] [select] |
|
Hello again haukex,
the thread is interesting and I made my best last night to provide an XML::Twig solution, but due to limited understanding of the XML in general I report here some thing i do not understand about the file you presentend as input.
First I cheated because I get the sample XML file before writing the program, because with XML i always go for a try-and-check path..
Second, in my wide ignorance, I really dont know how XHTML, DTD, DOM and transitional can affect the approach to the XML to parse. My sin.
Third: if XML::Twig (the only module I use for these task) complains about the document I'll use W3C validator to check the content, before crashing my head with the content, task i very dont like.
So, your sample is a valid one. I put it after the __DATA__ token and I got the following error:
no element found at line 2, column 0, byte 39 at D:/ulisse/perl5.26.64
+bit/perl/vendor/lib/XML/Parser.pm line 187.
at dontregexXML03.pl line 20.
After half an hour searching the web I ended reading of xpath bugs dated 2009 but no clue at all.
Any attempt to brutally cut the XML, removing lines and tags ended with the very same error, at the same line (??).
So I tested the YourMother's solution with your own modification and I get many errors but also the correct solution:
sample.html:11: HTML parser error : Element script embeds close tag
console.log(' <div class="data" id="Hello">World</div> ');
^
sample.html:49: HTML parser error : htmlParseStartTag: invalid element
+ name
<![CDATA[
^
sample.html:50: HTML parser error : Unexpected end tag : div
<div class="data" id="Bye">Bye</div>
^
Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F
+riday, Six=Saturday, Seven=Sunday
So i assumed the XML had some problems effectively: my others attempts to fix it using such detailed reports emitted by XML::LibXML had no more luck that previous ones.
As last resource i put the XML sample into a separate file and: TADA' all run smooth (not considering the   issue) with XML::Twig as presented above.
Any suggestion? Which is the best module to report formal errors in the XML structure? are the above reported errors real ones or are due to limits of the parsing module?
If the thread will continue can be the Rosetta of Perl XML parsing. Goood one!
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
|
Re: Parsing HTML/XML with Regular Expressions (XML::Twig)
by Discipulus (Canon) on Oct 16, 2017 at 22:00 UTC
|
Hello haukex
I normally use XML::Twig in the sad occasions I need to deal with XML. With small xml data i use __DATA__ and $twig->parse(<DATA>) but with your sample I got no element found at line 2, column 0, byte 39 at.. even if W3C validator parses the file as correct. Using a real file I had no errors. I dunno why and I rarely inspect XML with my eyes; doctor said is no good ;=)
I have no managed to strip out nbsp from Sunday, but now it's to late to deal with entities and the biiig XML::Twig manpage. See you Sundaynbsp at the Pubnbsp ;=)
use strict;
use warnings;
use XML::Twig;
my @days;
my $twig= XML::Twig->new(
twig_handlers=>{
'div[@class="data"]'=>sub{
(my $txt = $_[1]->text)=~s/\W//g;
push @days, $_[1]->att('id')."=$txt";
}
}
);
$twig->parsefile ('example.html');
print join ', ', @days;
# output
Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F
+riday, Six=Saturday, Seven=Sundaynbsp
PS i bet tybalt89 will come out with some working regex solution! ;=)
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
|
Thanks very much for the contribution! Regarding the DATA and issues, see my reply here - although I assume you meant $twig->parse(*DATA) instead of $twig->parse(<DATA>)? With the updated example in the root node, your code works!
And yes, I assumed someone might take up the challenge of actually using a regex - but of course then I'd have to try to break it with more test cases ;-)
| [reply] [d/l] [select] |
|
You presumed ~right about DATA filehandle.
The xmltwig.org and docs specify parse $string or \*OPEN_FILEHANDLE among twig's methods.
So you are right: I had to pass an handle not an iterator (?) like <DATA>
I dunno when I took this bad habit but if you look at this and this other one and this other too and probably many others of mines, $twig->parse(<DATA>) works!!
So $twig->parse(<DATA>) does not works with your example but i can confirm that passing the filehandle $twig->parse(\*DATA) or even $twig->parse(*DATA) works as expected.
Can be that wrong form works (at least sometimes) because of the XML::Twig ability to parse streams of XML?
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] |
|
It was possible to produce a regex that parses all of Perl, why not one for HTML?
There is a regex to parse XML (so, therefore, XHTML): XML Shallow Parsing
That regex produces a list of strings that will need further processing. Shallow parsing is mostly useful for XML-to-XML filtering. Technically, this challenge could be considered filtering, just not to XML. Will need to keep track of <div> nesting to find the end of the contained text.
# Not tested and assumes proper nesting of <div> elements (and valid X
+ML syntax)
# (Warning: Messy hack. Read at your own risk.)
my $nest = 0;
my $out = '';
my @elements = $xml =~ /$XML_SPE/g; # see http://www.cs.sfu.ca/~camero
+n/REX.html#AppA
for (@elements)
{
if (/^<div/)
{
$nest++ if ($nest > 0); # only increment if inside an interest
+ing <div>
next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal w
+hite space
next unless (/id\h*=\h*['"](\w+)['"]/);
$out .= ", $1=";
$nest = 1 if ($nest == 0); # if this is the outer most interes
+ting <div>
next;
}
$nest--, next if (/^<\/div/);
next if (/^[<]/); # skip other mark-up
$out .= $_ if ($nest > 0);
}
$out =~ s/^, //;
say "$out\n";
Update: Changed title to indicate (regex) | [reply] [d/l] [select] |
|
|
|
|
|
I am not sure wether such a regex would fit even into the 18 Exabyte-limit of most modern file systems …
:-)
| [reply] |
|
|
Re: Parsing HTML/XML with Regular Expressions
by choroba (Cardinal) on Oct 17, 2017 at 14:49 UTC
|
Using XML::XSH2 , I had to fix the script after downloading the XML: I hadn't had the namespace there, and I tried normalize-space instead of substitution which didn't work correctly.
open 1201438.xml ;
register-namespace xh http://www.w3.org/1999/xhtml ;
my $first = 1 ;
for //xh:div[@class='data'] {
if not($first) echo :n ', ' ;
$first = 0;
echo :s :n @id '=' xsh:subst(., '\W', '', 'g') ;
}
echo ;
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] [select] |
Re: Parsing HTML/XML with Regular Expressions
by Grimy (Pilgrim) on Oct 17, 2017 at 16:32 UTC
|
Obligatory Zalgo-summoning solution:
#!/usr/bin/perl -p0
s/<!([^<>]|<(?1)*>)*>//gs;
s/<(?!div\b[^>]*\bclass\s*=\s*(['"])data\1)([^<>]|<(?2)*>)*>//gs;
s/.*?<(?:[^'"]|(['"]).*?\1)*?\bid\s*=\s*(['"])(.*?)\2.*?>([^<]*)/$3=$4
+, /gs;
s/&#(\w+);/chr $1/ge;
s/[^\w=, ]|, $|(.)\1\1//g;
To make it harder on regexes, I suggest:
- throwing unbalanced [<>'"] inside CDATA sections / comments / attributes (especially class='data"')
- using names that can be confused with the interesting ones: <divx, aclass="data", …
- using XML namespaces liberally
- using external entities
| [reply] [d/l] [select] |
|
Impressive, thank you! As previously threatened, and as per your comments, some notes on trying to break the regex solution ;-)
using names that can be confused with the interesting ones: <divx, aclass="data", ...
Good point, but without some trickery those would no longer validate properly as XHTML either.
using XML namespaces liberally
Indeed, I tested this and it does cause trouble: Unsurprisingly the regex and HTML parsers can't handle it, but a little more surprising is that Mojo::DOM ignores namespaces and therefore fails with the following, and that also XML::Twig has trouble with namespaces, or at least I haven't found the right options yet. Only the XML::LibXML and XML::XSH2 solutions handle this correctly:
<html xmlns:foo="http://www.w3.org/1999/xhtml"
xmlns:bar="http://www.perlmonks.com"
...
<foo:div class="data" id="Zero" />
<bar:div class="data" id="Hi">there</bar:div>
(Update: Hmm, even the W3C Validator is having trouble with the namespaces...)
using external entities
As noted here, even some XML parsers seem to have trouble loading all the external entities. But even entities declared within the document should make life difficult for regexes:
<!ENTITY atad "data">
...
<div class="&atad;" id="Zero" />
Only the XML::LibXML and XML::Twig solutions handle that correctly, everything else (including XML::XSH2) fails.
Looks like XML::LibXML <update> and XML::XSH2 </update> are the only ones left standing in this torture test so far! :-)
And one more thing: currently entities with hex values like   aren't supported by the regex (although that's not too difficult to fix).
Updated since the issue with XML::XSH2 was worked out further down in this thread. | [reply] [d/l] [select] |
|
XML::XSH2 is just a wrapper around XML::LibXML. I'd be surprised if it didn't work the same. And indeed, the following doesn't print the id of the div that uses the &atad; class:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use XML::LibXML;
my $dom = 'XML::LibXML'->load_xml(location => '1.xml', load_ext_dtd =>
+ 0);
my $xpc = 'XML::LibXML::XPathContext'->new;
$xpc->registerNs(xh => 'http://www.w3.org/1999/xhtml');
for my $div ($xpc->findnodes('//xh:div[@class="data"]', $dom)) {
print $div->{id}, "\n"
}
Interestingly, at the same time the following shows the classes of all the divs as data:
for my $div ($xpc->findnodes('//xh:div', $dom)) {
print join ' ', @{ $div }{qw{ id class }}, "\n"
}
Bugreport anyone?
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] [select] |
|
|
|
Re: Parsing HTML/XML with Regular Expressions (HTML::Parser)
by tangent (Parson) on Oct 18, 2017 at 02:18 UTC
|
use HTML::Parser;
my $file = 'example.html';
my ($in_div,$in_wanted_div) = (0,0);
my @result;
my $parser = HTML::Parser->new(
api_version => 3,
start_h => [\&start, "tagname, attr"],
text_h => [\&text, "dtext"],
end_h => [\&end, "tagname"],
empty_element_tags => 1,
);
$parser->parse_file($file);
print join(', ',@result);
sub start {
my ($tag, $attr) = @_;
return unless ($tag eq 'div');
if (exists $attr->{'class'} and $attr->{'class'} eq 'data') {
$in_div = 1;
$in_wanted_div = 1;
push(@result, "$attr->{'id'}=");
}
else {
$in_div++;
}
}
sub text {
my ($text) = @_;
return unless $in_wanted_div;
$text =~ s/\W//g;
$result[-1] .= $text;
}
sub end {
my ($tag) = @_;
return unless ($tag eq 'div');
$in_div--;
$in_wanted_div = 0 if not $in_div;
}
Output:
Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F
+riday, Six=Saturday, Seven=Sunday
| [reply] [d/l] [select] |
|
| [reply] |
Re: Parsing HTML/XML with Regular Expressions (HTML::TreeBuilder::XPath)
by tangent (Parson) on Oct 18, 2017 at 02:34 UTC
|
In my previous comment I mentioned that I could not find a way to pass the attribute empty_element_tags from HTML::TreeBuilder to HTML::Parser. Looking at the source code for HTML::TreeBuilder I found this:
our @ISA = qw(HTML::Element HTML::Parser);
# This looks schizoid, I know...
So I've learnt something there! I can call empty_element_tags(1) and now it works.
use HTML::TreeBuilder::XPath;
my $file = 'example.html';
my @result;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->empty_element_tags(1); # calls this on HTML::Parser
$tree->parse_file($file);
$tree->eof;
my @divs = $tree->findnodes('//div[@class="data"]');
for my $div (@divs) {
my $text = $div->as_text || '';
$text =~ s/\W//g;
push(@result, $div->attr('id') . "=$text");
}
print join(', ',@result);
Output:
Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F
+riday, Six=Saturday, Seven=Sunday
| [reply] [d/l] [select] |
|
| [reply] |
Re: Parsing HTML/XML with Regular Expressions (HTML::Parser)
by fishy (Friar) on Oct 17, 2017 at 22:22 UTC
|
Hi Monks,
someone had to try with HTML::Parser... Here I am:
use warnings;
use strict;
use HTML::Parser;
my $parser = HTML::Parser->new(
api_version => 3,
start_h => [\&start_handler, "self, tagname, attr"]
+,
strict_names => 1,
empty_element_tags => 1,
);
my $file = "1201438.html";
open(my $fh, "<", $file) or die "Can't open < $file: $!";
my $contents = do { local $/; <$fh> };
close $fh;
$parser->parse($contents);
for (keys %{$parser->{_numbers}}) {
print "$_=", join("", @{$parser->{_numbers}->{$_}}), ", ";
}
print "\n";
sub start_handler {
my ($self, $tag, $attr) = @_;
return unless $tag eq 'div';
$self->handler(start => \&number_start_handler, "self,tagname,attr")
+;
}
# <div class="data" id="Zero" />
sub number_start_handler {
my ($self, $tag, $attr) = @_;
if ( exists $attr->{class}
&& $attr->{class} eq 'data'
&& exists $attr->{id}
&& $attr->{id} =~ /(Zero|One|Two|Three|Four|Five|Six|Seven)/ )
+{
$self->{_now} = $1;
$self->{_numbers}->{$1} = [];
$self->handler(text => \&number_text_handler, "self,text");
} elsif ($tag eq 'b') {
$self->handler(text => \&number_text_handler, "self,text");
} elsif ($tag eq 'div'
&& ! exists $attr->{class} ) {
$self->handler(text => \&number_text_handler, "self,text");
} else {
$self->handler(text => undef);
}
}
sub number_text_handler {
my ($self, $text) = @_;
$text =~ s/^\s+//;
$text =~ s/\s+$//;
push @{$self->{_numbers}->{$self->{_now}}}, $text;
}
No perfect output:
One=Monday, Six=Saturday, Three=Wednesday, Five=Friday, Two=Tuesday, S
+even=Sunday , Four=Thursday,
If someone could give me some hint why I miss 'Zero' and don't get right 'Sunday'?
Thanks!
| [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
|
Thank you, haukex for your comments and for your interesting OP.
Yes, tangent's code boosted my knowledge.
Cheers
| [reply] |
Re: Parsing HTML/XML with Regular Expressions (Mojo::DOM)
by LanX (Saint) on Oct 16, 2017 at 14:46 UTC
|
Thumbs up! :)
Would be nice if each contributer tagged his title with the name of the used module.
I'll start - as an unorthodox example - with the tag for the root post.
| [reply] |
|
| [reply] |
|
|