Re: Parsing HTML/XML with Regular Expressions (XML::LibXML)
by Your Mother (Archbishop) on Oct 16, 2017 at 13:20 UTC
|
Overly idiomatic but this was for fun, not production :P–
use XML::LibXML;
my $doc = XML::LibXML->load_html( location => "example.html",
{ recover => 1 } );
my @ids2text = map { [ $_->value, $_->getOwnerElement->textContent ] }
$doc->findnodes('//@id');
$_->[1] =~ s/\W+//g for @ids2text;
print join ", ", map sprintf("%s=%s", @$_), @ids2text;
While this happens to be XHTML
Sidenote on that. I am sure you know the sample is not XHTML but I thought I'd call it out for the sake of readers.
Update: I missed the "transitional" part of the XHTML declaration. It is indeed, shockingly, valid transitional XHTML. Goes to show how on point haukex is on this matter.
Update 2: updated node title per LanX. Pulled strict/warnings to shorten post. Plus link to module: XML::LibXML | [reply] [d/l] |
|
use warnings;
use strict;
use XML::LibXML;
my $doc = XML::LibXML->load_xml( location => 'example.xhtml',
no_network=>1, recover=>1 );
my $xpc = XML::LibXML::XPathContext->new($doc);
$xpc->registerNs('html', 'http://www.w3.org/1999/xhtml');
my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] }
$xpc->findnodes(q{//html:div[@class='data']});
$_->[1] =~ s/\W+//g for @ids2text;
print join ", ", map sprintf("%s=%s", @$_), @ids2text;
</update>
Thanks very much for the reply! Your post inspired some more test cases for my file, and I'm sorry to say I broke your code :-( But here's the fix:
my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] }
$doc->findnodes(q{//div[@class='data']});
Update: And yes, it does seem that load_html doesn't like XHTML - load_xml seems to work a bit better, although fetching the DTD from the net is pretty slow at the moment; adding the options {no_network=>1,recover=>1} disables the network check. However, with load_xml one also has to start using XML::LibXML::XPathContext:
my $xpc = XML::LibXML::XPathContext->new($doc);
$xpc->registerNs('html', 'http://www.w3.org/1999/xhtml');
my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] }
$xpc->findnodes(q{//html:div[@class='data']});
Update 2: Even with network, XML::LibXML is still complaining about ("Entity 'nbsp' not defined"), I'm not entirely sure why yet, as it seems to be defined in the DTD... Update 3: The W3C Validator doesn't complain...
| [reply] [d/l] [select] |
|
Hello again haukex,
the thread is interesting and I made my best last night to provide an XML::Twig solution, but due to limited understanding of the XML in general I report here some thing i do not understand about the file you presentend as input.
First I cheated because I get the sample XML file before writing the program, because with XML i always go for a try-and-check path..
Second, in my wide ignorance, I really dont know how XHTML, DTD, DOM and transitional can affect the approach to the XML to parse. My sin.
Third: if XML::Twig (the only module I use for these task) complains about the document I'll use W3C validator to check the content, before crashing my head with the content, task i very dont like.
So, your sample is a valid one. I put it after the __DATA__ token and I got the following error:
no element found at line 2, column 0, byte 39 at D:/ulisse/perl5.26.64
+bit/perl/vendor/lib/XML/Parser.pm line 187.
at dontregexXML03.pl line 20.
After half an hour searching the web I ended reading of xpath bugs dated 2009 but no clue at all.
Any attempt to brutally cut the XML, removing lines and tags ended with the very same error, at the same line (??).
So I tested the YourMother's solution with your own modification and I get many errors but also the correct solution:
sample.html:11: HTML parser error : Element script embeds close tag
console.log(' <div class="data" id="Hello">World</div> ');
^
sample.html:49: HTML parser error : htmlParseStartTag: invalid element
+ name
<![CDATA[
^
sample.html:50: HTML parser error : Unexpected end tag : div
<div class="data" id="Bye">Bye</div>
^
Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F
+riday, Six=Saturday, Seven=Sunday
So i assumed the XML had some problems effectively: my others attempts to fix it using such detailed reports emitted by XML::LibXML had no more luck that previous ones.
As last resource i put the XML sample into a separate file and: TADA' all run smooth (not considering the   issue) with XML::Twig as presented above.
Any suggestion? Which is the best module to report formal errors in the XML structure? are the above reported errors real ones or are due to limits of the parsing module?
If the thread will continue can be the Rosetta of Perl XML parsing. Goood one!
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
|
Re: Parsing HTML/XML with Regular Expressions (XML::Twig)
by Discipulus (Canon) on Oct 16, 2017 at 22:00 UTC
|
Hello haukex
I normally use XML::Twig in the sad occasions I need to deal with XML. With small xml data i use __DATA__ and $twig->parse(<DATA>) but with your sample I got no element found at line 2, column 0, byte 39 at.. even if W3C validator parses the file as correct. Using a real file I had no errors. I dunno why and I rarely inspect XML with my eyes; doctor said is no good ;=)
I have no managed to strip out nbsp from Sunday, but now it's to late to deal with entities and the biiig XML::Twig manpage. See you Sundaynbsp at the Pubnbsp ;=)
use strict;
use warnings;
use XML::Twig;
my @days;
my $twig= XML::Twig->new(
twig_handlers=>{
'div[@class="data"]'=>sub{
(my $txt = $_[1]->text)=~s/\W//g;
push @days, $_[1]->att('id')."=$txt";
}
}
);
$twig->parsefile ('example.html');
print join ', ', @days;
# output
Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F
+riday, Six=Saturday, Seven=Sundaynbsp
PS i bet tybalt89 will come out with some working regex solution! ;=)
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
|
Thanks very much for the contribution! Regarding the DATA and issues, see my reply here - although I assume you meant $twig->parse(*DATA) instead of $twig->parse(<DATA>)? With the updated example in the root node, your code works!
And yes, I assumed someone might take up the challenge of actually using a regex - but of course then I'd have to try to break it with more test cases ;-)
| [reply] [d/l] [select] |
|
You presumed ~right about DATA filehandle.
The xmltwig.org and docs specify parse $string or \*OPEN_FILEHANDLE among twig's methods.
So you are right: I had to pass an handle not an iterator (?) like <DATA>
I dunno when I took this bad habit but if you look at this and this other one and this other too and probably many others of mines, $twig->parse(<DATA>) works!!
So $twig->parse(<DATA>) does not works with your example but i can confirm that passing the filehandle $twig->parse(\*DATA) or even $twig->parse(*DATA) works as expected.
Can be that wrong form works (at least sometimes) because of the XML::Twig ability to parse streams of XML?
L*
There are no rules, there are no thumbs..
Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] |
|
It was possible to produce a regex that parses all of Perl, why not one for HTML?
There is a regex to parse XML (so, therefore, XHTML): XML Shallow Parsing
That regex produces a list of strings that will need further processing. Shallow parsing is mostly useful for XML-to-XML filtering. Technically, this challenge could be considered filtering, just not to XML. Will need to keep track of <div> nesting to find the end of the contained text.
# Not tested and assumes proper nesting of <div> elements (and valid X
+ML syntax)
# (Warning: Messy hack. Read at your own risk.)
my $nest = 0;
my $out = '';
my @elements = $xml =~ /$XML_SPE/g; # see http://www.cs.sfu.ca/~camero
+n/REX.html#AppA
for (@elements)
{
if (/^<div/)
{
$nest++ if ($nest > 0); # only increment if inside an interest
+ing <div>
next unless (/class\h*=\h*['"]data['"]/); # \h is horizontal w
+hite space
next unless (/id\h*=\h*['"](\w+)['"]/);
$out .= ", $1=";
$nest = 1 if ($nest == 0); # if this is the outer most interes
+ting <div>
next;
}
$nest--, next if (/^<\/div/);
next if (/^[<]/); # skip other mark-up
$out .= $_ if ($nest > 0);
}
$out =~ s/^, //;
say "$out\n";
Update: Changed title to indicate (regex) | [reply] [d/l] [select] |
|
|
|
|
|
I am not sure wether such a regex would fit even into the 18 Exabyte-limit of most modern file systems …
:-)
| [reply] |
|
|
Re: Parsing HTML/XML with Regular Expressions
by choroba (Cardinal) on Oct 17, 2017 at 14:49 UTC
|
Using XML::XSH2 , I had to fix the script after downloading the XML: I hadn't had the namespace there, and I tried normalize-space instead of substitution which didn't work correctly.
open 1201438.xml ;
register-namespace xh http://www.w3.org/1999/xhtml ;
my $first = 1 ;
for //xh:div[@class='data'] {
if not($first) echo :n ', ' ;
$first = 0;
echo :s :n @id '=' xsh:subst(., '\W', '', 'g') ;
}
echo ;
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] [select] |
Re: Parsing HTML/XML with Regular Expressions
by Grimy (Pilgrim) on Oct 17, 2017 at 16:32 UTC
|
Obligatory Zalgo-summoning solution:
#!/usr/bin/perl -p0
s/<!([^<>]|<(?1)*>)*>//gs;
s/<(?!div\b[^>]*\bclass\s*=\s*(['"])data\1)([^<>]|<(?2)*>)*>//gs;
s/.*?<(?:[^'"]|(['"]).*?\1)*?\bid\s*=\s*(['"])(.*?)\2.*?>([^<]*)/$3=$4
+, /gs;
s/&#(\w+);/chr $1/ge;
s/[^\w=, ]|, $|(.)\1\1//g;
To make it harder on regexes, I suggest:
- throwing unbalanced [<>'"] inside CDATA sections / comments / attributes (especially class='data"')
- using names that can be confused with the interesting ones: <divx, aclass="data", …
- using XML namespaces liberally
- using external entities
| [reply] [d/l] [select] |
|
Impressive, thank you! As previously threatened, and as per your comments, some notes on trying to break the regex solution ;-)
using names that can be confused with the interesting ones: <divx, aclass="data", ...
Good point, but without some trickery those would no longer validate properly as XHTML either.
using XML namespaces liberally
Indeed, I tested this and it does cause trouble: Unsurprisingly the regex and HTML parsers can't handle it, but a little more surprising is that Mojo::DOM ignores namespaces and therefore fails with the following, and that also XML::Twig has trouble with namespaces, or at least I haven't found the right options yet. Only the XML::LibXML and XML::XSH2 solutions handle this correctly:
<html xmlns:foo="http://www.w3.org/1999/xhtml"
xmlns:bar="http://www.perlmonks.com"
...
<foo:div class="data" id="Zero" />
<bar:div class="data" id="Hi">there</bar:div>
(Update: Hmm, even the W3C Validator is having trouble with the namespaces...)
using external entities
As noted here, even some XML parsers seem to have trouble loading all the external entities. But even entities declared within the document should make life difficult for regexes:
<!ENTITY atad "data">
...
<div class="&atad;" id="Zero" />
Only the XML::LibXML and XML::Twig solutions handle that correctly, everything else (including XML::XSH2) fails.
Looks like XML::LibXML <update> and XML::XSH2 </update> are the only ones left standing in this torture test so far! :-)
And one more thing: currently entities with hex values like   aren't supported by the regex (although that's not too difficult to fix).
Updated since the issue with XML::XSH2 was worked out further down in this thread. | [reply] [d/l] [select] |
|
XML::XSH2 is just a wrapper around XML::LibXML. I'd be surprised if it didn't work the same. And indeed, the following doesn't print the id of the div that uses the &atad; class:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use XML::LibXML;
my $dom = 'XML::LibXML'->load_xml(location => '1.xml', load_ext_dtd =>
+ 0);
my $xpc = 'XML::LibXML::XPathContext'->new;
$xpc->registerNs(xh => 'http://www.w3.org/1999/xhtml');
for my $div ($xpc->findnodes('//xh:div[@class="data"]', $dom)) {
print $div->{id}, "\n"
}
Interestingly, at the same time the following shows the classes of all the divs as data:
for my $div ($xpc->findnodes('//xh:div', $dom)) {
print join ' ', @{ $div }{qw{ id class }}, "\n"
}
Bugreport anyone?
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
| [reply] [d/l] [select] |
|
|
|
Re: Parsing HTML/XML with Regular Expressions (HTML::Parser)
by tangent (Parson) on Oct 18, 2017 at 02:18 UTC
|
use HTML::Parser;
my $file = 'example.html';
my ($in_div,$in_wanted_div) = (0,0);
my @result;
my $parser = HTML::Parser->new(
api_version => 3,
start_h => [\&start, "tagname, attr"],
text_h => [\&text, "dtext"],
end_h => [\&end, "tagname"],
empty_element_tags => 1,
);
$parser->parse_file($file);
print join(', ',@result);
sub start {
my ($tag, $attr) = @_;
return unless ($tag eq 'div');
if (exists $attr->{'class'} and $attr->{'class'} eq 'data') {
$in_div = 1;
$in_wanted_div = 1;
push(@result, "$attr->{'id'}=");
}
else {
$in_div++;
}
}
sub text {
my ($text) = @_;
return unless $in_wanted_div;
$text =~ s/\W//g;
$result[-1] .= $text;
}
sub end {
my ($tag) = @_;
return unless ($tag eq 'div');
$in_div--;
$in_wanted_div = 0 if not $in_div;
}
Output:
Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F
+riday, Six=Saturday, Seven=Sunday
| [reply] [d/l] [select] |
|
| [reply] |
Re: Parsing HTML/XML with Regular Expressions (HTML::TreeBuilder::XPath)
by tangent (Parson) on Oct 18, 2017 at 02:34 UTC
|
In my previous comment I mentioned that I could not find a way to pass the attribute empty_element_tags from HTML::TreeBuilder to HTML::Parser. Looking at the source code for HTML::TreeBuilder I found this:
our @ISA = qw(HTML::Element HTML::Parser);
# This looks schizoid, I know...
So I've learnt something there! I can call empty_element_tags(1) and now it works.
use HTML::TreeBuilder::XPath;
my $file = 'example.html';
my @result;
my $tree = HTML::TreeBuilder::XPath->new;
$tree->empty_element_tags(1); # calls this on HTML::Parser
$tree->parse_file($file);
$tree->eof;
my @divs = $tree->findnodes('//div[@class="data"]');
for my $div (@divs) {
my $text = $div->as_text || '';
$text =~ s/\W//g;
push(@result, $div->attr('id') . "=$text");
}
print join(', ',@result);
Output:
Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F
+riday, Six=Saturday, Seven=Sunday
| [reply] [d/l] [select] |
|
| [reply] |
Re: Parsing HTML/XML with Regular Expressions (HTML::Parser)
by fishy (Friar) on Oct 17, 2017 at 22:22 UTC
|
Hi Monks,
someone had to try with HTML::Parser... Here I am:
use warnings;
use strict;
use HTML::Parser;
my $parser = HTML::Parser->new(
api_version => 3,
start_h => [\&start_handler, "self, tagname, attr"]
+,
strict_names => 1,
empty_element_tags => 1,
);
my $file = "1201438.html";
open(my $fh, "<", $file) or die "Can't open < $file: $!";
my $contents = do { local $/; <$fh> };
close $fh;
$parser->parse($contents);
for (keys %{$parser->{_numbers}}) {
print "$_=", join("", @{$parser->{_numbers}->{$_}}), ", ";
}
print "\n";
sub start_handler {
my ($self, $tag, $attr) = @_;
return unless $tag eq 'div';
$self->handler(start => \&number_start_handler, "self,tagname,attr")
+;
}
# <div class="data" id="Zero" />
sub number_start_handler {
my ($self, $tag, $attr) = @_;
if ( exists $attr->{class}
&& $attr->{class} eq 'data'
&& exists $attr->{id}
&& $attr->{id} =~ /(Zero|One|Two|Three|Four|Five|Six|Seven)/ )
+{
$self->{_now} = $1;
$self->{_numbers}->{$1} = [];
$self->handler(text => \&number_text_handler, "self,text");
} elsif ($tag eq 'b') {
$self->handler(text => \&number_text_handler, "self,text");
} elsif ($tag eq 'div'
&& ! exists $attr->{class} ) {
$self->handler(text => \&number_text_handler, "self,text");
} else {
$self->handler(text => undef);
}
}
sub number_text_handler {
my ($self, $text) = @_;
$text =~ s/^\s+//;
$text =~ s/\s+$//;
push @{$self->{_numbers}->{$self->{_now}}}, $text;
}
No perfect output:
One=Monday, Six=Saturday, Three=Wednesday, Five=Friday, Two=Tuesday, S
+even=Sunday , Four=Thursday,
If someone could give me some hint why I miss 'Zero' and don't get right 'Sunday'?
Thanks!
| [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
|
Thank you, haukex for your comments and for your interesting OP.
Yes, tangent's code boosted my knowledge.
Cheers
| [reply] |
Re: Parsing HTML/XML with Regular Expressions (Mojo::DOM)
by LanX (Saint) on Oct 16, 2017 at 14:46 UTC
|
Thumbs up! :)
Would be nice if each contributer tagged his title with the name of the used module.
I'll start - as an unorthodox example - with the tag for the root post.
| [reply] |
|
| [reply] |