Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Parsing HTML/XML with Regular Expressions (XML::LibXML)

by Your Mother (Archbishop)
on Oct 16, 2017 at 13:20 UTC ( [id://1201443]=note: print w/replies, xml ) Need Help??


in reply to Parsing HTML/XML with Regular Expressions

Overly idiomatic but this was for fun, not production :P–

use XML::LibXML; my $doc = XML::LibXML->load_html( location => "example.html", { recover => 1 } ); my @ids2text = map { [ $_->value, $_->getOwnerElement->textContent ] } $doc->findnodes('//@id'); $_->[1] =~ s/\W+//g for @ids2text; print join ", ", map sprintf("%s=%s", @$_), @ids2text;
While this happens to be XHTML

Sidenote on that. I am sure you know the sample is not XHTML but I thought I'd call it out for the sake of readers.

Update: I missed the "transitional" part of the XHTML declaration. It is indeed, shockingly, valid transitional XHTML. Goes to show how on point haukex is on this matter.

Update 2: updated node title per LanX. Pulled strict/warnings to shorten post. Plus link to module: XML::LibXML

Replies are listed 'Best First'.
Re^2: Parsing HTML/XML with Regular Expressions (XML::LibXML; updated!)
by haukex (Archbishop) on Oct 16, 2017 at 15:12 UTC

    <update nr="4"> For the sake of completeness, here's a working script with the changes mentioned below:

    use warnings; use strict; use XML::LibXML; my $doc = XML::LibXML->load_xml( location => 'example.xhtml', no_network=>1, recover=>1 ); my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs('html', 'http://www.w3.org/1999/xhtml'); my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $xpc->findnodes(q{//html:div[@class='data']}); $_->[1] =~ s/\W+//g for @ids2text; print join ", ", map sprintf("%s=%s", @$_), @ids2text;

    </update>

    Thanks very much for the reply! Your post inspired some more test cases for my file, and I'm sorry to say I broke your code :-( But here's the fix:

    my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $doc->findnodes(q{//div[@class='data']});

    Update: And yes, it does seem that load_html doesn't like XHTML - load_xml seems to work a bit better, although fetching the DTD from the net is pretty slow at the moment; adding the options {no_network=>1,recover=>1} disables the network check. However, with load_xml one also has to start using XML::LibXML::XPathContext:

    my $xpc = XML::LibXML::XPathContext->new($doc); $xpc->registerNs('html', 'http://www.w3.org/1999/xhtml'); my @ids2text = map { [ $_->getAttribute('id'), $_->textContent ] } $xpc->findnodes(q{//html:div[@class='data']});

    Update 2: Even with network, XML::LibXML is still complaining about &nbsp; ("Entity 'nbsp' not defined"), I'm not entirely sure why yet, as it seems to be defined in the DTD... Update 3: The W3C Validator doesn't complain...

      Hello again haukex,

      the thread is interesting and I made my best last night to provide an XML::Twig solution, but due to limited understanding of the XML in general I report here some thing i do not understand about the file you presentend as input.

      First I cheated because I get the sample XML file before writing the program, because with XML i always go for a try-and-check path..

      Second, in my wide ignorance, I really dont know how XHTML, DTD, DOM and transitional can affect the approach to the XML to parse. My sin.

      Third: if XML::Twig (the only module I use for these task) complains about the document I'll use W3C validator to check the content, before crashing my head with the content, task i very dont like.

      So, your sample is a valid one. I put it after the __DATA__ token and I got the following error:

      no element found at line 2, column 0, byte 39 at D:/ulisse/perl5.26.64 +bit/perl/vendor/lib/XML/Parser.pm line 187. at dontregexXML03.pl line 20.

      After half an hour searching the web I ended reading of xpath bugs dated 2009 but no clue at all.

      Any attempt to brutally cut the XML, removing lines and tags ended with the very same error, at the same line (??).

      So I tested the YourMother's solution with your own modification and I get many errors but also the correct solution:

      sample.html:11: HTML parser error : Element script embeds close tag console.log(' <div class="data" id="Hello">World</div> '); ^ sample.html:49: HTML parser error : htmlParseStartTag: invalid element + name <![CDATA[ ^ sample.html:50: HTML parser error : Unexpected end tag : div <div class="data" id="Bye">Bye</div> ^ Zero=, One=Monday, Two=Tuesday, Three=Wednesday, Four=Thursday, Five=F +riday, Six=Saturday, Seven=Sunday

      So i assumed the XML had some problems effectively: my others attempts to fix it using such detailed reports emitted by XML::LibXML had no more luck that previous ones.

      As last resource i put the XML sample into a separate file and: TADA' all run smooth (not considering the &nbsp issue) with XML::Twig as presented above.

      Any suggestion? Which is the best module to report formal errors in the XML structure? are the above reported errors real ones or are due to limits of the parsing module?

      If the thread will continue can be the Rosetta of Perl XML parsing. Goood one!

      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

        Thanks for looking into that! So as for the &nbsp;, my understanding so far is this: of course an HTML parser will know what it is, but a generic XML parser will by default not know that entity - for that, it has to load the DTDs, but apparently not all XML parsers do that. So, to separate the two problems (the parsing of the XML in the root node vs. figuring out the right options to get the XML parser to recognize the HTML entities), I've updated the example XHTML in the root node to replace the &nbsp; (and a few other updates - unfortunately causing load_html to throw more errors, but load_xml to work better).

        Which is the best module to report formal errors in the XML structure?

        I typically use xmllint, which is also based on libxml2 just like XML::LibXML, so really either of those two tools should do XML validation pretty well (as I said above I'm not sure yet what's going on with the DTDs). For example, to validate the example from the root node against the XHTML schema, the following command works; it's also possible to speed it up by downloading the schema files locally and using the options --nonet --path /path/to/schemas/ --schema /path/to/schemas/xhtml1-strict.xsd (the "I/O error : Attempt to load network entity" messages can usually be ignored).

        $ xmllint --noout --schema \ 'http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd' example.xhtml example.xhtml validates

        <update> Or, you can use the --valid option for DTD validation. </update>

        For any (X)HTML, I'd consider the W3C Validator the gold standard. I've also often just used the above xmllint command.

        As for your problem with parsing the XML file from the DATA section, I'd have to look into that a bit when I find some more time. Perhaps the parser is doing something with the filehandle that is not compatible with DATA. Also, ikegami made an excellent point a while back: XML files should be treated like binary files, and it's better to let the XML parser handle the decoding (although my example file is currently pure 7-bit ASCII).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201443]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-18 20:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found