Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re^4: Parsing HTML/XML with Regular Expressions (validation of the content)

by haukex (Archbishop)
on Oct 17, 2017 at 11:24 UTC ( [id://1201489]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Parsing HTML/XML with Regular Expressions (validation of the content)
in thread Parsing HTML/XML with Regular Expressions

Thanks for looking into that! So as for the  , my understanding so far is this: of course an HTML parser will know what it is, but a generic XML parser will by default not know that entity - for that, it has to load the DTDs, but apparently not all XML parsers do that. So, to separate the two problems (the parsing of the XML in the root node vs. figuring out the right options to get the XML parser to recognize the HTML entities), I've updated the example XHTML in the root node to replace the   (and a few other updates - unfortunately causing load_html to throw more errors, but load_xml to work better).

Which is the best module to report formal errors in the XML structure?

I typically use xmllint, which is also based on libxml2 just like XML::LibXML, so really either of those two tools should do XML validation pretty well (as I said above I'm not sure yet what's going on with the DTDs). For example, to validate the example from the root node against the XHTML schema, the following command works; it's also possible to speed it up by downloading the schema files locally and using the options --nonet --path /path/to/schemas/ --schema /path/to/schemas/xhtml1-strict.xsd (the "I/O error : Attempt to load network entity" messages can usually be ignored).

$ xmllint --noout --schema \ 'http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd' example.xhtml example.xhtml validates

<update> Or, you can use the --valid option for DTD validation. </update>

For any (X)HTML, I'd consider the W3C Validator the gold standard. I've also often just used the above xmllint command.

As for your problem with parsing the XML file from the DATA section, I'd have to look into that a bit when I find some more time. Perhaps the parser is doing something with the filehandle that is not compatible with DATA. Also, ikegami made an excellent point a while back: XML files should be treated like binary files, and it's better to let the XML parser handle the decoding (although my example file is currently pure 7-bit ASCII).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1201489]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-19 13:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found