http://www.perlmonks.org?node_id=719339

dragonchild has asked for the wisdom of the Perl Monks concerning the following question:

I'm using HTML::Parser and am getting "HTML parser error : Tag foo invalid" errors. I am parsing non-HTML that's formatted as HTML. I can't find the place where it checks against a list of known elements. Help?

My criteria for good software:
  1. Does it work?
  2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

Replies are listed 'Best First'.
Re: HTML::Parser and "Invalid foo tag"
by JavaFan (Canon) on Oct 24, 2008 at 13:56 UTC
    I don't think HTML::Parser validates against a DTD, or even a list of allowed tags. (Considering it's event based, and can parse chunks, it can't validate anyway - it would need the entire document for that).

    What's more, I can't find anything in HTML::Parser (or in 'strings Parser.so') that even remotely matches the error you're getting. Which suggests to me that the error isn't generated by HTML::Parser.

      I say he needs to prove it :)
      #!/usr/bin/perl -- use strict; use warnings; use HTML::Parser; my $p = HTML::Parser->new( api_version => 3, default_h => [sub{print join ' | ', grep defined, @_,"\n" },"event +,tag,text,"], # strict_names => 1, xml_mode => 1, ); $p->parse( '<boo><foo><shoo><Moo><COW></BOO> <html><body>hi <br> <a href="1"> hello </a> <boo><foo><shoo><Moo><COW></BOO> </body></html>' ); __END__ start_document | | start | boo | <boo> | start | foo | <foo> | start | shoo | <shoo> | start | Moo | <Moo> | start | COW | <COW> | end | /BOO | </BOO> | text | | start | html | <html> | start | body | <body> | text | hi | start | br | <br> | text | | start | a | <a href="1"> | text | hello | end | /a | </a> | text | | start | boo | <boo> | start | foo | <foo> | start | shoo | <shoo> | start | Moo | <Moo> | start | COW | <COW> | end | /BOO | </BOO> | text | | end | /body | </body> | end | /html | </html> |
      Google suggests its xmllint that complaining.