Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

HTML::HTML5::Parser weirdness

by djh (Novice)
on Feb 23, 2020 at 16:06 UTC ( #11113347=perlquestion: print w/replies, xml ) Need Help??

djh has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to use HTML::HTML5::Parser to parse some HTML pages I stored in files. My program seems to work just fine except with one file and I'm baffled as to what's happening. My program sits in a loop processing files from a list. I've added debugging so it prints the name of the file and then a dump of the document as parsed and then it goes on to process the document except in this one case. So the relevant bit of code is:

for my $filename (@files) { print "$BASE_DIR$filename\n"; #next if $filename eq '2020-02-17-00:10:01.html'; my $doc = $parser->parse_file($BASE_DIR . $filename); print "doc=", Dumper($doc), $doc->toString;

and the output for the problematic file is:

{my home directory}/met-office-datahub/met-office-forecasts/2020-02-17 +-00:10:01.html doc=$VAR1 = bless( do{\(my $o = '93912432739248')}, 'XML::LibXML::Docu +ment' ); <?xml version="1.0" encoding="windows-1252"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html> Can't call method "toString" on an undefined value ...

Now I've checked the contents of that file and it actually starts (just like all the others):

<!DOCTYPE html> <html lang="en"> <head> <meta charset="utf-8"/> <meta http-equiv="X-UA-Compatible" content="IE=Edge">

I can't figure out where that strange alleged file contents is coming from or why it affects just that file. In particular the weird <head/> and <body/> tags. I've searched for those strings in my home directory and in /usr/lib/perl5 and done a web search but haven't found anything.

So I'd be very grateful if anybody has any ideas on techniques to figure out what the problem is, or happens to recognize it :)

Replies are listed 'Best First'.
Re: HTML::HTML5::Parser weirdness
by Corion (Pope) on Feb 23, 2020 at 16:57 UTC

    You don't show us how you populate $filename.

    Maybe the file does not exist, or $filename has whitespace at the end. When I try a mock program without an existing file, I get a very similar output:

    #!perl use strict; use warnings; use XML::LibXML; use HTML::HTML5::Parser; my $p = HTML::HTML5::Parser->new(); my $doc = $p->parse_file('file:///this/doesnotexist.html', { ignore_http_response_code => 1, }, ); print $doc->toString;

    Output:

    Use of uninitialized value $c_type in pattern match (m//) at /home/cor +ion/perl5/lib/perl5/HTML/HTML5/Parser.pm line 59. <?xml version="1.0" encoding="windows-1252"?> <html xmlns="http://www.w3.org/1999/xhtml"><head/><body/></html>

    If you can show us a short and self-contained example (SSCCE), then that would remove a lot of guesswork and maybe helps us find the root of the problem better.

      Thanks for reading my post and taking the trouble to reply.

      I did show you exactly how I populated $filename of course:

      for my $filename (@files)

      but I expect you meant how I populated @files, which was using readdir

      opendir my $dir, $BASE_DIR or die "Cannot open $BASE_DIR directory: $! +"; my @all_files = readdir $dir; closedir $dir; my @files = sort grep { $_ =~ /\-00:/ } @all_files;

      I didn't think that was important, since I specifically said I'd checked the file existed and I even posted some of its contents. And the fact that the loop works for all the other files in that directory is a strong hint there's no whitespace problems or whatever.

      But the fact that you got a similar error with a non-existent file suggests to me that the problem is in the module, which was why I looked for <head/> in /usr/lib/perl5. So I'll go and look further to see if I can isolate where that string is coming from.

      While I appreciate the benefits of SSCCE, I think the effort I would need to construct one in this case outweighs the benefits. But I may do so if I'm still stuck after a while.

      PS Why does perlmonks format code at column 70? I could vaguely understand column 72 if I was a punched-card FORTRAN programmer but I use an 80-column terminal and try to stay within that!

        While I appreciate the benefits of SSCCE, I think the effort I would need to construct one in this case outweighs the benefits.

        Because very often the process of creating a SSCCE will find the problem for you. In the process of creating a SSCCE you are forced to examine your assumptions. Often you will find they aren't correct. You also make it easy for us to reproduce your problem - why should we do all the work? See I know what I mean. Why don't you?.

        Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
        > Why does perlmonks format code at column 70?

        You can configure that in the Display Settings.

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: HTML::HTML5::Parser weirdness
by tobyink (Canon) on Feb 23, 2020 at 20:08 UTC

    As per Corion's post, make sure you're able to read the file okay — file exists, has correct permissions, etc. Try reading the file using normal Perl open, readline, etc, then sending the string to parse_string on the parser object, rather than using parse_file.

    If that still doesn't work, try emailing the file to the author of that module, tobyink @ cpan.org. He can be very helpful sometimes. :)

      Hi Toby. Very pleased to see you here. The file is just like twenty or thirty others, scraped by a cron job and I've checked permissions and content several times. I'll try using parse_string etc if I put together an SSCCE.

      I think Corion's post indicates that the problem isn't with the particular file, although it does seem that particular file is triggering the problem. But the identical result he got with a non-existent file is a strong suggestion that the problem lies elsewhere. In particular finding out where those funky <head/> and <body/> strings come from is my main focus at present.

        The document is being parsed as having no contents in the head and no contents in the body. Head and body elements are still parsed though, because in the HTML5 model, all HTML documents have a head and a body. You're using XML::LibXML to output the document, and XML::LibXML will typically output an empty HTML element like <blah />. So that's why you're seeing those in the output. I wouldn't expect that they're in the input.

        The problem is that it's not seeing anything at all in the head and body in the input. Probably because of a parsing error too extreme to recover from. But I'd need to see the file to be sure.

        In particular finding out where those funky <head/> and <body/> strings come from is my main focus at present.

        I would suggest that is a waste of time. It’s almost certainly indicative of the lack of head and body content. Those are merely how tags/elements/nodes without content are rendered. They are exactly equivalent to this style <head></head>. The problem lies elsewhere.

      tobyink wrote:

      "If that still doesn't work, try emailing the file to the author of that module, tobyink @ cpan.org. He can be very helpful sometimes. :)"

      Hi Toby,

      I sent an email to you enclosing the files on 2020-02-27 and I sent another as a reminder this morning. But I haven't received an acknowledgment or anything else in reply. Did you receive them?

      Thanks, Dave

        As I said previously, I had written to tobyink @ cpan.org as he invited but received no reply. So I posted here to enquire but have heard nothing from him since. I can see that he's been visiting this site and posting in other threads since, so I'm not sure what to infer? I still have the problem of his module refusing to process a particular HTML file. Does anybody else have any suggestions as to how to proceed?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11113347]
Approved by stevieb
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2020-10-31 19:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (291 votes). Check out past polls.

    Notices?