Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

meta tag extraction with TokeParser

by Anonymous Monk
on Mar 20, 2006 at 20:15 UTC ( [id://538034]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

There was one node on here when I searched how to do this (code shown below) but it works with errors. It produces the right output but it has some 30 lines of errors saying an uninitialized value.

my %meta; my $htm2 = HTML::TokeParser->new( \$src ); while (my $token = $htm2->get_token) { next if $token->[1] ne 'meta' && $token->[0] ne 'S'; $meta{$token->[2]{name}} = $token->[2]{content}; }
Ideally, I want to collect all meta tags and store each of them into a hash with the meta name as the key. I'm already using TokeParser for scraping, so please don't suggest I also use TokeParser::Simple. I read through the docs and can't seem to find any information on what I am looking for.

Also, if a modified version of the code above works, can you explain line for line what it's doing? I'm having trouble piecing things together.

My last question is this. Can I extract different parts of an HTML document with TokeParser in one run? Or must I run them all separately?

I can extract the title tag just fine, but only when I make a new reference to TokeParser. It seems like a waste of resources to call the module AGAIN when the html dump is still in memory, right? Or does the data change after each time you loop over tokens?

Replies are listed 'Best First'.
What's wrong with HTML::TokeParser::Simple?
by Ovid (Cardinal) on Mar 20, 2006 at 23:34 UTC

    I'm already using TokeParser for scraping, so please don't suggest I also use TokeParser::Simple.

    I don't understand this. Perhaps you weren't aware of this, but a deliberate design decision of HTML::TokeParser::Simple was to be a drop in replacement of HTML::TokeParser. Using my module means changing this:

    use HTML::TokeParser; my $parser = HTML::TokeParser->new( \$src );

    To this:

    use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new( \$src );

    After that change, all of your code still works. I did that deliberately so that folks could take advantage of the features of my module without having to rewrite their code. Your snippet changes from:

    my %meta; my $htm2 = HTML::TokeParser->new( \$src ); while (my $token = $htm2->get_token) { next if $token->[1] ne 'meta' && $token->[0] ne 'S'; $meta{$token->[2]{name}} = $token->[2]{content}; }

    To:

    my %meta; my $htm2 = HTML::TokeParser::Simple->new( \$src ); while (my $token = $htm2->get_token) { next unless $token->is_start_tag('meta'); $meta{$token->get_attr('name')} = $token->get_attr('content'); }

    I think most would agree that not only is this far more readable (and therefor far more maintainable). So if you're still not convinced, that's OK, but please don't try to suggest to others that HTML::TokeParser::Simple is not a good alternative to HTML::TokeParser. It's far easier to understand and use. You gain a lot and lose nothing.

    Cheers,
    Ovid

    New address of my CGI Course.

Re: meta tag extraction with TokeParser
by Thelonius (Priest) on Mar 21, 2006 at 00:18 UTC
    I think instead of this:
    next if $token->[1] ne 'meta' && $token->[0] ne 'S';
    you meant:
    next if $token[0] ne 'S' || $token[1] ne 'meta';
    Note the order of evaluation and the || instead of &&

    But it might be even clearer if you say:

    if ($token->[0] eq 'S' && $token->[1] eq 'meta' && $token->[2]{name}) +{ $meta{$token->[2]{name}} = $token->[2]{content}; }
Re: meta tag extraction with TokeParser
by saintmike (Vicar) on Mar 20, 2006 at 20:59 UTC
    next if $token->[1] ne 'meta' && $token->[0] ne 'S'; $meta{$token->[2]{name}} = $token->[2]{content};
    If you're seeing warnings about an 'uninitialized value', you should check if $token->[x] exists and is defined before comparing it to a different value or following the alleged reference.
      I tried adding the following but it didn't change
      if ($token->[2] ne '') { $meta{$token->[2]{name}} = $token->[2]{content}; }
        Again, you're not checking if it's defined. Use this instead:
        if( defined $token->[2] ) { ... }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://538034]
Approved by ChemBoy
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (2)
As of 2024-07-23 14:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.