http://www.perlmonks.org?node_id=223323

sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:

I have read documents on CPAN and in my perl cookbook and hav ebeen very discouraged when it comes to parsing html data. I tried LWP::Simple and TokeParser and none of the tutorials I can across helped in the least. Does anyone know of any documentations on how to parse like every virtual code and have it show examples? Someone must have one out there!

Thanks
"Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us" sulfericacid

Title edit by tye

Replies are listed 'Best First'.
Re: I need more docs
by pfaut (Priest) on Dec 31, 2002 at 15:15 UTC

    This example uses HTML::TokeParser to locate all of the META tags in a document.

    #!/usr/bin/perl -w use strict; use HTML::TokeParser; # get the file name to parse from the command line my $html = shift or die 'Specify a file name'; # access the file with the parser my $p = HTML::TokeParser->new($html) or die $!; # loop through all tokens while (my $r = $p->get_token) { # $r->[0] tells us if it's a Start tag, End tag, Text, Comment # for start tags, $r->[1] tells us the type of tag # only process META start tags next unless $r->[0] eq 'S' && $r->[1] eq 'meta'; print "Found <meta> tag\n"; # $r->[2] is a hash ref containing attributes and values while (my ($k,$v) = each %{$r->[2]}) { print "\t$k = $v\n"; } }

    Here's the result from running it against an html file on my system.

    $ perl meta.pl ~/public_html/cvsbook.html Found <meta> tag content = text/html http-equiv = Content-Type Found <meta> tag content = Open Source Development With CVS name = description Found <meta> tag content = makeinfo 4.0 name = generator
    --- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';
Re: I need more docs
by Beatnik (Parson) on Dec 31, 2002 at 15:35 UTC
    No offence but judging by your previous problem and the way you fixed it, I can only suggest you take smaller steps. Don't eat anything larger than your own head. Read up on the basics a bit more. I'm sure there are plenty of examples on how to use several of the parsers. There are modules on CPAN that depent on them. Like mentioned above, there are several good nodes on it too.

    Ofcourse, the earlier problem could be caused by alcohol consumption, it still is New Years Eve (or New Years Day in some countries).

    If this is taken the wrong way, I'm sorry. And yes, I do expect this node to be downvoted ;-)

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.

      ++ Beatnik for tactfully getting across what needs to be said.

      I would also suggest getting the book Perl and LWP by O'Reilly. It goes through the various approaches available in a step-by-step manner.


      Just a tongue-tied, twisted, earth-bound misfit. -- Pink Floyd

Re: I need more docs
by pfaut (Priest) on Dec 31, 2002 at 14:49 UTC
      Yes, I have tried both of those actually. Again, I tried to read the documentations on CPAN but I found them too confusing but couldn't find any other sources on it :( I know there are a lot of modules that will fit my needs but there's nothing to aid me in understanding it.

      Thanks.

      "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"

      sulfericacid
Re: Seeking HTML parsing examples
by pg (Canon) on Dec 31, 2002 at 23:05 UTC

    I am not sure whether you know about the Data::Dumper package. If you don't know, you really should learn it, and if you know, you should use it in this case.

    I attached a piece of code to show you how to use Data::Dumper.

    use HTML::TokeParser; use LWP::Simple; use Data::Dumper; use strict; my $url = $ARGV[0]; print Dumper(head($url)); my $html = get($url) || print "failed to get $url\n"; my $parser = new HTML::TokeParser(\$html); while (my $token = $parser->get_token()) { print Dumper($token); }
.. and we need more information.
by talexb (Chancellor) on Dec 31, 2002 at 14:45 UTC

    You need to give us an example of what you want to do, and show how the CPAN modules that you've tried fail to meet those needs. We're not mind-readers.

    --t. alex
    Life is short: get busy!
      The modules on CPAN do meet my needs, I just can't quite understand them with the limited documentation I have. I don't really care what module it is, as long as it can be used to parse html data, I can work from there.

      I just want this in general because I have run into a few scripts I couldn't make because I didn't understand this. For starters I'd like to see an easy method to parse all meta tags from a given url and have them stored in their own variable.

      Any help would be greatly appreciated.

      "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us"
      sulfericacid

        If all you want to do is parse the meta tags, perhaps HTML::HeadParser will do the trick for you.

Re: Seeking HTML parsing examples
by aspen (Sexton) on Jan 01, 2003 at 00:20 UTC
    sulfericacid, one of the following might be of help: I can not vouch for the accuracy (or usefulness) of the above, but maybe one will be helpful.