Seeking HTML parsing examples

sulfericacid has asked for the wisdom of the Perl Monks concerning the following question:

I have read documents on CPAN and in my perl cookbook and hav ebeen very discouraged when it comes to parsing html data. I tried LWP::Simple and TokeParser and none of the tutorials I can across helped in the least. Does anyone know of any documentations on how to parse like every virtual code and have it show examples? Someone must have one out there!

Thanks
"Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us" sulfericacid

Title edit by tye

Comment on Seeking HTML parsing examples

Replies are listed 'Best First'.
Re: I need more docs by pfaut (Priest) on Dec 31, 2002 at 15:15 UTC
This example uses HTML::TokeParser to locate all of the META tags in a document. #!/usr/bin/perl -w use strict; use HTML::TokeParser; # get the file name to parse from the command line my $html = shift or die 'Specify a file name'; # access the file with the parser my $p = HTML::TokeParser->new($html) or die $!; # loop through all tokens while (my $r = $p->get_token) { # $r->[0] tells us if it's a Start tag, End tag, Text, Comment # for start tags, $r->[1] tells us the type of tag # only process META start tags next unless $r->[0] eq 'S' && $r->[1] eq 'meta'; print "Found <meta> tag\n"; # $r->[2] is a hash ref containing attributes and values while (my ($k,$v) = each %{$r->[2]}) { print "\t$k = $v\n"; } } [download] Here's the result from running it against an html file on my system. `$ perl meta.pl ~/public_html/cvsbook.html Found <meta> tag content = text/html http-equiv = Content-Type Found <meta> tag content = Open Source Development With CVS name = description Found <meta> tag content = makeinfo 4.0 name = generator` [download] `--- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';` [download]	[reply] [d/l] [select]
Re: I need more docs by Beatnik (Parson) on Dec 31, 2002 at 15:35 UTC
No offence but judging by your previous problem and the way you fixed it, I can only suggest you take smaller steps. Don't eat anything larger than your own head. Read up on the basics a bit more. I'm sure there are plenty of examples on how to use several of the parsers. There are modules on CPAN that depent on them. Like mentioned above, there are several good nodes on it too. Ofcourse, the earlier problem could be caused by alcohol consumption, it still is New Years Eve (or New Years Day in some countries). If this is taken the wrong way, I'm sorry. And yes, I do expect this node to be downvoted ;-) Greetz Beatnik ... Quidquid perl dictum sit, altum viditur.	[reply]
Re: Re: I need more docs by data64 (Chaplain) on Dec 31, 2002 at 19:07 UTC
++ Beatnik for tactfully getting across what needs to be said. I would also suggest getting the book Perl and LWP by O'Reilly. It goes through the various approaches available in a step-by-step manner. Just a tongue-tied, twisted, earth-bound misfit. -- Pink Floyd	[reply]
Re: I need more docs by pfaut (Priest) on Dec 31, 2002 at 14:49 UTC
Have you looked at HTML::TokeParser or HTML::TokeParser::Simple? I used the former in cbhistory to reformat html posted in chatterbox messages. `--- print map { my ($m)=1<<hex($_)&11?' ':''; $m.=substr('AHJPacehklnorstu',hex($_),1) } split //,'2fde0abe76c36c914586c';` [download]	[reply] [d/l]
Re: Re: I need more docs by sulfericacid (Deacon) on Dec 31, 2002 at 14:56 UTC
Yes, I have tried both of those actually. Again, I tried to read the documentations on CPAN but I found them too confusing but couldn't find any other sources on it :( I know there are a lot of modules that will fit my needs but there's nothing to aid me in understanding it. Thanks. "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us" sulfericacid	[reply]
Re: Seeking HTML parsing examples by pg (Canon) on Dec 31, 2002 at 23:05 UTC
I am not sure whether you know about the Data::Dumper package. If you don't know, you really should learn it, and if you know, you should use it in this case. I attached a piece of code to show you how to use Data::Dumper. `use HTML::TokeParser; use LWP::Simple; use Data::Dumper; use strict; my $url = $ARGV[0]; print Dumper(head($url)); my $html = get($url) \|\| print "failed to get $url\n"; my $parser = new HTML::TokeParser(\$html); while (my $token = $parser->get_token()) { print Dumper($token); }` [download]	[reply] [d/l]
.. and we need more information. by talexb (Chancellor) on Dec 31, 2002 at 14:45 UTC
You need to give us an example of what you want to do, and show how the CPAN modules that you've tried fail to meet those needs. We're not mind-readers. --t. alex Life is short: get busy!	[reply]
Re: .. and we need more information. by sulfericacid (Deacon) on Dec 31, 2002 at 14:53 UTC
The modules on CPAN do meet my needs, I just can't quite understand them with the limited documentation I have. I don't really care what module it is, as long as it can be used to parse html data, I can work from there. I just want this in general because I have run into a few scripts I couldn't make because I didn't understand this. For starters I'd like to see an easy method to parse all meta tags from a given url and have them stored in their own variable. Any help would be greatly appreciated. "Age is nothing more than an inaccurate number bestowed upon us at birth as just another means for others to judge and classify us" sulfericacid	[reply]
Re: Re: .. and we need more information. by Mr. Muskrat (Canon) on Dec 31, 2002 at 15:12 UTC
If all you want to do is parse the meta tags, perhaps HTML::HeadParser will do the trick for you.	[reply]
Re: Seeking HTML parsing examples by aspen (Sexton) on Jan 01, 2003 at 00:20 UTC
sulfericacid, one of the following might be of help: From The Perl Journal, Issue 17 Spring 2000, Parsing HTML with HTML::PARSER And from Issue 19 Fall 2000, Scanning HTML Lastly, a (not-so-great) example of using HTML::Parser is found here (search down for HTML::Parser) I can not vouch for the accuracy (or usefulness) of the above, but maybe one will be helpful.	[reply]

Back to Seekers of Perl Wisdom