Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Trouble Parsing HTML

by Rhodium (Scribe)
on Jan 28, 2005 at 21:46 UTC ( #426130=perlquestion: print w/replies, xml ) Need Help??

Rhodium has asked for the wisdom of the Perl Monks concerning the following question:

Hi all

I am having a very difficult time grasping how to use extract a "selected" cell out of a table. Here is what the HTML looks like:

<TD style="HEIGHT: 29px"> <select name="cmbPurpose" id="cmbPurpose" tabindex="3"> <option value="CD">Cell Development</option> <option value="MS">Miscellaneous R&amp;D</option> <option value="NP">New Package</option> <option value="NR">Non R&amp;D</option> <option value="PC">New process</option> <option selected="selected" value="PD">New product</option> <option value="SP">Sustaining Product</option> <option value="SW">Software and Platform</option> <option value="TD">Technology Development</option> </select></TD>
What I am looking for is something that will put the two things that I care about - the name of the select and the named value of the selected item. So for this I would expect the following:

"cmbPurpose"
"New Product"

I can get the first part - But I CAN'T get the second part. I am open to any module which will let me parse the values. Here is what I have thus far..
use HTML::TokeParser; my $p = HTML::TokeParser->new( \$webpage->content ); while (my $token = $p->get_tag("select")) { my $select = $token->[1]{name} ; print "$select\n"; }
I don't understand why I am so braindead on this simple idea..

Rhodium

The seeker of perl wisdom.

Replies are listed 'Best First'.
Re: Trouble Parsing HTML
by Util (Priest) on Jan 28, 2005 at 22:27 UTC
    An inner $p->get_tag() loop is needed.
    use strict; use warnings; use HTML::TokeParser; my $p = HTML::TokeParser->new( *DATA ); while (my $token = $p->get_tag('select')) { my $name = $token->[1]{name} ; my $option_text = ''; while (my $token2 = $p->get_tag('option', '/select')) { last if $token2->[0] eq '/select'; if ( $token2->[0] eq 'option' and $token2->[1]{selected} ) { $option_text = $p->get_trimmed_text(); last; } } print "\$name = '$name', \$option_text = '$option_text'\n"; } __DATA__ <TD style="HEIGHT: 29px"> <select name="cmbPurpose" id="cmbPurpose" tabindex="3"> <option value="CD">Cell Development</option> <option value="MS">Miscellaneous R&amp;D</option> <option value="NP">New Package</option> <option value="NR">Non R&amp;D</option> <option value="PC">New process</option> <option selected="selected" value="PD">New product</option> <option value="SP">Sustaining Product</option> <option value="SW">Software and Platform</option> <option value="TD">Technology Development</option> </select></TD>
      I think one of the other posters said it best TMTOWTDI.
      However this was the solution that I was looking for. I really appreciate the hand and thanks a ton.

      Thanks again to all posters


      Rhodium

      The seeker of perl wisdom.

Re: Trouble Parsing HTML
by saintmike (Vicar) on Jan 28, 2005 at 22:11 UTC
    use strict; use HTML::TreeBuilder; my $parser = HTML::TreeBuilder->new(); my $tree = $parser->parse( join '', <DATA> ); my $selectname = $tree->look_down( "_tag", "select", ); print $selectname->attr('name'), "\n"; my $selected = $tree->look_down( "_tag", "option", "selected", "selected"); print $selected->as_trimmed_text(), "\n"; $tree->delete(); __DATA__ <TD style="HEIGHT: 29px"> <select name="cmbPurpose" id="cmbPurpose" tabindex="3"> <option value="CD">Cell Development</option> <option value="MS">Miscellaneous R&amp;D</option> <option value="NP">New Package</option> <option value="NR">Non R&amp;D</option> <option value="PC">New process</option> <option selected="selected" value="PD">New product</option> <option value="SP">Sustaining Product</option> <option value="SW">Software and Platform</option> <option value="TD">Technology Development</option> </select></TD>
Re: Trouble Parsing HTML
by Aristotle (Chancellor) on Jan 28, 2005 at 23:20 UTC

    Your question has already been answered well, so as an aside, I want to make you aware of HTML::TokeParser::Simple. It can make TokeParser code much less crufty to read; have a look at the examples in the docs.

    Makeshifts last the longest.

Re: Trouble Parsing HTML
by nerfherder (Monk) on Jan 28, 2005 at 22:47 UTC
    TMTOWTDI, that's for sure! :-)
    #!/usr/bin/perl -w use strict; use HTML::TokeParser; my $p = HTML::TokeParser->new( *DATA ); for (my $token = $p->get_tag("select")) { my $select = $token->[1]{name} ; print "$select\n"; } while (my $token2 = $p->get_tag("option")) { if ($token2->[1]{selected}) { my $option = $p->get_text("option"); print "$option\n"; } } __DATA__ <select name="cmbPurpose" id="cmbPurpose" tabindex="3"> <option value="CD">Cell Development</option> <option value="MS">Miscellaneous R&amp;D</option> <option value="NP">New Package</option> <option value="NR">Non R&amp;D</option> <option value="PC">New process</option> <option selected="selected" value="PD">New product</option> <option value="SP">Sustaining Product</option> <option value="SW">Software and Platform</option> <option value="TD">Technology Development</option>
Re: Trouble Parsing HTML
by Popcorn Dave (Abbot) on Jan 29, 2005 at 05:37 UTC
    Take a look at this node. I wrote a quick and dirty program using HTML::TokeParser to dump the output so I could see what it was getting. It may help you to clarify how to identify what you're looking for.

    HTH!

    Useless trivia: In the 2004 Las Vegas phone book there are approximately 28 pages of ads for massage, but almost 200 for lawyers.
Re: Trouble Parsing HTML
by bgreenlee (Friar) on Jan 28, 2005 at 22:07 UTC
    Here's a down-and-dirty way to do it:
    $webpage->content =~ m/<select\s+name\s*=\s*"(.*?)".*?<option\s+select +ed.*?>(.*?)</si; print "$1\n$2\n";

    -b

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://426130]
Approved by Tanktalus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (1)
As of 2021-10-18 01:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (72 votes). Check out past polls.

    Notices?