Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Hello Perl Monks, I have to parse a HTML page to find the speakers listed on the page and count the numbers of sessions or tutorials each speaker has in total. This is a homework assignment so I'm not looking for someone to 'do' it for me but to put me in the right direction

The following is the rough format of the page. I've removed actual links and data to retain how the data I need to parse is formed. Below that is the code I have written so far.

<h2>Speakers</h2> <p>Overview of what to expect</p> <p>Our speaker list is growing. Please check back regularly to see who + we have lined up for you.</p> <span style="font-weight: bold; font-size: 1.0em;"><a href ="http://li +nktosite">A Speaker</a></span> <br /> Speaker's background <ul> <li><b>Tutorial: <a href="link to info about the tutorial">Description +</a></b></li> </ul> <p style="clear:both;"> <a href ="http://link to more info about speaker">Click here</a> for m +ore info. </p> <span style="font-weight: bold; font-size: 1.0em;"><a href ="http://li +nktosite about speaker">Another Speaker</a></span> <br /> Information about this speaker. <P> He is the author of xxxx <a href="http://link to book/"><i>Book name</i></a>, contributes <a href="http://link to article">articles</a> to more info. <ul> <li><b>Session: <a href="http://link to session">Session description</ +a></b></li> </ul> <p style="clear:both;"> <a href ="http://link to more info about speaker">Click here</a> for m +ore info. </p>

My code so far. I've managed to pull the speakers so far and place them into a hash but have so far been unable to work out a way to get the Session or Tutorial elements to be captured. There are other elements of the same format but I don't want to catch those instances. Some of the code is 'dodgy' as a result of different attempts to get the proper links.

Note: This is homework so I'm looking for guidance on where I am going wrong or suggestions on where I should be looking.

#!/usr/local/bin/perl use strict; use warnings; use lib "$ENV{HOME}/mylib/lib/perl5"; use HTML::TableParser; use WWW::Mechanize; use HTML::TreeBuilder; use LWP::Simple; # Define debugging variable - set to positive integer to enable my $DEBUG_FLAG = 1; # Define variable that will contain the URL we will parse my $URL = 'Path to URL speakers.html'; # Define our tree using HTML::Treebuilder and parse the document my $tree = HTML::TreeBuilder->new; $tree->parse( get( $URL ) ); # Define our hash that will contain speaker names and their count my %speakers; # Look for the elements (speakers) we are searching for based on the a +nchor "a" tag my @elements = $tree->look_down( _tag => "a", \&find_speakers ); # Populate our speaker hash and intialize the value to 0 for my $element ( @elements ) { $speakers{$element->as_text} = 0; } # Print list of speakers if debug mode is enabled if ( defined $DEBUG_FLAG ) { foreach (sort keys %speakers) { print "$_\n"; } } # Loop through each speaker - the goal here is eventually count all Se +ssion and Tutorial # links for each speaker foreach (keys %speakers) { #check_sessions($_); # my $element = $tree->look_down( _tag => "a", # sub { shift->as_text eq $_ } ); # print $element->as_text() . "\n"; # my @rightlist = $element->right(); # print "@rightlist\n"; # my $count = 0; # while ($element->look_down( _tag => "li", \&count_sessions ) ) # { # $count++; # } # print "$_ = $count\n"; #$element->dump(); } sub check_sessions { #print "@_\n"; my $speaker = shift; my $element = $tree->look_down( _tag => "li" ); my $parent = $element->look_up( _tag => "a", sub { shift->as_text eq $speaker } ); if (defined $parent) { if ( $element->as_text() =~ /[Session:]|[Tutorial:]/ ) { print $element->as_text() . "\n"; return 1; } else { return 0; } } else { return 0; } } # find_speakers subroutine finds the 'speakers' within the HTML being +parsed # based on the source being an anchor tag, it's parent tag not being a + line and # it is within a span tag sub find_speakers { my $element = shift; my ($parent_tag) = $element->lineage_tag_names; # Our parent tag should NOT be a line and the element should be a 's +pan' tag $parent_tag ne 'li' && $element->look_up( _tag => 'span' ); } # count_sessions subroutine - this was one attempt at trying to get at + the Session and tutorial links sub count_sessions { my $element = shift; print "Got to count_sessions\n"; my ($parent_tag) = $element->lineage_tag_names; $parent_tag eq 'ul' && ( $element->as_text eq "Session" || $element- +>as_text eq "Tutorial" ); } # in_list subroutine - not presently used sub in_list { my $element = shift; my ($parent_tag) = $element->lineage_tag_names; # Our parent tag should be a line and the element should be a 'span' + tag $parent_tag eq 'li' && $element->look_up( _tag => 'span' ); } # find_top_speakers - placeholder code for subroutine that will find o +ur top 3 speakers sub find_top_speakers { }

In reply to Parse HTML page for links and count by author by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (2)
As of 2024-04-25 20:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found