Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Parse HTML page for links and count by author

by Anonymous Monk
on Jun 07, 2014 at 10:42 UTC ( #1089133=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks, I have to parse a HTML page to find the speakers listed on the page and count the numbers of sessions or tutorials each speaker has in total. This is a homework assignment so I'm not looking for someone to 'do' it for me but to put me in the right direction

The following is the rough format of the page. I've removed actual links and data to retain how the data I need to parse is formed. Below that is the code I have written so far.

<h2>Speakers</h2> <p>Overview of what to expect</p> <p>Our speaker list is growing. Please check back regularly to see who + we have lined up for you.</p> <span style="font-weight: bold; font-size: 1.0em;"><a href ="http://li +nktosite">A Speaker</a></span> <br /> Speaker's background <ul> <li><b>Tutorial: <a href="link to info about the tutorial">Description +</a></b></li> </ul> <p style="clear:both;"> <a href ="http://link to more info about speaker">Click here</a> for m +ore info. </p> <span style="font-weight: bold; font-size: 1.0em;"><a href ="http://li +nktosite about speaker">Another Speaker</a></span> <br /> Information about this speaker. <P> He is the author of xxxx <a href="http://link to book/"><i>Book name</i></a>, contributes <a href="http://link to article">articles</a> to more info. <ul> <li><b>Session: <a href="http://link to session">Session description</ +a></b></li> </ul> <p style="clear:both;"> <a href ="http://link to more info about speaker">Click here</a> for m +ore info. </p>

My code so far. I've managed to pull the speakers so far and place them into a hash but have so far been unable to work out a way to get the Session or Tutorial elements to be captured. There are other elements of the same format but I don't want to catch those instances. Some of the code is 'dodgy' as a result of different attempts to get the proper links.

Note: This is homework so I'm looking for guidance on where I am going wrong or suggestions on where I should be looking.

#!/usr/local/bin/perl use strict; use warnings; use lib "$ENV{HOME}/mylib/lib/perl5"; use HTML::TableParser; use WWW::Mechanize; use HTML::TreeBuilder; use LWP::Simple; # Define debugging variable - set to positive integer to enable my $DEBUG_FLAG = 1; # Define variable that will contain the URL we will parse my $URL = 'Path to URL speakers.html'; # Define our tree using HTML::Treebuilder and parse the document my $tree = HTML::TreeBuilder->new; $tree->parse( get( $URL ) ); # Define our hash that will contain speaker names and their count my %speakers; # Look for the elements (speakers) we are searching for based on the a +nchor "a" tag my @elements = $tree->look_down( _tag => "a", \&find_speakers ); # Populate our speaker hash and intialize the value to 0 for my $element ( @elements ) { $speakers{$element->as_text} = 0; } # Print list of speakers if debug mode is enabled if ( defined $DEBUG_FLAG ) { foreach (sort keys %speakers) { print "$_\n"; } } # Loop through each speaker - the goal here is eventually count all Se +ssion and Tutorial # links for each speaker foreach (keys %speakers) { #check_sessions($_); # my $element = $tree->look_down( _tag => "a", # sub { shift->as_text eq $_ } ); # print $element->as_text() . "\n"; # my @rightlist = $element->right(); # print "@rightlist\n"; # my $count = 0; # while ($element->look_down( _tag => "li", \&count_sessions ) ) # { # $count++; # } # print "$_ = $count\n"; #$element->dump(); } sub check_sessions { #print "@_\n"; my $speaker = shift; my $element = $tree->look_down( _tag => "li" ); my $parent = $element->look_up( _tag => "a", sub { shift->as_text eq $speaker } ); if (defined $parent) { if ( $element->as_text() =~ /[Session:]|[Tutorial:]/ ) { print $element->as_text() . "\n"; return 1; } else { return 0; } } else { return 0; } } # find_speakers subroutine finds the 'speakers' within the HTML being +parsed # based on the source being an anchor tag, it's parent tag not being a + line and # it is within a span tag sub find_speakers { my $element = shift; my ($parent_tag) = $element->lineage_tag_names; # Our parent tag should NOT be a line and the element should be a 's +pan' tag $parent_tag ne 'li' && $element->look_up( _tag => 'span' ); } # count_sessions subroutine - this was one attempt at trying to get at + the Session and tutorial links sub count_sessions { my $element = shift; print "Got to count_sessions\n"; my ($parent_tag) = $element->lineage_tag_names; $parent_tag eq 'ul' && ( $element->as_text eq "Session" || $element- +>as_text eq "Tutorial" ); } # in_list subroutine - not presently used sub in_list { my $element = shift; my ($parent_tag) = $element->lineage_tag_names; # Our parent tag should be a line and the element should be a 'span' + tag $parent_tag eq 'li' && $element->look_up( _tag => 'span' ); } # find_top_speakers - placeholder code for subroutine that will find o +ur top 3 speakers sub find_top_speakers { }

Comment on Parse HTML page for links and count by author
Select or Download Code
Re: Parse HTML page for links and count by author
by RichardK (Priest) on Jun 07, 2014 at 10:57 UTC

    There's lots of good advice on how to work out why you code isn't doing what you want in basic debugging checklist

    Why not use the perl debugger, perldebug to step through your code and see what it actually does?

Re: Parse HTML page for links and count by author
by poj (Curate) on Jun 07, 2014 at 16:57 UTC
    You can use the same basic routine you have for speakers to extract the other data, just filter accordingly with some logic.
    #!/usr/local/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $html = do{local $/;<DATA>}; my $tree = HTML::TreeBuilder->new; $tree->parse( $html ); my @nodes = $tree->look_down( _tag => "a", \&a_tag ); sub a_tag { my ($element) = @_; my $parent = $element->parent; my $text = $element->as_text; if ($parent->tag eq 'span'){ print "Speaker = $text\n"; # set current author } elsif ($parent->tag eq 'b' && $parent->as_text =~/(Session|Tutorial)/){ print "$1 = $text\n"; # add record to current author } }
    poj

      Many thanks poj. Got me across the line. Final version of my program appears below

      #!/usr/local/bin/perl use strict; use warnings; use lib "$ENV{HOME}/mylib/lib/perl5"; use HTML::TreeBuilder; use LWP::Simple; # Program Name: top_speakers.pl # Author: XXXXX # Purpose: Parses the page http://perlcourse.ecorp.net/conf-mirror/ +conferences.oreillynet.com/speakers.html # and finds the speakers who had the most sessions and/or tutor +ials # Original code only found sessions or tutorials, adjusted code + to find Sessions, Tutorials, BOF's & Panels # to match expected output per project specification # Define debugging variable - set to positive integer to enable my $DEBUG_FLAG = 0; # Define variable that will contain the URL we will parse my $URL = 'http://perlcourse.ecorp.net/conf-mirror/conferences.oreilly +net.com/speakers.html'; # Define our tree using HTML::Treebuilder and parse the document my $tree = HTML::TreeBuilder->new; $tree->parse( get( $URL ) ); # Define our hash that will contain speaker names and their count my %speakers; # Define current speaker variable - used in find_speakers subroutine my $current_speaker; my @nodes = $tree->look_down( _tag => "a", \&find_speakers ); # If in debug mode, Print list of speaker and their total of Sessions +or Tutorials if ( $DEBUG_FLAG ) { foreach (sort keys %speakers) { print "$_ = ($speakers{$_})\n"; } } # Set a counter to limit our results, call our sorting routine to # sort in descending order (highest to lowest) and print results # Exit loop once we have 3 speakers displayed. # Technically if there are speakers with the same amount of speaking # engagements they should be weighted equally (equal third etc) but # this was not in the project requirements my $counter = 0; foreach my $key (sort hashValueDescending (keys(%speakers))) { print "$key\t($speakers{$key})\n"; $counter++; last if $counter == 3; } # Delete tree object to free up the memory (Best practice) $tree->delete; # find_speakers subroutine - finds speakers, adds their name to the %s +peakers hash # then looks for Sessions, Tutorials, BOFs or Panels that the speaker +is presenting # and adds those to the total for each speaker sub find_speakers { my ($element) = @_; my $parent = $element->parent; my $text = $element->as_text; # Check if tag is a 'span' as this was consistent for delineating th +e speakers # throughout the document if ($parent->tag eq 'span'){ print "Speaker = $text\n" if $DEBUG_FLAG; # add current speaker to the hash and initialize to zero # Note: We would need an alternative method if a speaker link appe +ared more than once $speakers{$text} = 0; # set current speaker $current_speaker = $text; } # Check if the parent tag is a bold element and if the text matche +s one # of our criteria - Session, Tutorial, BOF or Panel elsif ($parent->tag eq 'b' && $parent->as_text =~/(Session|Tutorial|BOF|Panel)/){ print "$1 = $text\n" if $DEBUG_FLAG; # add record to current speaker - set counter to current speaker c +ontents and increment by 1 # then assign to the $speaker hash my $count = $speakers{$current_speaker} + 1; $speakers{$current_speaker} = $count; } } # hashValueDescending subroutine - sorts the hash in descending numeri +cal order # from highest down to lowest sub hashValueDescending { $speakers{$b} <=> $speakers{$a}; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1089133]
Approved by ww
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (4)
As of 2014-07-26 11:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls