Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Extracting information from the SETI@Home PM group

by Rhose (Priest)
on Dec 11, 2001 at 20:51 UTC ( #130970=sourcecode: print w/ replies, xml ) Need Help??

Category: HTML Utility
Author/Contact Info Rhose
Description: I had never used the HTML::TableExtract module, so I created this script as a learning experience. As normal, if anyone has suggestions on things which I could do better, I would love to read them.
#!/usr/bin/perl -w
use strict;
#--
#-- Script:    SETIStat.pl
#-- Purpose:   Displays the information for the SETI@Home PM group
#--
#-- Author:    Robert(Bob) Smith
#-- Date:      December 11, 2001
#--
#-- Wish List: Add error handling if the requested HTML page is not re
+trieved
#--
#-- Rev Hist:  00.00.a 2001-12-11 rws  Initial version
#--
#-- Notes:     This script was created as a learning example for HTML:
+:TableExtract
#--

#-- Use modules
use HTML::TableExtract;
use HTTP::Request::Common;
use LWP::UserAgent;


#-- Define constants
use constant VERSION                   => '00.00.a';

use constant FIELD_DELIM               => ',';
use constant SETI_URL                  => 'http://setiathome.ssl.berke
+ley.edu/stats/team/team_86606.html';

use constant ERR                       =>
  {
    'ok'                               => 0,
  };


#-- Define variables
my $gExtractedTable;                   #  Table extracted from the HTM
+L
my $gHTMLPage;                         #  Retrieved HTML page
my $gName;                             #  Member's name
my $gRank;                             #  Member's ranking
my $gRow;                              #  Pointer to rows in extracted
+ tables
my $gTable;                            #  Pointer to extracted tables
my $gUserAgent;                        #  LWP::UserAgent


#-- Retrieve the HTML page
$gUserAgent = LWP::UserAgent->new;
$gHTMLPage = $gUserAgent->request(GET SETI_URL);


#-- Extract the table
#--
#-- Note: TableExtract will handle tables nested within tables (outerm
+ost table is depth==0)
#--       as well as multiple tables within the same HTML document (fi
+rst table is count==0).
#--       Since the information I wish to extract is not nested, depth
+ will be 0, and since
#--       it is the second table on the page (the first table is the g
+roup description,
#--       web site, number of members, etc...,) the count will be 1.
#--
$gExtractedTable = HTML::TableExtract->new(depth => 0, count => 1);
$gExtractedTable->parse($gHTMLPage->content);


#-- Display information
foreach $gTable ($gExtractedTable->table_states)
{
  foreach $gRow ($gTable->rows)
  {

    #-- Print data row
    if ($$gRow[0]=~/^(\d+)\)\s*/)
    {
      ($gRank, $gName)=($1,$');
      $gName=$` if $gName=~/[\s\n]+$/;
      $$gRow[2]=$' if $$gRow[2]=~/^\s+/;

      print $gRank, FIELD_DELIM,
            $gName, FIELD_DELIM,
            $$gRow[1], FIELD_DELIM,
            $$gRow[2], FIELD_DELIM,
            $$gRow[3], "\n";
    }

    #-- Print header row
    else
    {
      $$gRow[1]=~tr/\n/ /;
      $$gRow[3]=~tr/\n/ /;

      print 'Rank', FIELD_DELIM,
            $$gRow[0], FIELD_DELIM,
            $$gRow[1], FIELD_DELIM,
            $$gRow[2], FIELD_DELIM,
            $$gRow[3], "\n";
    }
  }
}


#-- Exit
exit(ERR->{ok});


#-- End of script

Comment on Extracting information from the SETI@Home PM group
Download Code
Re: Extracting information from the SETI@Home PM group
by djw (Vicar) on Dec 17, 2001 at 20:20 UTC
    This is pretty cool.

    I have never tried HTML::TableExtract but now that I know about it, I'm going to give it a try. I have done something similar for my user stats here: perldev.org, but I use LWP::Simple to grab the page, and HTML::TreeBuilder to strip the html out (then regex's to grab the data I want).

    Something else to note is that the perlmonks SETI page has a few table entries that are blank. In the html 'code', you can see they use the ' & n b s p ; ' code for a single space in each case (user #17 and user #80). When I run your code on my Win2k box, I get an 'a' with an inflection symbol (or whatever its called) above it. On my linux box I get a blank entry as you would expect. I mention this just as a head's up in case you didn't notice...

    Cool use of the module ++.

    djw

Back to Code Catacombs

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: sourcecode [id://130970]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2015-07-05 03:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls