Beefy Boxes and Bandwidth Generously Provided by pair Networks Frank
"be consistent"
 
PerlMonks  

Process a HTML file to get information from it.

by Griffler (Novice)
on Dec 11, 2006 at 16:47 UTC ( #589082=perlquestion: print w/ replies, xml ) Need Help??
Griffler has asked for the wisdom of the Perl Monks concerning the following question:

I have looked at the HTML::parser examples and gotten a lot of help from them but my problem goes a little deeper than it can help. I am very new to perl and love it but need some help. I have an HTML file that I must parse to get information from. It looks something like this:
<a name="a"></a> <h2>A</h2> <table border="0" cellpadding="0" cellspacing="0" width="1 +00%"> <tr> <td> <table id="a" border="1" bordercolor="#333366" ce +llpadding="5" cellspacing="0" width="100%"> <tr> <td width="33%" class="clsTableBody" valign +="top" id="firstCol"><a href="pdf\c76b834e-36e1-497b-b13e-eba2348dc04 +4.pdf" target="_blank">Abbott, Evelyn</a><br/><span>110136892</span>< +br/><a href="pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf" target="_b +lank">Agnew, Thomas</a><br/><span>110377660</span><br/></td> <td width="34%" class="clsTableBodyClear" v +align="top" id="secondCol"><a href="pdf\37d3e78b-1adb-458b-9e89-0df78 +0909f08.pdf" target="_blank">Allison, David</a><br/><span>108116112</ +span><br/><a href="pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf" targ +et="_blank">Allison, Gary Owen</a><br/><span>116815754</span><br/></t +d> <td width="33%" class="clsTableBody" valign +="top" id="thirdCol"><a href="pdf\ae8d51e0-005b-44be-84cb-3c9b5733575 +5.pdf" target="_blank">Arsenault, Michael</a><br/><span>108318866</sp +an><br/><a href="pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf" target +="_blank">Arsenault, Normand A.</a><br/><span>113069066</span><br/></ +td> </tr> </table> </td> </tr> </table>
(Tried to post the actual html code but it is being rendered) I need to get the href values from this and then right after the href values is a 9 digit number that belongs to this record that I need also. IF anyone can point me in a direction to get these two values i would be very grateful. Thanks

Comment on Process a HTML file to get information from it.
Download Code
Re: Process a HTML file to get information from it.
by JediWizard (Deacon) on Dec 11, 2006 at 17:10 UTC

    Try using <code> tags around your html to prevent it from rendering.

    Otherwise, I think you need something like this:

    m/<a[^>]* # an anchor tag href= # the Href in the anchor (["'])((?:(?!\1).)*)\1 # the value in the href [^>]*> # anything to the end of the anchor [^<>]* # the content in the anchor tag <\/a> # the end of the anchor (?:\s|<[^>]*>)+ # any whitespace or html tags (\d{9}) # the 9 digit number /isxm; my $href = $2; my $number = $3;

    Update:

    my $re = qr/<a[^>]* # an anchor tag href= # the Href in the anchor (["'])((?:(?!\1).)*)\1 # the value in the href [^>]*> # anything to the end of the anchor [^<>]* # the content in the anchor tag <\/a> # the end of the anchor (?:\s|<[^>]*>)+ # any whitespace or html tags (\d{9}) # the 9 digit number /isxm; my $string = do{local $/; <DATA>}; while($string =~ m/$re/g){ my $href = $2; my $number = $3; print "$number - $href\n"; } __DATA__ <a name="a"></a> <h2>A</h2> <table border="0" cellpadding="0" cellspacing="0" width="1 +00%"> <tr> <td> <table id="a" border="1" bordercolor="#333366" ce +llpadding="5" cellspacing="0" width="100%"> <tr> <td width="33%" class="clsTableBody" valign +="top" id="firstCol"><a href="pdf\c76b834e-36e1-497b-b13e-eba2348dc04 +4.pdf" target="_blank">Abbott, Evelyn</a><br/><span>110136892</span>< +br/><a href="pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf" target="_b +lank">Agnew, Thomas</a><br/><span>110377660</span><br/></td> <td width="34%" class="clsTableBodyClear" v +align="top" id="secondCol"><a href="pdf\37d3e78b-1adb-458b-9e89-0df78 +0909f08.pdf" target="_blank">Allison, David</a><br/><span>108116112</ +span><br/><a href="pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf" targ +et="_blank">Allison, Gary Owen</a><br/><span>116815754</span><br/></t +d> <td width="33%" class="clsTableBody" valign +="top" id="thirdCol"><a href="pdf\ae8d51e0-005b-44be-84cb-3c9b5733575 +5.pdf" target="_blank">Arsenault, Michael</a><br/><span>108318866</sp +an><br/><a href="pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf" target +="_blank">Arsenault, Normand A.</a><br/><span>113069066</span><br/></ +td> </tr> </table> </td> </tr> </table>

    Output:

    110136892 - pdf\c76b834e-36e1-497b-b13e-eba2348dc04 +4.pdf 110377660 - pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf 108116112 - pdf\37d3e78b-1adb-458b-9e89-0df78 +0909f08.pdf 116815754 - pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf 108318866 - pdf\ae8d51e0-005b-44be-84cb-3c9b5733575 +5.pdf 113069066 - pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf

    They say that time changes things, but you actually have to change them yourself.

    —Andy Warhol

      Thanks for the hints!!!
Re: Process a HTML file to get information from it.
by andyford (Curate) on Dec 11, 2006 at 17:13 UTC

    Depending on the surrounding HTML and how static your source is, you might be able to get by without the Parser.

    Perhaps you could just use a regular expression in quick-but-dirty fashion like this:

    /pdf.+?>(.+?)<.+span>(\d{9})<\/span>/;
    Then your data might be in $1 and $2. What have you tried?

    non-Perl: Andy Ford

      I was using the code sample from the HTML::Parser mod and it parsed out all the href's but I could not figure out how to get the 9 digit number after Here is the code for that I was using
      use HTML::Parser; my $p = HTML::Parser->new(api_version => 3, start_h => [\&a_start_handler, "self,tagname +,attr"], report_tags => [qw(a img)], ); $p->parse_file(shift || die) || die $!; sub a_start_handler { my($self, $tag, $attr) = @_; return unless $tag eq "a"; return unless exists $attr->{href}; print "A $attr->{href}\n"; $self->handler(text => [], '@{dtext}' ); $self->handler(start => \&img_handler); $self->handler(end => \&a_end_handler, "self,tagname"); } sub img_handler { my($self, $tag, $attr) = @_; return unless $tag eq "img"; push(@{$self->handler("text")}, $attr->{alt} || "[IMG]"); } sub a_end_handler { my($self, $tag) = @_; my $text = join("", @{$self->handler("text")}); $text =~ s/^\s+//; $text =~ s/\s+$//; $text =~ s/\s+/ /g; print "T $text\n"; $self->handler("text", undef); $self->handler("start", \&a_start_handler); $self->handler("end", undef); }
      The file has a ton of other stuff in it but the what I posted is the main guts.
Re: Process a HTML file to get information from it.
by wfsp (Abbot) on Dec 11, 2006 at 18:00 UTC
    here's my go
    #!/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $html = q{ <a name="a"></a> <h2>A</h2> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td> <table width="100%" cellpadding="5" cellspacing="0" border="1"> <tr> <td width="33%" valign="top" class="clsTableBody"> <a href="pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf" targe +t="_blank"> Abbott, Evelyn </a><br /> <span>110136892</span><br /> <a href="pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf" targe +t="_blank"> Agnew, Thomas </a><br /> <span>110377660</span><br /> </td> <td width="34%" valign="top" class="clsTableBodyClear"> <a href="pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf" targe +t="_blank"> Allison, David </a><br /> <span>108116112</span><br /> <a href="pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf" targe +t="_blank"> Allison, Gary Owen </a><br /> <span>116815754</span><br /> </td> <td width="33%" valign="top" class="clsTableBody"> <a href="pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf" targe +t="_blank"> Arsenault, Michael </a><br /> <span>108318866</span><br /> <a href="pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf" targe +t="_blank"> Arsenault, Normand A. </a><br /> <span>113069066</span><br /> </td> </tr> </table> </td> </tr> </table> }; my $p = HTML::TokeParser::Simple->new(\$html); # parse until second table my $table_count = 2; while (my $t = $p->get_tag('table')){ last unless --$table_count; } my (%href, $this_href, $number); while (my $t = $p->get_token){ if ($t->is_start_tag('a')){ $this_href = $t->get_attr('href'); next; } if ($t->is_start_tag('span')){ $number = $p->get_trimmed_text('/span'); $href{$this_href} = $number; next; } last if $t->is_end_tag('table'); } for my $key (keys %href){ print "$key -> $href{$key}\n"; }
    output:
    ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 > Terminated with exit code 0.
      This is great but how would I modify this to parse through a file that has that same table structure 25 more time. (Basically One table for each letter of the alphabet.)
        Assuming each letter is in an H2 tag (and that these are the only H2 tags) and that each structure is identical.

        This should do the trick. We collect the data into a HoH (%href).

        Hope this helps.

        my $p = HTML::TokeParser::Simple->new(\$html); my (%href, $this_href, $number, $letter); while (my $t = $p->get_token){ if ($t->is_start_tag('h2')){ $letter = $p->get_trimmed_text('/h2'); next; } if ($t->is_start_tag('a')){ # skip bookmarks next if $t->get_attr('name'); $this_href = $t->get_attr('href'); next; } if ($t->is_start_tag('span')){ $number = $p->get_trimmed_text('/span'); $href{$letter}{$this_href} = $number; next; } }
        output
        ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl A pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 B pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 C pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf -> 110377660 pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf -> 110136892 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf -> 108318866 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf -> 108116112 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf -> 113069066 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf -> 116815754 > Terminated with exit code 0..
Re: Process a HTML file to get information from it.
by Popcorn Dave (Abbot) on Dec 11, 2006 at 18:32 UTC
    You might have a look at HTML::TokeParser as you should be able to pull out the information as tokens. I wrote a small quick and dirty program to dump HTML to tokens using HTML::Tokeparser that you can find in this node Re: HTML::TokeParser help - parsing headlines .

    HTH!

    Revolution. Today, 3 O'Clock. Meet behind the monkey bars.
Re: Process a HTML file to get information from it.
by GrandFather (Cardinal) on Dec 11, 2006 at 18:49 UTC

    HTML::TreeBuilder is a pretty good tool for this sort of work, especially if the format of the HTML is consistent for the data you need to extract. Consider:

    use strict; use warnings; use HTML::TreeBuilder; my $root = HTML::TreeBuilder->new (); $root->parse_file (*DATA); for ($root->look_down ('_tag', 'a')) { my $href = $_->attr ('href'); next if ! $href; my $sib = $_->right ()->right (); my $number = $sib->as_text (); print "$href: $number\n"; } __DATA__

    Prints:

    pdf\c76b834e-36e1-497b-b13e-eba2348dc044.pdf: 110136892 pdf\8a956f66-1c60-48fc-905c-b49d617aa6c5.pdf: 110377660 pdf\37d3e78b-1adb-458b-9e89-0df780909f08.pdf: 108116112 pdf\6c0a5bb4-143d-4305-957b-796c8193d07a.pdf: 116815754 pdf\ae8d51e0-005b-44be-84cb-3c9b57335755.pdf: 108318866 pdf\e646f948-f78d-4463-a01d-0261aebf70dc.pdf: 113069066

    DWIM is Perl's answer to Gödel
      I tried your code and I got the following error.... Can't call method "right" without a package or object reference at C:\Change\2539\testit2.pl line 21. I modified the code to look like this:
      use strict; use warnings; use HTML::TreeBuilder; my $root = HTML::TreeBuilder->new (); $root->parse_file ("c:\\change\\2539\\index.html"); for ($root->look_down ('_tag', 'a')) { my $href = $_->attr ('href'); next if ! $href; my $sib = $_->right ()->right (); my $number = $sib->as_text (); print "$href: $number\n"; }
Re: Process a HTML file to get information from it.
by Griffler (Novice) on Dec 11, 2006 at 19:52 UTC
    Thanks to all who posted This was a great help!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://589082]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (9)
As of 2014-04-18 23:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (473 votes), past polls