Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Extract info from HTML

by George_Sherston (Vicar)
on Nov 11, 2001 at 21:42 UTC ( #124678=perlquestion: print w/replies, xml ) Need Help??

George_Sherston has asked for the wisdom of the Perl Monks concerning the following question:

I recently made a little script to come up with an index for the increasingly unwieldy Name Space thread. This led me into an area I've not explored before (one of many, I hasten to add), that of extracting info from HTML. I wanted something that wd go through the page and pull out a node number and name for the first post by each monk who had contributed to the thread. What I came up with was this:
#!/usr/bin/perl -w use strict; use CGI qw(:standard :cgi-lib); use LWP::Simple; my $url ="http://perlmonks.org/index.pl?node_id=110166"; my $html = get $url or die "can't get url $!"; my %names; #find the names and node ids: while ($html =~ s/=(\d*?)&lastnode_id=110166">[^<]*<\/A><BR> by <A HRE +F="\/index\.pl\?node_id=\d*&lastnode_id=110166">(.*?)<\/A> on \w{3} \ +d{2}, \d{4} at \d{2}:\d{2}//s) { $names{$2} = $1 unless $names{$2}; } # print out a page of links to nodes: for (sort { lc($a) cmp lc($b) } keys %names) { print "<A HREF=\"/index.pl?node_id=$names{$_}&lastnode_id=110166\" +>$_</A> | "; }
... which does the job, BUT the regex is big and fat and ugly, and I just wondered whether there was a more elegant, less impenetrable way to do it (i.e. I wondered how *many* such ways there were). I looked at HTML::Parser, but (and perhaps my inspection was too cursory) it didn't seem as though it wd help me much in pulling out bits of tags, as I need to here. Also, I felt that the while loop was a bit clumsy... but couldn't see a quicker way to capture two matches into a hash. I'd be very interested in any suggestions how to do this better.

George Sherston

Edit: chipmunk 2001-11-11

Replies are listed 'Best First'.
Re: Extract info from HTML
by Chmrr (Vicar) on Nov 11, 2001 at 22:33 UTC

    Parsing HTML with regexen is always hard. I'd suggest that you use HTML::LinkExtor to grab all of the links off of the page, then sort through them looking for the ones you actually want.

    The other possibility would be to use HTML::TreeBuilder, but that might turn out to be more complicated. However, I've found that TreeBuilder is a much easier way of thinking about HTML than Parser. Most of the useful documentation for HTML::TreeBuilder is found under HTML::Element, BTW.

    perl -pe '"I lo*`+$^X$\"$]!$/"=~m%(.*)%s;$_=$1;y^+*`^ ve^#$&V"+@( NO CARRIER'

      I'd suggest that you use HTML::LinkExtor to grab all of the links off of the page, then sort through them looking for the ones you actually want.
      In some cases, that might work, but for this particular one, he has no way of determining which links were those of "authors" who replied to Name Space.

      As for HTML::TreeBuilder, demerphq puts on a nice show on how you'd do it, but for me, it's too much work (and a lil'bit of a mind ben).

       
      ___crazyinsomniac_______________________________________
      Disclaimer: Don't blame. It came from inside the void

      perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

Re: Extract info from HTML
by demerphq (Chancellor) on Nov 12, 2001 at 08:37 UTC
    Well, after a long night between you and Chmrr I learned something for sure.

    I used HTML::TreeBuilder as suggested, and from the docs and some tinkering produced the following code. Its not at all more elegant than yours, nor arguably better, but whatever. Now I have no doubt that if I trawled the catacombs, or when blakem or a number of other people show up that theyll blow this all apart, but here goes anyway :-).

    use warnings; use strict; use CGI qw(:standard :cgi-lib); use LWP::Simple; use HTML::TreeBuilder; sub find_node { my $node=shift; #well it might not be a font after all... my $hashref=shift; my $depth=shift; my @content=$node->content_list; return @content if (!ref($node) || uc($node->tag) ne "FONT"); my @tmp=$node->content_list; # Build a fingerprint of the node. Numeric as a minor optimization # -1 is text, # is the number of children the node has, so <br> ha +s none # (usually) my $depthprint=join(":",map{ref $_ ? scalar $_->content_list : -1} + @tmp); # This could be neater... my $fingerprint=join("<>",map{ref $_ ? $_->tag : $_ } @tmp); my ($node_id,$title,$date_node,$monk_node); if ($depthprint=~/^(1:0:)?-1:1:-1$/ && # the finger print to match $fingerprint=~/^(a<>br<>)? by <>a<> on [^<>]+$/i) { # the node + to match #ok, this is almost definately a node header if ($1) { #print $1; $node_id = CGI->new( $tmp[0]->attr('href')=~/\?(.*)/ )->pa +ram( 'node_id' ); $title = ( $tmp[0]->content_list )[0]; ($monk_node,$date_node)=@tmp[3,4]; } else { #so no a<>br<> at the start, means this is #probably the start of the nodes. go up and see if its #a td, if it is then its first child should be an h3 #if its not, at any point bail, if it is, then the content +s #of the h3 is the thread title #print "Start?"; my $parent=$node->parent; return @content if $parent->tag ne "td"; my @pcont=$parent->content_list; my $hdr=$pcont[0]; return @content if !$hdr || !ref($hdr) || $hdr->tag ne "h3 +"; $node_id=($hdr->content_list)[0]; $title=$node_id; ($monk_node,$date_node)=@tmp[1,2]; #print "Start!".$depth; } } else { return @content; } my $home_id = CGI->new( $monk_node->attr('href')=~/\?(.*)/ )->p +aram( 'node_id' ); my ($date) = ( $date_node=~/on (.*)/ ); my ($monkname) = $monk_node->content_list; # Build the hash. this could be more elegant # print "($depth) $date $title $monkname"; $hashref->{$monkname}->{$node_id}={date=>$date,title=>$title}; $hashref->{$monkname}->{Home}=$home_id; return; } sub recurse { my ($node,$hash,$depth)=@_; # depth first search, real simple, everything is loaded in the # hash. # # monkname->home->id # | # ->id->date->value #note id sorts by date doesnt it? # | # ->title->value ref($_) && recurse($_,$hash,$depth+1) foreach find_node($node,$hash,$depth); }; sub get_names_in_thread { my $id=shift; my $html; print "<p><STRONG>The posters from thread [id://$id]</STRONG><br>" +; if ($id) { my $url ="http://perlmonks.org/index.pl?node_id=$id"; $html= get( $url) or die "can't get url $!"; } else { #for debugging warn "Using DATA"; local $/; $html=<DATA>; } my $tree = HTML::TreeBuilder->new(); $tree->parse($html); my $hash={}; recurse($tree,$hash,0); #everything here on is just formatting my @sorted=sort {$a->[1] cmp $b->[1]} map {my $key=lc($_); #For { Nule }.. Keeps the weirdos on one branch :-) $key=~s/[^[:alpha:]]/{/g; ["[id://$hash->{$_}->{Home}| $_ ]",$key]} keys %$hash; my %ltrs; foreach (@sorted) { my $ltr=substr($_->[1],0,1); $ltrs{$ltr}=[] if !exists $ltrs{$ltr}; push @{$ltrs{$ltr}},$_->[0]; } my $ret= "<ul>\n"; foreach (sort keys %ltrs) { $ret.="\t<li>\n"; $ret.="\t\t".join(" | ",@{$ltrs{$_}})."\n"; $ret.="\t</li>\n"; } return $ret."</ul>"; } print get_names_in_thread(110166); #123859 is big too __DATA__
    which outputs:

    The posters from thread Name Space

    That was a lot of fun George_Sherston, I learned a lot. Thanks. (And BTW, I know I could have used more CGI tricks, but its been a long night, and I couldnt be bothered. Also some kind of recursion could be used to follow each reply looking for more replies, but, thats for another night :-).

    Yves / DeMerphq
    --
    Have you registered your Name Space?
    UPDATE: Fixed spelling of Chmrr

(crazyinsomniac) Re: Extract info from HTML
by crazyinsomniac (Prior) on Nov 12, 2001 at 12:46 UTC
    Weeel I started writing code on this about two hours ago, and didn't do much until I actually started thinking about the problem some 15 minutes ago (when I put down the peanuts and exausted me votes for the day) and this is what I came up with:

    First, I needed to pick a module to use, and HTML::TokeParser sat really well with me. The initial problem for me, was to figure out what "html" is the one I wan't, and I did what I always do when diagnosing such a problem, I dump the entire document token by token, in this case, with:

    #!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TokeParser; my $url ="http://perlmonks.org/index.pl?node_id=110166"; my $rawHTML = get($url); # attempt to d/l the page to mem die "LWP::Simple messed up $!" unless ($rawHTML); my $tp; $tp = HTML::TokeParser->new(\$rawHTML) or die "WTF $tp gone bad: $!"; # And now -- a generic HTML::TokeParser loop while (my $token = $tp->get_token) { my $ttype = shift @{ $token }; print "TYPE : $ttype\n####\n"; printf( join( '', map { "$_:%s\n####\n" } 1..@{$token} ) , @{$token} ); print "####################################################\n\n"; } __END__ Which produces something like: TYPE : D #### 1:<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" > #### #################################################### TYPE : T #### 1: #### 2: #### #################################################### TYPE : C #### 1:<!--took this out for IE6ites "http://www.w3.org/TR/REC-html40/loos +e.dtd"--> #### #################################################### TYPE : T #### 1: #### 2: #### #################################################### TYPE : S #### 1:html #### 2:HASH(0x1afeee0) #### 3:ARRAY(0x1afeef8) #### 4:<HTML> #### #################################################### TYPE : T #### 1: #### 2: #### ####################################################
    Then, after "visualizing" what criteria I can use to pick out the stuff I need (noted in __END__), I crafted me while loop like so:
    #!/usr/bin/perl -w use strict; use LWP::Simple; use HTML::TokeParser; my $url ="http://perlmonks.org/index.pl?node_id=110166"; my $rawHTML = get($url); # attempt to d/l the page to mem die "LWP::Simple messed up $!" unless ($rawHTML); my $tp; $tp = HTML::TokeParser->new(\$rawHTML) or die "WTF $tp gone bad: $!"; # And now -- a generic HTML::TokeParser loop while (my $token = $tp->get_token) { my $ttype = shift @{ $token }; if($ttype eq "S" and $token->[0] eq "br") { my ( @t ) = ( undef, #$tp->get_token, #S 0 $tp->get_token, #T 1 $tp->get_token, #S 2 $tp->get_token, #T 3 $tp->get_token, #E 4 $tp->get_token, #T 5 ); if( # ($t[0][0] eq "S" and $t[0][1] eq "br") and ($t[1][0] eq "T" and $t[1][1] =~ /by/) and ($t[2][0] eq "S" and $t[2][1] eq "a") and ($t[3][0] eq "T" ) and ($t[4][0] eq "E" and $t[4][1] eq "a") and ($t[5][0] eq "T" and $t[5][1] =~ /on \w{3} \d{2}, \d{4} at +/) ) { print $t[2][4], $t[3][1], $t[4][2], " | "; } } } # endof while (my $token = $p->get_token) undef $rawHTML; # no more raw html undef $tp; # destroy the HTML::TokeParser object (don't need it n +o more) __END__ ######### WITH ADDED NEWLINES FOR READABILITY AT >< <TR BGCOLOR=eeeeee><TD colspan=2> <UL> <font size=2> <A HREF="/index.pl?node_id=110247&lastnode_id=110166"> Re: Re: Name Space </A> <BR> by <A HREF="/index.pl?node_id=85506&lastnode_id=110166"> Hofmator </A> on Sep 05, 2001 at 02:27 </UL> </font></TD></tr> ########## BROKEN DOWN BY TOKEN TYPE : S #### 1:br #### 2:HASH(0x1af8128) #### 3:ARRAY(0x1afeeec) #### 4:<BR> #### #################################################### TYPE : T #### 1: by #### 2: #### #################################################### TYPE : S #### 1:a #### 2:HASH(0x1ab4384) #### 3:ARRAY(0x1ab6324) #### 4:<A HREF="/index.pl?node_id=85506&lastnode_id=110166"> #### #################################################### TYPE : T #### 1:Hofmator #### 2: #### #################################################### TYPE : E #### 1:a #### 2:</A> #### #################################################### TYPE : T #### 1: on Sep 05, 2001 at 02:27 #### 2: #### ####################################################
    Which produced the following list:
    japhy | Hofmator | tilly | davorg | scain | ichimunki | runrig | demerphq | merphq | shotgunefx | Masem | cLive ;-) | synapse0 | lo_tech | agent00013 | MrNobo1024 | Corion | demerphq | lo_tech | George_Sherston | Hofmator | Zaxo | idnopheq | dragonchild | herveus | wine | TheoPetersen | toadi | dga | mexnix | ybiC | {NULE} | theorbtwo | George_Sherston | Jouke | George_Sherston | tye | gregor42 | Guildenstern | sifukurt | CubicSpline | scain | zakzebrowski | jackdied | suaveant | poqui | mikeB | davis | s173451000 | blakem | George_Sherston | PotPieMan | mr_mischief | Zecho | earthboundmisfit | kwoff | Arguile | chaoticset | BrentDax | Aighearach | basicdez | brianarn | George_Sherston | BooK | riffraff | seanbo | Maestro_007 | stefan k | dthacker | Hero Zzyzzx | beretboy | Veachian64 | giulienk | blakem | George_Sherston |
    The lesson here is, thank god vroom has a consistent format making it possible for me to decide what i want relatively easily (and thank god for HTML::TokeParser including the RAW html so I don't have to do much recreating, just repiecing ;D).

    Is it elegant? I don't care, it makes sense to me (in practice and in theory).

    update: oh yeah, it's not sorted, cause I don't actually "collect" the urls (users/userids) I want, cause like you can see, I just print them out.

    This may help (a token can look like):

    ["S", $tag, $attr, $attrseq, $text] ["E", $tag, $text] ["T", $text, $is_data] ["C", $text] ["D", $text] ["PI", $token0, $text]
    update: oh, point taken, that's just a simple oversight on my part, all i'd have to do is add a couple of more tokens... later ;D

     
    ___crazyinsomniac_______________________________________
    Disclaimer: Don't blame. It came from inside the void

    perl -e "$q=$_;map({chr unpack qq;H*;,$_}split(q;;,q*H*));print;$q/$q;"

      Huh, I always thought Tokeparser was something to do with getting stoned in an orderly recursive fashion. :) That was interesting. Q: can you make it kick out nodes below a depth of 1? - otherwise it's an index of all nodes, rather than an index of nodes where people give their etymology... Q2 - can it pick up the node_id of the named monk's node in this thread, rather than the named monk's home node?

      George Sherston
        Well, even though this wasnt addressed to me:

        Mine will extract all the above information just change the following lines

        print "($depth)$monkname posted '$monkname' on $date\n"; $hashref->{$monkname}->{$node_id}={ date=>$date, title=>$title, depth=>$depth };
        Then you can extract whatever you want.
        $VAR1 = { 'demerphq' => { '110238' => { 'depth' => '13', 'title' => 'Corions Name Space +', 'date' => 'Sep 05, 2001 at 01: +04' }, 'Home' => '108447', '110195' => { 'depth' => '12', 'title' => 'Re: Name Space', 'date' => 'Sep 04, 2001 at 15: +46' } }, 'George_Sherston' => { 'Home' => '103111', '124767' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Nov 11, 2001 + at 22:33' }, 'Name Space' => { 'depth' => '9', 'title' => 'Name Sp +ace', 'date' => 'Sep 04, +2001 at 13:33' }, '121046' => { 'depth' => '14', 'title' => 'Re: Re: Re: + Name Space', 'date' => 'Oct 24, 2001 + at 01:21' }, '117665' => { 'depth' => '13', 'title' => 'Re: TheOrbT +wo\'s Name Space', 'date' => 'Oct 09, 2001 + at 00:05' }, '117303' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Oct 07, 2001 + at 03:57' }, '110244' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Sep 05, 2001 + at 01:58' }, '122854' => { 'depth' => '13', 'title' => 'Re: Re: Nam +e Space', 'date' => 'Nov 02, 2001 + at 08:07' } }, };
        Note that the depths are as follows:9 root node, 12 reply, 13, reply to a reply...
        But a thought: You dont want the posts from just a fixed depth in the parse tree. That would for instance eliminate you from the list (you dont have a reply to yourself) as well as anyone who explained their name in a reply to another persons explaination, merphq would be an example, however I believe there are more as well.

        Actually, one of the more interesting issues with this thread was acurately picking up all names from all levels, there is an annoying habit of <UL> tags messing up the pattern, also of the main post being marked up differently.

        Anyway, Ill revisit this a bit later, :-)

        Yves / DeMerphq
        --
        Have you registered your Name Space?

Re: Extract info from HTML
by blakem (Monsignor) on Nov 12, 2001 at 16:09 UTC
    Since I saw demerphq's subtle challenge, I thought I'd try a slightly different take on the problem. The objective is to gather information about the direct children of a node (GS said he'd prefer to skip the grandchildren). I immediately thought about the XML feeds but soon realized that they only provide parent_node information... Walking up the nodeid tree is easy, walking down it is not.

    So, I settled on a hybrid solution. First I grab out all the nodeids mentioned on the original page (including one that actually refers to an everything2.com node). I then use the xml feeds to check which of these nodeids are direct children of 'Name Space'.

    The xml tickers apparently only work when you're logged in so you'll have to set $user and $pass. I just dump the data out at the end, so you'll need to tweak that as well if you want html.

    #!/usr/bin/perl -T use warnings; no warnings 'uninitialized'; use lib '/web/httpd_perl/perllibs'; use strict; use LWP::Simple; use XML::Simple; my $namespaceid = 110166; my $namespaceurl = "http://perlmonks.org/index.pl?node_id=$namespaceid +"; my $xmlurl = "http://perlmonks.org/index.pl?node_id=37150"; my $user = 'blakem'; my $pass = 'yeahright'; ## Get a unique list of all node_ids on the page that are bigger than +$namespaceid my %seen; my @possiblenodes = grep {!$seen{$_}++ && $_ > $namespaceid} get($namespaceurl) =~ /\bnode_id=(\d+)/gs; ## Ask the perlmonks XML engine about the nodes my $queryurl = $xmlurl . '&nodes=' . join(',',@possiblenodes) . "&user=$user&passwd=$pass&op=login"; my $nodeinfo = XMLin(get $queryurl); ## Keep those that have 'Name Space' as their parent; %seen = (); my (@children,@authorids); for my $node (@{$nodeinfo->{NODE}}) { if ($node->{parent_node} == $namespaceid) { push (@children,[$node->{node_id},$node->{author_user}]); push (@authorids,$node->{author_user}) unless $seen{$node->{author +_user}}++; } } ## Get usernames by asking the XML engine about homenodeids my %authors; my $authorurl = $xmlurl . '&nodes=' . join(',',@authorids) . "&user=$user&passwd=$pass&op=login"; my $authorinfo = XMLin(get $authorurl); for my $node (@{$authorinfo->{NODE}}) { $authors{$node->{node_id}} = $node->{content}; } ## Map the author names back onto our list of children @children = map {[$_->[0],$_->[1],$authors{$_->[1]}]} @children; ## Generate some output based on the data in @children printf "%20s %8s %8s\n", 'Name', 'HomeNode', 'Nodeid'; printf "%20s %8s %8s\n", '-'x10, '-'x8, '-'x8; for (sort {lc($a->[2]) cmp lc($b->[2])} @children) { my ($nodeid,$homenodeid,$name) = @$_; printf "%20s %8d %8d\n", $name, $homenodeid, $nodeid; } =OUTPUT Name HomeNode Nodeid ---------- -------- -------- agent00013 88170 110210 Aighearach 8329 122320 Arguile 70968 121735 basicdez 52645 122419 beretboy 70819 124672 blakem 83485 124766 BooK 21732 123175 BrentDax 59600 121974 brianarn 64771 122822 buckaduck 63135 110774 cadfael 32111 110343 chaoticset 117316 121945 Chmrr 34691 124788 Corion 5348 110229 CubicSpline 113570 120007 davis 80839 120534 demerphq 108447 110195 dga 101472 110325 dragonchild 85580 110274 dthacker 4945 124473 earthboundmisfit 85698 121656 giulienk 114167 124686 gregor42 63250 119121 Guildenstern 25301 120004 Hero Zzyzzx 59531 124631 herveus 73441 110281 Hofmator 85506 110246 ichimunki 45391 110190 idnopheq 63535 110270 jackdied 111952 120038 japhy 1936 110171 Jouke 27919 117301 kwoff 115491 121723 lo_tech 108270 110209 Maestro_007 73311 123882 Masem 53423 110197 mexnix 63979 110336 mikeB 82957 120127 mr_mischief 62512 121651 MrNobo1024 56379 110212 poqui 105362 120119 PotPieMan 23091 120952 riffraff 60508 123693 runrig 31503 110194 s173451000 120825 120950 seanbo 72233 123827 shotgunefx 75719 110196 sifukurt 1683 120006 stefan k 5094 124362 suaveant 56739 120111 synapse0 62956 110202 TheoPetersen 22772 110307 theorbtwo 25047 117292 tilly 26179 110186 toadi 5087 110313 Veachian64 116016 124682 wine 77442 110292 ybiC 14909 117217 zakzebrowski 38800 120012 Zaxo 82147 110250 Zecho 105646 121653 {NULE} 113194 117228

    -Blake

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://124678]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2021-10-18 19:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (75 votes). Check out past polls.

    Notices?