Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Scraping with Treebuilder

by lv211 (Beadle)
on Jul 15, 2006 at 20:26 UTC ( [id://561481]=perlquestion: print w/replies, xml ) Need Help??

lv211 has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I picked up Spidering Hacks and I've had a problem with hack #19.

I keep getting the errors:

"my" variable @perlbooks masks earlier declaration in same scope at treebuilder.pl line 50.

Bareword "parent" not allowed while "strict subs" in use at treebuilder.pl line 15.

syntax error at treebuilder.pl line 34, near ")

Here is the script:

#!/usr/bin/perl use strict; use LWP::Simple; use HTML::TreeBuilder; my $url = 'http:www.oreilly.com/catalog/prindex.html'; my $page = get( $url ) or die $!; my $p = HTML::TreeBuilder->new_from_content( $page ); my @links = $p->look_down( _tag => 'a', href => qr{^ \Qhttp://www.oreilly.com/catalog/\E \w+$}x ); my @rows = map { $_->parent-parent } @links; my @books; for my $row (@rows) { my %book; my @cells = $row->look_down( _tag => 'td' ); $book{title} = $cells[0]->as_trimmed_text; $book{isbn} = $cells[1]->as_trimmed_text; $book{price} = $cells[2]->as_trimmed_text; $book{price} =~ s/^\$//; $book{url} = get_url( $cells[0] ); $book{safari} = get_url( $cells[3] ); $book{examples} = get_url( $cells[4] ); push @books, \%book; } sub get_url { my $node = shift; my @hrefs = $node->look_down( _tag => 'a' ) return unless @hrefs; my $url = $hrefs[0]->attr('href'); $url =~ s/\s+$//; return $url; } $p = $p->delete; { my $count = 1; my @perlbooks = sort { $a->{price} <=> $b->{price} } grep { $_->{title} =~ /perl/i } @books; print $count++, "\t", $_->{price}, "\t", $_->{title} for @perl +books; } { my @perlbooks = grep { $_->{title} =~ /perl/i } @books; my @javabooks = grep { $_->{title} =~ /java/i } @books; my $diff = @javabooks - @perlbooks; print "There are " .@perlbooks." Perl books and ".@javabooks. " Java books. $diff more java than Perl." }

Replies are listed 'Best First'.
Re: Scraping with Treebuilder
by liverpole (Monsignor) on Jul 15, 2006 at 20:40 UTC
    Hi lv211,

    Yes, it's because you have a couple of errors in your code:

        my @rows = map { $_->parent-parent } @links;

    should be parent->parent.  And ...

        my @hrefs = $node->look_down( _tag => 'a' )

    is missing a semi-colon ';' at the end.

    That should make your code compile cleanly, anyway.  Good luck!


    s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
      Thanks liverpole.
        My pleasure :)

        By the way, I did notice, while trying to run your program, that I was getting an error with the URL you have:

        my $url = 'http:www.oreilly.com/catalog/prindex.html';

        It doesn't work for me in a browser, either (it's a missing or unavailable page).

        The reason I mention it is that the error message from Perl was somewhat cryptic ... so if you get an error retrieving that page, you may want to try another page which you're positive is accessible.


        s''(q.S:$/9=(T1';s;(..)(..);$..=substr+crypt($1,$2),2,3;eg;print$..$/
Re: Scraping with Treebuilder
by HuckinFappy (Pilgrim) on Jul 17, 2006 at 03:50 UTC
    The first error liverpole identified is easy to find if you turn on warnings (which you are not using in your code):
    [10] perl -Mwarnings /tmp/testit.pl "my" variable @perlbooks masks earlier declaration in same scope at /t +mp/testit.pl line 50. Bareword "parent" not allowed while "strict subs" in use at /tmp/testi +t.pl line 15. syntax error at /tmp/testit.pl line 34, near ") return" Global symbol "@hrefs" requires explicit package name at /tmp/testit.p +l line 34. Execution of /tmp/testit.pl aborted due to compilation errors.
    The error stopping you from even compiling though, is the missing semicolon. 13 years of writing perl, and I still find missing semicolons, parens and braces are the hardest errors to find sometimes. I use perltidy now to try and help. For example, running your code through perltidy, I end up with:
    sub get_url { my $node = shift; my @hrefs = $node->look_down(_tag => 'a') return unless @hrefs; my $url = $hrefs[0]->attr('href'); $url =~ s/\s+$//; return $url; }
    Well, that long line jumped right out at me as being seriously wrong, and it was easy to figure out the fix then.

    HTH,
    ~Jeff

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://561481]
Approved by davidrw
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2024-10-12 15:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.