Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Hash table checker doesnt work

by hodashirzad (Novice)
on Apr 10, 2007 at 07:56 UTC ( #609089=perlquestion: print w/replies, xml ) Need Help??

hodashirzad has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

The following script is for my crawler to go to a page and get a list of products of that page and the next page. it is a recursive function but there are two problems with it.
1. I am using HTML:TokeParser and a while loop that gets all the links and it has to check two things class of the link or the href but when I put them both together the program only checks for the class and it skips the href:
next unless defined($token->[1]{href}); next unless defined($token->[1]{class}); next unless $token->[1]{href} =~ /\?page=/ || $token->[1]{class} = +~ /top10_link/; my $urls = $token->[1]{href};
2. the problem is that it loops forever even after I put all the accepted links in a hash table and I check the hash table before adding the next link & recursing. some how it just goes ahead and adds the existing link to the hash anyway:
if(!$parent{$urls}){ my $count = keys %parent; $parent{$urls} = $i; my $count2 = keys %parent; print "$i: count1 is: $count and count2 is $count2\n$urls\n"; if ($urls =~ /page=/){ print "!!!!!!!!!!!!recursing!!!!!!!!!!!!!\n"; #print "\n\"$i $title\"\n $parent{$urls}\n"; &passing($urls); } }
and here is the whole of the script:
#!/usr/bin/perl -w use strict; use HTML::TokeParser; use LWP::Simple; use URI::URL; my %parent; sub passing{ my $url = shift; my $data = get($url) or die $!; #the magical parser. my $p = HTML::TokeParser->new(\$data); my $i=0; while (my $token = $p->get_tag("a")) { next unless defined($token->[1]{href}); next unless defined($token->[1]{class}); next unless $token->[1]{href} =~ /\?page=/ || $token->[1]{class} = +~ /top10_link/; my $urls = $token->[1]{href}; $urls =~ s/&PHPSESSID=.*//g; $urls = &canonical($urls, "http://www.ash-distribution.co.uk/index +.php"); my $title = $p->get_trimmed_text; if (!$parent{$urls}){ $parent{$urls} = $i; if ($urls =~ /page=/){ print "!!!!!!!!!!!!recursing!!!!!!!!!!!!!\n"; #print "\n\"$i $title\"\n $parent{$urls}\n"; &passing($urls); } } $i++; } } sub canonical{ if (not $_[0] =~ m%^http://%){ $_[0] = url($_[0])->abs($_[1]); } return $_[0]; } &passing("http://www.ash-distribution.co.uk/index.php?c=%20Ink&sc=Epso +n%20Replacement%20Cartridges");
Any help is appriciated, Thanks in advance.

++Update:I figured why the hash dont work if i start the $i with 1 instead of 0 it will work, i dont know why but if i give the value of the hash 0 it doesnt like it but if I give anything else works fine.

Replies are listed 'Best First'.
Re: Hash table checker doesnt work
by ferreira (Chaplain) on Apr 10, 2007 at 10:41 UTC

    If you're having trouble getting the recursion right, you must peek inside the structures that control your recursion, like %parent in this case. You could use something like

    use Data::Dumper; ... print Dumper \%parent;
    But I don't see any code that adds entries to this hash. How recursion is being controlled if every time a URL passes through the code, it meets (apparently) the same conditions again? I think that self-referencing pages should be quite common.

    As an aside note, most often you should not use subroutine calling like &passing(). Prefer plain passing().

      Sorry that was a typo mistake but i mananged to get it working now by just having the $i start from 1 instead of 0 but I dont know why when I start the $i from 0 it ignores the

      if(!$parent{$urls})

Re: Hash table checker doesnt work
by ferreira (Chaplain) on Apr 10, 2007 at 10:03 UTC

    Update: That was a mistake. I didn't know what I looked at to see this error that isn't there.

    The first thing I spotted in your code was:

    next unless $token->[1]{href} =~ /\?page=/ || $token->[1]{class} =~ /t +op10_link/;
    which has some precedence problems. You really meant:
    next unless ($token->[1]{href} =~ /\?page=/) || ($token->[1]{class} =~ /top10_link/);
    or
    next unless ($token->[1]{href} =~ /\?page=/) or ($token->[1]{class} =~ /top10_link/);
      Thanks the code you said works but gives me errors (uninitialised) for the links that havent got class identified so I decided to use if statement instead of next unless :
      next unless defined($token->[1]{href}) || defined($token->[1]{class}) +; if (defined($token->[1]{class}) && $token->[1]{class} =~ /top10_link +/){ do some code } if($token->[1]{href} =~ /\?page=/){ do some code }
      Thanks again

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://609089]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (4)
As of 2020-01-18 20:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?