http://www.perlmonks.org?node_id=175402

smitz has asked for the wisdom of the Perl Monks concerning the following question:

I'm currently working on a little project, which is supposed to crawl the local intranet, and save some meta tags in a DB.
Simple enough, I thought I'd expand my module knowledge and try (the excellent!) WWW::Robot.

However, after a sucessfull start, the program simply stops. No die, no 'out of memory', just stop. This usually occurs after
about 40 pages, though I have been known to reach ~80. Code below, not that long.

#!perl # SMiTZ's Big Bad Support Crawler # crawls support.intel.com, reading title and meta description tags, # storing these in a lovely database. # # v0.1 use strict; use warnings; use WWW::Robot; use DBI; use Data::Dumper; my $docRoot = 'http://support.intel.com'; my $databaseName = 'spider1'; my $databaseTable = 'index'; # Connect to DB my $dbh = DBI->connect("DBI:ODBC:$databaseName", {RaiseError=>1}) or d +ie ("Couldn\'t connect to database: $DBI::errstr; stopped"); my $robot = new WWW::Robot( 'NAME' => 'SMiTZ\'s Bot', 'VERSION' => 0.1, 'EMAIL' => '***@intel.com', 'VERBOSE' => 0, #NAUGHTY BOY!!! Change ASAP 'DELAY' => 0, 'IGNORE_TEXT' => 0); . $robot->proxy('http', 'http://***.***.intel.com:911'); # damn firew +all $robot->addHook('follow-url-test', \&follow_url_test); $robot->addHook('invoke-on-contents', \&invoke_on_contents); $robot->run($docRoot); $dbh->disconnect(); #--------------------------------------------------------------------- +---------------------- # Hook subs sub follow_url_test { my ($robot, $hook, $url) = @_; return 0 unless $url->scheme eq 'http'; return 0 if $url =~ /\.(gif|jpg|png|xbm|au|wav|mpg|doc|xml|ppt)$/; return $url =~ /^$docRoot/; } sub invoke_on_contents { my ($robot, $hook, $url, $response, $structure) = @_; return unless $response->content_type eq 'text/html'; my $desc = $response->header("X-meta-keywords"); $desc = 'none, none, none, none, none' unless ($desc); my $title = $response->header("title"); my @desc = split(/,/, $desc); $dbh->do(q{ INSERT INTO index (url, title, description1, description2, des +cription3, description4, description5) VALUES (?, ?, ?, ?, ?, ?, ?) }, undef, $url, $title, $desc[0], $desc[1], $desc[2], $desc[3], $d +esc[4]) or die $dbh->errstr; print "*"; }
Any ideas what the cause could be? I'd be happy with a RTFM response, as long as anyone could indicate what M.

Thanks muchly in advance,
SMiTZ

Replies are listed 'Best First'.
Re: WWW::Robot hangs
by marcos (Scribe) on Jun 19, 2002 at 06:51 UTC
    Did you check your database log files to see if anything strange is happening? You do a INSERT INTO index ... in sub invoke_on_contents {...}, but you never do an explicit commit to the database. I guess this is because you rely on the fact the you have DBI AutoCommit on. In fact AFAIK AutoCommit should default to on. Anyway DBI man page says "Explicitly defining the required AutoCommit behavior is strongly recommended and may become mandatory in a later version.". My guess is that AutoCommit on may not be working properly for you, and after a certain number of INSERT (that are not automatically commited) you may end up with some condition of database rollback segment full, and this may causes DBI:ODBC to crash (OK, I don't know why you don't get back any error).
    I already had some problem with AutoCommit, so I usually explicitly set AutoCommit to off and then perform an explicit commit when I want.
    I hope this helps.

    marcos
      Ill be honest, most of the DBI instructions were added on for debugging purposes, and I'm not sure I know what some of them do, particularly AutoCommit.
      I'm using this on a Win2000 Box with MS Access, how do I check the server log?
      Dont tell me, its not possible with Access... :-(
      Further, how do I do a hard commit? And should I be doing this after each INSERT or just occasionaly?

      Thanks,
      SMiTZ
        Unfortunately I can give you no help with Access, and I don't know if Access support commit and rollback ... sorry, Don't you have any other DB to test you script with?. Anyway, For the other questions:
        AutoCommit is a database handler attribute: from the DBI man page: "If true, then database changes cannot be rolled-back (undone). If false, then database changes automatically occur within a "transaction", which must either be committed or rolled back using the commit or rollback methods.": you can read further details in DBI man page. You can set AutoCommit in the connect statement in the same way as you set RaiseError, for example:
        my $dbh = DBI->connect( "DBI:ODBC:$databaseName", { AutoCommit => 0, RaiseError => 1 } ) or die "cannot connect to Oracle!";

        You can perform an explicit commit on the database simply saying:
        $dbh->commit; # or call $dbh->rollback; to undo changes
        This makes sense only when you turn AutoCommit off.
        You can do a commit after every INSERT to see if it works, but this may be time consuming, so you may decide to commit every 10 or 50 INSERT, or whatever is suitable for your application.

        marcos
        forget MS Access and get a real DB. installation is a breeze, usage is quite friendly with a frontend.

        I tried writing NT services using Access/ODBC (even ADO) but they all dies or locks (without further notice) after 1-2 days. with MySql and DBI I have months of uptime and not a problem.

        cheers,
        Aldo

        __END__ $_=q,just perl,,s, , another ,,s,$, hacker,,print;
Re: WWW::Robot hangs
by caedes (Pilgrim) on Jun 18, 2002 at 22:03 UTC
    I had a very similar problem with this package. It seemed that the hooked subs were not being activated correctly. This was on a Win32 system, so at the time I was inclided to blame the Perl port and move on my using a different package.

    What platform have you tested this code on?

Re: WWW::Robot hangs
by caedes (Pilgrim) on Jun 19, 2002 at 12:20 UTC
    Since you are using the same exact platform that I was having trouble on I don't think the problem lies in DBI. I was just outputting the results of the crawl to flat files and was still getting problems where the hooked subs wouldn't get called. I'd suggest that you break the WWW::Robot use down to the bare minimum and see if you can get that to work; just get rid of all the database code.