http://www.perlmonks.org?node_id=759194

hsmyers has asked for the wisdom of the Perl Monks concerning the following question:

While running a script on production data I noticed that it seemed locked up. Since it had sat there at the command line for several hours beyond the usual run time I pulled the plug with a ^C (this in XP BTW) only to see:

Can't coerce UNKNOWN to string in aelem at C:\Documents and Settings\hsmyers\Desktop\backlink\backlink.pl line 68, <$fh_IN> line 52878470.


instead of the usual message about Terminating on signal SIGINT(2). I hit ^C again and made it back out to the prompt, only to see the Microsoft message box about how the application had failed and did I want to tell Microsoft about it etc. My experience to date has suggested that perl.exe does not crash easily. Clearly, however, something here says otherwise. I've attached the code and would appreciate any information anyone might be able to toss my way. File sizes for the data in typically are in the Gigabyte range-- in the case of the crash, it was 2,322,832,377.
#!/usr/bin/perl # backlink.pl -- script to rewrite .ic files by adding backlink text. use strict; use warnings; use English; use Prosaix; use Data::Dumper::Simple; my %hrefs; my $url; my $re_url = qr/^\/url: (.*)/; my $re_end = qr/^\/endtext/; my $re_wc = qr/^\/wc: (.*)/; my $notfound = 0; my $total = 0; my $success = '_____found_____'; my $count = 1; my $base_file = $ARGV[0] or die "Missing input file name\n"; (my $output_file = $base_file) =~ s/\.txt/\+bl.txt/; open( my $fh_HREF, '<', $base_file . '.links' ) or die "Couldn't open '$base_file.links': $OS_ERROR\n"; open( my $fh_IN, '<', $base_file ) or die "Couldn't open '$base_file': $OS_ERROR\n"; open( my $fh_OUT, '>', $output_file ) or die "Couldn't open 'base_file.backlnked': $OS_ERROR\n"; open( my $fh_ERR, '>', $base_file . '.unlinked' ) or die "Couldn't open 'base_file.unlinked': $OS_ERROR\n"; binmode $fh_OUT; start(); while (<$fh_HREF>) { chomp; my ( $key, $value ) = split(/\|/); unless ( defined( $hrefs{$key} ) ) { $hrefs{$key} = []; } push( @{ $hrefs{$key} }, $value ); } while (<$fh_IN>) { print $fh_OUT $_; if (/$re_url/) { $url = $1; } if (/$re_wc/) { print "wc $count $1\n"; $count++; } elsif (/$re_end/) { print $fh_OUT "/backlinks\n"; if ( defined( $hrefs{$url} ) ) { my $s = $hrefs{$url}; if (defined($s->[0])) { my @text = collapse(@$s); print $fh_OUT join( ".\n", @text ), ".\n"; print join( ".\n", @text ), ".\n"; } push( @{ $hrefs{$url} }, $success ); } } } while ( my ( $url, $text ) = each(%hrefs) ) { if ( !defined( $text->[-1] ) ) { print $fh_ERR "$url|_____NO_TEXT_____\n"; } elsif ( $text->[-1] ne $success ) { $notfound++; if (defined($text->[0])) { my @text = collapse(@$text); print $fh_ERR "$url|", join( ",", @text ), "\n"; } else { print $fh_ERR "$url|\n"; } } $total++; } close($fh_HREF); close($fh_IN); close($fh_OUT); close($fh_ERR); print "$notfound URLs (of $total) pointing to about.com pages not\n"; print "in this IC ($base_file) written to $base_file.unlinked\n"; finish(); sub collapse { my @s = @_; my @t; for (@s) { next unless defined($_); push(@t,$_); } return @t; }

--hsm

"Never try to teach a pig to sing...it wastes your time and it annoys the pig."