Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^3: filter a file using an exclusion list

by Cristoforo (Deacon)
on Feb 13, 2011 at 22:16 UTC ( #887900=note: print w/replies, xml ) Need Help??


in reply to Re^2: filter a file using an exclusion list
in thread filter a file using an exclusion list

There are some errors. The one that blows up is:

my %stopwords = map { $_ => } <@excludes>;

If you print %stopwords in Data::Dumper you would have had:

$VAR1 = { '9999853' => '999986' };

And this is not what you want! In the map, you are failing to assign a value with each key. You could state that as:

my %stopwords = map { $_ => 1} @excludes;

Here I am assigning a value of '1'. It conveniently then tests true when you are looking for stopwords. I also removed the angle brackets around @excludes in your code, (that would be a glob, not what you want).

Altogether, it could be solved like this:

#!/usr/bin/perl use strict; use warnings; use 5.012; use Data::Dumper; my $large =<<EOF; 9999853 5615 4 148656321 999986 5615 14 94873609 9999883 5615 4 860669 9999929 5615 4 73689618 9999931 5615 4 31286083 9999944 5615 4 148596445 999995 5615 10 78405504 9999963 5615 4 84291761 9999966 5615 4 5978256 9999979 5615 4 135953341 EOF my $excludes =<<EOF; 9999853 999986 EOF open my $fh1, "<", \$excludes or die $!; my %stopwords = map {chomp; $_ => 1} <$fh1>; close $fh1 or die $!; open my $fh2, "<", \$large or die $!; while( <$fh2> ){ my ($test) = /^(\d+)/; # if $test is in the hash # then $stopwords{ $test } == 1 or true print unless $stopwords{ $test }; } close $fh2 or die $!; #print Dumper \%stopwords;

Replies are listed 'Best First'.
Re^4: filter a file using an exclusion list
by coldy (Scribe) on Feb 13, 2011 at 23:29 UTC
    The above code works well on sample data - For a 70,000 item exclusion list and 40GB of data to filter the script does not work - it is not filtering the exclusion list and outputs every line of the large file (even though I have gone through and used grep on some of the items and it finds them in the large file - I would use grep -v -f excludes.txt large.txt but that does not work either on the full data). Is there a maximum limit on the size of a perl hash? Any other reason it would not work on the full data?
Re^4: filter a file using an exclusion list
by coldy (Scribe) on Feb 13, 2011 at 23:38 UTC
    ahh - dos2unix !! I think it will work now. Many thanks!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://887900]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (8)
As of 2016-10-01 20:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    How many different varieties (color, size, etc) of socks do you have in your sock drawer?






    Results (7 votes). Check out past polls.