Re: how to split huge file reading into multiple threads

Replies are listed 'Best First'.
Re^2: how to split huge file reading into multiple threads by sagarika (Novice) on Aug 30, 2011 at 09:05 UTC
Here is what my code does:- I have a file having 20+ millions of lines/records. ( say a.txt) I have another file having around 600 lines/records (say b.txt) These lines have some categories. So, a category is matching to more than one line/records. now; what my code does is: 1. Create a Hash out of b.txt ( key = category ; value=some mandatory part of the records ). 2. Read every record from a.txt and check if it matches with any of the mandatory part of the records ; if yes, create a file of that category and dump that entire line/record into that category. So, every record (of 20+ millions) is getting compared with some (roughly saying) 600 odd records ( if we consider the match found would be the last record - in worst case ) And thats where the whole processing/looping is happening. Please help. how can I expedite the process ?	[reply]
Re^3: how to split huge file reading into multiple threads by AR (Friar) on Aug 30, 2011 at 12:30 UTC
Please show some code. We can help you best if you show a stripped down, but working, version of your code with sample data. Maybe your problem is that you're opening files over and over again when you should be keeping them open. I can't tell from your description of the code.	[reply]
Re^4: how to split huge file reading into multiple threads by sagarika (Novice) on Sep 02, 2011 at 06:36 UTC
Alright Here is some snippet of the code: The script mainly consumes time in the for loop where Hash of arrays keys are referred. @Patterns=("xxx", "SSS", "s:S"); sub getCsvHash { %master=(); # unset the HASH. foreach my $wlp (@Patterns) { my $key=$wlp; $key =~ s/[\s+\|:]/_/g; open FILE, "csv_file.csv" or die $!; while(<FILE>) { my $line=$_; { my @csv = split(",", $line); if ($csv[1] =~ /"$wlp/) { push (@{$master{$key}}, $line); # push as value of a hash } } } #while(<FILE>) ends here close FILE; } #foreach $wlp ends here } #Function getWhiteListCsvArrays ends here. sub Processfiles { open DFH, "$log_file"; while(<DFH>) { my $line=$_; if ($line =~/(.?)\t(.?)\t(.?)\t(.?)\t(.?)\t(.?)\t(.?)\t(.?) +\t(.?)\t(.?)\t(.?)/) { my $rc=$3; my $ct=$10; my $cl=$6; my $retval=applyList($line,$rc,$cl,$ct) } } } sub applyList { foreach my $row (@{$master{$key}}) { my $param_op1_flag="nc"; #Set the flag to indicate that the + optional parameter 1 is nc(not-checked). my $param_op2_flag="nc"; #Set the flag to indicate that the + optional parameter 2 is nc(not-checked). my @row_csv = split(",", $row); $row_csv[3] =~s/"(.?)"/$1/g; #Get the mandatory part 1. Th +is is the domain-name. $row_csv[4] =~s/"(.?)"/$1/g; #Get the mandatory part 2. Th +is is part after domain-name. my $param_man = $row_csv[3] . $row_csv[4]; #combine the man +datory parts. my $param_op1=$row_csv[5]; #Get the optional parameter 1. my $param_op2=$row_csv[6]; #Get the optional parameter 2. $param_op1=~ s/\n//g; #Remove the new-lines if any. $param_op2=~ s/\n//g; #Remove the new-lines if any. if(length($param_op1)) #check if optional parameter 1 has s +omething to check or not. { $param_op1 =~s/"(.?)"/$1/g; #Remove the double inverted +commas. $param_op1 =~s/\?/\\\?/g; #Escape the special characters +like: ?. if($url =~/$param_op1/) #check if optional parameter 1 is + present in URL or not. { $param_op1_flag="cf"; #Set the optional parameter 1 fla +g to cf (checked-found). } else { $param_op1_flag="cnf"; #Set the optional parameter 1 fl +ag to cnf (checked-not-found). } } if(length($param_op2) > 1 ) { $param_op2 =~s/"(.*?)"/$1/g; #Remove the double inverted +commans. $param_op2 =~s/\?/\\\?/g; #Escape the special characters +like: ?. if($url =~ /$param_op2/) #check if optional parameter 2 i +s present in URL or not. { $param_op2_flag="cf"; #Set the optional parameter 2 fla +g to cf (checked-found). } else { $param_op2_flag="cnf"; #Set the optional parameter 2 fl +ag to cnf (checked-not-found). } } if($url=~/$param_man/ && ($param_op1_flag eq "cf" \|\| $param +_op1_flag eq "nc") && ($param_op2_flag eq "cf" \|\| $param_op2_flag eq +"nc")) { if (($cl < 5000 \|\| $rc == 206) && $key =~/^AS_D/) { open OPF, ">>out/$key_cont"; print OPF $line; close OPF; applyBlackList($line,$key_cont,$ct); $retval_def=1; return $retval_def; } open OPF, ">>out/$key"; print OPF $line; close OPF; if($key !~/^AD/) { applyBlackList($line,$key,$ct); } $retval_def=1; return $retval_def } } } return $retval_def; } } } [download] Please suggest.	[reply] [d/l]
Re^5: how to split huge file reading into multiple threads by roboticus (Chancellor) on Sep 02, 2011 at 12:17 UTC
Re^6: how to split huge file reading into multiple threads by sagarika (Novice) on Sep 07, 2011 at 06:03 UTC
Re^3: how to split huge file reading into multiple threads by GrandFather (Saint) on Aug 30, 2011 at 10:00 UTC
As Corion suggests it is hard to offer much by way of constructive advice without something concrete to play with. However, it may be that you can leverage regular expressions in some fashion to speed up the matching phase of the process. I can't provide much more focused advice without some information about the nature of the matching. True laziness is hard work	[reply]
Re^3: how to split huge file reading into multiple threads by Corion (Patriarch) on Aug 30, 2011 at 09:43 UTC
This is not code we can download and run for ourselves. Please reduce your problem to a program of about 20 lines, and also post about 20 lines of representative data.	[reply]


XP is just a number
	PerlMonks