Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: how to split huge file reading into multiple threads

by AR (Friar)
on Aug 23, 2011 at 12:58 UTC ( [id://921886]=note: print w/replies, xml ) Need Help??


in reply to how to split huge file reading into multiple threads

Can you show us a stripped-down copy of your single-threaded working code? I have a couple of single-threaded scripts that regularly parse gigabytes of text files in a few minutes. Maybe there's something in your current script that could be fixed.

If your script is being delayed by disk reads and writes, then multi-threading will not help you.

  • Comment on Re: how to split huge file reading into multiple threads

Replies are listed 'Best First'.
Re^2: how to split huge file reading into multiple threads
by sagarika (Novice) on Aug 30, 2011 at 09:05 UTC

    Here is what my code does:-

    I have a file having 20+ millions of lines/records. ( say a.txt)

    I have another file having around 600 lines/records (say b.txt) These lines have some categories. So, a category is matching to more than one line/records.

    now; what my code does is:

    1. Create a Hash out of b.txt ( key = category ; value=some mandatory part of the records ).

    2. Read every record from a.txt and check if it matches with any of the mandatory part of the records ; if yes, create a file of that category and dump that entire line/record into that category.

    So, every record (of 20+ millions) is getting compared with some (roughly saying) 600 odd records ( if we consider the match found would be the last record - in worst case )

    And thats where the whole processing/looping is happening.

    Please help. how can I expedite the process ?

      Please show some code. We can help you best if you show a stripped down, but working, version of your code with sample data.

      Maybe your problem is that you're opening files over and over again when you should be keeping them open. I can't tell from your description of the code.

        Alright Here is some snippet of the code: The script mainly consumes time in the for loop where Hash of arrays keys are referred.
        @Patterns=("xxx", "SSS", "s:S"); sub getCsvHash { %master=(); # unset the HASH. foreach my $wlp (@Patterns) { my $key=$wlp; $key =~ s/[\s+|:]/_/g; open FILE, "csv_file.csv" or die $!; while(<FILE>) { my $line=$_; { my @csv = split(",", $line); if ($csv[1] =~ /"$wlp/) { push (@{$master{$key}}, $line); # push as value of a hash } } } #while(<FILE>) ends here close FILE; } #foreach $wlp ends here } #Function getWhiteListCsvArrays ends here. sub Processfiles { open DFH, "$log_file"; while(<DFH>) { my $line=$_; if ($line =~/(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?) +\t(.*?)\t(.*?)\t(.*?)/) { my $rc=$3; my $ct=$10; my $cl=$6; my $retval=applyList($line,$rc,$cl,$ct) } } } sub applyList { foreach my $row (@{$master{$key}}) { my $param_op1_flag="nc"; #Set the flag to indicate that the + optional parameter 1 is nc(not-checked). my $param_op2_flag="nc"; #Set the flag to indicate that the + optional parameter 2 is nc(not-checked). my @row_csv = split(",", $row); $row_csv[3] =~s/"(.*?)"/$1/g; #Get the mandatory part 1. Th +is is the domain-name. $row_csv[4] =~s/"(.*?)"/$1/g; #Get the mandatory part 2. Th +is is part after domain-name. my $param_man = $row_csv[3] . $row_csv[4]; #combine the man +datory parts. my $param_op1=$row_csv[5]; #Get the optional parameter 1. my $param_op2=$row_csv[6]; #Get the optional parameter 2. $param_op1=~ s/\n//g; #Remove the new-lines if any. $param_op2=~ s/\n//g; #Remove the new-lines if any. if(length($param_op1)) #check if optional parameter 1 has s +omething to check or not. { $param_op1 =~s/"(.*?)"/$1/g; #Remove the double inverted +commas. $param_op1 =~s/\?/\\\?/g; #Escape the special characters +like: ?. if($url =~/$param_op1/) #check if optional parameter 1 is + present in URL or not. { $param_op1_flag="cf"; #Set the optional parameter 1 fla +g to cf (checked-found). } else { $param_op1_flag="cnf"; #Set the optional parameter 1 fl +ag to cnf (checked-not-found). } } if(length($param_op2) > 1 ) { $param_op2 =~s/"(.*?)"/$1/g; #Remove the double inverted +commans. $param_op2 =~s/\?/\\\?/g; #Escape the special characters +like: ?. if($url =~ /$param_op2/) #check if optional parameter 2 i +s present in URL or not. { $param_op2_flag="cf"; #Set the optional parameter 2 fla +g to cf (checked-found). } else { $param_op2_flag="cnf"; #Set the optional parameter 2 fl +ag to cnf (checked-not-found). } } if($url=~/$param_man/ && ($param_op1_flag eq "cf" || $param +_op1_flag eq "nc") && ($param_op2_flag eq "cf" || $param_op2_flag eq +"nc")) { if (($cl < 5000 || $rc == 206) && $key =~/^AS_D/) { open OPF, ">>out/$key_cont"; print OPF $line; close OPF; applyBlackList($line,$key_cont,$ct); $retval_def=1; return $retval_def; } open OPF, ">>out/$key"; print OPF $line; close OPF; if($key !~/^AD/) { applyBlackList($line,$key,$ct); } $retval_def=1; return $retval_def } } } return $retval_def; } } }
        Please suggest.

      As Corion suggests it is hard to offer much by way of constructive advice without something concrete to play with. However, it may be that you can leverage regular expressions in some fashion to speed up the matching phase of the process. I can't provide much more focused advice without some information about the nature of the matching.

      True laziness is hard work

      This is not code we can download and run for ourselves. Please reduce your problem to a program of about 20 lines, and also post about 20 lines of representative data.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://921886]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (3)
As of 2024-04-19 20:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found