Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Optimization tips

by sroux (Sexton)
on Jul 21, 2022 at 08:42 UTC ( [id://11145626]=perlquestion: print w/replies, xml ) Need Help??

sroux has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I am working on a script for data migration purpose, it is about mapping accounts and others fields through 2 mapping tables then logging unmapped items for audit/tracing purposes. It takes ~1h30 for ~1 million rows. I am looking for some advice in order to optimize the script, quick wins or blatant code optimization advice, I am not looking for complete code rewriting (even if that would be very cool).

Thank you for any input and advice!

my $begin_time = time(); #Specify files to map #my @files = ("dataset1.dat", "dataset2.dat"); foreach $files (@ARGV) { #Load file into array (start) my $begin_time = time(); print "Processing file: ".$files."\n"; open my $handle, '<', $files; chomp( my @lines = <$handle> ); #Exclude @lines2 = grep( !/(Final|ConvP&L|ConvBsheet)/, @lines ); @lines = (); @lines = @lines2; close $handle; #Load file into array (end) my $end_time = time(); printf( "Load file into array: "."%.2f\n", $end_time - $begin_time + ); #Mapping process (start) my $begin_time = time(); # Mapping 1 open(MAPPINGFILE1, "mapping3.txt") or die print "Error: not found +in specified path\n"; # Mapping 2 open(MAPPINGFILE2, "mapping4.txt") or die print "Error: not found +in specified path\n"; #$count++ while <MAPPINGFILE1>; foreach $line (@lines) { #Mapping des charges directes seek MAPPINGFILE1, 0, 0; while (<MAPPINGFILE1>) { chomp(); # Skip blank lines and comments next if /^(\s*(#.*)?)?$/; # Split source columns @source = split /\t/, $line; # Split mapping columns @mapping = split /\t/, $_; # Account is matching with source if ($line =~ m/$mapping[0]/) { #Account substitution if($mapping[2] eq "") { $line =~ s/$mapping[0]/"Compte cible non défini !" +/; } else { $line =~ s/$mapping[0]/$mapping[2]/; } #Mapping = target Unit, Alloc_ + Unit source if($mapping[3] ne "Unit source") { #Unit substitution if($mapping[3] eq "") { $line =~ s/$source[2]/"Unit cible non définie +!"\tALLOC_$source[2]/; } else { $line =~ s/$source[2]/$mapping[3]\tALLOC_$sour +ce[2]/; } push @lines2, $line; last; } else { #Mapping : source Unit, Alloc_ + source Unit $line =~ s/$source[2]/$source[2]\tALLOC_$source[2] +/; push @lines2, $line; last; } } elsif (eof(MAPPINGFILE1)) { #Mapping des charges indirectes seek MAPPINGFILE2, 0, 0; while (<MAPPINGFILE2>) { chomp(); # Skip blank lines and comments next if /^(\s*(#.*)?)?$/; # Split mapping columns (tab) @mapping = split /\t/, $_; # Mapping is matching if ($line =~ m/$mapping[0]/) { $line =~ s/$mapping[0]/$mapping[4]/; @source = split /\t/, $line; $line =~ s/$source[2]/$mapping[2]\tALLOC_$sour +ce[2]/; push @lines2, $line; last; } elsif (eof(MAPPINGFILE2)) { push @rejects, "Lignes non mappées (Account): +"."\t".$line; } } } } } close MAPPINGFILE1; close MAPPINGFILE2; #Suppress double quote for (@lines2) { s/"//g } for (@rejects) { s/"//g } #Replace empty by Missing for (@lines2) { s/\t(?=\t)/\t#MI/g } #Generic mapping for (@lines2) { s/A2022Local/ACTUAL;FY22;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/A2022AjConsoLocal/ACTUAL;FY22;Working_Central;Ad +j_Conso;Local_YTD/g } for (@lines2) { s/A2022InEur/ACTUAL;FY22;Working_Central;Input;Eur +_ACT2022_Rate_YTD/g } for (@lines2) { s/A2022AjConso/ACTUAL;FY22;Working_Central;Adj_Con +so;Eur_ACT2022_Rate_YTD/g } for (@lines2) { s/A2022TxB22/ACTUAL;FY22;Working_Central;Input;Eur +_BUD2022_Rate_YTD/g } for (@lines2) { s/A2022TxB22AjConso/ACTUAL;FY22;Working_Central;Ad +j_Conso;Eur_BUD2022_Rate_YTD/g } for (@lines2) { s/A2021Local/ACTUAL;FY21;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/A2021AjConsoLocal/ACTUAL;FY21;Working_Central;Ad +j_Conso;Local_YTD/g } for (@lines2) { s/A2021InEur/ACTUAL;FY21;Working_Central;Input;Eur +_ACT2021_Rate_YTD/g } for (@lines2) { s/A2021AjConso/ACTUAL;FY21;Working_Central;Adj_Con +so;Eur_ACT2021_Rate_YTD/g } for (@lines2) { s/A2021TxB22/ACTUAL;FY21;Working_Central;Input;Eur +_BUD2021_Rate_YTD/g } for (@lines2) { s/A2021TxB22AjConso/ACTUAL;FY21;Working_Central;Ad +j_Conso;Eur_BUD2021_Rate_YTD/g } for (@lines2) { s/A2020Local/ACTUAL;FY20;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/A2020AjConsoLocal/ACTUAL;FY20;Working_Central;Ad +j_Conso;Local_YTD/g } for (@lines2) { s/A2020InEur/ACTUAL;FY20;Working_Central;Input;Eur +_ACT2020_Rate_YTD/g } for (@lines2) { s/A2020AjConso/ACTUAL;FY20;Working_Central;Adj_Con +so;Eur_ACT2020_Rate_YTD/g } for (@lines2) { s/A2020TxB22/ACTUAL;FY20;Working_Central;Input;Eur +_BUD2020_Rate_YTD/g } for (@lines2) { s/A2020TxB22AjConso/ACTUAL;FY20;Working_Central;Ad +j_Conso;Eur_BUD2020_Rate_YTD/g } for (@lines2) { s/B2022Local/BUDGET;FY22;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/B2022AjConsoLoc/BUDGET;FY22;Working_Central;Adj_ +Conso;Local_YTD/g } for (@lines2) { s/B2022AjTBLocal/BUDGET;FY22;Working_Central;Adj_C +onso;Local_YTD/g } for (@lines2) { s/B2022InEur/BUDGET;FY22;Working_Central;Input;Eur +_BUD2022_Rate_YTD/g } for (@lines2) { s/B2022AjConso/BUDGET;FY22;Working_Central;Adj_Con +so;Eur_BUD2022_Rate_YTD/g } for (@lines2) { s/B2022AjTB/BUDGET;FY22;Working_Central;Adj_Conso; +Eur_BUD2022_Rate_YTD/g } #Add header unshift @lines2, "Scenario;Year;Audit;Version;Vision;Entity;Unit;U +nit_Alloc;Account;Jan;Feb;Mar;Apr;May;Jun;Jul;Aug;Sep;Oct;Nov;Dec"; #Set common delimiter for (@lines2) { s/(\t|;)/\t/g } for (@rejects) { s/(\t|;)/\t/g } #Mapping process (end) my $end_time = time(); printf( "Mapping process: "."%.2f\n", $end_time - $begin_time ); #Output (start) my $begin_time = time(); open my $handle2, ">", $files."_output.txt"; print $handle2 join( "\n", @lines2 ); close $handle2; open my $handle3, ">", $files."_output.err"; print $handle3 join( "\n", @rejects ); close $handle3; #Output (end) my $end_time = time(); printf( "Output: "."%.2f\n", $end_time - $begin_time ); }

Replies are listed 'Best First'.
Re: Optimization tips
by hv (Prior) on Jul 21, 2022 at 10:41 UTC

    On top of what hippo says, you are reading and parsing the mapping files anew for every line you process, that's an awful lot of wasted work. Similarly, you are splitting the line into @source repeatedly for each mapping record.

    As a first step: separate out the reading and parsing of the mapping files into data structures, do that once, then walk through the data structures in the loop over @lines. That might look something like this:

    # Mapping des charges directes my @mapping1 = map { my @mapping = split /\t/, $_; # account, mapped account, mapped source [ $mapping[0], $mapping[2], $mapping[3] ]; } <MAPPINGFILE1>; # Mapping des charges indirectes my @mapping2 = map { my @mapping = split /\t/, $_; # account, mapped account, mapped source [ $mapping[0], $mapping[4], $mapping[2] ]; } <MAPPINGFILE2>; LINE: for my $line (@lines) { my $source = (split /\t/, $line)[2]; # Mapping des charges directes for my $mapping1 (@mapping1) { my($account, $mapped_account, $mapped_source) = @$mapping1; # Account is matching with source if ($line =~ /$account/) { # Account substitution if ($mapping eq "") { $line =~ s/$account/"Compte cible non défini !"/; } else { $line =~ s/$account/$mapped_account/; } # Mapping = target Unit, Alloc_ + Unit source if ($mapped_source eq 'Unit source') { # Mapping : source Unit, Alloc_ + source Unit $line =~ s/$source/$source\tALLOC_$source/; } elsif ($mapped_source eq "") { $line =~ s/$source/"Unit cible non définie !"\tALLOC_$source/; } else { # Unit substitution $line =~ s/$source/$mapped_source\tALLOC_$source/; } push @lines2, $line; next LINE; } } # Mapping des charges indirectes for my $mapping2 (@mapping2) { my($account, $mapped_account, $mapped_source) = @$mapping2; if ($line =~ /$account/) { $line =~ s/$account/$mapped_account/; $line =~ s/$source/$mapped_source\tALLOC_$source/; push @lines2, $line; next LINE; } } push @rejects, "Lignes non mappées (Account): \t$line"; }

    However I suspect that even bigger savings are possible: for example, if the account name appears in a specific column in $line, you could probably turn the whole thing into a single hash lookup.

      I see that you fixed one "unreachable code" problem...
      Op should be aware of this in his code:
      To be inside of the while (<MAPPINGFILE2>) { loop means that you are not at the eof yet. When this while loop finishes, MAPPINGFILE2 will be eof, but not before.

      UPDATE: I decided that this is not right. The read of the very last line will consume all characters and hence reach eof. So you can reach eof before a read of <MAPPINGFILE2> would return an undef (the normal way to detect eof).

      while (<MAPPINGFILE2>) { chomp(); # Skip blank lines and comments next if /^(\s*(#.*)?)?$/; # Split mapping columns (tab) @mapping = split /\t/, $_; # Mapping is matching if ($line =~ m/$mapping[0]/) { $line =~ s/$mapping[0]/$mapping[4]/; @source = split /\t/, $line; $line =~ s/$source[2]/$mapping[2]\tALLOC_$sour +ce[2]/; push @lines2, $line; last; } elsif (eof(MAPPINGFILE2)) { #correction: ###### +Can happen push @rejects, "Lignes non mappées (Account): +"."\t".$line; } }
Re: Optimization tips
by hippo (Bishop) on Jul 21, 2022 at 09:08 UTC

    Tip 1: profile. There's no real point in optimising code which already contributes negligibly to the runtime. I like the venerable Devel::NYTProf but other profilers are available.

    I haven't analysed your code in any detail but this just jumps right out at me:

    for (@lines2) { s/A2022Local/ACTUAL;FY22;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/A2022AjConsoLocal/ACTUAL;FY22;Working_Central;Ad +j_Conso;Local_YTD/g } for (@lines2) { s/A2022InEur/ACTUAL;FY22;Working_Central;Input;Eur +_ACT2022_Rate_YTD/g } for (@lines2) { s/A2022AjConso/ACTUAL;FY22;Working_Central;Adj_Con +so;Eur_ACT2022_Rate_YTD/g } for (@lines2) { s/A2022TxB22/ACTUAL;FY22;Working_Central;Input;Eur +_BUD2022_Rate_YTD/g } for (@lines2) { s/A2022TxB22AjConso/ACTUAL;FY22;Working_Central;Ad +j_Conso;Eur_BUD2022_Rate_YTD/g } for (@lines2) { s/A2021Local/ACTUAL;FY21;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/A2021AjConsoLocal/ACTUAL;FY21;Working_Central;Ad +j_Conso;Local_YTD/g } for (@lines2) { s/A2021InEur/ACTUAL;FY21;Working_Central;Input;Eur +_ACT2021_Rate_YTD/g } for (@lines2) { s/A2021AjConso/ACTUAL;FY21;Working_Central;Adj_Con +so;Eur_ACT2021_Rate_YTD/g } for (@lines2) { s/A2021TxB22/ACTUAL;FY21;Working_Central;Input;Eur +_BUD2021_Rate_YTD/g } for (@lines2) { s/A2021TxB22AjConso/ACTUAL;FY21;Working_Central;Ad +j_Conso;Eur_BUD2021_Rate_YTD/g } for (@lines2) { s/A2020Local/ACTUAL;FY20;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/A2020AjConsoLocal/ACTUAL;FY20;Working_Central;Ad +j_Conso;Local_YTD/g } for (@lines2) { s/A2020InEur/ACTUAL;FY20;Working_Central;Input;Eur +_ACT2020_Rate_YTD/g } for (@lines2) { s/A2020AjConso/ACTUAL;FY20;Working_Central;Adj_Con +so;Eur_ACT2020_Rate_YTD/g } for (@lines2) { s/A2020TxB22/ACTUAL;FY20;Working_Central;Input;Eur +_BUD2020_Rate_YTD/g } for (@lines2) { s/A2020TxB22AjConso/ACTUAL;FY20;Working_Central;Ad +j_Conso;Eur_BUD2020_Rate_YTD/g } for (@lines2) { s/B2022Local/BUDGET;FY22;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/B2022AjConsoLoc/BUDGET;FY22;Working_Central;Adj_ +Conso;Local_YTD/g } for (@lines2) { s/B2022AjTBLocal/BUDGET;FY22;Working_Central;Adj_C +onso;Local_YTD/g } for (@lines2) { s/B2022InEur/BUDGET;FY22;Working_Central;Input;Eur +_BUD2022_Rate_YTD/g } for (@lines2) { s/B2022AjConso/BUDGET;FY22;Working_Central;Adj_Con +so;Eur_BUD2022_Rate_YTD/g } for (@lines2) { s/B2022AjTB/BUDGET;FY22;Working_Central;Adj_Conso; +Eur_BUD2022_Rate_YTD/g }

    You are looping over the same array 24 times to run 24 very similar search/replace operations. It should be much more efficient (not to mention maintainable) to combine these into one loop and ideally one search/replace operation. But if your profiling indicates that the slow part is elsewhere, concentrate on that first.


    🦛

Re: Optimization tips
by Fletch (Bishop) on Jul 21, 2022 at 18:22 UTC

    Good remarks so far but I don't think anyone has yet remarked that you're slurping your input file into an array then iterating over it. For a large multi-meg file that's going to cause a good bit of overhead in and of itself (to say nothing of it inflating your process' size which may affect performance if it then winds up causing extra paging by the OS). Unless you need the full file for context (which it doesn't appear to my (admittedly perfunctory) skimming over the code) there's no reason not to read that input file line-by-line.

    Presumably your mapping files are going to be the smaller inputs so (as was mentioned) I'd suggest restructuring things to read those into data structures once, then work over the meat of the main input line-by-line. If you can alter things to work more from a hash lookup instead of the multiple substitutions even if the mappings are "large" you can use something like GDBM_File or DB_File to keep those out of memory and lookup from disk instead.

    Edit: s/it the/it then/ ; me no tipe gud tewday. Also if you're really looking to speed things up you might could use MCE::Loop to split the reading of the large file across multiple consumers. But fix the structural problems first then it'll be easier because you'll have a cleaner line-by-line processing loop to shove into the MCE bits.

    The cake is a lie.
    The cake is a lie.
    The cake is a lie.

Re: Optimization tips
by jwkrahn (Abbot) on Jul 21, 2022 at 19:38 UTC

    I haven't read through all the code yet, but small tip: use tr/// instead of s/// for single character substitutions.

    #Set common delimiter for (@lines2) { s/(\t|;)/\t/g } for (@rejects) { s/(\t|;)/\t/g }
    #Set common delimiter tr/\t;/\t/ for @lines2, @rejects;
      More than that, I don't see the reason to replace a tab with a tab. That just wastes time.
Re: Optimization tips
by perlfan (Vicar) on Jul 22, 2022 at 11:31 UTC
    foreach $files (@ARGV) { ... foreach $line (@lines) { #Mapping des charges directes seek MAPPINGFILE1, 0, 0; while (<MAPPINGFILE1>) { ...
    Mentioned, but here to be clear. Nothing you do is going to help if you're nesting loops. Also, checkout this talk on Supercharging Perl from YAPC this year.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11145626]
Approved by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-04-18 18:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found