sroux has asked for the wisdom of the Perl Monks concerning the following question:
Dear monks,
I am working on a script for data migration purpose, it is about mapping accounts and others fields through 2 mapping tables then logging unmapped items for audit/tracing purposes.
It takes ~1h30 for ~1 million rows. I am looking for some advice in order to optimize the script, quick wins or blatant code optimization advice, I am not looking for complete code rewriting (even if that would be very cool).
Thank you for any input and advice!
my $begin_time = time();
#Specify files to map
#my @files = ("dataset1.dat", "dataset2.dat");
foreach $files (@ARGV) {
#Load file into array (start)
my $begin_time = time();
print "Processing file: ".$files."\n";
open my $handle, '<', $files;
chomp( my @lines = <$handle> );
#Exclude
@lines2 = grep( !/(Final|ConvP&L|ConvBsheet)/, @lines );
@lines = ();
@lines = @lines2;
close $handle;
#Load file into array (end)
my $end_time = time();
printf( "Load file into array: "."%.2f\n", $end_time - $begin_time
+ );
#Mapping process (start)
my $begin_time = time();
# Mapping 1
open(MAPPINGFILE1, "mapping3.txt") or die print "Error: not found
+in specified path\n";
# Mapping 2
open(MAPPINGFILE2, "mapping4.txt") or die print "Error: not found
+in specified path\n";
#$count++ while <MAPPINGFILE1>;
foreach $line (@lines) {
#Mapping des charges directes
seek MAPPINGFILE1, 0, 0;
while (<MAPPINGFILE1>) {
chomp();
# Skip blank lines and comments
next if /^(\s*(#.*)?)?$/;
# Split source columns
@source = split /\t/, $line;
# Split mapping columns
@mapping = split /\t/, $_;
# Account is matching with source
if ($line =~ m/$mapping[0]/) {
#Account substitution
if($mapping[2] eq "") {
$line =~ s/$mapping[0]/"Compte cible non défini !"
+/;
} else {
$line =~ s/$mapping[0]/$mapping[2]/;
}
#Mapping = target Unit, Alloc_ + Unit source
if($mapping[3] ne "Unit source") {
#Unit substitution
if($mapping[3] eq "") {
$line =~ s/$source[2]/"Unit cible non définie
+!"\tALLOC_$source[2]/;
} else {
$line =~ s/$source[2]/$mapping[3]\tALLOC_$sour
+ce[2]/;
}
push @lines2, $line;
last;
} else {
#Mapping : source Unit, Alloc_ + source Unit
$line =~ s/$source[2]/$source[2]\tALLOC_$source[2]
+/;
push @lines2, $line;
last;
}
}
elsif (eof(MAPPINGFILE1)) {
#Mapping des charges indirectes
seek MAPPINGFILE2, 0, 0;
while (<MAPPINGFILE2>) {
chomp();
# Skip blank lines and comments
next if /^(\s*(#.*)?)?$/;
# Split mapping columns (tab)
@mapping = split /\t/, $_;
# Mapping is matching
if ($line =~ m/$mapping[0]/) {
$line =~ s/$mapping[0]/$mapping[4]/;
@source = split /\t/, $line;
$line =~ s/$source[2]/$mapping[2]\tALLOC_$sour
+ce[2]/;
push @lines2, $line;
last;
} elsif (eof(MAPPINGFILE2)) {
push @rejects, "Lignes non mappées (Account):
+"."\t".$line;
}
}
}
}
}
close MAPPINGFILE1;
close MAPPINGFILE2;
#Suppress double quote
for (@lines2) { s/"//g }
for (@rejects) { s/"//g }
#Replace empty by Missing
for (@lines2) { s/\t(?=\t)/\t#MI/g }
#Generic mapping
for (@lines2) { s/A2022Local/ACTUAL;FY22;Working_Central;Input;Loc
+al_YTD/g }
for (@lines2) { s/A2022AjConsoLocal/ACTUAL;FY22;Working_Central;Ad
+j_Conso;Local_YTD/g }
for (@lines2) { s/A2022InEur/ACTUAL;FY22;Working_Central;Input;Eur
+_ACT2022_Rate_YTD/g }
for (@lines2) { s/A2022AjConso/ACTUAL;FY22;Working_Central;Adj_Con
+so;Eur_ACT2022_Rate_YTD/g }
for (@lines2) { s/A2022TxB22/ACTUAL;FY22;Working_Central;Input;Eur
+_BUD2022_Rate_YTD/g }
for (@lines2) { s/A2022TxB22AjConso/ACTUAL;FY22;Working_Central;Ad
+j_Conso;Eur_BUD2022_Rate_YTD/g }
for (@lines2) { s/A2021Local/ACTUAL;FY21;Working_Central;Input;Loc
+al_YTD/g }
for (@lines2) { s/A2021AjConsoLocal/ACTUAL;FY21;Working_Central;Ad
+j_Conso;Local_YTD/g }
for (@lines2) { s/A2021InEur/ACTUAL;FY21;Working_Central;Input;Eur
+_ACT2021_Rate_YTD/g }
for (@lines2) { s/A2021AjConso/ACTUAL;FY21;Working_Central;Adj_Con
+so;Eur_ACT2021_Rate_YTD/g }
for (@lines2) { s/A2021TxB22/ACTUAL;FY21;Working_Central;Input;Eur
+_BUD2021_Rate_YTD/g }
for (@lines2) { s/A2021TxB22AjConso/ACTUAL;FY21;Working_Central;Ad
+j_Conso;Eur_BUD2021_Rate_YTD/g }
for (@lines2) { s/A2020Local/ACTUAL;FY20;Working_Central;Input;Loc
+al_YTD/g }
for (@lines2) { s/A2020AjConsoLocal/ACTUAL;FY20;Working_Central;Ad
+j_Conso;Local_YTD/g }
for (@lines2) { s/A2020InEur/ACTUAL;FY20;Working_Central;Input;Eur
+_ACT2020_Rate_YTD/g }
for (@lines2) { s/A2020AjConso/ACTUAL;FY20;Working_Central;Adj_Con
+so;Eur_ACT2020_Rate_YTD/g }
for (@lines2) { s/A2020TxB22/ACTUAL;FY20;Working_Central;Input;Eur
+_BUD2020_Rate_YTD/g }
for (@lines2) { s/A2020TxB22AjConso/ACTUAL;FY20;Working_Central;Ad
+j_Conso;Eur_BUD2020_Rate_YTD/g }
for (@lines2) { s/B2022Local/BUDGET;FY22;Working_Central;Input;Loc
+al_YTD/g }
for (@lines2) { s/B2022AjConsoLoc/BUDGET;FY22;Working_Central;Adj_
+Conso;Local_YTD/g }
for (@lines2) { s/B2022AjTBLocal/BUDGET;FY22;Working_Central;Adj_C
+onso;Local_YTD/g }
for (@lines2) { s/B2022InEur/BUDGET;FY22;Working_Central;Input;Eur
+_BUD2022_Rate_YTD/g }
for (@lines2) { s/B2022AjConso/BUDGET;FY22;Working_Central;Adj_Con
+so;Eur_BUD2022_Rate_YTD/g }
for (@lines2) { s/B2022AjTB/BUDGET;FY22;Working_Central;Adj_Conso;
+Eur_BUD2022_Rate_YTD/g }
#Add header
unshift @lines2, "Scenario;Year;Audit;Version;Vision;Entity;Unit;U
+nit_Alloc;Account;Jan;Feb;Mar;Apr;May;Jun;Jul;Aug;Sep;Oct;Nov;Dec";
#Set common delimiter
for (@lines2) { s/(\t|;)/\t/g }
for (@rejects) { s/(\t|;)/\t/g }
#Mapping process (end)
my $end_time = time();
printf( "Mapping process: "."%.2f\n", $end_time - $begin_time );
#Output (start)
my $begin_time = time();
open my $handle2, ">", $files."_output.txt";
print $handle2 join( "\n", @lines2 );
close $handle2;
open my $handle3, ">", $files."_output.err";
print $handle3 join( "\n", @rejects );
close $handle3;
#Output (end)
my $end_time = time();
printf( "Output: "."%.2f\n", $end_time - $begin_time );
}
Re: Optimization tips
by hv (Prior) on Jul 21, 2022 at 10:41 UTC
|
On top of what hippo says, you are reading and parsing the mapping files anew for every line you process, that's an awful lot of wasted work. Similarly, you are splitting the line into @source repeatedly for each mapping record.
As a first step: separate out the reading and parsing of the mapping files into data structures, do that once, then walk through the data structures in the loop over @lines. That might look something like this:
# Mapping des charges directes
my @mapping1 = map {
my @mapping = split /\t/, $_;
# account, mapped account, mapped source
[ $mapping[0], $mapping[2], $mapping[3] ];
} <MAPPINGFILE1>;
# Mapping des charges indirectes
my @mapping2 = map {
my @mapping = split /\t/, $_;
# account, mapped account, mapped source
[ $mapping[0], $mapping[4], $mapping[2] ];
} <MAPPINGFILE2>;
LINE: for my $line (@lines) {
my $source = (split /\t/, $line)[2];
# Mapping des charges directes
for my $mapping1 (@mapping1) {
my($account, $mapped_account, $mapped_source) = @$mapping1;
# Account is matching with source
if ($line =~ /$account/) {
# Account substitution
if ($mapping eq "") {
$line =~ s/$account/"Compte cible non défini !"/;
} else {
$line =~ s/$account/$mapped_account/;
}
# Mapping = target Unit, Alloc_ + Unit source
if ($mapped_source eq 'Unit source') {
# Mapping : source Unit, Alloc_ + source Unit
$line =~ s/$source/$source\tALLOC_$source/;
} elsif ($mapped_source eq "") {
$line =~ s/$source/"Unit cible non définie !"\tALLOC_$source/;
} else {
# Unit substitution
$line =~ s/$source/$mapped_source\tALLOC_$source/;
}
push @lines2, $line;
next LINE;
}
}
# Mapping des charges indirectes
for my $mapping2 (@mapping2) {
my($account, $mapped_account, $mapped_source) = @$mapping2;
if ($line =~ /$account/) {
$line =~ s/$account/$mapped_account/;
$line =~ s/$source/$mapped_source\tALLOC_$source/;
push @lines2, $line;
next LINE;
}
}
push @rejects, "Lignes non mappées (Account): \t$line";
}
However I suspect that even bigger savings are possible: for example, if the account name appears in a specific column in $line, you could probably turn the whole thing into a single hash lookup. | [reply] [d/l] [select] |
|
while (<MAPPINGFILE2>) {
chomp();
# Skip blank lines and comments
next if /^(\s*(#.*)?)?$/;
# Split mapping columns (tab)
@mapping = split /\t/, $_;
# Mapping is matching
if ($line =~ m/$mapping[0]/) {
$line =~ s/$mapping[0]/$mapping[4]/;
@source = split /\t/, $line;
$line =~ s/$source[2]/$mapping[2]\tALLOC_$sour
+ce[2]/;
push @lines2, $line;
last;
} elsif (eof(MAPPINGFILE2)) { #correction: ######
+Can happen
push @rejects, "Lignes non mappées (Account):
+"."\t".$line;
}
}
| [reply] [d/l] [select] |
Re: Optimization tips
by hippo (Bishop) on Jul 21, 2022 at 09:08 UTC
|
Tip 1: profile. There's no real point in optimising code which already contributes negligibly to the runtime. I like the venerable Devel::NYTProf but other profilers are available.
I haven't analysed your code in any detail but this just jumps right out at me:
for (@lines2) { s/A2022Local/ACTUAL;FY22;Working_Central;Input;Loc
+al_YTD/g }
for (@lines2) { s/A2022AjConsoLocal/ACTUAL;FY22;Working_Central;Ad
+j_Conso;Local_YTD/g }
for (@lines2) { s/A2022InEur/ACTUAL;FY22;Working_Central;Input;Eur
+_ACT2022_Rate_YTD/g }
for (@lines2) { s/A2022AjConso/ACTUAL;FY22;Working_Central;Adj_Con
+so;Eur_ACT2022_Rate_YTD/g }
for (@lines2) { s/A2022TxB22/ACTUAL;FY22;Working_Central;Input;Eur
+_BUD2022_Rate_YTD/g }
for (@lines2) { s/A2022TxB22AjConso/ACTUAL;FY22;Working_Central;Ad
+j_Conso;Eur_BUD2022_Rate_YTD/g }
for (@lines2) { s/A2021Local/ACTUAL;FY21;Working_Central;Input;Loc
+al_YTD/g }
for (@lines2) { s/A2021AjConsoLocal/ACTUAL;FY21;Working_Central;Ad
+j_Conso;Local_YTD/g }
for (@lines2) { s/A2021InEur/ACTUAL;FY21;Working_Central;Input;Eur
+_ACT2021_Rate_YTD/g }
for (@lines2) { s/A2021AjConso/ACTUAL;FY21;Working_Central;Adj_Con
+so;Eur_ACT2021_Rate_YTD/g }
for (@lines2) { s/A2021TxB22/ACTUAL;FY21;Working_Central;Input;Eur
+_BUD2021_Rate_YTD/g }
for (@lines2) { s/A2021TxB22AjConso/ACTUAL;FY21;Working_Central;Ad
+j_Conso;Eur_BUD2021_Rate_YTD/g }
for (@lines2) { s/A2020Local/ACTUAL;FY20;Working_Central;Input;Loc
+al_YTD/g }
for (@lines2) { s/A2020AjConsoLocal/ACTUAL;FY20;Working_Central;Ad
+j_Conso;Local_YTD/g }
for (@lines2) { s/A2020InEur/ACTUAL;FY20;Working_Central;Input;Eur
+_ACT2020_Rate_YTD/g }
for (@lines2) { s/A2020AjConso/ACTUAL;FY20;Working_Central;Adj_Con
+so;Eur_ACT2020_Rate_YTD/g }
for (@lines2) { s/A2020TxB22/ACTUAL;FY20;Working_Central;Input;Eur
+_BUD2020_Rate_YTD/g }
for (@lines2) { s/A2020TxB22AjConso/ACTUAL;FY20;Working_Central;Ad
+j_Conso;Eur_BUD2020_Rate_YTD/g }
for (@lines2) { s/B2022Local/BUDGET;FY22;Working_Central;Input;Loc
+al_YTD/g }
for (@lines2) { s/B2022AjConsoLoc/BUDGET;FY22;Working_Central;Adj_
+Conso;Local_YTD/g }
for (@lines2) { s/B2022AjTBLocal/BUDGET;FY22;Working_Central;Adj_C
+onso;Local_YTD/g }
for (@lines2) { s/B2022InEur/BUDGET;FY22;Working_Central;Input;Eur
+_BUD2022_Rate_YTD/g }
for (@lines2) { s/B2022AjConso/BUDGET;FY22;Working_Central;Adj_Con
+so;Eur_BUD2022_Rate_YTD/g }
for (@lines2) { s/B2022AjTB/BUDGET;FY22;Working_Central;Adj_Conso;
+Eur_BUD2022_Rate_YTD/g }
You are looping over the same array 24 times to run 24 very similar search/replace operations. It should be much more efficient (not to mention maintainable) to combine these into one loop and ideally one search/replace operation. But if your profiling indicates that the slow part is elsewhere, concentrate on that first.
| [reply] [d/l] |
Re: Optimization tips
by Fletch (Bishop) on Jul 21, 2022 at 18:22 UTC
|
Good remarks so far but I don't think anyone has yet remarked that you're slurping your input file into an array then iterating over it. For a large multi-meg file that's going to cause a good bit of overhead in and of itself (to say nothing of it inflating your process' size which may affect performance if it then winds up causing extra paging by the OS). Unless you need the full file for context (which it doesn't appear to my (admittedly perfunctory) skimming over the code) there's no reason not to read that input file line-by-line.
Presumably your mapping files are going to be the smaller inputs so (as was mentioned) I'd suggest restructuring things to read those into data structures once, then work over the meat of the main input line-by-line. If you can alter things to work more from a hash lookup instead of the multiple substitutions even if the mappings are "large" you can use something like GDBM_File or DB_File to keep those out of memory and lookup from disk instead.
Edit: s/it the/it then/ ; me no tipe gud tewday. Also if you're really looking to speed things up you might could use MCE::Loop to split the reading of the large file across multiple consumers. But fix the structural problems first then it'll be easier because you'll have a cleaner line-by-line processing loop to shove into the MCE bits.
The cake is a lie.
The cake is a lie.
The cake is a lie.
| [reply] [d/l] |
Re: Optimization tips
by jwkrahn (Monsignor) on Jul 21, 2022 at 19:38 UTC
|
I haven't read through all the code yet, but small tip: use tr/// instead of s/// for single character substitutions.
#Set common delimiter
for (@lines2) { s/(\t|;)/\t/g }
for (@rejects) { s/(\t|;)/\t/g }
#Set common delimiter
tr/\t;/\t/ for @lines2, @rejects;
| [reply] [d/l] [select] |
|
More than that, I don't see the reason to replace a tab with a tab. That just wastes time.
| [reply] |
Re: Optimization tips
by perlfan (Vicar) on Jul 22, 2022 at 11:31 UTC
|
foreach $files (@ARGV) {
...
foreach $line (@lines) {
#Mapping des charges directes
seek MAPPINGFILE1, 0, 0;
while (<MAPPINGFILE1>) {
...
Mentioned, but here to be clear. Nothing you do is going to help if you're nesting loops. Also, checkout this talk on Supercharging Perl from YAPC this year. | [reply] [d/l] |
|
|