Optimization tips

sroux has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I am working on a script for data migration purpose, it is about mapping accounts and others fields through 2 mapping tables then logging unmapped items for audit/tracing purposes. It takes ~1h30 for ~1 million rows. I am looking for some advice in order to optimize the script, quick wins or blatant code optimization advice, I am not looking for complete code rewriting (even if that would be very cool).

Thank you for any input and advice!

my $begin_time = time();

#Specify files to map
#my @files = ("dataset1.dat", "dataset2.dat");

foreach $files (@ARGV) {

    #Load file into array (start)
    my $begin_time = time();

    print "Processing file: ".$files."\n";

    open my $handle, '<', $files;
    chomp( my @lines = <$handle> );
    #Exclude
    @lines2 = grep( !/(Final|ConvP&L|ConvBsheet)/, @lines );
    @lines = ();
    @lines = @lines2;
    close $handle;

    #Load file into array (end)
    my $end_time = time();
    printf( "Load file into array: "."%.2f\n", $end_time - $begin_time
+ );

    #Mapping process (start)
    my $begin_time = time();

    # Mapping 1
    open(MAPPINGFILE1, "mapping3.txt") or die print "Error: not found 
+in specified path\n";

    # Mapping 2
    open(MAPPINGFILE2, "mapping4.txt") or die print "Error: not found 
+in specified path\n";

    #$count++ while <MAPPINGFILE1>;

    foreach $line (@lines) {

        #Mapping des charges directes
        seek MAPPINGFILE1, 0, 0;
        while (<MAPPINGFILE1>) {

            chomp();
            # Skip blank lines and comments
            next if /^(\s*(#.*)?)?$/;

            # Split source columns
            @source = split /\t/, $line;

            # Split mapping columns
            @mapping = split /\t/, $_;

            # Account is matching with source
            if ($line =~ m/$mapping[0]/) {

                #Account substitution
                if($mapping[2] eq "") {
                    $line =~ s/$mapping[0]/"Compte cible non défini !"
+/;
                } else {
                    $line =~ s/$mapping[0]/$mapping[2]/;
                }

                #Mapping = target Unit, Alloc_ + Unit source
                if($mapping[3] ne "Unit source") {

                    #Unit substitution
                    if($mapping[3] eq "") {
                        $line =~ s/$source[2]/"Unit cible non définie 
+!"\tALLOC_$source[2]/;
                    } else {
                        $line =~ s/$source[2]/$mapping[3]\tALLOC_$sour
+ce[2]/;
                    }
                    push @lines2, $line;
                    last;
                } else {

                    #Mapping : source Unit, Alloc_ + source Unit
                    $line =~ s/$source[2]/$source[2]\tALLOC_$source[2]
+/;
                    push @lines2, $line;
                    last;
                }
            }
            elsif (eof(MAPPINGFILE1)) {

                #Mapping des charges indirectes
                seek MAPPINGFILE2, 0, 0;
                while (<MAPPINGFILE2>) {

                    chomp();
                    # Skip blank lines and comments
                    next if /^(\s*(#.*)?)?$/;

                    # Split mapping columns (tab)
                    @mapping = split /\t/, $_;

                    # Mapping is matching
                    if ($line =~ m/$mapping[0]/) {

                        $line =~ s/$mapping[0]/$mapping[4]/;
                        @source = split /\t/, $line;
                        $line =~ s/$source[2]/$mapping[2]\tALLOC_$sour
+ce[2]/;
                        push @lines2, $line;
                        last;
                    } elsif (eof(MAPPINGFILE2)) {
                        push @rejects, "Lignes non mappées (Account): 
+"."\t".$line;
                    }
                }
            }
        }
    }
    close MAPPINGFILE1;
    close MAPPINGFILE2;

    #Suppress double quote
    for (@lines2) { s/"//g }
    for (@rejects) { s/"//g }

    #Replace empty by Missing
    for (@lines2) { s/\t(?=\t)/\t#MI/g }

    #Generic mapping
    for (@lines2) { s/A2022Local/ACTUAL;FY22;Working_Central;Input;Loc
+al_YTD/g }
    for (@lines2) { s/A2022AjConsoLocal/ACTUAL;FY22;Working_Central;Ad
+j_Conso;Local_YTD/g }
    for (@lines2) { s/A2022InEur/ACTUAL;FY22;Working_Central;Input;Eur
+_ACT2022_Rate_YTD/g }
    for (@lines2) { s/A2022AjConso/ACTUAL;FY22;Working_Central;Adj_Con
+so;Eur_ACT2022_Rate_YTD/g }
    for (@lines2) { s/A2022TxB22/ACTUAL;FY22;Working_Central;Input;Eur
+_BUD2022_Rate_YTD/g }
    for (@lines2) { s/A2022TxB22AjConso/ACTUAL;FY22;Working_Central;Ad
+j_Conso;Eur_BUD2022_Rate_YTD/g }

    for (@lines2) { s/A2021Local/ACTUAL;FY21;Working_Central;Input;Loc
+al_YTD/g }
    for (@lines2) { s/A2021AjConsoLocal/ACTUAL;FY21;Working_Central;Ad
+j_Conso;Local_YTD/g }
    for (@lines2) { s/A2021InEur/ACTUAL;FY21;Working_Central;Input;Eur
+_ACT2021_Rate_YTD/g }
    for (@lines2) { s/A2021AjConso/ACTUAL;FY21;Working_Central;Adj_Con
+so;Eur_ACT2021_Rate_YTD/g }
    for (@lines2) { s/A2021TxB22/ACTUAL;FY21;Working_Central;Input;Eur
+_BUD2021_Rate_YTD/g }
    for (@lines2) { s/A2021TxB22AjConso/ACTUAL;FY21;Working_Central;Ad
+j_Conso;Eur_BUD2021_Rate_YTD/g }

    for (@lines2) { s/A2020Local/ACTUAL;FY20;Working_Central;Input;Loc
+al_YTD/g }
    for (@lines2) { s/A2020AjConsoLocal/ACTUAL;FY20;Working_Central;Ad
+j_Conso;Local_YTD/g }
    for (@lines2) { s/A2020InEur/ACTUAL;FY20;Working_Central;Input;Eur
+_ACT2020_Rate_YTD/g }
    for (@lines2) { s/A2020AjConso/ACTUAL;FY20;Working_Central;Adj_Con
+so;Eur_ACT2020_Rate_YTD/g }
    for (@lines2) { s/A2020TxB22/ACTUAL;FY20;Working_Central;Input;Eur
+_BUD2020_Rate_YTD/g }
    for (@lines2) { s/A2020TxB22AjConso/ACTUAL;FY20;Working_Central;Ad
+j_Conso;Eur_BUD2020_Rate_YTD/g }

    for (@lines2) { s/B2022Local/BUDGET;FY22;Working_Central;Input;Loc
+al_YTD/g }
    for (@lines2) { s/B2022AjConsoLoc/BUDGET;FY22;Working_Central;Adj_
+Conso;Local_YTD/g }
    for (@lines2) { s/B2022AjTBLocal/BUDGET;FY22;Working_Central;Adj_C
+onso;Local_YTD/g }
    for (@lines2) { s/B2022InEur/BUDGET;FY22;Working_Central;Input;Eur
+_BUD2022_Rate_YTD/g }
    for (@lines2) { s/B2022AjConso/BUDGET;FY22;Working_Central;Adj_Con
+so;Eur_BUD2022_Rate_YTD/g }
    for (@lines2) { s/B2022AjTB/BUDGET;FY22;Working_Central;Adj_Conso;
+Eur_BUD2022_Rate_YTD/g }

    #Add header
    unshift @lines2, "Scenario;Year;Audit;Version;Vision;Entity;Unit;U
+nit_Alloc;Account;Jan;Feb;Mar;Apr;May;Jun;Jul;Aug;Sep;Oct;Nov;Dec";

    #Set common delimiter
    for (@lines2) { s/(\t|;)/\t/g }
    for (@rejects) { s/(\t|;)/\t/g }

    #Mapping process (end)
    my $end_time = time();
    printf( "Mapping process: "."%.2f\n", $end_time - $begin_time );

    #Output (start)
    my $begin_time = time();

    open my $handle2, ">", $files."_output.txt";
    print $handle2 join( "\n", @lines2 );
    close $handle2;

    open my $handle3, ">", $files."_output.err";
    print $handle3 join( "\n", @rejects );
    close $handle3;

    #Output (end)
    my $end_time = time();
    printf( "Output: "."%.2f\n", $end_time - $begin_time );
}
[download]

Comment on Optimization tips Download Code

Replies are listed 'Best First'.
Re: Optimization tips by hv (Prior) on Jul 21, 2022 at 10:41 UTC
On top of what hippo says, you are reading and parsing the mapping files anew for every line you process, that's an awful lot of wasted work. Similarly, you are splitting the line into `@source` repeatedly for each mapping record. As a first step: separate out the reading and parsing of the mapping files into data structures, do that once, then walk through the data structures in the loop over `@lines`. That might look something like this: # Mapping des charges directes my @mapping1 = map { my @mapping = split /\t/, $_; # account, mapped account, mapped source [ $mapping[0], $mapping[2], $mapping[3] ]; } <MAPPINGFILE1>; # Mapping des charges indirectes my @mapping2 = map { my @mapping = split /\t/, $_; # account, mapped account, mapped source [ $mapping[0], $mapping[4], $mapping[2] ]; } <MAPPINGFILE2>; LINE: for my $line (@lines) { my $source = (split /\t/, $line)[2]; # Mapping des charges directes for my $mapping1 (@mapping1) { my($account, $mapped_account, $mapped_source) = @$mapping1; # Account is matching with source if ($line =~ /$account/) { # Account substitution if ($mapping eq "") { $line =~ s/$account/"Compte cible non défini !"/; } else { $line =~ s/$account/$mapped_account/; } # Mapping = target Unit, Alloc_ + Unit source if ($mapped_source eq 'Unit source') { # Mapping : source Unit, Alloc_ + source Unit $line =~ s/$source/$source\tALLOC_$source/; } elsif ($mapped_source eq "") { $line =~ s/$source/"Unit cible non définie !"\tALLOC_$source/; } else { # Unit substitution $line =~ s/$source/$mapped_source\tALLOC_$source/; } push @lines2, $line; next LINE; } } # Mapping des charges indirectes for my $mapping2 (@mapping2) { my($account, $mapped_account, $mapped_source) = @$mapping2; if ($line =~ /$account/) { $line =~ s/$account/$mapped_account/; $line =~ s/$source/$mapped_source\tALLOC_$source/; push @lines2, $line; next LINE; } } push @rejects, "Lignes non mappées (Account): \t$line"; } [download] However I suspect that even bigger savings are possible: for example, if the account name appears in a specific column in `$line`, you could probably turn the whole thing into a single hash lookup.	[reply] [d/l] [select]
Re^2: Optimization tips by Marshall (Canon) on Jul 24, 2022 at 17:24 UTC
I see that you fixed one "unreachable code" problem... Op should be aware of this in his code: To be inside of the `while (<MAPPINGFILE2>) {` loop means that you are not at the eof yet. When this while loop finishes, MAPPINGFILE2 will be eof, but not before. UPDATE: I decided that this is not right. The read of the very last line will consume all characters and hence reach eof. So you can reach eof before a read of <MAPPINGFILE2> would return an undef (the normal way to detect eof). `while (<MAPPINGFILE2>) { chomp(); # Skip blank lines and comments next if /^(\s(#.)?)?$/; # Split mapping columns (tab) @mapping = split /\t/, $_; # Mapping is matching if ($line =~ m/$mapping[0]/) { $line =~ s/$mapping[0]/$mapping[4]/; @source = split /\t/, $line; $line =~ s/$source[2]/$mapping[2]\tALLOC_$sour +ce[2]/; push @lines2, $line; last; } elsif (eof(MAPPINGFILE2)) { #correction: ###### +Can happen push @rejects, "Lignes non mappées (Account): +"."\t".$line; } }` [download]	[reply] [d/l] [select]
Re: Optimization tips by hippo (Bishop) on Jul 21, 2022 at 09:08 UTC
Tip 1: profile. There's no real point in optimising code which already contributes negligibly to the runtime. I like the venerable Devel::NYTProf but other profilers are available. I haven't analysed your code in any detail but this just jumps right out at me: for (@lines2) { s/A2022Local/ACTUAL;FY22;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/A2022AjConsoLocal/ACTUAL;FY22;Working_Central;Ad +j_Conso;Local_YTD/g } for (@lines2) { s/A2022InEur/ACTUAL;FY22;Working_Central;Input;Eur +_ACT2022_Rate_YTD/g } for (@lines2) { s/A2022AjConso/ACTUAL;FY22;Working_Central;Adj_Con +so;Eur_ACT2022_Rate_YTD/g } for (@lines2) { s/A2022TxB22/ACTUAL;FY22;Working_Central;Input;Eur +_BUD2022_Rate_YTD/g } for (@lines2) { s/A2022TxB22AjConso/ACTUAL;FY22;Working_Central;Ad +j_Conso;Eur_BUD2022_Rate_YTD/g } for (@lines2) { s/A2021Local/ACTUAL;FY21;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/A2021AjConsoLocal/ACTUAL;FY21;Working_Central;Ad +j_Conso;Local_YTD/g } for (@lines2) { s/A2021InEur/ACTUAL;FY21;Working_Central;Input;Eur +_ACT2021_Rate_YTD/g } for (@lines2) { s/A2021AjConso/ACTUAL;FY21;Working_Central;Adj_Con +so;Eur_ACT2021_Rate_YTD/g } for (@lines2) { s/A2021TxB22/ACTUAL;FY21;Working_Central;Input;Eur +_BUD2021_Rate_YTD/g } for (@lines2) { s/A2021TxB22AjConso/ACTUAL;FY21;Working_Central;Ad +j_Conso;Eur_BUD2021_Rate_YTD/g } for (@lines2) { s/A2020Local/ACTUAL;FY20;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/A2020AjConsoLocal/ACTUAL;FY20;Working_Central;Ad +j_Conso;Local_YTD/g } for (@lines2) { s/A2020InEur/ACTUAL;FY20;Working_Central;Input;Eur +_ACT2020_Rate_YTD/g } for (@lines2) { s/A2020AjConso/ACTUAL;FY20;Working_Central;Adj_Con +so;Eur_ACT2020_Rate_YTD/g } for (@lines2) { s/A2020TxB22/ACTUAL;FY20;Working_Central;Input;Eur +_BUD2020_Rate_YTD/g } for (@lines2) { s/A2020TxB22AjConso/ACTUAL;FY20;Working_Central;Ad +j_Conso;Eur_BUD2020_Rate_YTD/g } for (@lines2) { s/B2022Local/BUDGET;FY22;Working_Central;Input;Loc +al_YTD/g } for (@lines2) { s/B2022AjConsoLoc/BUDGET;FY22;Working_Central;Adj_ +Conso;Local_YTD/g } for (@lines2) { s/B2022AjTBLocal/BUDGET;FY22;Working_Central;Adj_C +onso;Local_YTD/g } for (@lines2) { s/B2022InEur/BUDGET;FY22;Working_Central;Input;Eur +_BUD2022_Rate_YTD/g } for (@lines2) { s/B2022AjConso/BUDGET;FY22;Working_Central;Adj_Con +so;Eur_BUD2022_Rate_YTD/g } for (@lines2) { s/B2022AjTB/BUDGET;FY22;Working_Central;Adj_Conso; +Eur_BUD2022_Rate_YTD/g } [download] You are looping over the same array 24 times to run 24 very similar search/replace operations. It should be much more efficient (not to mention maintainable) to combine these into one loop and ideally one search/replace operation. But if your profiling indicates that the slow part is elsewhere, concentrate on that first. 🦛	[reply] [d/l]
Re: Optimization tips by Fletch (Bishop) on Jul 21, 2022 at 18:22 UTC
Good remarks so far but I don't think anyone has yet remarked that you're slurping your input file into an array then iterating over it. For a large multi-meg file that's going to cause a good bit of overhead in and of itself (to say nothing of it inflating your process' size which may affect performance if it then winds up causing extra paging by the OS). Unless you need the full file for context (which it doesn't appear to my (admittedly perfunctory) skimming over the code) there's no reason not to read that input file line-by-line. Presumably your mapping files are going to be the smaller inputs so (as was mentioned) I'd suggest restructuring things to read those into data structures once, then work over the meat of the main input line-by-line. If you can alter things to work more from a hash lookup instead of the multiple substitutions even if the mappings are "large" you can use something like GDBM_File or DB_File to keep those out of memory and lookup from disk instead. Edit: `s/it the/it then/` ; me no tipe gud tewday. Also if you're really looking to speed things up you might could use MCE::Loop to split the reading of the large file across multiple consumers. But fix the structural problems first then it'll be easier because you'll have a cleaner line-by-line processing loop to shove into the MCE bits. The cake is a lie. The cake is a lie. The cake is a lie.	[reply] [d/l]
Re: Optimization tips by jwkrahn (Abbot) on Jul 21, 2022 at 19:38 UTC
I haven't read through all the code yet, but small tip: use `tr///` instead of `s///` for single character substitutions. `#Set common delimiter for (@lines2) { s/(\t\|;)/\t/g } for (@rejects) { s/(\t\|;)/\t/g }` [download] `#Set common delimiter tr/\t;/\t/ for @lines2, @rejects;` [download]	[reply] [d/l] [select]
Re^2: Optimization tips by NERDVANA (Deacon) on Jul 22, 2022 at 02:34 UTC
More than that, I don't see the reason to replace a tab with a tab. That just wastes time.	[reply]
Re: Optimization tips by perlfan (Vicar) on Jul 22, 2022 at 11:31 UTC
`foreach $files (@ARGV) { ... foreach $line (@lines) { #Mapping des charges directes seek MAPPINGFILE1, 0, 0; while (<MAPPINGFILE1>) { ...` [download] Mentioned, but here to be clear. Nothing you do is going to help if you're nesting loops. Also, checkout this talk on Supercharging Perl from YAPC this year.	[reply] [d/l]


Keep It Simple, Stupid
	PerlMonks