Write large array to file, very slow

junebob has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have written a bit of perl, which is performing very slowly, so I'm hoping to get some advice here

The script takes in any number of files, where all files have the format that each line starts with a 10 hexit hex count, followed by anything. The count on each line is always greater than the count value on the previous line. The task is to merge all the input files in to one file, in order. The input files can be quite large, 3GB or so. After a bit of googling I decided to put all the input files in an array, and put the result in a new array and finally write out the new array to a file. Mainly because I have access to machines with lots of RAM, so I thought if it's all chucked in to memory it'll be faster, and then I just dump the end result in to a file.

It hasn't really worked out as I expected. The script got to the point where the final array is complete and it's starting to write out to the file after about an hour or so. However, just the writing to a file is taking many hours!

Any suggestions as to how to improve my script? Thanks!

#!/bin/env perl
use strict;
use warnings;
use List::Util qw(min max);
use Math::BigInt;

my @filenames = @ARGV;

#Define empty hash. This will be a hash of all the filenames. Within t
+he hash each filename points to an array containing the entire conten
+ts of the file, and an array of timestamps.
my %all_files=();

#>32 hex to dec function
sub hex2dec {
  my $hex = shift;
  return Math::BigInt->from_hex("0x$hex");
}

#For each file on the command line, create a new hash entry indexed by
+ the filename. Each entry is an array containing the contents of the 
+file.
foreach my $filename (@filenames) {
  open(my $handle, "<", "$filename") or die "Failed to open file $file
+name: $!\n";
  while(<$handle>) {
    chomp;
    my $fullline = $_;
    if($fullline =~ m/(\w+).*/) {
        #Store contents of line
        my $timestamp = $1;
      push @{$all_files{$filename}}, $fullline;
      push @{$all_files{"${filename}.timestamp"}}, $timestamp;
    } else {
        print "Unexpected line format: $fullline in $filename\n";
        exit;
    }
  }
  close $handle;
  $all_files{"${filename}.neof"} = 1;
}

my $neofs = 1;
my @minarray = ();
my $min = 0;
my $storeline = "";
my @mergedlogs = ();
my $matchmin=0;
my $line=0;
while ($neofs == 1) {
  print "$line\n";
  $line++;
  $neofs = 0;
  #First find the lowest count
  foreach my $filename (@filenames) {
    print "@{$all_files{\"${filename}.timestamp\"}}[0]\n";
    my $tmpdec=hex2dec(@{$all_files{"${filename}.timestamp"}}[0]);
    print "$tmpdec\n";
    push @minarray, hex2dec(@{$all_files{"${filename}.timestamp"}}[0])
+;
  }
  $min = min @minarray;
  @minarray = ();
  #For each file matching the lowest count, shift out the current line
  foreach my $filename (@filenames) {
    print "$filename $min";
    $matchmin=0;
    if(hex2dec(@{$all_files{"${filename}.timestamp"}}[0]) == $min && $
+all_files{"${filename}.neof"} == 1) {
      $matchmin=1;
      $storeline = shift @{$all_files{$filename}};
      shift @{$all_files{"${filename}.timestamp"}};
      #Check if array is empty (i.e. file completed)
      if ( ! @{$all_files{$filename}}) {
        #If so, set not end of file to 0
        $all_files{"${filename}.neof"} = 0;
        #Force count value to max so that it loses all future min batt
+les
        push @{$all_files{"${filename}.timestamp"}}, "10000000000";
      }
      #Push the line to the merged file.
      push @mergedlogs, "$storeline $filename";
    }
    $neofs = $neofs || $all_files{"${filename}.neof"};
  }
}

unlink "mergedlogs.txt";
foreach (@mergedlogs)
{
  open FH, ">>mergedlogs.txt" or die "can't open mergedlogs.txt: $!";
  print FH "$_\n";
  close FH
}
[download]

Comment on Write large array to file, very slow Download Code

Replies are listed 'Best First'.
Re: Write large array to file, very slow by Corion (Patriarch) on Aug 20, 2018 at 14:15 UTC
You should first find out which part is slow. You are opening and closing the output file for every line, instead of opening it once and then writing all your data to it. This is usually very slow. If the reading of all input files is slow, then you can't get any faster. If the searching/merge sort is slow, there are some things you can do to speed it up. For example, you are repeatedly calling `hex2dec`, and maybe Memoize'ing that function speeds up things. But while we are optimizing, are you sure that you need to convert the timestamps to numbers before you can compare them? `"0xAB00" gt "0x1234"` is true. So maybe you can strip out the complete conversion from timestamps. Also, what you will be implementing is the output phase of any merge sort, so look at example implementations of those for reference. I would for example remove all files from `@filenames` and all arrayrefs from `%all_files` that are already empty instead of adding a placeholder entry that needs to be re-checked on every loop. Also, instead of accumulating the output and then writing it, it might be faster if you write the output as you have it at hand, as that could give the operating system some time to write the data to disk before you hand it new data.	[reply] [d/l] [select]
Re^2: Write large array to file, very slow by junebob (Initiate) on Aug 21, 2018 at 06:55 UTC
Thanks for all the replies and suggestions! After I posted it did occur to me to write out direct to file instead of using the mergedlogs intermediate variable and that had a huge speed improvement. I ran both versions, and the original is still going after 16 hrs, whereas the improved version finished in about 30 minutes. I'll investigate the other suggestions too.	[reply]
Re: Write large array to file, very slow by hippo (Bishop) on Aug 20, 2018 at 14:17 UTC
`foreach (@mergedlogs) { open FH, ">>mergedlogs.txt" or die "can't open mergedlogs.txt: $!"; print FH "$_\n"; close FH }` [download] You are opening and closing the file on every single record. Don't do that. Instead: `open FH, ">>mergedlogs.txt" or die "can't open mergedlogs.txt: $!"; $\| = 0; # just in case foreach (@mergedlogs) { print FH "$_\n"; } close FH;` [download] There are other ways this could be improved, but this should get you a large gain for little effort.	[reply] [d/l] [select]
Re^2: Write large array to file, very slow by Eily (Monsignor) on Aug 20, 2018 at 14:35 UTC
The ">>" mode was used precisely because the file is constantly being reopened. But with your ++proposition, the preceding unlink can be removed, to let ">" overwrite the file instead. Besides, the 3 args version of open with a scalar can be used for many reasons (elegance, safety ...) but at the very least for consistency with the way the input files are opened. `my $output_file = "mergedlogs.txt"; open my $output_fh, ">", $output_file or die "Can't open $output_file: + $!"; { local $\| = 0; local $\ = "\n"; # Automatically append \n foreach (@mergedlogs) { print $output_fh $_; # "$_\n" copies $_ into a new string before a +ppending \n } } close $output_fh;` [download] Although Corion's proposition to write the result straight to the file, without the @mergedlogs intermediate variable is probably a good idea as well.	[reply] [d/l]
Re^3: Write large array to file, very slow by hippo (Bishop) on Aug 20, 2018 at 15:24 UTC
Just ran a quick bench and your suggestion to avoid the copy (++) works very well - it's about twice as fast as my code above. As another test I also tried `local $, = "\n"; print $output_fh @mergedlogs;` but that's no faster to within statistical noise. I'll run a longer bench later just to see if it's at all significant.	[reply] [d/l]
Re^4: Write large array to file, very slow by Eily (Monsignor) on Aug 20, 2018 at 16:14 UTC
Re^5: Write large array to file, very slow by hippo (Bishop) on Aug 20, 2018 at 18:11 UTC
Some notes below your chosen depth have not been shown here
Re: Write large array to file, very slow by QM (Parson) on Aug 20, 2018 at 14:51 UTC
You should also be able to do a merge sort, holding only the most recently read line of each file in memory. A quick search found File::MergeSort, which seems what you want. It can sort lines based on a key you specify: Merge keys are extracted from the input lines using a user defined subroutine. Comparisons on the keys are done lexicographically. While it sorts these keys lexicographically, there is room to munge the keyspace to do as you need. If you want your own sort routine, you can roll your own easily enough. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re: Write large array to file, very slow by anonymized user 468275 (Curate) on Aug 20, 2018 at 15:01 UTC
Your excessive use of RAM seems only because you need to process all the files before writing the first line of output. I'd be more inclined to create an empty file and open it in readwrite mode (+< not >>), write the old 10 hexits of the first file just as a placeholder, process each file, skipping subsequent headers and when finished sysseek back to the beginning, overwrite with the new 10 hexits and close up. But you have to limit the functions on the readwrite filehandle to sysseek and syswrite, e.g. (write a load of As, rewind and overwrite 10 Bs at the beginning): `use strict; use warnings; use Fcntl 'SEEK_SET'; system "touch myfile"; open my $fh, '+<', 'myfile'; syswrite $fh, 'A' x 100000, 100000; sysseek $fh, 0, SEEK_SET; syswrite $fh, 'B' x 10, 10, 0; close $fh;` [download] (updated) One world, one people	[reply] [d/l]


Perl: the Markov chain saw
	PerlMonks