Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Write large array to file, very slow

by junebob (Initiate)
on Aug 20, 2018 at 14:04 UTC ( [id://1220724]=perlquestion: print w/replies, xml ) Need Help??

junebob has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have written a bit of perl, which is performing very slowly, so I'm hoping to get some advice here

The script takes in any number of files, where all files have the format that each line starts with a 10 hexit hex count, followed by anything. The count on each line is always greater than the count value on the previous line. The task is to merge all the input files in to one file, in order. The input files can be quite large, 3GB or so. After a bit of googling I decided to put all the input files in an array, and put the result in a new array and finally write out the new array to a file. Mainly because I have access to machines with lots of RAM, so I thought if it's all chucked in to memory it'll be faster, and then I just dump the end result in to a file.

It hasn't really worked out as I expected. The script got to the point where the final array is complete and it's starting to write out to the file after about an hour or so. However, just the writing to a file is taking many hours!

Any suggestions as to how to improve my script? Thanks!

#!/bin/env perl use strict; use warnings; use List::Util qw(min max); use Math::BigInt; my @filenames = @ARGV; #Define empty hash. This will be a hash of all the filenames. Within t +he hash each filename points to an array containing the entire conten +ts of the file, and an array of timestamps. my %all_files=(); #>32 hex to dec function sub hex2dec { my $hex = shift; return Math::BigInt->from_hex("0x$hex"); } #For each file on the command line, create a new hash entry indexed by + the filename. Each entry is an array containing the contents of the +file. foreach my $filename (@filenames) { open(my $handle, "<", "$filename") or die "Failed to open file $file +name: $!\n"; while(<$handle>) { chomp; my $fullline = $_; if($fullline =~ m/(\w+).*/) { #Store contents of line my $timestamp = $1; push @{$all_files{$filename}}, $fullline; push @{$all_files{"${filename}.timestamp"}}, $timestamp; } else { print "Unexpected line format: $fullline in $filename\n"; exit; } } close $handle; $all_files{"${filename}.neof"} = 1; } my $neofs = 1; my @minarray = (); my $min = 0; my $storeline = ""; my @mergedlogs = (); my $matchmin=0; my $line=0; while ($neofs == 1) { print "$line\n"; $line++; $neofs = 0; #First find the lowest count foreach my $filename (@filenames) { print "@{$all_files{\"${filename}.timestamp\"}}[0]\n"; my $tmpdec=hex2dec(@{$all_files{"${filename}.timestamp"}}[0]); print "$tmpdec\n"; push @minarray, hex2dec(@{$all_files{"${filename}.timestamp"}}[0]) +; } $min = min @minarray; @minarray = (); #For each file matching the lowest count, shift out the current line foreach my $filename (@filenames) { print "$filename $min"; $matchmin=0; if(hex2dec(@{$all_files{"${filename}.timestamp"}}[0]) == $min && $ +all_files{"${filename}.neof"} == 1) { $matchmin=1; $storeline = shift @{$all_files{$filename}}; shift @{$all_files{"${filename}.timestamp"}}; #Check if array is empty (i.e. file completed) if ( ! @{$all_files{$filename}}) { #If so, set not end of file to 0 $all_files{"${filename}.neof"} = 0; #Force count value to max so that it loses all future min batt +les push @{$all_files{"${filename}.timestamp"}}, "10000000000"; } #Push the line to the merged file. push @mergedlogs, "$storeline $filename"; } $neofs = $neofs || $all_files{"${filename}.neof"}; } } unlink "mergedlogs.txt"; foreach (@mergedlogs) { open FH, ">>mergedlogs.txt" or die "can't open mergedlogs.txt: $!"; print FH "$_\n"; close FH }

Replies are listed 'Best First'.
Re: Write large array to file, very slow
by Corion (Patriarch) on Aug 20, 2018 at 14:15 UTC

    You should first find out which part is slow.

    You are opening and closing the output file for every line, instead of opening it once and then writing all your data to it. This is usually very slow.

    If the reading of all input files is slow, then you can't get any faster.

    If the searching/merge sort is slow, there are some things you can do to speed it up. For example, you are repeatedly calling hex2dec, and maybe Memoize'ing that function speeds up things. But while we are optimizing, are you sure that you need to convert the timestamps to numbers before you can compare them? "0xAB00" gt "0x1234" is true. So maybe you can strip out the complete conversion from timestamps.

    Also, what you will be implementing is the output phase of any merge sort, so look at example implementations of those for reference. I would for example remove all files from @filenames and all arrayrefs from %all_files that are already empty instead of adding a placeholder entry that needs to be re-checked on every loop.

    Also, instead of accumulating the output and then writing it, it might be faster if you write the output as you have it at hand, as that could give the operating system some time to write the data to disk before you hand it new data.

      Thanks for all the replies and suggestions! After I posted it did occur to me to write out direct to file instead of using the mergedlogs intermediate variable and that had a huge speed improvement. I ran both versions, and the original is still going after 16 hrs, whereas the improved version finished in about 30 minutes. I'll investigate the other suggestions too.
Re: Write large array to file, very slow
by hippo (Bishop) on Aug 20, 2018 at 14:17 UTC
    foreach (@mergedlogs) { open FH, ">>mergedlogs.txt" or die "can't open mergedlogs.txt: $!"; print FH "$_\n"; close FH }

    You are opening and closing the file on every single record. Don't do that. Instead:

    open FH, ">>mergedlogs.txt" or die "can't open mergedlogs.txt: $!"; $| = 0; # just in case foreach (@mergedlogs) { print FH "$_\n"; } close FH;

    There are other ways this could be improved, but this should get you a large gain for little effort.

      The ">>" mode was used precisely because the file is constantly being reopened. But with your ++proposition, the preceding unlink can be removed, to let ">" overwrite the file instead.

      Besides, the 3 args version of open with a scalar can be used for many reasons (elegance, safety ...) but at the very least for consistency with the way the input files are opened.

      my $output_file = "mergedlogs.txt"; open my $output_fh, ">", $output_file or die "Can't open $output_file: + $!"; { local $| = 0; local $\ = "\n"; # Automatically append \n foreach (@mergedlogs) { print $output_fh $_; # "$_\n" copies $_ into a new string before a +ppending \n } } close $output_fh;
      Although Corion's proposition to write the result straight to the file, without the @mergedlogs intermediate variable is probably a good idea as well.

        Just ran a quick bench and your suggestion to avoid the copy (++) works very well - it's about twice as fast as my code above. As another test I also tried local $, = "\n"; print $output_fh @mergedlogs; but that's no faster to within statistical noise. I'll run a longer bench later just to see if it's at all significant.

Re: Write large array to file, very slow
by QM (Parson) on Aug 20, 2018 at 14:51 UTC
    You should also be able to do a merge sort, holding only the most recently read line of each file in memory.

    A quick search found File::MergeSort, which seems what you want. It can sort lines based on a key you specify:

    Merge keys are extracted from the input lines using a user defined subroutine. Comparisons on the keys are done lexicographically.

    While it sorts these keys lexicographically, there is room to munge the keyspace to do as you need. If you want your own sort routine, you can roll your own easily enough.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: Write large array to file, very slow
by anonymized user 468275 (Curate) on Aug 20, 2018 at 15:01 UTC
    Your excessive use of RAM seems only because you need to process all the files before writing the first line of output. I'd be more inclined to create an empty file and open it in readwrite mode (+< not >>), write the old 10 hexits of the first file just as a placeholder, process each file, skipping subsequent headers and when finished sysseek back to the beginning, overwrite with the new 10 hexits and close up. But you have to limit the functions on the readwrite filehandle to sysseek and syswrite, e.g. (write a load of As, rewind and overwrite 10 Bs at the beginning):
    use strict; use warnings; use Fcntl 'SEEK_SET'; system "touch myfile"; open my $fh, '+<', 'myfile'; syswrite $fh, 'A' x 100000, 100000; sysseek $fh, 0, SEEK_SET; syswrite $fh, 'B' x 10, 10, 0; close $fh;

    (updated)

    One world, one people

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1220724]
Approved by marto
Front-paged by haukex
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-04-25 15:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found