Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

?? Blazes under Linux; crawls under XP & Win2K...

by WordWeaver (Acolyte)
on Jan 09, 2006 at 08:40 UTC ( #521873=perlquestion: print w/replies, xml ) Need Help??
WordWeaver has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to figure out why a script that I've written runs in half a minute under Linux, but takes a half-hour or more under Win32.

The script loops through a text file containing ~800K variable-length, pipe-delimited records. The first byte of each line identifies record type.

The script copies each line of the source file to one of several destination files, depending on that line's record type ID.

In a Linux environment,the script processes the text file in ~30 seconds. I've tried the identical script under XP Home and Win2K. In both environments, running ActivePerl, the same script takes 30 - 40 minutes. The XP install is on a dual-boot machine with the Linux environment. I tried running the script under Cygwin on the dual-boot machine, but it didn't run any faster.

The obvious solution, for me, is to run the script only under Linux. But it could be useful to colleagues who only have Win32.

I'd be grateful for any tips on writing something that runs faster under Win32.

Code snippet:

#!/usr/bin/perl -w use strict; $|++; open REPORTS, '/home/user/data.txt'; my $record = <REPORTS> ; while ($record = <REPORTS>) { if(substr($record,0,1) eq '1') { open(fileOUT, ">>form_1_record.txt"); print fileOUT $record; close fileOUT; } ... }

The script contains similar if statements for each record type.

Edit: g0n - replaced pre tags with code tags

Replies are listed 'Best First'.
Re: ?? Blazes under Linux; crawls under XP & Win2K...
by Corion (Pope) on Jan 09, 2006 at 08:48 UTC

    Don't open and close the output files for every line. Open them at the start of the run, and close them at the end of your program (optional). Also, error checking is always a good idea:

    my $reportfile = '/home/user/data.txt'; open my $reports, "<", $reportfile or die "Could not read '$reportfile': $!"; my @knowntypes = qw(1 2 3 4 5); my %filehandle; for my $type (@knowntypes) { open $filehandle{$type}, ">>", "form_${type}_record.txt" or die "Couldn't create 'form_$type_record.txt': $!"; }; for (<$reports>) { /^(\d+)\|/ or die "Malformed input line: $_"; my $type = $1; if (not exists $filehandle{$type}) { die "Unknown record type $type in line $_"; }; print $filehandle{$type} $_; };

    If you hit the file descriptor limit of your OSes (255 on Win32, some other number on Linux), there is a module for that, FileCache. The Perl Cookbook has a recipe detailing its use (7.17).

    Update: ChemBoy spotted a typo. "form_${type}_record.txt" instead of "form_$type_record.txt". That's why we use strict;.

      You could extend not opening/closing so that the file is only opened if not already open. That way you don't have to know types before the run.
      ... while (<more data>) { if (file not already open) { open file or handle error } write data to relevent output; } close outputs;

      print can also be an expensive process. Instead of doing a print in the for loop save the resulting line to a $variable and after the loop is completed print the string to the file.

      my $output; while (<report>) { $output .= $line; } print FILE $ouput;
      Opening and closing the output files only once per run solves the problem!

      There's much for me to learn from all the code examples offered in this thread. Even in my halting, pidgin Perl code, though, opening and closing output files only once per run made a world of difference. I'm now getting run times under Win32 comparable to what I was getting under Linux.

      Not directly a Perl question, but I'm curious why Linux handled my badly written code so much more effectively than did XP.

Re: ?? Blazes under Linux; crawls under XP & Win2K...
by ikegami (Pope) on Jan 09, 2006 at 09:23 UTC
    Other performance tips:
    • Disable the anti-virus while the script is running.

    • Save the reports to local drives (C:, D:, etc), not network/mapped drives. You can always copy them to network drives at the end.

    We used to compile our (C++) programs on the network drive to get a permanent (backed-up) copy of the intermediary (object, etc) files, but we found it was just too slow.

Re: ?? Blazes under Linux; crawls under XP & Win2K...
by sh1tn (Priest) on Jan 09, 2006 at 11:29 UTC
    In addition - your code was tested under WinXP (cpu 600Mhz)
    with 1,919,147 bytes file (80000 lines). The script
    took 111 seconds. Obviously it is not Perl code problem.

      I made some similar tests with the code snippet with a varying size of number of lines and line lengths.

      I the worst case (3MB, 1M lines "x\n") it took well over 10 mins to run through on the local harddisk.

      In my perception, the runtime of the code was determined (about linearly) by the number of lines in the data file. (I'd attribute this to the devasting effect of the continuous open/close operations)

      Also, working on a network drive (opposed to the local HD), increased the time needed by an order of magnitude. (9923 secs over LAN vs. 723 secs local for my 1M lines data file with a slightly inferior machine than sh1tn's; WinXP)

Re: ?? Blazes under Linux; crawls under XP & Win2K...
by wfsp (Abbot) on Jan 09, 2006 at 08:49 UTC

    Perhaps opening/closing all the files before outside the loop may improve performance. Also, not using the flush ($|++) may help too.

    update: corrected

      Disabling the flush in the above program won't make much of a difference. It's just a matter of the write(2) happening after the print (with buffering disabled), or before the close (with buffering enabled). But since the only print happens before the close, it's not going to make much of a difference. In either case, there will be the same number of write(2)s.
      Perl --((8:>*
Re: ?? Blazes under Linux; crawls under XP & Win2K...
by elwarren (Curate) on Jan 09, 2006 at 21:18 UTC
    I bet it's your antivirus program. It scans the file everytime a file open is intercepted. Ran into this problem running a database on an Windows based server. Dramatically increased server performance by disabling it.
Re: ?? Blazes under Linux; crawls under XP & Win2K...
by markwx (Acolyte) on Jan 10, 2006 at 01:55 UTC
    In addition to advice regarding only opening the output files once, an extra speed increase could be gained by using unpack instead of substr:
    use strict; use IO::File; $|++; my $sourcefile = 'data.txt'; my @possibletargets = qw(1 2 3 4 5); my %tgthandles = (); foreach my $target (@possibletargets) { my $fname = 'form_' . $target . '_record.txt'; $tgthandles{$target} = IO::File->new(">>$fname"); } open(REPORTS, $sourcefile); while(<REPORTS>) { my $line = $_; my $id = unpack("A1A*", $line); my $fh = $tgthandles{$id}; print $fh $line; } close(REPORTS); foreach my $key (keys(%tgthandles)) { $tgthandles{$key}->close; }
Re: ?? Blazes under Linux; crawls under XP & Win2K...
by radiantmatrix (Parson) on Jan 09, 2006 at 22:05 UTC

    Windows and UNIX have different style line endings. If you run this under Win32 but use a file with UNIX-style endings, the whole file will be slurped instead of processed line by line. Try this before opening your file:

    use Getopt::Long; my $line_ending = "\012"; #unix style default GetOptions( 'windows' => sub { $line_ending = "\015\012" } ); local $/ = $line_ending;

    This will cause your script to assume UNIX line endings, but if you pass the parameter --windows to the command line, it will then assume the file has Windows-style line endings.

    You could autodetect this, too, but it involves opening the file and reading until you find one of those pairs, then resetting the filehandle. I know I've seen code for that on PerlMonks...

    A collection of thoughts and links from the minds of geeks
    The Code that can be seen is not the true Code
    "In any sufficiently large group of people, most are idiots" - Kaa's Law
      That's only true if the input file is open in binmode. If you don't call binmode, they have the same line endings.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://521873]
Approved by Corion
Front-paged by jbrugger
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2019-02-16 11:13 GMT
Find Nodes?
    Voting Booth?
    I use postfix dereferencing ...

    Results (95 votes). Check out past polls.