Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Fast reading and processing from a text file - Perl vs. FORTRAN

by ozgurp (Beadle)
on May 24, 2003 at 04:09 UTC ( #260529=perlquestion: print w/ replies, xml ) Need Help??
ozgurp has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I am an aeronautical engineer (structural) and I use a finite element analysis software called PATRAN/NASTRAN. After every analysis this software generates result files (text) with ".f06" extention. These files can be very large and many depending on the model that is being analysed.

When this software is used to analyse composite elements, it creates results for FAILURE INDICES, and prints results for each element and each ply.

I was using a perl program that was opening these files (280 MB) each and processing the files and extracting the information that I wanted in about an hour. The files were about 1.6 GB in size. Then someone found this FORTRAN program that does the same thing in about ten minutes.

I just can not understand what it is doing so differently that it does the job much faster than my perl program does.

my code is something like this:
open file while(<FH>){ extract info, print, } close FH;
and here is an example section from one one of the .f06 files:
11150505 STRAIN 1 1.0088 -12 + 0.0369 1 + MARCH 10, 2003 MSC.NASTRAN 5/15/00 PAGE 8077 2011-ULTIMATE,_UP_BM,_-55,_MINP + 0 + SUBCOM 2011 F A I L U R E I N D I C E S F O R L A Y E R E D C O +M P O S I T E E L E M E N T S ( Q U A D 4 ) ELEMENT FAILURE PLY FP=FAILURE INDEX FOR PLY FB=FAILURE IN +DEX FOR BONDING FAILURE INDEX FOR ELEMENT FLAG ID THEORY ID (DIRECT STRESSES/STRAINS) (INTER-LAMIN +AR STRESSES) MAX OF FP,FB FOR ALL PLIES 2 0.0000 -2 + 0.0630 3 0.9486 -12 + 0.0914 4 0.0000 -2 + 0.1108 5 0.0172 -1 + 0.1108 6 0.0000 -2 + 0.0914 7 0.5449 -12 + 0.0630 8 0.0000 -2 + 0.0369 9 0.4847 -12 + 1.0088 *** 11150506 STRAIN 1 0.4050 -12 + 0.0339 2 0.0000 -2 + 0.0577 3 0.4086 -12 + 0.0831 4 0.4104 -12 + 0.1043 5 0.0000 -12 + 0.1179 6 0.4140 -12 + 0.1306 7 0.0000 -12 + 0.1374 8 0.0000 -12 + 0.1408 9 0.4195 -12 + 0.1408 10 0.0000 -12 + 0.1374 11 0.0000 -12 + 0.1306 12 0.4249 -12 + 0.1179 13 0.0000 -12 + 0.1043 14 0.4285 -12 + 0.0831 15 0.4303 -12 + 0.0577 16 0.0000 -12 + 0.0339 17 0.4340 -12 + 0.4340 1 + MARCH 10, 2003 MSC.NASTRAN 5/15/00 PAGE 8078 2011-ULTIMATE,_UP_BM,_-55,_MINP + 0 + SUBCOM 2011 F A I L U R E I N D I C E S F O R L A Y E R E D C O +M P O S I T E E L E M E N T S ( Q U A D 4 ) ELEMENT FAILURE PLY FP=FAILURE INDEX FOR PLY FB=FAILURE IN +DEX FOR BONDING FAILURE INDEX FOR ELEMENT FLAG ID THEORY ID (DIRECT STRESSES/STRAINS) (INTER-LAMIN +AR STRESSES) MAX OF FP,FB FOR ALL PLIES 11150507 STRAIN 1 0.2949 -12 + 0.0455 2 0.0001 -12 + 0.0792 3 0.3250 -12 + 0.1133 4 0.3400 -12 + 0.1418 5 0.0000 -12 + 0.1610 6 0.3701 -12 + 0.1781 7 0.0000 -12 + 0.1877 8 0.0000 -12 + 0.1925 9 0.4152 -12 + 0.1925 10 0.0000 -2 + 0.1877 11 0.0000 -2 + 0.1781 12 0.4603 -12 + 0.1610 13 0.0000 -12 + 0.1418 14 0.4903 -12 + 0.1133 15 0.5054 -12 + 0.0792 16 0.0000 -12 + 0.0455 17 0.5354 -12 + 0.5354 11150508 STRAIN 1 0.2814 -2 + 0.0564 2 0.0001 -12 + 0.0981 3 0.2392 -2 + 0.1404 4 0.2181 -2 + 0.1757 5 0.0000 -12 + 0.1995 6 0.1759 -2 + 0.2207 7 0.0000 -12 + 0.2326 8 0.0000 -12 + 0.2386 1 + MARCH 10, 2003 MSC.NASTRAN 5/15/00 PAGE 8079 2011-ULTIMATE,_UP_BM,_-55,_MINP +

20030524 Edit by Corion: Added READMORE tag

Comment on Fast reading and processing from a text file - Perl vs. FORTRAN
Select or Download Code
Re: Fast reading and processing from a text file - Perl vs. FORTRAN
by pzbagel (Chaplain) on May 24, 2003 at 04:24 UTC

    You'll need to post info on your "extract info" routine? This is probably what may be slowing up your code. I've worked with extrememly large datasets before(nearly 2G of data) and my programs have typically run pretty quickly(<10 min), however I was performing very simple transforms on the data. Are you storing things in variables? Using regexes? We'll need the details to help you.

Re: Fast reading and processing from a text file - Perl vs. FORTRAN
by blokhead (Monsignor) on May 24, 2003 at 04:25 UTC
    Well, "extract info" (in your pseudocode) is pretty vague. There are a lot of ways to grab fixed-width data from a file. The ones that come to mind ...
    • regexes (slowest)
    • split (usually ok)
    • unpack (usually fastest)
    I'd recommend switching the parsing to unpack, you will probably see a big speedup (especially if you're currently grabbing the data with regexes). Use the "A<num>" pack template to get a <num>-character field from the data.

    Actually, you should benchmark split vs. unpack against your data but I'm almost positive unpack will be faster.

    blokhead

Re: Fast reading and processing from a text file - Perl vs. FORTRAN
by DigitalKitty (Parson) on May 24, 2003 at 04:29 UTC
    Hi ozgurp.

    If possible, could you post your perl code?
    I'm fairly confident we ( collectively ) can decrease the amount of time needed to process the data with perl. An hour is a long time when compared to 10 minutes.

    Thanks,
    -Katie.
      Hi Katie, here is the code that extracts failure indices (Please note I am only a beginner perl programmer):
      use strict; use warnings; my @FileArray = ("c:/ultimate1_it2.f06"); open(QUAD4FIINFILE, ">QUAD4FI.txt") or die "Unable to open QUAD4FI.txt + file\n"; open(QUAD4CEINFILE, ">QUAD4CE.txt") or die "Unable to open QUAD4CE.txt + file\n"; open(FAIL_FLAG_QUAD4, ">FAIL_FLAG_QUAD4.txt") or die "Unable to open F +AIL_FLAG_QUAD4.txt file\n"; &Initial_Sort(); close QUAD4FIINFILE; close QUAD4CEINFILE; close FAIL_FLAG_QUAD4; sub Initial_Sort { #------------------------------------------------------ my $in = 0; my $loadname; my $loadname1; my $loadname2; my @subcaseno = (); my $last = ""; my $var; # The following variables are for extraction of failure indices fo +r layered composite elements my $QUAD4CE_Element_ID = 0; my $QUAD4CE_Failure_Theory = ""; my $QUAD4CE_Flag_for_elem_id_line = 0; my $QUAD4CE_Flag_for_long_line = 0; my $QUAD4CE_Ply_Id = 0; my $QUAD4CE_Failure_Index_1 = ""; my $QUAD4CE_Failure_Index_2 = ""; my $QUAD4CE_Element_Id_Line = ""; # ------------------------------- my $Size_Of_FileArray = @FileArray; #-------------------------------- my @QUAD4CE_load_array = (); my $QUAD4CE_load_array_counter = 0; #-------------------------------- for (my $i =0; $i<= $#FileArray; $i++) { open(FILE, $FileArray[$i]) or die "Unable to open input file\n +"; while (<FILE>) { #This is the main while loop that goes throug +h f06 Files. $in = $_; if (($in =~ /^1/) && ($in =~ /MSC/) && ($in =~ /NASTRAN/) + && ($in =~ /PAGE/)) { $loadname = <FILE>; if($loadname ne /^\s+$/){ chomp ($loadname); $loadname1 = $loadname; } next; } if ( ($in =~ m/^0\s+(.+?)\s+SUBCASE/) || ($in =~ m/^0\s+(. ++?)\s+SUBCOM/) || ($in =~ m/^0\s+(.+?)\s+SYM/) || ($in =~ m/^0\s+(.+? +)\s+SYMCOM/) || ($in =~ m/^0\s+(.+?)\s+REPCASE/) ) { if ($1 eq " "){ $loadname2 = $loadname1; }else{$loadname2 = $1;} @subcaseno = split(' ', $in); $var = @subcaseno; next; } if( ($in =~ /F A I L U R E I N D I C E S F O R L A Y + E R E D C O M P O S I T E E L E M E N T S/) && ($in =~ /( Q U A +D 4 )/) ) { do { $in = (<FILE>); chomp($in); if ( ($in =~ /\d\.\d\d/) || ($in =~ /0\.0/) || ($ +in =~ /\.0/) || ($in =~ /\d+/) ) { my @array = split(" ", $in); my $size = @array; if($size == 5){ $QUAD4CE_Element_ID = $array[0]; $QUAD4CE_Failure_Theory = $array[1]; $QUAD4CE_Ply_Id = $array[2]; $QUAD4CE_Failure_Index_1 = $array[3]; $QUAD4CE_Failure_Index_2 = $array[4]; $QUAD4CE_Flag_for_elem_id_line = 1; }elsif($size == 3){ $QUAD4CE_Ply_Id = $array[0]; $QUAD4CE_Failure_Index_1 = $array[1]; $QUAD4CE_Failure_Index_2 = $array[2]; $QUAD4CE_Flag_for_long_line = 1; }elsif( ( ($size == 1) && ($QUAD4CE_Flag_for_e +lem_id_line == 1) ) || ( ($size == 2) && ($QUAD4CE_Flag_for_elem_id_l +ine == 1) ) ){ print QUAD4CEINFILE ("$QUAD4CE_Element_ID $ +QUAD4CE_Failure_Theory $QUAD4CE_Ply_Id $QUAD4CE_Failure_Index_1 + $QUAD4CE_Failure_Index_2 $array[0] $subcaseno[$var-1] $l +oadname2\n"); $QUAD4CE_Element_Id_Line = $QUAD4CE_Element_ID +; $QUAD4CE_Flag_for_elem_id_line = 0; }elsif( (($size == 1) && ($QUAD4CE_Flag_for_lo +ng_line == 1)) || (($size == 2) && ($QUAD4CE_Flag_for_long_line == 1) +) ){ print QUAD4CEINFILE ("$QUAD4CE_Element_ID $ +QUAD4CE_Failure_Theory $QUAD4CE_Ply_Id $QUAD4CE_Failure_Index_1 + $QUAD4CE_Failure_Index_2 $array[0]\n"); $QUAD4CE_Flag_for_long_line = 0; print QUAD4FIINFILE ("$QUAD4CE_Element_Id_Line + $array[0] $subcaseno[$var-1] $loadname2\n"); if( ($size == 2) && ( (defined $array[1] && $a +rray[1] =~ m/\*{3}/) ) ) { print FAIL_FLAG_QUAD4 ("$QUAD4CE_Element_I +D $QUAD4CE_Ply_Id $QUAD4CE_Failure_Index_1 $QUAD4CE_Failure_ +Index_2 $array[0] $array[1] $subcaseno[$var-1] $loadname2 +\n"); } } } }until (($in =~ /^1/) && ($in =~ /MSC/) && ($in =~ / +NASTRAN/) && ($in =~ /PAGE/)); if($subcaseno[$var-1] =~ /^\d+$/){ $QUAD4CE_load_array[$QUAD4CE_load_array_counter] = + $subcaseno[$var-1]; $QUAD4CE_load_array_counter++; } } # End of if } # End Of while - end of main while loop that goes through ea +ch f06 file close FILE; } # End of for (my $i =0; $i<= $#FileArray; $i++) { }
        ozgurp,
        Unfortunately I am not a perl guru myself. I can only provide you with some hints. Typically, a better algorithm is what will make your code run faster. Sometimes you can trade memory for time by caching (see Memoize by Dominus). When you want to evaluate how a tweak has impacted performance - look into Benchmark. The thing to remember here is to go through many iterations to remove "flukes", vary your data as code behaves differently based off input, and try to test on a system at rest so it won't be influenced by other running programs. There is also Devel::DProf.

        Let me point out a few things in your code that may or may not help you.

      • my @FileArray = ("c:/ultimate1_it2.f06"); - I am assuming this is this way because you might have numerous file names in this array? If not, there is no need to make it an array.
      • &Initial_Sort(); - This is normally considered bad form. Use the & or the () - and the tendency is to lean towards ().
      • my $Size_Of_FileArray = @FileArray; - This is probably not needed and is likely to break. If you use @FileArray in a scalar context, it will provide you with what you are after. The problem with this is if you alter @FileArray, you have to remember to update $Size_Of_Array.
      • for (my $i =0; $i<= $#FileArray; $i++) { - This is usually done as for (0 .. $#FileArray) or if you don't like dealing with $_ (nested loops are also a good reason), you can used for my $index (0 .. $FileArray).
      • The regex engine is expensive. It looks like at the beginning of parsing you are trying to throw away some lines you aren't interested in. The problem is this check has to be performed on every single line of the file. It would be better to create a flag variable. Test to see if the flag is set, if not check for the lines you want to avoid, and then set the flag. This way, only a variable is checked in memory.
      •   if ( ($in =~ m/^0\s+(.+?)\s+SUBCASE/) || ($in =~ m/^0\s+(.+?)\s+SUBCOM/) || ($in =~ m/^0\s+(.+?)\s+SYM/) || ($in =~ m/^0\s+(.+?)\s+SYMCOM/) || ($in =~ m/^0\s+(.+?)\s+REPCASE/) ) { - you could probably reduce the invocations of the regex engine - \s+SUB(CASE|COM) \s+SYM(COM)?
      • You may also want to consider index if you do not care where something appears in a line, but just want to know if it is present. I would recommend benchmarking this as the data you are checking usually dictates which will be faster.

        Now, I am sure other monks would be able to look at your data that your provided and write a very fast an elegant script to do what you are asking.

        Cheers - L~R

        I haven't analyzed your code real closely, but this part sticks out as something that might be optimized:
        if ( ($in =~ m/^0\s+(.+?)\s+SUBCASE/) || ($in =~ m/^0\s+(. ++?)\s+SUBCOM/) || ($in =~ m/^0\s+(.+?)\s+SYM/) || ($in =~ m/^0\s+(.+? +)\s+SYMCOM/) || ($in =~ m/^0\s+(.+?)\s+REPCASE/) ) {
        Regexes tend to do alot better on fixed strings, and especially on strings which are anchored to the beginning. So what I might try is:
        # Give up right away if we don't find '0' at the beginning if ( ($in =~ /^0/) && ( $in =~ /SUBCASE/ or $in =~ /SUBCOM/ or ...)) {
        Or you might try to combine your key strings:
        if ( ( substr($in, 0, 1) eq '0' ) and ( $in =~ /\b(?:SUB|REP)(?:CASE|C +OM)/ or ... ) {
        For one thing looking for SYM and then SYMCOM is redundant and a waste of time, unless you want a '\b' after the strings.

        You might try the study function before doing the above regexes, it may or may not help. Try using the Benchmark module to see what is best on your data.

        Update: And looking again, its probably the next section that needs the most help...

Re: Fast reading and processing from a text file - Perl vs. FORTRAN
by TomDLux (Vicar) on May 25, 2003 at 01:44 UTC

    I can find a number of improvements, and switching to unpack() instead of split() to process a few GB of data might have a significant effect,but I don't find anything disastrous. You've avoid the mistake of using for $in ( <FILE> ) which reads in the while 280 MB file and does a lot of swapping.


    Your @FileArray consists of one filename. I'll assume it normally contains a lot of names.


    QUAD4FINFILE and QUAD4CEINFILE are not input files ... how about naming them ...OUTFILE. Or better yet, discard half the characters and call them Q4FI and Q4CE?


    Instead of your C/Java/FORTRAN-style for loop, how about:

    for ( @FileArray ) { open FILE or die $!;

    The for() takes each element of the array, in turn, and assigns it to $_; open FILE looks to $_ for the name of the file to open, since no name was provided. If the open has a problem, die with the system error message .


    Instead of:

    while ( <FILE> ) { $in = $_

    Since you want the value in a named variable, since you'll be doing lots with it, why not put it into the named variabe right away. Oh, and chomp $in, once, right away:

    while ( $in = <FILE> ) { chomp $in

    My guess is that ($in =~ /^1/) && ($in =~ /MSC/) &&  ($in =~ /NASTRAN/) && ($in =~ /PAGE/) detects the page header. Since only one of those regexes is anchored, the others have to search the string, anyway. The new regex is still anchored. I use non-greedy searches to minimize back-tracking.

    if ( $line =~ /^1.*?MSC.NASTRAN.*?PAGE/ ) { my $name = <FILE>; $loadname = $name if ($name !~ /^\s*$/ ); next; }

    Notice that you need a negated regex, '!~', not a negated string comparison, 'ne'.

    This isn't Pascal, you can declare local variables anywhere you want, not just at the top of a routine. Read the line into a local variable, and copy to an outer scope variable only if it has something you want to keep


    if ( $line =~ m/^0\s+(\S*?)\s+(SUBCASE)|(SUBCOM)|(SYM\b)|(SYMCOM)|(REP +CASE)/ ) { $loadname = $1 if $1; $type = $+; next; }

    Instead of five regexes, of which as many as four may fail, let's have one. The beginning of the lines we match are all the same, but end with one of the five words. SYM needs a end-of-work detector '\b' to distinguish from SYMCOM. If we don't have a match on the non-space block, $1 will be undef, so $loadname retains it's old value. Instead of splitting the string and keeping track of the array size to access whether it's a SUBCASE, SUBCOM, etc, $+ provides the last defined bracketed expression. That's exactly what we want ... only one of those right hand alternatives will be true. Save it in an externally defined variable for later use


    The F A I L U R E and Q U A D strings are on the same line, and have no text which is not hard-coded. Why not have a variable defined at the top of the routine or in a configuration section with the full value of the string, and do a string comparison. :

    my $failure_indices_line = "F A I L U R E ....... ( Q U A D )"; ... if ( $in eq $failure_indices_line ) {

    You detect if the line contains an integer or real number. How about defining the goal as a block consisting only of digits and decimal points?

    if ( $in =~ /[\d.]+/ ) {
    Enlil points that the '+' is superfluous: you have a match as soon as you dedtect a digit or a dot.

    It's a real killer figuring out the meaning of the code where you extract data from the numbers. First, you split on whitespace, which means that numbers in fields 1, 2, 3 produce a three element array, as would numbers in fields 7, 8 and 9. Second, the sample files are too long the lines wrap around.

    Instead of using split, use unpack(). It's a little more inconvenient to set up, but since your files are rigidly colummn-oriented, it's a perfect match, and will run faster. Use '@N' to 'jump to column N' and "Ad" to collect 'd' space-terminated ASCII characters. Spaces that follow the number are dropped, leading spaces are kept. The first field will have leading spaces, so you might want to force converstion to a number, using $var += 0. Field three has a fixed left column, and grrows to the right, so uunpack will automatically handle that ( But what happens with positive numbers? Do they move into the column used by the negative sign?)

    You do this do{} loop until coming to another line with '1' in the first columnn. That anchored '1' is sufficient, you don't need the other three tests. But since the line contains a number, it will match the inner if, and so will be processed by the if (size == 5) { tests. Luckily, it contains 9 or more fields. But in the print statements, what used to be $loadname2 is now $loadname, $subcaseno$var-1 is now $type.


    You then check to see if the last field of $subcaseno is numberic ... but it's going to be SUBCASE, SUBCOM, etc.

    Also, you've reached that header line or whatever it is, but you don't do anything. You leave the inner loop and go back to the top of the while() where a new line is read in, and probly discarded, since it doesn't have any context.

    Tom
Re: Fast reading and processing from a text file - Perl vs. FORTRAN
by BrowserUk (Pope) on May 25, 2003 at 03:37 UTC

    Having profiled you code line by line two different ways, it seems that almost all of the time is spent performing IO. Either reading from or writing to files. The only executable line that shows any significant cpu ot elapsed time usage is the the line

    92:         my @array = split(" ", $in);

    Which I don't see any easy way to optimise.

    Of course, the IO charecteristics of my system may be significantly different from the system you are running on, though I notice you are on a Win32 sytem too, so it shouldn't be so different.

    As for why the FORTRAN program is so much faster, assuming that it is doing the same processing,IO and running on the same hardware (which is an open question?), then my best guess is that the FORTRAN IO uses larger buffers and asynchronous reads and writes. I know of at least one PC-based FORTRAN compiler that does this.

    It is possible to use threads and other techniques to acheive this using perl without radically altering the structure of your existing program, but it isn't trivial to do. Whether you would come close to getting the kind of speed up you would need is very speculative.

    This is the output from using Devel::SmallProf on your code. I've massaged the output to just show the significant lines in decreasing order of cpu usage.

    cpu/iter iters wall s cpu s line 0.4281 1 0.428139 0.000000 19:close FAIL_FLAG_QUAD4; 0.0034 296 0.999800 0.110000 121: print FAIL_FLAG_QUAD4 + ("$Element_ID 0.0011 888 0.999800 0.191000 112: $Flag_for_elem_id_line + = 0; 0.0003 3257 0.999800 0.623000 62: if (($in =~ /^1/) && ($in + =~ /MSC/) && 0.0003 30192 8.998197 6.425000 92: my @array = split(" + ", $in); 0.0003 13912 3.999199 3.344000 115: print Q4CE_IN ("$Eleme +nt_ID 0.0003 13912 3.999199 2.245000 116: $Flag_for_long_line = +0; 0.0003 13912 3.999199 3.103000 118: print Q4FI_IN ("$Eleme +nt_Id_Line 0.0003 13912 3.999199 2.973000 120: if( ($size == 2) && ( +(defined 0.0002 31672 5.998798 6.290000 88: $in = (<FILE>); 0.0002 31672 4.998999 5.005000 89: chomp($in); 0.0002 30192 5.998798 4.405000 93: my $size = @array; 0.0002 30192 6.998598 5.549000 95: if($size == 5){ 0.0002 13912 2.999399 2.341000 103: $Ply_Id = $array[0]; 0.0001 13912 1.999599 2.293000 104: $Failure_Index_1 = $ar +ray[1]; 0.0001 13912 0.999800 2.935000 106: $Flag_for_long_line = +1; 0.0000 1 0.000000 0.000000 5:my @FileArray = ("junk.txt") +; 0.0000 1 0.000000 0.000000 7:open(Q4FI_IN, ">QUAD4FI.txt" +) or die "Unable 0.0000 1 0.000000 0.010000 9:open(Q4CE_IN, ">QUAD4CE.txt" +) or die "Unable

    I noticed a couple of errors in your code, but these have been noted above.

    One question? Are you guys paid by the keystoke? :^)

    Those have to be the longest variable names I've come across in a while.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      Thanks to everyone who responded.

      Ozgur

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://260529]
Approved by Courage
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (11)
As of 2014-07-28 14:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (200 votes), past polls