Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re: Fast reading and processing from a text file - Perl vs. FORTRAN

by TomDLux (Vicar)
on May 25, 2003 at 01:44 UTC ( [id://260643]=note: print w/replies, xml ) Need Help??


in reply to Fast reading and processing from a text file - Perl vs. FORTRAN

I can find a number of improvements, and switching to unpack() instead of split() to process a few GB of data might have a significant effect,but I don't find anything disastrous. You've avoid the mistake of using for $in ( <FILE> ) which reads in the while 280 MB file and does a lot of swapping.


Your @FileArray consists of one filename. I'll assume it normally contains a lot of names.


QUAD4FINFILE and QUAD4CEINFILE are not input files ... how about naming them ...OUTFILE. Or better yet, discard half the characters and call them Q4FI and Q4CE?


Instead of your C/Java/FORTRAN-style for loop, how about:

for ( @FileArray ) { open FILE or die $!;

The for() takes each element of the array, in turn, and assigns it to $_; open FILE looks to $_ for the name of the file to open, since no name was provided. If the open has a problem, die with the system error message .


Instead of:

while ( <FILE> ) { $in = $_

Since you want the value in a named variable, since you'll be doing lots with it, why not put it into the named variabe right away. Oh, and chomp $in, once, right away:

while ( $in = <FILE> ) { chomp $in

My guess is that ($in =~ /^1/) && ($in =~ /MSC/) &&  ($in =~ /NASTRAN/) && ($in =~ /PAGE/) detects the page header. Since only one of those regexes is anchored, the others have to search the string, anyway. The new regex is still anchored. I use non-greedy searches to minimize back-tracking.

if ( $line =~ /^1.*?MSC.NASTRAN.*?PAGE/ ) { my $name = <FILE>; $loadname = $name if ($name !~ /^\s*$/ ); next; }

Notice that you need a negated regex, '!~', not a negated string comparison, 'ne'.

This isn't Pascal, you can declare local variables anywhere you want, not just at the top of a routine. Read the line into a local variable, and copy to an outer scope variable only if it has something you want to keep


if ( $line =~ m/^0\s+(\S*?)\s+(SUBCASE)|(SUBCOM)|(SYM\b)|(SYMCOM)|(REP +CASE)/ ) { $loadname = $1 if $1; $type = $+; next; }

Instead of five regexes, of which as many as four may fail, let's have one. The beginning of the lines we match are all the same, but end with one of the five words. SYM needs a end-of-work detector '\b' to distinguish from SYMCOM. If we don't have a match on the non-space block, $1 will be undef, so $loadname retains it's old value. Instead of splitting the string and keeping track of the array size to access whether it's a SUBCASE, SUBCOM, etc, $+ provides the last defined bracketed expression. That's exactly what we want ... only one of those right hand alternatives will be true. Save it in an externally defined variable for later use


The F A I L U R E and Q U A D strings are on the same line, and have no text which is not hard-coded. Why not have a variable defined at the top of the routine or in a configuration section with the full value of the string, and do a string comparison. :

my $failure_indices_line = "F A I L U R E ....... ( Q U A D )"; ... if ( $in eq $failure_indices_line ) {

You detect if the line contains an integer or real number. How about defining the goal as a block consisting only of digits and decimal points?

if ( $in =~ /[\d.]+/ ) {
Enlil points that the '+' is superfluous: you have a match as soon as you dedtect a digit or a dot.

It's a real killer figuring out the meaning of the code where you extract data from the numbers. First, you split on whitespace, which means that numbers in fields 1, 2, 3 produce a three element array, as would numbers in fields 7, 8 and 9. Second, the sample files are too long the lines wrap around.

Instead of using split, use unpack(). It's a little more inconvenient to set up, but since your files are rigidly colummn-oriented, it's a perfect match, and will run faster. Use '@N' to 'jump to column N' and "Ad" to collect 'd' space-terminated ASCII characters. Spaces that follow the number are dropped, leading spaces are kept. The first field will have leading spaces, so you might want to force converstion to a number, using $var += 0. Field three has a fixed left column, and grrows to the right, so uunpack will automatically handle that ( But what happens with positive numbers? Do they move into the column used by the negative sign?)

You do this do{} loop until coming to another line with '1' in the first columnn. That anchored '1' is sufficient, you don't need the other three tests. But since the line contains a number, it will match the inner if, and so will be processed by the if (size == 5) { tests. Luckily, it contains 9 or more fields. But in the print statements, what used to be $loadname2 is now $loadname, $subcaseno$var-1 is now $type.


You then check to see if the last field of $subcaseno is numberic ... but it's going to be SUBCASE, SUBCOM, etc.

Also, you've reached that header line or whatever it is, but you don't do anything. You leave the inner loop and go back to the top of the while() where a new line is read in, and probly discarded, since it doesn't have any context.

Tom

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://260643]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2024-03-28 20:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found