Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

File handles - there must be a better way

by Anonymous Monk
on May 13, 2013 at 16:11 UTC ( #1033314=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I often find the need to open a fairly large number of input files simultaneously, to run line-by-line comparisons on them. I can not read in all of the complete files before starting my analysis - due to the large size the files, that would be a very slow and RAM-intensive approach.

The way I'm doing this now is horribly clumsy. I have an array of input file names, and then I have an array of arbitrary file handles, with at least as many elements as the former array, e.g.:

my @FileHandles = ('A', 'B', 'C'...'X');

I have a loop that opens all input files. Then to start reading them:
$file1 = $InHandles[0]; while (<$file1>){ ... for ($f = 1; $f < @InHandles; $f++){ $file = $InHandles[$f]; $_ = (<$file>); ...
where "..." includes the retrieval of information from the current line of each file.

Today I found a need to read in over 100 files, and this has motivated me to find a more sensible way to deal with large numbers of concurrently used file handles. Is there a simple way to generate unique handles automatically for each element in my array of input files, without typing out a silly array like the @FileHandles shown above?

I've also found that at least when using double letter file handles from an array, I am forced to turn off 'use strict', which I otherwise keep on to catch typos.

Thanks for any help you can offer.

Comment on File handles - there must be a better way
Select or Download Code
Re: File handles - there must be a better way (hash)
by LanX (Canon) on May 13, 2013 at 16:42 UTC
    > Is there a simple way to generate unique handles automatically for each element in my array of input files, without typing out a silly array like the @FileHandles shown above?

    I'd rather use a hash to hold the %handle per path.

    DB<111> $path='/tmp/path' => "/tmp/path" DB<112> open $handle{$path},">",$path; => 1 DB<113> $handle{$path}->print("This is File $path") => 1 DB<114> close $handle{$path} => 1 DB<115> `cat $path` => "This is File /tmp/path"

    This works for me, but plz note that I had to use the normal (and saner) method call syntax for print.

    But I have to doubt that you really need to open 100 files simultaneously, maybe you should consider a better algorithm which works sequentially?

    HTH! =)

    Cheers Rolf

    ( addicted to the Perl Programming Language)

    edit

    another limitation is that you can't use < ... > syntax for readline

    DB<127> open $handle{$path},"<",$path; => 1 DB<128> ref $handle{$path} => "GLOB" DB<129> print while (<$handle{$path}>) GLOB(0x8f19688) DB<130> print while (readline($handle{$path})) This is File /tmp/path DB<131> seek $handle{$path},0,0 => 1 DB<132> readline($handle{$path}) => "This is File /tmp/path"

    update

    anyway looping over the hash fixes all syntactic "anomalies" again

    DB<151> while ( my ($path,$handle) = each %handle ) { print <$handle>; } This is File /tmp/path

    update

    Just to be sure, no trouble with strict!

    DB<159> ;{ use strict; my $path="/tmp/path"; my %handle; open $handle{$path},"<",$path; while ( my ($path,$handle) = each %handle ) { print <$handle>; } } This is File /tmp/path
Re: File handles - there must be a better way
by MidLifeXis (Prior) on May 13, 2013 at 17:06 UTC

    You could eliminate the Cish for loop if you don't need the index into @InHandles.

    for my $file ( @InHandles ) { ... }

    Update: Quite right, Tux. Missed the initial grab of the first file handle.

    --MidLifeXis

      Only if he replaces the first line with

      my $file1 = shift @InHandles;

      Enjoy, Have FUN! H.Merijn

      Or, keeping the index 0 assignment (array  @InHandles is not changed – might need it later?):

      my $file1 = $InHandles[0]; ... for my $fh (@InHandles[ 1 .. $#InHandles ]) { ... }
Re: File handles - there must be a better way
by pokki (Scribe) on May 13, 2013 at 18:51 UTC

    Sounds like you're stuck on old-style global file handles (open FOO, ...). This forces you to maintain an array of file handle names to avoid collisions, since you can only have one file handle of a given name. If you're already using lexical file handles, disregard the rest of this post!

    Use lexical file handles instead (in general, this is preferred; there is zero advantage to using the other style, and even if the only thing you get from lexical file handles is lexical scope, that's a net win):

    my %filehandle_of; foreach my $filename (@filenames) { # here $filehandle is a reference to a new anonymous filehandle in + each iteration, you can push it into an array if you don't need the +mapping instead if (open my $filehandle, '<', $filename) { $filehandle_of{$filename} = $filehandle; } else { warn "could not open '$filename' for reading: $!\n"; next; } }

    See also perldoc -f open.

Re: File handles - there must be a better way
by stephen (Priest) on May 13, 2013 at 19:13 UTC

    Leaving hundreds of filehandles open is probably a bad idea. I'm assuming that you're leaving them open in order to read them line-by-line. However, there are more scalable ways of doing that.

    You can get the current position of the file read buffer with:

    my $file_pos = tell($fh);

    And you can go to that file position with:

    seek($fh, $file_pos, 0);

    If you keep track of your position in each file, you can open one file at a time and still read through hundreds of files line-by-line. For example, here's a code snippet that reads through and prints out a cross-section of a bunch of different files, but still only opens one file at a time:

    #!env perl use strict; use warnings; our @Files = @ARGV; MAIN: { # A table storing each active filename and its # current position my %file_table = (); # Line number for each file we're reading through # (for printout purposes only) my $line_num = 0; # Set up our file table to point everything to 0 foreach my $file (@Files) { $file_table{$file} = 0; } # Keep printing each line so long as at least one file # has stuff to print while ( scalar keys %file_table ) { # Keep track of line numbers $line_num++; # Open each file, seek to last read position, # read a line, then note the next position foreach my $file ( sort keys %file_table ) { open( my $fh, '<', $file ) or die "Oops! $!"; seek( $fh, $file_table{$file}, 0 ); my $line = <$fh>; print "$file\t$line_num\t$line\n"; if ( eof $fh ) { delete $file_table{$file}; } else { $file_table{$file} = tell($fh); } close($fh); } } print "All done\n"; }

    stephen

      Great ideas here, Stephen.   And the same general line of reasoning certainly could be modified in many ways.   For example, one could pre-read and then buffer a few lines from each file, replenishing each buffer on an as-needed basis as the program proceeds.   This would give fairly efficient access to “the next few lines in each file” without too much burden, and it would scale.   You could introduce the concept of “bookmarking” your present position in any given file, then “reading ahead” in search of what you are looking for, knowing that you can “fall back” to the bookmarked point.   And so on.   All of which wizardry can be generally concealed from most of the rest of the programming.

      There are definite limits on the number of file-handles that an operating system can be expected to allow any application to have open at one time, and those limits are often rather small ... in theory and/or in practice.   I tend to design on the assumption of “maybe, a few dozen.”

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1033314]
Front-paged by tye
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (10)
As of 2014-08-01 08:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (257 votes), past polls