Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Efficient file handling (was Re^3: trouble parsing log file...)

by jarich (Curate)
on Nov 23, 2006 at 14:59 UTC ( #585732=note: print w/ replies, xml ) Need Help??


in reply to Re^2: trouble parsing log file...
in thread trouble parsing log file...

I thought I'd reply rather than --ing your post just because I disagreed.

I cannot think of any meaning of the phrase "more efficient" which would render your statement correct.

All the reading I've ever done on the matter says that parsing a file line by line is extremely efficient. What happens is as follows. The operating system reads a chunk of the file into memory; this is then broken up on newlines (or whatever the value of $/ is); then we iterate over each line until we run out and the process repeats. We can parse a file line by line as follows:

while ( <FILE> )

If we choose to stop reading the file at any point (perhaps we've found what we want) and call last, then we end up only reading the smallest part of the file as necessary. This means it's efficient time-wise, and because we're only holding one chunk of file in memory at a time, it's efficient memory-wise.

Alternately, my reading has said that "dumping the file to an array" and parsing it line by line is very inefficient. This is the case whether we do this like this:

my @logarray = <FILE>; foreach my $element (@logarray)

or like this:

foreach my $element (<FILE>)

This is because the file system still gives Perl the file on a chunk by chunk basis, and Perl still splits it up on $/, but Perl has to do this for the whole file even if we're only going to look at the first 10 lines. Worse, Perl now has to store the entire file in memory, rather than just a chunk. So this is the least efficient way to handle a file in Perl.

It is however very useful when we need random access to the whole file; for example when sorting it, or pulling out random quotes.

I'd love to hear why, if you think I'm mistaken in my understanding in this matter.


Comment on Efficient file handling (was Re^3: trouble parsing log file...)
Select or Download Code
Re: Efficient file handling (was Re^3: trouble parsing log file...) (stranger than fiction)
by tye (Cardinal) on Nov 23, 2006 at 15:50 UTC

    Perl did some old tricks that reached a little bit too far inside the <stdio.h> macros to be completely portable but that allowed Perl line-at-a-time I/O to be about twice as fast as C line-at-a-time I/O... on sufficiently "standard enough" systems. That was back in the days of AT&T Unix, before Linux. Last time I checked (long enough ago that I hope things have improved but not long enough ago that I've heard that they have), Perl still did line-at-a-time I/O unnecessarilly inefficiently when compiled on a system that isn't "standard enough" (which is nearly every system these days).

    This meant that Perl line-at-a-time I/O was 4 times slower than it really should be on Linux (for example). This actually made re-implementing line-at-a-time I/O in Perl code faster than using Perl's own line-at-a-time I/O implemented in C code (about twice as fast, which means that when Perl gets fixed, it would be about twice as slow, which would be expected).

    Yes, it makes little sense for Perl code to be faster than Perl's own C code. Unfortunately, that was certainly the case not too long ago.

    The command perl -V:d_stdstdio will tell you whether Perl thinks your platform is "standard enough".

    But, yes, the speed difference between line-at-a-time I/O and "slurping" is usually small enough not to matter (even considering Perl's quirk here). The memory consumption difference can be hugely significant, of course.

    - tye        

Re: Efficient file handling (was Re^3: trouble parsing log file...)
by perl_geoff (Acolyte) on Nov 24, 2006 at 18:01 UTC
    Hi, I tried to do this and couldn't get it to work correctly, can you show me what I'm doing wrong?
    use strict; use warnings; my $logfile="log.txt"; my $error="DOWN"; my $warn="PROBLEM"; my $redbutton="\<img src\=\'default_files/perlredblink\.gif'>"; my $greenbutton="\<img src\=\'default_files/perlgreenblink\.gif'>"; my $yellowbutton="\<img src\=\'default_files/perlyellowblink\.gif'>"; open LOG, $logfile or die "Cannot open $logfile for read :$!"; my $button = $greenbutton; my @logfile=<LOG>; # throw logfile into an array while (<LOG>) { if ($_ =~ /$error/i) { $button = $redbutton; print "<!--Content-type: text/html-->\n\n"; print "$button"; last; } elsif ($_ =~ /$warn/i) { $button = $yellowbutton; print "<!--Content-type: text/html-->\n\n"; print "$button"; last; } else { print "<!--Content-type: text/html-->\n\n"; print "$button"; last; } } close LOG;

      First and foremost, your line:

      my @logfile=<LOG>;    # throw logfile into an array

      does exactly what the comment says: it reads the whole logfile into an array (which you subsequently never use). That means that the test in the following line:

      while ( <LOG> ) {

      can never be true: the filehandle has already reached the end of the file by the time this line is reached. Just get rid of the first of these two lines.

      However, once this problem is fixed, it becomes clear that your program logic is flawed. The program will only ever read just one line from the logfile, because all the branches of the if/elsif/else structure end with last. So if the first line of the logfile contains, say, just "foo", it will print a green button and stop executing, even if the second (or third...) line is "SERVER DOWN".

      Lastly, as your regexes stand, $error will match eg "Downing Street", which is probably not what you want.

      I suggest that you use something like the following (simplified for the purposes of this posting to read from __DATA__ rather from a filehandle, and to output a simple string):

      use strict; use warnings; my $error = 'DOWN'; my $warn = 'PROBLEM'; my $redbutton = 'RED BUTTON'; my $greenbutton = 'GREEN BUTTON'; my $yellowbutton = 'YELLOW BUTTON'; my $button = $greenbutton; while ( <DATA> ) { if ( /\b$error\b/i ) { $button = $redbutton; last; } elsif ( /\b$warn\b/i ) { $button = $yellowbutton; } } print $button; __DATA__ foo tony.blair@downingstreet.gov.uk Watership Down bar
        OK, in light if this discussion, I have a similar problem.
        I need help, I have a script that i am trying to do the following:
        file1= list of unique ID numbers
        file2= list of unique html code with the unique ID numbers within the code per line(about 16k-bytes each seperated by carage.
        Both are standard text files.
        ===========
        #!/usr/bin/perl
        #read each line in test1.txt into data_file array
        $data_file="test1.txt";
        open(DATA, $data_file) || die("Could not open file!");
        @data_file=<DATA>;

        #read each line in code.txt into a names_file array
        $names_file="code.txt"
        open(NAMES, $names_file) || die("Could not open file!");
        @names_data=<NAMES>;

        #create loop that reads each ID in code.txt (NAMES array), searches for each in array elements for #test1.txt (DATA array), redirects a new (NAMES).html for each element
        foreach ( $NAMES )
        {
        chomp($NAMES);
        ($NAMES=$DATA<0> > +("$NAMES<0>.html"));
        }

        close NAMES;
        close DATA;

        I am new to perl but this is absolutely riddled with errors and I have written this according to examples of similar scripts.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://585732]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (10)
As of 2014-07-22 10:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (109 votes), past polls