Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

More efficient munging if infile very large

by ybiC (Prior)
on Jul 26, 2001 at 15:14 UTC ( #99943=perlquestion: print w/ replies, xml ) Need Help??
ybiC has asked for the wisdom of the Perl Monks concerning the following question:

The code at bottom works fine for converting all alpha characters in a text file to upper (or lower) case.   But having the two "$munged = " lines inside the "for(@in)" loop looks like ineficient if handling large files.

I tried these two tweaks to move that logic outside the for(@in) loop.   The first doesn't work because (correct me if I'm wrong) $_ is empty when $munged is declared.   The second fails with this compilation error: syntax error at uclc.pl line (push @munged, $munged();), near "$munged(" and I'm not clear just what's happening there.

# tweak one my @munged; my $munged = lc() if ($uclc eq 'lc'); $munged = uc() if ($uclc eq 'uc'); for(@in) { push @munged, $munged; } # tweak two my @munged; my $munged = lc if ($uclc eq 'lc'); $munged = uc if ($uclc eq 'uc'); for(@in) { push @munged, $munged(); }
So my questions for the wise bretheren and sisteren are:
  1. should I be concerned about potential ineficiency of this code for large input files?
  2. if so, how might it be improved?
    cheers,
    Don
    striving toward Perl Adept
    (it's pronounced "why-bick")
#!/usr/bin/perl -w use strict; my $uclc = shift or Usage(); my $infile = shift or Usage(); my $outfile = shift or Usage(); Usage() unless ($uclc eq 'lc' or 'uc'); open (IN, "< $infile") or die "Error opening $infile for read: $!"; my @in = <IN>; close IN or die "Error closing $infile after read: $!"; my @munged; for(@in) { my $munged = lc() if ($uclc eq 'lc'); $munged = uc() if ($uclc eq 'uc'); push @munged, $munged; } open (OUT, "> $outfile") or die "Error opening $outfile for write: $!" +; print OUT for(@munged); close OUT or die "Error closing $outfile after write: $ +!"; ###################################################################### +#### sub Usage { die "\n Usage: uclc.pl (lc|uc) infile outfile\n"; } ###################################################################### +####

Comment on More efficient munging if infile very large
Select or Download Code
Re: More efficient munging if infile very large
by Masem (Monsignor) on Jul 26, 2001 at 15:32 UTC
    Why not:
    @in = map { $uclc eq 'lc' ? lc : uc } @in;
    And avoid the extra variables althogether, and get $_ set for you all at once? Since you've already got $uclc checked for the right state, you don't need the second wasteful if.

    Or, even better (1 if, instead of N...)

    @in = $uclc eq 'lc' ? map { lc } @in : map { uc } @in;

    -----------------------------------------------------
    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain

Re: More efficient munging if infile very large
by MZSanford (Curate) on Jul 26, 2001 at 15:34 UTC
    I am unclear about the $munged() bit of code in your description. &$munged() would work, were $munged a code reference. As for the effeciency of your approach, i would run it all through Benchmark, for any other opinions would be but opinions. It maybe quicker to do something like
    if ($uclc eq 'lc') { ... for loop with lc } elsif ($uclc eq 'uc') { ... for loop with uc } else { die "invalid arg"; }
    it may be slightly fatser, but check benchmark
    Thus spake the Master Programmer:
    "When you have learned to snatch the error code from the trap frame, it will be time for you to leave."
    -- The Tao of Programming
Re: More efficient munging if infile very large
by synapse0 (Pilgrim) on Jul 26, 2001 at 15:40 UTC
    Well, I think you're way overdoing it here.. if you're worried about efficiency and resources, you probably shouldn't be reading in the array at all.. there's no need. And, since you've already determined that $uclc is going to be either one of 'uc' or 'lc', you don't need the extra check in there..
    Here's how I woulda done it..
    #!/usr/bin/perl -w use strict; my $uclc = shift or Usage(); my $infile = shift or Usage(); my $outfile = shift or Usage(); Usage() unless ($uclc eq 'lc' || $uclc eq 'uc'); open (IN, "< $infile") or die "Error opening $infile for read: $!"; open (OUT, "> $outfile") or die "Error opening $outfile for write: $!" +; foreach (<IN>) { # just print directly on reading, easier on the mem.. # quicker too, and we know that $uclc is going to # be either lc or uc at this point. print OUT $uclc eq 'lc' ? lc() : uc(); } close IN or die "Error closing $infile after read: $!"; close OUT or die "Error closing $outfile after write: $ +!"; ###################################################################### +#### sub Usage { die "\n Usage: uclc.pl (lc|uc) infile outfile\n"; } ###################################################################### +####
    -Syn0

    Update: didn't look too closely at opening code before, but fixed same bug after Hofmator mentioned it. Also, his elimination of the string equality check in the loop was a good call. ++

      I would implement the algorithm in nearly the same way, but I think I would implement it as a filter-like command which you can easily call in pipe sequences. So read-in with the magic <> (this allows for STDIN or infiles) and print to STDOUT (which can be easily redirected into outfile from the command line). This has the additional advantage of shortening the program a bit:

      #!/usr/bin/perl -w use strict; my $uclc = shift or Usage(); # fixed small bug Usage() unless ($uclc eq 'lc' or $uclc eq 'uc'); # to speed up the loop (no string compare on every line) my $lc = $uclc eq 'lc'; while (<>) { print $lc ? lc : uc; } sub Usage { die << EO_USE; Usage: uclc.pl (lc|uc) [infile] reads from STDIN if no infile given EO_USE }

      Update: Oops, thanks tilly, fixed: while instead of foreach.

      -- Hofmator

        Change the foreach to a while.

        The difference is that foreach puts the file read into list context so the entire file has to be held in memory. The while prints as it reads and so will work much better on large files.

        Otherwise I like your changes.

      Thanks, synapse0 (and others who replied also) for the suggestions.   8^)

      I'm not familiar with the eq 'lc' ? lc() : uc(); syntax from print OUT $uclc eq 'lc' ? lc() : uc();.   Would you be so kind as to explain how that works?

      Updated: 2001-07-29 07:40 CDT

      Here's what I ended up using.   Thanks again, and ++ to all who offered suggestions in nodes and CB!
          cheers,
          Don
          striving toward Perl Adept
          (it's pronounced "why-bick")

      #!/usr/bin/perl -w use strict; my $uclc = shift or Usage(); Usage() unless ($uclc eq 'uc' or 'lc'); while(<>) { print $uclc eq 'uc' ? uc() : lc(); } sub Usage { print "\n Usage: uclc.pl (uc|lc) < infile > outfile\n\n"; exit; } =head1 NAME uclc.pl =head1 SYNOPSIS uclc.pl (uc|lc) < infile > outfile" Convert alpha characters in text file to uppercase (or lowercase). =head1 UPDATED 2001-07-26 16:30 CDT Simplify code for (in|out)put of STD(IN|OUT). Remove unnecessary "or Usage()" from first two input lines. Replace if/else with ternary "?:". my $munged; if ($uclc eq 'uc') { $munged = uc(); } else {$munged = lc(); } print OUT $munged; While instead of for so file read line-by-line not slurp entire fil +e. Post efficiency SoPW to PerlMonks 2001-07-25 Initial working code. =head1 TODOS None that I know of. =head1 TESTED ActivePerl 5.61 on Win2kPro =head1 AUTHOR ybiC =head1 CREDITS Thanks to synapse0, OeufMayo, crazyinsomniac, Masem, MZSanford, virtualsue, Hoffmater, tilly, clemburg, and ichimunki for suggestions. And to davorg for "Data Munging with Perl". Oh yeah, and to some guy named vroom. =cut
        see perlman:perlop and look for the section on Conditional Operator. It's basically an in-place if-then type of construct that evaluates the EXPR $uclc eq 'lc' and returns lc() if the EXPR is true or uc() if the EXPR is false.

        from perlop:

        Conditional Operator Ternary "?:" is the conditional operator, just as in C. It works mu +ch like an if-then-else. If the argument before the ? is true, the arg +ument before the : is returned, otherwise the argument after the : is returned. For example: printf "I have %d dog%s.\n", $n, ($n == 1) ? '' : "s";
        for the gory details read on there ...

        -- Hofmator

Re: More efficient munging if infile very large
by OeufMayo (Curate) on Jul 26, 2001 at 15:54 UTC

    Okay, here's my attempt at this one, it should be slightly more efficient than using arrays, but you really should benchmark that...:

    #!/usr/bin/perl -w use strict; unless ($ARGV[0] eq '-lc' || $ARGV[0] eq '-uc' ){ die "usage: lcuc [-uc | -lc] file\n"; } my $op = substr(shift @ARGV, 1, 2); while(<>){ no strict 'refs'; print &{$op}($_) } sub lc{ lc shift } sub uc{ uc shift }

    TMTOWTDI Update (as suggested by tilly)

    I'll admit that this one is cleaner than the above code. :)

    #!/usr/bin/perl -w use strict; my %ops = ( 'lc' => sub {lc shift }, 'uc' => sub {uc shift }, ); my $case = shift @ARGV; unless (exists $ops{$case}){ die "usage: lcuc [", join(' | ', keys %ops), "] file\n"; } while(<>){ print $ops{$case}->($_); }
    <kbd>--
    my $OeufMayo = new PerlMonger::Paris({http => 'paris.mongueurs.net'});</kbd>
Re: More efficient munging if infile very large
by clemburg (Curate) on Jul 26, 2001 at 16:27 UTC

    I think your efficiency concern is about the wrong thing. Look at the results of this (call like your program but with one more arg that indicates iterations to perform):

    #!/usr/bin/perl -w use strict; use Benchmark; my $uclc = shift or Usage(); my $infile = shift or Usage(); my $outfile = shift or Usage(); my $iterations = shift || 10; Usage() unless ($uclc eq 'lc' or 'uc'); timethese($iterations, { original => sub { setup(); original(); teardown(); }, Masem_1 => sub { setup(); Masem_1(); teardown(); }, Masem_2 => sub { setup(); Masem_2(); teardown(); }, clemburg => sub { setup(); clemburg(); teardown(); }, }); sub original { my @in = <IN>; my @munged; for(@in) { my $munged = lc() if ($uclc eq 'lc'); $munged = uc() if ($uclc eq 'uc'); push @munged, $munged; } print OUT for(@munged); } sub Masem_1 { my @in = <IN>; @in = map { $uclc eq 'lc' ? lc : uc } @in; print OUT for(@in); } sub Masem_2 { my @in = <IN>; @in = $uclc eq 'lc' ? map { lc } @in : map { uc } @in; print OUT for(@in); } sub clemburg { while (<IN>) { my $munged = lc() if ($uclc eq 'lc'); $munged = uc() if ($uclc eq 'uc'); print OUT $munged; } } sub setup { open (IN, "< $infile") or die "Error opening $infile for read: $!"; open (OUT, "> $outfile") or die "Error opening $outfile for write: $!"; } sub teardown { close IN or die "Error closing $infile after write: $!"; close OUT or die "Error closing $outfile after write: $!"; } ###################################################### sub Usage { die "\n Usage: uclc.pl (lc|uc) infile outfile\n"; } ######################################################

    The results for 10 iterations on my machine are like this, using a 1MB text file with mixed case and some markup characters:

    > perl benchmark.pl lc terms.por terms.try 10 Benchmark: timing 10 iterations of Masem_1, Masem_2, clemburg, origina +l... Masem_1: 14 wallclock secs (11.01 usr + 1.34 sys = 12.36 CPU) Masem_2: 16 wallclock secs ( 9.88 usr + 0.84 sys = 10.72 CPU) clemburg: 10 wallclock secs ( 5.11 usr + 0.75 sys = 5.86 CPU) original: 32 wallclock secs (10.52 usr + 0.98 sys = 11.50 CPU)

    Obviously, the approach you and Masem use is very resource-intensive, reading the whole file into memory. That could kill your process or cause worse things if you work with really large files.

    Christian Lemburg
    Brainbench MVP for Perl
    http://www.brainbench.com

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://99943]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (12)
As of 2014-10-22 07:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (114 votes), past polls