Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight

Assining data to an array is slow..

by smferris (Beadle)
on Mar 01, 2001 at 02:52 UTC ( [id://61453] : perlquestion . print w/replies, xml ) Need Help??

smferris has asked for the wisdom of the Perl Monks concerning the following question:

Me again. (Can you tell I just really found this site. I knew it existed but really didn't browse it)

I'm parsing a fixed width flat file for use in loading to different destinations. Possibly back to a file, possibly into a database.

I figured unpack would be faster and cleaner and it is. As long as you don't assign the output of unpack to an array. EG:

open(FH,"large.file") or die $!; while($row=<FH>){ unpack("a9 a40 a15 a15 a15 a2 a9 a9 a9",$row); }

Runs in about 20seconds on 2.1million rows. Modify the code as such: (added the array assignment)

open(FH,"large.file") or die $!; while($row=<FH>){ @data=unpack("a9 a40 a15 a15 a15 a2 a9 a9 a9",$row); }

Now the code runs in over a minute. I have to "transform" the different elements in @data. Is there anyway to make this faster? I'm assuming the memory structure for @data is being reallocated every iteration. Is that true?

As always, all help is greatly appreciated.

Shawn M Ferris
Oracle DBA

Replies are listed 'Best First'.
Re: Assining data to an array is slow..
by chromatic (Archbishop) on Mar 01, 2001 at 03:06 UTC
    I don't understand the question.

    The first snippet doesn't do anything useful. It unpacks things, then throws away the results. You might as well not open the file at all. A program like that will run in approximately zero seconds. Much faster!

    As for the second snippet, yes, there's memory allocation for each iteration. That's because unpack creates and you assign new values to @data for each row. You can't get around that if you want to do something with the data. And, depending on your data structure, unpack is probably the fastest way to get at it.

    Assuming your code handles exactly 2.1 million rows and the code runs in 70 seconds, that's 30,000 iterations per second. That's pretty fast.

    This falls in the category of "things you can't optimize away without breaking the program" -- you're not using regular expressions to get at the data, which would slow you down, and you're not using split, which is probably slower than unpack in this case.

    You're probably as fast as you can get without removing anything useful.

      I understand that not assigning the data back to an array isn't useful. But it is still parsing the row, correct? My point was that to parse 2.1 million rows is fast. But storing it slows it considerably.

      I think what's taking the time is the deletion and re-creation of the memory structure for each iteration of the loop. Uneccessary in my mind as the successive iterations (in this case) are always going to be of identical size.

      Given the above, I was hoping for..

      a) That the memory used by the unpack itself could be reused, rather than having to copy it to a perl structure.
      b) That I could predefine the size of @data and not have it destroy with each iteration.

      Of course.. I'm not a seasoned programmer and this entire thread is just a waste of everyones time in which case I apologize. 8)

      I just think if unpack has to put it into it's own array(has to or how does it know what to send back) that assigning it to a perl data type shouldn't take at least 6 times as long. If course, I really don't know what the behind the scenes of perl actually does to store data in memory.

      Shawn M Ferris
      Oracle DBA

        The key to understanding Perl as it works is 'context'. Every operation takes place in some sort of context -- that means, its results will be evaluated as a single item or as a list of items. (There is also 'void' context and boolean context, which are more or less official but not germane to the discussion.)

        For example, evaluating a list in scalar context produces the number of elements in the list. In list context, it produces the elements of a list:

        my @list = (1, 2, 3); # list context, @list -> (1, 2, 3) my $num = @list; # scalar context, $num -> 3 my @second_list = @list; # list context, @second_list -> (1, 2, 3)
        Perl the interpreter is smart enough not to do more work than it has to (in most cases), so it usually determines the context of an operation before performing the operation to weasel out of extra work or to produce the right results for the context. You can do the same if you use wantarray().

        This is important because unpack performs differently in scalar and in list context. Its perldoc page says that in scalar context, it returns just the first value. In list context, it returns all values.

        In your first code snippet, it's evaluated in scalar context (more properly void, but we'll keep this simple). Perl can tell that you don't care about the return values, so it only has to unpack the first bit of data. It ignores the rest. (Since it's in void context, it may *completely* ignore the *entire* string, but I haven't looked at the source.)

        This means the first snippet isn't doing as much work as the second, even in the unpack statement itself. Put aside the array assignment for the moment -- besides that, the two snippets aren't doing an equal amount of work!

        To find out how much work the unpack would do in list context, put it in list context:

        while ($row = <FH>) { () = unpack("a9 a40 a15 a15 a15 a2 a9 a9 a9",$row); }
        This will be a more meaningful benchmark.

        Besides all that, Perl handles memory internally via a reference-like mechanism. None of this tedious copying-the-contents-of-one-location-to-another jive you get in C. So the overhead is creating an array structure and populating it with the things unpack returns anyway. It's a whole lot smarter about these things than C.

        In short, don't worry about memory management in Perl for now.

Re (tilly) 1: Assining data to an array is slow..
by tilly (Archbishop) on Mar 01, 2001 at 03:23 UTC
    The reason for the slow-down is that the first time you are calling unpack in scalar (well really void) context so it is only extracting one field, while the second time you are extracting all of the fields. So Perl is doing a lot more work.

    Now I have seen a couple of signs that unpack might not as fast as it could be. But I would need to look at it closely to figure out why. (Or even if that is true.)

    In any case I wouldn't worry about it. This isn't running interactively, is it? If not then wait until you are done and see if it is fast enough...

(boo) Re: Assinging data to an array is slow..
by boo_radley (Parson) on Mar 01, 2001 at 03:12 UTC
    Runs in about 20seconds on 2.1million rows
    Now the code runs in over a minute
    You're annoyed at this speed on 2.1million rows? Really? Are you sure?
    as for serious advice, maybe you could store your data in a database, and access it through DBI? then you could run your transforms through an insert or update statement
Re: Assigning data to an array is slow..
by Albannach (Monsignor) on Mar 01, 2001 at 03:26 UTC
    I'm not sure I'd be able to call the run times you're getting "slow" either, but you might want to read the thread Memory efficiency with arrays, in which the suggestion to pre-extend the array is mentioned, though it won't make a huge difference. As I recall, this did help quite a bit in Perl4 days though, but that was long ago.

    I'd like to be able to assign to an luser

Re: Assining data to an array is slow..
by jeroenes (Priest) on Mar 01, 2001 at 12:30 UTC
    You should also make sure that the data actually fit in memory, or at least physical memory. perl -e '@a=(1)x2E6;sleep();' takes 48 Mb on perl5.6, and I recall it was nearly twice as much in perl5.0. And that's without any data.

    So the actual allocation of the data takes time as well.

    "We are not alone"(FZ)

Re: Assining data to an array is slow..
by rbi (Monk) on Mar 01, 2001 at 16:45 UTC
    some days ago I faced the time taken by unpack compared to other ways of extracting fields from records.
    I think that using substr can speed up the things.
    After your posting I took the occasion to learn how to use Benchmark and tested this code below on a 400000-record file:
    #/usrl/bin/perl -w use Benchmark; $filename = @ARGV[0]; timethese ( $count, {'Method One' => '&One', 'Method Two' => '&Two', 'Method Three' => '&Three'} ); sub One { open(FILE,@ARGV[0]); while($row=<FILE>) { @data = unpack('a4a2a2a2a2',$row); } close(FILE); } sub Two { open(FILE,@ARGV[0]); while($row=<FILE>) { ($data[0],$data[1],$data[2],$data[3],$data[4]) = unpack('a4a2a2a2a2',$row); } close(FILE); } sub Three { open(FILE,@ARGV[0]); while($row=<FILE>) { $data[0] = substr($row,0,4); $data[1] = substr($row,4,2); $data[2] = substr($row,6,2); $data[3] = substr($row,8,2); $data[4] = substr($row,10,2); } close(FILE); }
    and I got this.
    Method One: 43 wallclock secs (40.72 usr + 1.23 sys = 41.95 CPU) @ 0 +.02/s (n=1) (warning: too few iterations for a reliable count) Method Two: 43 wallclock secs (41.50 usr + 1.42 sys = 42.92 CPU) @ 0 +.02/s (n=1) (warning: too few iterations for a reliable count) Method Three: 36 wallclock secs (33.76 usr + 1.42 sys = 35.18 CPU) @ + 0.03/s (n=1) (warning: too few iterations for a reliable count)
    Again I think sub Three approach (substr) proves to be faster than sub One (unpack into an array) or Two (unpack into array elements).
    Hope this may help.


      Doesn't getting the warning "too few iterations for a reliable count" in the output bother you at all?


      "Perl makes the fun jobs fun
      and the boring jobs bearable" - me

        hi davorg,
        sure that warning it's not very nice... :) however I also saw similar difference (about 15%) by running separately the routines and checking with the ps - process status - command.
        As I said, I've tried to use Benchmark for the first time (there's always a first time...) for this test. However, I guess it's not a problem of file size that warning. I'd appreciate, for my learning, if the code would be changed by someone to something that can be bechmarked (if it is not a problem of input file size).