Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Efficient array element deletion

by kennethk (Abbot)
on Dec 04, 2008 at 22:52 UTC ( [id://728128]=perlquestion: print w/replies, xml ) Need Help??

kennethk has asked for the wisdom of the Perl Monks concerning the following question:

After reading Shift, Pop, Unshift and Push with Impunity!, a question occurred to me (purely academic). If I have a long array and my goal is to perform some test on each element and remove those elements that fail, what are the best ways to do it from CPU and memory standpoints? So one choice would be

@array = grep(!/^\#/, @array)

where presumably the grep operation has been heavily optimized for CPU time. However, this should create a temporary result array, which in turn could double my memory footprint. On the other extreme of the spectrum, I could say

for ( reverse 0 .. $#array) { splice (@array,$_,1) if ($array[$_] =~ /^\#/) }

but unless just about every entry is excised, that imposes a large performance toward the end of the operation. So the above-noted node inspired the following solution

for (0 .. $#array) { push @array, $value if ($value = shift @array) !~ /^\#/ }

My question is: would this ultimately have worse memory penalties than grep? It seems it must ultimately allocate enough memory at one end to accommodate the entire set, and then the interpreter/system cannot recover this memory since the variable is still allocated. Also, is there an even more clever way to do this I'm missing?

Update:Following the notes by johngg and jwkrahn, I've corrected an error in my splice loop and learned something new about negative indices. Just because you tested code doesn't mean it did what you thought...

Replies are listed 'Best First'.
Re: Efficient array element deletion
by ikegami (Patriarch) on Dec 04, 2008 at 23:10 UTC

    Every time push is forced to allocate more memory, it needs to copy the entire array. This can be avoided by preallocating enough memory.

    my $count = @array; $#array = $count*2 - 1; for (1 .. $count) { push @array, $value if ($value = shift @array) !~ /^\#/ }

    In terms of scalability,

    • The grep solution you provided uses O(N) time and O(N) memory.
    • The splice solution you provided uses O(N2) time and O(1) memory.
    • The shift-push solution you provided uses O(N2) time and O(N) memory.
    • The shift-push solution I provided uses O(N) time and O(N) memory.

      From Shift, Pop, Unshift and Push with Impunity!:

      One consequence of perl's list implementation is that queues implemented using perl lists end up "creeping forward" through the preallocated array space leading to reallocations even though the queue itself may never contain many elements. In comparison, a stack implemented with a perl list will only require reallocations as the list grows larger. However, perl is smartly coded because the use of lists as queues was anticipated. Consequently, these queue-type reallocations have a negligible impact on performance. In benchmarked tests, queue access of a list (using repeated push/shift operations) is nearly as fast as stack access to a list (using repeated push/pop operations).

      I read this to mean that while naive implementation would have yielded O(N2), perl is smart enough that the exponent drops (closer) to O(N). Is this incorrect?

      Also, it seems like O(N2) on splice is a worst case, where best case (either all or no deletions) would be O(N), leading me to think it'd be closer to O(N log N) in practice.

      The crux of my question though was supposed to be about the constant in front of the memory term, particularly as all scale equivalently in memory.

        Also, it seems like O(N2) on splice is a worst case, where best case (either all or no deletions) would be O(N), leading me to think it'd be closer to O(N log N) in practice.

        I tried all N=16 inputs:

        0 elements were shifted 1 times 16 elements were shifted 16 times 31 elements were shifted 120 times 45 elements were shifted 560 times 58 elements were shifted 1820 times 70 elements were shifted 4368 times 81 elements were shifted 8008 times 91 elements were shifted 11440 times 100 elements were shifted 12870 times 108 elements were shifted 11440 times 115 elements were shifted 8008 times 121 elements were shifted 4368 times 126 elements were shifted 1820 times 130 elements were shifted 560 times 133 elements were shifted 120 times 135 elements were shifted 16 times 136 elements were shifted 1 times 98 elements where shifted on average

        The average result is 98, which is about twice O(N log N). So,
        Average case
        = O({loop body cost}*N + {element shift cost}*N log N)
        = O(N + N log N)
        = O(N log N)

        The thing is, the worst case is also in the same order, so
        Worse case
        = O(N log N)

        I accept your better average case, and I propose a better worst case than we both thought.

        I read this to mean that while naive implementation would have yielded O(N2), perl is smart enough that the exponent drops (closer) to O(N). Is this incorrect?

        A naïve implementation of push would take O(N) for every element pushed. Currently, it takes O(1) for most pushes, and O(N) on occasion.

        @a = qw( a b c ); +---+---+---+---+ | a | b | c | / | / = allocated, but unused. +---+---+---+---+ push @a, 'd'; +---+---+---+---+ | a | b | c | d | +---+---+---+---+ push @a, 'e'; +---+---+---+---+---+---+---+---+---+---+---+---+ | a | b | c | d | e | / | / | / | / | / | / | / | +---+---+---+---+---+---+---+---+---+---+---+---+

        It only preallocates so much. As soon as the preallocated memory is used up, a new memory block is alocated. the whole array must be copied. The shift-push solution is therefore O(N * N*{chance of reallocation needed}) which probably ressembles worse/average case O(N log N).

        So I that makes the scalability as follows:

        • The grep solution you provided uses O(N) time and O(N) memory.
        • The splice solution you provided uses O(N log N) time and O(1) memory.
        • The shift-push solution you provided uses O(N log N) time and O(N) memory.
        • The shift-push solution I provided uses O(N) time and O(N) memory.

        The crux of my question though was supposed to be about the constant in front of the memory term, particularly as all scale equivalently in memory.

        I thought you were more interested in speed, sorry.

        • splice is done in-place. (Assuming you get rid of the reverse!!)
        • grep probably uses N SV* extra memory. It could possibly be done in place.
        • My shift-push uses N SV* extra memory.
        • Your shift-push uses between N and 5*N SV* (peak), and between N and 3*N SV* (final) extra memory.

        Pushing slightly more than doubles the allocated memory when a reallocation is forced. If N' is the number of elements kept, 3*N is 2*(N+N') when N'=N, minus the initial memory. The peak occurs when copying the pointers from the old memory block to the new memory block.

Re: Efficient array element deletion
by johngg (Canon) on Dec 04, 2008 at 23:13 UTC
    for (-$#array .. 0) { ...

    I don't think that's going to do quite what you intended.

    J:\>perl -le "@arr = ( 1 .. 10 ); print $arr[ $_ ] for - $#arr .. 0;" 2 3 4 5 6 7 8 9 10 1 J:\>

    Perhaps this instead.

    J:\>perl -le "@arr = ( 1 .. 10 ); print $arr[ $_ ] for reverse 0 .. $# +arr;" 10 9 8 7 6 5 4 3 2 1 J:\>

    Cheers,

    JohnGG

Re: Efficient array element deletion
by jwkrahn (Abbot) on Dec 04, 2008 at 23:17 UTC
    On the other extreme of the spectrum, I could say
    for (-$#array .. 0) { splice (@array,$_,1) if ($array[$_] =~ /^\#/) }

    That doesn't do what you seem to think it does.   It starts with the second element of the array, iterates up to the last element of the array and finally ends with the first element of the array.   If you splice elements out of the array then the elements remaining will be moved and won't be removed by subsequent splices.

    What you need to do is start at the last element of the array and iterate towards the first element:

    for ( reverse 0 .. $#array ) { splice @array, $_, 1 if $array[ $_ ] =~ /^\#/; }

      for ( reverse 0 .. $#array ) flattens the list.
      for ( -@array .. -1 ) is better.

        for ( reverse 0 .. $#array ) flattens the list.for ( -@array .. -1 ) is better.

        Except I don't think that will work either.

        J:\> perl -le "@arr = ( 1 .. 10 ); print $arr[ $_ ] for -@arr .. -1;" 1 2 3 4 5 6 7 8 9 10 J:\>

        From the documentation (Range Operators, my emphasis): In list context, it returns a list of values counting (up by ones) from the left value to the right value so I don't think it can be persauded to decrement. So doing

        J:\> perl -le "@arr = ( 1 .. 10 ); print $arr[ $_ ] for -1 .. -@arr;" J:\>

        results in nothing useful.

        Cheers,

        JohnGG

Re: Efficient array element deletion
by ikegami (Patriarch) on Dec 04, 2008 at 22:55 UTC
    If your array can't contain undefined values to begin with, the simplest approach would be to undefine the values rather than deleting them. Then, just ignore the undefined values later on.
Re: Efficient array element deletion
by fert (Acolyte) on Dec 04, 2008 at 23:12 UTC
    If you aren't afraid of a few counters you could do something like this:
    my $replace = 0; for ( my $x = 0; $x< @array; $x++ ) { if ( # pass of your condition ) { $array[$replace] = $array[$x]; $replace++; } }
    This will effectively 'shift' every thing over to the front of your array, avoiding double memory issues, and all you have to do is one final pass to "cleanup" the invalid entries at the end (pop @array until the length == $replace).

      And if you aren't afraid of Perl you can:

      my $replace = 0; for my $x (0 .. $#array) { next unless # pass of your condition; $array[$replace++] = $array[$x]; }

      or maybe even:

      your condition and $array[$replace++] = $array[$_] for 0 .. $#array;

      Perl's payment curve coincides with its learning curve.
Re: Efficient array element deletion
by Sinister (Friar) on Dec 05, 2008 at 07:50 UTC
    If I have a long array and my goal is to perform some test on each element and remove those elements that fail, what are the best ways to do it from CPU and memory standpoints?

    I think preventing those entries from ever making it to the array is far more efficient then pushing them on and then later ''grep-ing'' them out.

    Any form of array shrinkage is costly (as has been proved throughout this whole thread).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://728128]
Approved by Joost
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-24 03:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found