Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Testing if an array contains a value and then deleting it in most efficient way

by karpatov (Beadle)
on Feb 18, 2008 at 14:21 UTC ( [id://668557]=perlquestion: print w/replies, xml ) Need Help??

karpatov has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, please could you advice me on most efficient way of testing an array for a value which when found is deleted from it? The array should contain over 10 000 values and a value is going to be tested almost million times during the parsing a database.
thank you karpatov
  • Comment on Testing if an array contains a value and then deleting it in most efficient way

Replies are listed 'Best First'.
Re: Testing if an array contains a value and then deleting it in most efficient way
by Punitha (Priest) on Feb 18, 2008 at 14:38 UTC
Re: Testing if an array contains a value and then deleting it in most efficient way
by Skeeve (Parson) on Feb 18, 2008 at 14:40 UTC
    If you say "contains" the answer is most often "hash". But you should explain a bit more. Maybe some code you've already done. It's a bit difficult to give an answer when you don't supply enough information.

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      OK.
      I have
      1. a list of chemical compound IDs (10.000 IDs)
      2. a xml output of a database with the records for compounds identified by the ID (almost million records) splitted into .xml files by 25.000.

      What I want to get certain value buried in the xml tree for an individual compound when the compound`s ID is in my list.
      During the weekend I was trying to solve a similar problem, when I was parsing other type xml of output, I wanted buried value from records that were containing some value conforming regex. When regex pattern was found the record was parsed by XML:twig (building only the twigs with value of interest and children_text). This strategy (regex and only then xml:pareser), adviced me by Monks pc88mxer and graff, worked great and in fact the IDs I have now are the result. The only difference now is that the records to be xml-parsed should be selected by extracting relevant value for <compoundID>value</compoundID> by regex and then checking if this value is among 10.000 IDs. So I am looking for most efficient way how to look-up the array of IDs (and deleting already found to speed-up the process). Is there some other strategy?
      Tx for reply.

        I think Skeeve is correct. Your best way forward is a hash of the ids.

        my %set_contains = map { $_ => 1 } @id_set; if ( $set_contains{$some_id} ) { # $some_id is in the set } # removes $some_id from the set: delete $set_contains{$some_id};
Re: Testing if an array contains a value and then deleting it in most efficient way
by ikegami (Patriarch) on Feb 18, 2008 at 15:59 UTC

    Deleting from the middle of an array is inherently slow, since all the subsequent elements need to be shifted. Locating elements is slow too.

    Is the data sorted by the search criteria?
    If so, the item can be located using a binary search which is much faster than a linear search.

    Can you use a hash indexed by the search criteria.
    Locating and deleting the element would become really cheap.

    If you're stuck with an array, you might want to consider undefing the element instead of deleting it. Just skip the undefs when it's time to save the structure.

      In code (while using array) ...

      use List::MoreUtils qw/ firstidx /; for my $val ( generate() ) { my $i; # Deal w/ "uninitialized variable" warning as appropriate. while ( -1 < ( $i = firstidx { $_ == $val } @array ) ) { delete $array[ $i ]; } } # Later... # exists() here would go better with earlier delete(), but # then would need to generate list of indices. my @save = grep { defined $_ } @array;

        ow! You transformed a O(N) problem into an O(N2) solution.
        firstidx starts at the start of the array every loop pass.

        Not only did you half the speed, you doubled the memory requirements.
        "firstidx" copies the entire array before processing it.

      In fact the values to be looked-up will come numerically sorted (probably if the record IDs in db are really ordered by their numeric ID), so sorting the values in the array would mean that only first element is deleted.
      So it seems that I can only ask whether the value to be looked up equals just the first item of the array! How only could miss it? Tx for pointing the possibility of ordering.
      Undefs are ignored during the search, so their presence doesnt slowdown the procedure?
        Fortunately for you, deleting the first (or last) element of an array (shift(@a)) is very fast (O(1)) in Perl.

        A side note ... The records may be sorted but will not be returned as such when not asked, if database server is Sybase 15.0.x.

Re: Testing if an array contains a value and then deleting it in most efficient way
by dsheroh (Monsignor) on Feb 18, 2008 at 17:44 UTC
    I'll put in another vote for "use a hash instead of an array" unless there's some reason that the order of the values needs to be preserved.

    If the order that you'll need the values is predictable, you may also be able to have the database pre-sort them for you so that you can just shift values out of the array. One sort will almost always be faster than 10,000 searches. (Especially since those searches will have to be done in extremely-ineffecient ways unless the data is sorted first...)

Re: Testing if an array contains a value and then deleting it in most efficient way
by jwkrahn (Abbot) on Feb 19, 2008 at 04:52 UTC
    1. Use a hash.

    2. @array = grep $_ != $value, @array;

    3. Or a fairly efficient method:

      for my $i ( 0 .. $#array ) {
          if ( $array[ $i ] == $value ) {
              @array[ $i, -1 ] = @array[ -1, $i ];
              $#array--;
              last;
              }
          }


    1. Update: after my naive attempt at efficiency and some benchmarking this works better:

      my $i
      for ( @array ) {
          if ( $_ == $value ) {
              $i = $_;
              last;
              }
          }
      splice @array, $i, 1;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://668557]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-25 07:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found