Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

best data structure

by Anonymous Monk
on Aug 28, 2002 at 13:04 UTC ( #193438=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a program that needs to process a list as a FIFO so I have written it as...
foreach my $item (@list) {
   ... code
however the code inside the loop is appending new information to the end of the list but I want to avoid duplicate entries so I'm having to loop over the list to check if the new item isn't already in the list. The program is, obviously, getting slower the longer the list gets.

So I should use a hash to do the lookup, but a hash cannot be extended within the outer loop.
while(my($item, $ummy) = each(%hash)) {
   ... code
only processed the values that were in the hash when it was created and ignores those that are being added as the code is run.

It looks like I need to use both a list so that I can process all the items and a hash to speed up the lookup. Is there another way?

Replies are listed 'Best First'.
Re: best data structure
by rob_au (Abbot) on Aug 28, 2002 at 13:27 UTC
    Why not combine the two approaches? The first solution which comes to my mind calls for a single iteration through the elements of @list with a lookup hash, %hash, that is used to avoid duplicate processing.

    For example ...

    my %hash; while (my $item = shift @list) { next if exists $hash{$item}++; . . }


Re: best data structure
by digiryde (Pilgrim) on Aug 28, 2002 at 17:19 UTC

    Seems to be a few questions need to be asked first:

    • How many items can be in the list (realistically and maximum)?
    • How big can the "items" be (in chars)?
    • Is the set of "items" finite and/or completely known now?
    • Can you use a tree data structure to locate duplicates?

    I have tackled this problem years ago. We had a FIFO that other applications could write to (if they had write permissions) that was read from by a process flow control program. We kept running into users who would submit the same job many times. Each job could only run once per request. We wound up keeping a hash using the job name as the hash key. We would see up to 1000 jobs submitted in a few minutes at times, and performance was fine. Using the hash as the key means perl prevents dups for you. Or you can test for the existance of the hash key, and handle dups any way you want to. We actually implemented the queue so that a job could be set to do one of several things on dup requests.

    • Rerun flag (if one instance of the job is running and another request comes in, the first job finishes and then starts again as soon as it was done.)
    • It could set an ignore flag (if one job is running and another request comes in, the request is ignored until the job is finished running.)
    • It could set a count flag (if a job is running and another request for it comes in, a counter is incremented for that job. When the job finishes, the counter is decremented and the job is run again).

    Also, by filtering the job this way, we could determine whether or not a dup job was the same process with same data or same process with any data. Using this model, we were able to create a very powerful tool that handled workflow (both program and people) using a simple process definition language.

    The general idea of it is as follows:

    • look at internal queue
      • Is queue empty? - Block on FIFO until something comes in and add it to the queue.
      • Is queue not empty? - Poll FIFO and add anything in it to the back of the queue.
    • Process first queue item.
      • Is this a job control command? Do it and go to top. This gave us a way of setting priority of jobs, cancelling jobs, and more...
      • Is this a process request? Check the hash for the process name. If it exists, then you have a dup request. If you have a dup request, check the rules for the process to see how that process handles duplicate requests. We then used a hash from the process request to store the data for the request.
        • Check for concurrency (manage dups)
        • if process can start (rules verification), start sub-process. Normally done with a fork.
        • if process can not start
          • are we counting requests? bump counter. remove from queue
          • are we rerunning once? set rerun flag. remove from queue
          • are we ignoring? remove from queue.
    • ...

    Though not nescesarily the best solution, using a hash makes it much easier to manage (and allows tie) and opens up many possibilities for growth.

    msg me for more on this if you are interested.


Re: best data structure
by RollyGuy (Chaplain) on Aug 28, 2002 at 13:33 UTC
    I'm not sure if it's exactly what you were looking for, but I used the built-in perl function grep to examine my list for entries that are already in there. Here's the snippet of code I used to test the grep function:
    # Create an initial list @list = (9,2,3,4,5,6); # Data we wish to insert @insert_data = 4..9; # For every datum that we wish to # insert, check for it's existance # in the list and append it if it is # not found foreach $item (@insert_data){ next if grep /$item/, @list; push @list, $item; } # Print out our final list foreach $item (@list){ print $item, ","; }

    This results in the output: 9,2,3,4,5,6,7,8.

    Hope this helps -- Rolly

    Update: See comments below for reasons why this isn't an efficient solution.

      grep will work for this, but it's not a good solution to check for an existing item, especially not via regexp, because it will always iterate over the whole list, even if the elemet is fond at the first position.

      So first thought would be a hash again, withe code like that:

      While this too is an option to solve the problem at hand, you will find that over time and with larger sets of data, that this method will prove to be substantially slower. This is due to the fact that a grep or loop over all set elements will, in a worst-case scenario, occur with O(n) execution - In contrast, a hash-lookup operates with O(1) execution.

      As such this approach, particularly with larger data sets, can be justified, trading memory space for execution speed.


Re: best data structure
by thelenm (Vicar) on Aug 28, 2002 at 15:34 UTC
    You may want to look into Tie::IxHash. That module implements Perl hashes that preserve the order in which the hash elements were added. You can use the normal hash functions, such as exists, and you can also use "arrayish" functions, such as Push and Shift.

    -- Mike


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://193438]
Approved by moxliukas
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2022-09-26 19:05 GMT
Find Nodes?
    Voting Booth?
    I prefer my indexes to start at:

    Results (118 votes). Check out past polls.