Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

“A meeting at the Liquor-Vodka Factory”, or… same ARRAY questions again?!!

by Mabooka-Mabooka (Sexton)
on Sep 02, 2005 at 20:14 UTC ( #488776=perlquestion: print w/ replies, xml ) Need Help??
Mabooka-Mabooka has asked for the wisdom of the Perl Monks concerning the following question:

Dear Patient Monks,

There’s a novel “A meeting at the Liquor-Vodka Factory” (“собрание на ликеро-водочном заводе”) by Russian author M. Zhvanetsky. Workers at a liquor factory are having a meeting. It is structured like this:

The chairman pours himself from a bottle and says: “Today, we are going to listen to our Director of the Transportation Department”.

The Director of the Transportation Department comes forward, pours himself from a bottle and makes some speech.

The chairman pours himself again and says: “Well… Now, let us listen to our Director of the Transportation Department. Please tell us how are things in our transportation”.

The Director of the Transportation Department comes forward again, pours himself, and talks some more.

The chairman: “All right, then. Now, how about our Transportation Department? Where’ s our Director of Transportation? Anybody seen him?”...

And so it goes and goes, until the Chairman says: “Well, the things seem to go right. Except it’s a re-eeeeal pity that today we couldn’t make it to hear our Director of the Transportation Department.”.


... That’s approximately what I feel myself when searching for answers to questions like:

  • a. What is a size of an array in Perl?
  • b. How to find whether a certain element is present in an array?
  • c. What is the index of an element in an array?...

I hear: “Boo!!!”,,, “NOT Again!”, “RTFM!!”, “RTFAQ!!!”…. But please stay with me for another minute, -- I’ll try to prove that the problem still exists.

To make myself clearer, let me ask you three more questions:

  • d. How many people come here seeking for the answers to the abovementioned questions?
  • e. How many people come here for these more than once in their lifetime?
  • f. How many leave without the answer they were hoping to get, -- or, worse, with a wrong answer?

I am ready to bet that the ##s are high. And here’s why. Let’s take a look at a) in FAQs here: "How do I find the size of an array?":

It becomes obvious that there are at least 7 answers:-). No one proposes the one and only one best solution, the confusion on whether to add a 1 to the index of the last element persists, and nobody discusses performance implications of casting to a scalar vs. using the last index. But that’s probably OK: an answer CAN be found, it only takes some time….

As for questions b) and c), it’s unfortunately not a case. I mean: neither solution works for me.
(These are two similar questions, so let’s look just at the first one - "How to find out if X is an element in an array?"):

... I actually love asking this question during an interview (regardless of the programming language), and see whether the person jumps into writing some sort of a loop or asks: how big is an array? Is it sorted? Is it dynamic?... You know, one could spend hours talking about this stuff.

What is proposed here in FAQs is:

* Two algorithms: either - traversing an array or - duplicating its content in a hash (“associative array” that is? +) for a fast lookup; * Using ‘grep’ or ‘eq’; * Assuming that arrays look like qw /I am an array, yeah…/.

Which means: Arrays are data structures that are always

SMALL; UNSORTED (== no assumptions about distribution of elements); CONTAIN ONLY SHORT STRINGS
, and there’re no performance or memory problems whatsoever while dealing with them.

Hmm... guess what: that’s not exactly the set of conditions that ~some~ real-world applications have to deal with.

I assert that:

* Traversing an array should never be used for searching; * Using a hash is a pretty smart solution, except: - It has memory implication (hence not always can be used); - It slows down considerably for huge arrays (or, at least you h +ave to build your own hash structure with some pretty hashing functio +n); - In any case, questions arise: is it smart to build it every tim +e one needs a lookup? What if the contents of an array change all the + time?.., etc. * There are other algorithms in CS.

Besides, there’re integers, floats, doubles, and data structures appearing as elements of arrays sometime…

And on top of all, -- nobody has proposed a subroutine, every recipe's style is not a "production code". (That’s why the code that looks elegant at the 1st glance:

$X = "x"; @array = qw / x y z 1 2 3 /; return exists {map { $_ => 1 } @array}->{$X};

is not really useful: one cannot paste it to a sub to make it work.

So…… Thanks for bearing. I hope I've made a poin. What's missing is: Well-thought-through and unambiguous answers to these three questions at the very top of the "Array" FAQ list.
Or at least could anybody please post (for starters):

- a sub - that takes two arguments : an array of ints and an int; - assumes that the array is sorted; - uses binary search :-); - returns an index of the element or undef.



THANKS FOR READING!

Janitored by Arunbear - replaced pre tags with code tags, to prevent distortion of site layout

Comment on “A meeting at the Liquor-Vodka Factory”, or… same ARRAY questions again?!!
Select or Download Code
Re: “A meeting at the Liquor-Vodka Factory”, or… same ARRAY questions again?!!
by ikegami (Pope) on Sep 02, 2005 at 20:50 UTC

    * a. What is a size of an array in Perl?

    In terms of bytes?
    I don't know.

    In terms of elements?
    scalar(@array) (or simply @array in a scalar context).

    --> What is the index of the first element of an array in Perl?

    $[, which is usually assumed to be 0. (Updated)

    --> What is the index of the last element of an array in Perl?

    $#array (which is the same as @array - 1).

    * b. How to find whether a certain element is present in an array?

    I presume you're asking for the best way to do so, but the choice of which is best depends on the definition of element, on whether the array is sorted, on the size of the array, on whether a single lookup is done or not, and on how much effort you want to spend.

    * c. What is the index of an element in an array?...

    You must locate the element yourself. See (b).

    * d. How many people come here seeking for the answers to the abovementioned questions?
    * e. How many people come here for these more than once in their lifetime?
    * f. How many leave without the answer they were hoping to get, -- or, worse, with a wrong answer?

    Programming is more than just knowing the language. That's why I spent 4 years in university instead of just reading the docs.

    Or at least could anybody please post [a binary search] (for starters)

    There's a module on CPAN (Search::Binary) and there's one on my scratchpad, copied here for your convenience:

    sub _unsigned_to_signed { return unpack('i', pack('I', $_[0])); } sub binsearch(&$\@) { my ($compare, $value, $array) = @_; # # If the $value is not found in @array, # the 1s complement of where the $value # should be inserted is returned. So, # $value is found if the return value >= 0. # # When compare is called, $a is an alias for $value, # and $b is an alias for the element of the array # against which $a which be compared. # # Example: # # Add $value to sorted @array, if it's not already there. # $idx = binsearch { $a <=> $b } $value, @array; # splice(@array, ~$idx, 0, $value) if ($idx < 0); # my $i = 0; my $j = $#$array; return $j if ($j == -1); my $ap = do { no strict 'refs'; \*{caller().'::a'} }; my $bp = do { no strict 'refs'; \*{caller().'::b'} }; local *$ap; local *$bp; *$ap = \$value; for (;;) { my $k = int(($i+$j)/2); *$bp = \($array->[$k]); my $cmp = &$compare(); return $k if ($cmp == 0); if ($cmp < 0) { $j = $k-1; return _unsigned_to_signed(~$k) if ($i > $j); } else { $i = $k+1; return _unsigned_to_signed(~$i) if ($i > $j); } } }

    It's not quite to your spec without this wrapper:

    sub my_binsearch($\@) { unshift(@_, sub { $a <=> $b }); my $idx = &binsearch; undef $idx if $idx < 0; return $idx; }

    Update: binsearch now uses the caller's package's $a and $b.

    Update: Fixed bugs in binsearch.

    Test proggy:

    my @unsorted = qw( ... strings ... ); my @sorted; foreach (@unsorted) { my $word = lc($_); my $idx = binsearch { $a cmp $b } $word, @sorted; splice(@sorted, ~$idx, 0, $word) if ($idx < 0); } print(join("\n", @sorted), "\n");

      --> What is the index of the first element of an array in Perl?

      0

      Actually,
      $[ The index of the first element in an array, and of the fir +st character in a substring. Default is 0, but you could theoretically set it to 1 to make Perl behave more like aw +k (or Fortran) when subscripting and when evaluating the index() + and substr() functions. (Mnemonic: [ begins subscripts.) As of release 5 of Perl, assignment to $[ is treated as a compiler directive, and cannot influence the behavior of a +ny other file. (That's why you can only assign compile-time constants to it.) Its use is highly discouraged. Note that, unlike other compile-time directives (such as strict), assignment to $[ can be seen from outer lexical s +copes in the same file. However, you can use local() on it to st +rictly bound its value to a lexical block.
        Shoot, you're right. I consider $[ deprecated, so I never think of it.
Re: “A meeting at the Liquor-Vodka Factory”, or… same ARRAY questions again?!!
by InfiniteSilence (Curate) on Sep 02, 2005 at 21:24 UTC
    This was an interesting post but comes off a bit frustrated and uses examples that are not "on point."

    For example, you said: "is not really useful: one cannot paste it to a sub to make it work."

    Perhaps I am confused but Perlmonks (or any other free, online resource by volunteers) is unlikely to do your job for you.

    Second, and more importantly, have you ever heard of the phrase, "garbage in, garbage out?" If you ask a specific question -- one that is qualified with performance requirements like, "hey, the array in question will have 1 million elements" -- you are far more likely to get a detailed answer. Be specific.

    Celebrate Intellectual Diversity

Re: “A meeting at the Liquor-Vodka Factory”, or… same ARRAY questions again?!!
by graff (Chancellor) on Sep 02, 2005 at 21:59 UTC
    Thank you for an entertaining statement of the problem. I suppose some people might view multiple answers to a question as causing some sort of difficulty, but it is in the spirit of Perl to have multiple ways to perform a task. If you're not comfortable with that, well, you'll get used to it as you spend more time with the language.

    Please understand that your closing question is actually different from, in terms of being much more detailed than, the FAQ that you cited. In a perfect world, the more general, less specific nature of the FAQ answers would give you a starting point, and from there you can try tweaks and variations that would isolate and adapt a general approach to your specific need. You do need to do some of the work yourself.

    BTW, please please remember that you can use HTML formatting tags (like <UL>, <LI>, etc) in your post. Using <pre> is generally unnecessary and not helpful.

    Regarding your last request, I'm reluctant to post a specific solution, because I'm not sure how much error trapping might be called for (e.g. checking for wrong number of args, wrong type of args, array not sorted, array contains undef or non-integer elements, array contains non-unique elements, etc), or what sort of behavior would be best when an error is caught (e.g. die, croak, carp, or just return a non-integer, given that an undef return is supposed to be a valid result). These things depend on how much is known, or not known, about the caller.

    And of course, as indicated an earlier reply, what is known (or not known) about the caller and the input data will make a difference in what sort of algorithm is best.

    I suppose if I change your spec, it might be easier for the sub to be written in a more adaptable way: return undef in case the inputs to the sub fail to meet any essential condition; return "-1" if the inputs are considered valid and the target does not occur in the array; otherwise, return an index pointing to the target value in the array.

    Since this reply is already too long, I'll take the shortcut of assuming that the inputs can be trusted to meet the stated conditions:

    sub find_int_in_array { my ( $arref, $targ ) = @_; # args are array ref, int value my $asize = scalar @$arref; my $lasta = $asize - 1; my $result = ( $targ < $$arref[0] or $targ > $$arref[$lasta] ) ? - +1 : ( $targ == $$arref[$lasta] ) ? $lasta : ( $targ == $$arref[0] ) ? 0 : undef; return $result if ( defined( $result )); my $nextidx = $asize / 2; my $nextinc = $nextidx / 2; while ( ! defined( $result )) { if ( $$arref[$nextidx] == $targ ) { $result = $nextidx; } else { my $newidx = ( $$arref[$nextidx] < $targ ) ? $nextidx + $nextinc : $nextidx - $nextinc; $result = -1 if ( $nextidx == $newidx ); $nextidx = $newidx; $nextinc /= 2; } } return $result; }
    If you want to add some error checking to that, just have it return undef when there is an error. Or better yet, use a module that already exists to do a binary search over an array (as cited in the first reply) and read its man page.
Re: “A meeting at the Liquor-Vodka Factory”, or… same ARRAY questions again?!!
by dave_the_m (Parson) on Sep 02, 2005 at 22:19 UTC
    * Using a hash is a pretty smart solution, except:
    - It has memory implication (hence not always can be used);
    - It slows down considerably for huge arrays (or, at least you have to build your own hash structure with some pretty hashing function);
    Assuming the hash can fit in memory, it has roughtly constant time for lookups; there is no need for a pretty hashing function as such a function is already used interally - that's why it's called a hash!

    Dave.

Re: “A meeting at the Liquor-Vodka Factory”, or… same ARRAY questions again?!!
by spiritway (Vicar) on Sep 03, 2005 at 02:35 UTC

    Old Russian saying: "This work is not easy, to drag a hippo from a swamp" (admittedly it loses something in translation). Your frustration is understandable. However, I think you overlook one of the important points about Perlmonks, why it is so useful.

    Perlmonks is so useful because you often get so many different answers to your question. Everyone has a slightly (or hugely) different take on your question, comes up with something a bit different. Yes, there are wrong answers. People often misread the question, misunderstand the subtleties of Perl, or otherwise err. That's part of the price we all pay for having a "free" source of information - it's not 100% tested and accurate. Still, neither is Windows, and you pay for that...

    Your question about the number of elements in an array *has* been answered, and abundantly so. There is a Perl idiom that returns the number of elements in an array.

    Your question about the index of an element of an array is not specifically a Perl question to begin with. It arises in every language that has arrays, and it is not a trivial question. You can sort your array and speed up the searching, but in general you'll have to go look for it, no matter what language you use. Hashes are quicker but come with a price, and you might not care to pay that price.

    The point is, there's no one "right" way to do most things in Perl, which is what makes it such a versatile and dynamic language. What you're complaining about - and I agree, it's very frustrating at times - is also one of the major strengths of Perl. There really is more than one way to do it, and you're likely to hear several of them each time you ask a question. In the final analysis, it's up to you to decide which answer is appropriate, if any - or to develop yet another way to do it, and maybe share that with us.

Re: “A meeting at the Liquor-Vodka Factory”, or… same ARRAY questions again?!!
by polettix (Vicar) on Sep 03, 2005 at 15:18 UTC
    As you already pointed out, you're not describing a problem, but rather a family of related problems. Each of which has a solution that best fits.
    Traversing an array should never be used for searching
    This really hits me, in particular the never. If your search needs are rare, for example, or you don't have particular needs in term of efficiency - why not? Keeping it simple seems the most winning solution to me, human cycles are quite more expensive than CPU ones. And List::Util contains tools that are there only to be put directly into production code (yes, it's part of the CORE MODULES).

    More or less, I'd use a hash when I know in advance that I'll do a lot of lookups. Building the hash from the array has its cost, both in term of space and time (yes, you have to traverse the whole array!), so I'd keep this solution for occasions where I really need it.

    Last, but not least, your request for code is funny. Why not posting an initial implementation? The description seems already near to Perl code, and it would have probably helped you better than a look for cut-and-paste.

    Flavio
    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Don't fool yourself.
Re: “A meeting at the Liquor-Vodka Factory”, or… same ARRAY questions again?!!
by tphyahoo (Vicar) on Sep 04, 2005 at 14:29 UTC
    If you're doing anything other than traversing the array, like in a for loop or something, USE A HASH.

    I believe that answer is correct 97% of the time some perl seeker asks this question, and I guess the other 3% just hasn't happened to me yet. So, in the spirit of short answers... yes... just use a hash.

Re: “A meeting at the Liquor-Vodka Factory”, or… same ARRAY questions again?!!
by Mabooka-Mabooka (Sexton) on Sep 05, 2005 at 20:26 UTC
    Gentlemen,

    1st of all - huge thank you to everybody. It’s as always a pleasure to be here and have your questions answered. I *love* this community of wise, intelligent and professional hackers:-), (-- although I am not too much into Perl:-)- (sorry))..

    A few general remarks:
    1. I am sorry if my initial post caused a bit of frustration for some of you. It wasn’t intended to be. My point was:
    I expect a higher quality of answers in FAQ than in general posts. And the more F is a Q, the higher quality the A (I assume) should be. That’s all.
    It’s just my opinion of course, but I think many people treat “FAQ” as the only source of truth...

    2. Thanks a lot for the two _binary search_ answers. Both seem to work, I’ll do some corner-case and performance testing and compare them to hash implementation as well, -- and probably post the results here (if nobody minds).
    2-a). I’ve made two changes in find_int_in_array :
    A)

    #my ( $arref, $targ ) = @_; # args are array ref, int value my ( $targ, $arref ) = @_; # args are: (int, array ref)
    this is just a “style” thing, easier to remember (int in arr, not arr in int), plus - easier to test together with other impl-s by the same driver;
    B)
    Changed
    my $nextidx = $asize / 2; my $nextinc = $nextidx / 2;

    to:
    my $nextidx = int($asize / 2); my $nextinc = int($nextidx / 2);
    (Otherwise, for an array of 5 elements, it’d return an index = 2.5 ;-...)).

    3. Re: using <code> v.s <pre>:
    I do know the rules and tried not to, but,,, on my screen, it’s either <code> is too small (cannot read), or with the font size increase, everything else’s too big and bold..... Maybe smb. can review this policy? (Minor thing of course).

    4. ($#array + 1) vs. scalar(@array) :
    You’re going to laugh, but I did a performance test..:). The results is: $#array is ~10-15% faster than the other one. Intuitively, that’s what one would expect (my wild guess is that $#array is probably an internal counter that is always kept up-to-date, and the scalar(@array) gets calculated).
    ....This is a pure academical question of course:-)... It’s hard to imagine an application that would suffer from using the 2nd approach.

    5. Using a hash vs. Binary search.
    Again, hash works just fine, except for three implications:

    • memory;
    • slows down for huge amounts of data (vs. bisearch which is always linear);
    • complex:-)

    Enough said about ## 1 & 2; let me explain the 3d one:
    ... Imagine an application that updates an array rarely and looks up often. Say you have 10,000,000 customers / books / whatever that get added / published only once a minute, but information is requested 1000 per second.
    So, re-building the hash for lookup 1000 times a second is clearly not an option. Right?
    A solution to that might be (I’ll think in some pseudo-language now):
    class Array_with_Hash: @the_array = () % the_hash = (); init(@a) { put (a =>> the_array); recreate_hash (the_array, the_hash); } pop(elem): { array.pop(el) update_hash(el); } ...

    Well, if you know for sure that push and pop is *all* that you do, that might be possible. However, creating a generic class like this might be really tricky.
    • array must be locked (no access to it directly, -- only by this class’s methods;
    • hash values probably have to be arrays_of_indexes...
    • implementing sort(), slice-and-dice etc. might take some time;
    • it’s quite easy to “forget” about something...
    • ...

    This only means that the hash approach only works if
    • cost(recreating the hash every time) == nothing
    • OR
    • cost(programming time) == nothing..:)
    • .
    In other words, -- thanks again for the binary search in Perl! (And it should be in FAQ, too...:-)...
      Changed
      my $nextidx = $asize / 2; my $nextinc = $nextidx / 2;
      to:
      my $nextidx = int($asize / 2); my $nextinc = int($nextidx / 2);
      (Otherwise, for an array of 5 elements, it’d return an index = 2.5 ;-...)).

      Actually, that change is unnecessary. Whenever a floating point value is used as an array index, Perl automatically uses just the integer part of the value, since that would be the only sensible thing to do.

      On another topic...
      Imagine an application that updates an array rarely and looks up often. Say you have 10,000,000 customers / books / whatever that get added / published only once a minute, but information is requested 1000 per second. So, re-building the hash for lookup 1000 times a second is clearly not an option. Right?

      Um, if the application is as you describe, why would you want to use an array at all? With that sort of ratio between updates and searches, it would be better just to maintain a hash instead of an array. Surely you do not want to "recreate the hash every time" in order to implement the search; create the hash once and maintain it, assuming it fits in memory -- and if not, consider a DBM_File approach (cf. AnyDBM_File), or just use a real database backend.

      If your issue is "using a (temporary) hash to store a copy of an array so that the array can be searched for specific values", yes, that's a bad approach for really big arrays and the kind of update/search ratio you're suggesting. But you're not explaining why the primary data storage needs to be an array.

      If the issue is really "coming up with a viable app to support lots of searches for specific values in a large set", the answer is more likely to be: start by using a hash as the primary storage, and use the values to be searched for as hash keys.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://488776]
Approved by graff
Front-paged by ghenry
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2014-09-15 03:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (145 votes), past polls