http://www.perlmonks.org?node_id=817226

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Lazy web,
I am looking for help constructing a modified binary search.

I have a large array of strings which contain duplicates sorted in ASCIIbetical order. Given two strings ($beg & $end), I want to find the number of elements in the array that are between the two points. My idea was to first perform a binary search to find the index of the nearest element to $beg and then $end and do some math.

Here are the two things that lead me to ask you dear lazy web.

Cheers - L~R

Replies are listed 'Best First'.
Re: Modified Binary Search
by Old_Gray_Bear (Bishop) on Jan 13, 2010 at 18:01 UTC
    I have written this kind of tool to scan Apache log files. The actual problem statement was "return the logs starting at XX:XX:XX and running until YY:YY:YY".

    I used a binary search to position myself at the last entry with a time-stamp < XX:XX:XX and then started a sequential read until I reached a timestamp > YY:YY:YY. I used seek() and tell() to get/set my position in the file; calculated the mid-point of my range and then looped through until I got with in range. The only really tricky part was remembering to do two reads at the mid-point to get to a full record with a time-stamp in it (tell() and seek() work on bytes not records).

    Pseudo-code:

    $start=0, $stop=length of file(array) # Now check the entry at $mid for my start/stop condition $again=1 Loop-while($again=1) $again=0 $mid=int(($stop - $start)/2) if (array[$mid] < $beg) $start = $mid $again=1 if (array[$stop] > $end) $stop = $mid $again=1 End Loop
    When you fall out of the loop, $start will be the first entry <= $beg and $stop will be the first >= $end.

    ----
    I Go Back to Sleep, Now.

    OGB

Re: Modified Binary Search
by bobf (Monsignor) on Jan 13, 2010 at 18:09 UTC

    If you looking for feedback on your proposed algorithm (rather than code)...

    I would probably take a similar approach, assuming that the array is large enough that a binary search is expected to be more efficient than creating an index of the first n characters of each string.

    Keep in mind, though, that when you do the binary search for $beg you are getting information not only about the position of $beg, but also about $end. Each time you compare a string from the array to $beg you could also determine the relative location of $end to that string. Then, once you nail down the position of $beg you could potentially start searching for the exact position of $end using a much smaller search space. (The actual benefit achieved will depend on the positions of $beg and $end within the array.)

Re: Modified Binary Search
by jethro (Monsignor) on Jan 13, 2010 at 18:01 UTC
    I think you just have to drop the stop condition, i.e. in the classical case the algorithm stops when the search string is found. Here you have to look on until the upper and lower ends of the search interval are reduced to neighboring elements (i.e. no further halving step is possible). The upper element is the the one you want in the case of $beg, the lower element the one in case of $end.

    Small special case: If the search interval contracts to elements 0 and 1, $beg might start at 0. Other way round for $end

Re: Modified Binary Search
by sfink (Deacon) on Jan 13, 2010 at 22:21 UTC

    Sure, you could write two custom binary searches. The first would search out the boundary between $beg and anything smaller than it, rather than just looking for $beg. It's a slight modification to standard binary search. I'm pretty sure I've written it before. And it's no harder to get right than standard binary search -- as in, trivially easy in principle, but taking an embarrassingly long time in practice.

    There's probably a clever way to do it purely by tweaking the comparison function passed to a binary search: when passed an index, treat it as referring to the border between the pair of adjacent elements. Only return zero (matching) if $array[$i] eq $probe && ($i == 0 || $array[$i-1] lt $probe).

    Then you'd either do the same thing but for the other type of boundary for $end, or parameterize your code to do either one. (Once again, easy in theory, pain in the butt in practice.)

    But... this only buys you something if you expect a *lot* of duplicates of $beg or $end. Is that the case? If not, just do your standard binary searches followed by linear scans to find the actual boundaries. Much, much easier to code up, and probably faster for most data sets.

Re: Modified Binary Search
by ikegami (Pope) on Jan 13, 2010 at 22:35 UTC
    For starters,
    use strict; use warnings; sub _unsigned_to_signed { return unpack('i', pack('I', $_[0])); } sub binsearch(&$\@) { my ($compare, $value, $array) = @_; # # If the $value is not found in @array, # the 1s complement of where the $value # should be inserted is returned. So, # $value is found if the return value >= 0. # # When compare is called, $a is an alias for $value, # and $b is an alias for the element of the array # against which $a which be compared. # # Example: # # Add $value to sorted @array, if it's not already there. # $idx = binsearch { $a <=> $b } $value, @array; # splice(@array, ~$idx, 0, $value) if $idx < 0; # my $i = 0; my $j = $#$array; return $j if $j == -1; my $ap = do { no strict 'refs'; \*{caller().'::a'} }; my $bp = do { no strict 'refs'; \*{caller().'::b'} }; my $kp = do { no strict 'refs'; \*{caller().'::k'} }; local *$ap; local *$bp; local *$kp; *$ap = \$value; for (;;) { my $k = int(($j-$i)/2) + $i; *$kp = \$k; *$bp = \($array->[$k]); my $cmp = &$compare(); return $k if $cmp == 0; if ($cmp < 0) { $j = $k-1; return _unsigned_to_signed(~$k) if $i > $j; } else { $i = $k+1; return _unsigned_to_signed(~$i) if $i > $j; } } } our $k; for ( # 0 1 2 3 4 5 6 7 8 [ [qw( a b c p q r x w z )], 'p', 'r', 3, 5, 3 ], [ [qw( a b c p q r x w z )], 'd', 'w', 3, 5, 3 ], [ [qw( q q q q q q q q q )], 'p', 'z', 0, 8, 9 ], [ [qw( q q q q q q q q q )], 'q', 'z', 0, 8, 9 ], [ [qw( q q q q q q q q q )], 'r', 'z', 9, 8, 0 ], [ [qw( q q q q q q q q q )], 'a', 'p', 0, -1, 0 ], [ [qw( q q q q q q q q q )], 'a', 'q', 0, 8, 9 ], [ [qw( q q q q q q q q q )], 'a', 'r', 0, 8, 9 ], ) { my ( $array, $beg, $end, $expect_beg_idx, $expect_end_idx, $expect_count ) = @$_; my $beg_idx = binsearch { $a cmp $b || ( $k > 0 && $a eq $array->[$k-1] ? -1 : 0 ) } $beg, @$array; $beg_idx = ~$beg_idx if $beg_idx < 0; my $end_idx = binsearch { $a cmp $b || ( $k < $#$array && $a eq $array->[$k+1] ? +1 : 0 ) } $end, @$array; $end_idx = ~$end_idx - 1 if $end_idx < 0; my $count = $end_idx - $beg_idx + 1; printf("%2d %-5s %2d %-5s %2d %-5s\n", $beg_idx, $beg_idx == $expect_beg_idx ? "(ok)" : "(bad)", $end_idx, $end_idx == $expect_end_idx ? "(ok)" : "(bad)", $count, $count == $expect_count ? "(ok)" : "(bad)", ); }
    3 (ok) 5 (ok) 3 (ok) 3 (ok) 5 (ok) 3 (ok) 0 (ok) 8 (ok) 9 (ok) 0 (ok) 8 (ok) 9 (ok) 9 (ok) 8 (ok) 0 (ok) 0 (ok) -1 (ok) 0 (ok) 0 (ok) 8 (ok) 9 (ok) 0 (ok) 8 (ok) 9 (ok)

    It can be optimised, of course. In particular, the negating can be removed, the callback can be inlined, and the initial cuts can be used for both searches.

Re: Modified Binary Search
by BrowserUk (Pope) on Jan 14, 2010 at 00:23 UTC

    Does the list come to you sorted, or do you sort it yourself?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Modified Binary Search
by BrowserUk (Pope) on Jan 14, 2010 at 09:22 UTC

    You are tackling the wrong problem.

    Wherever you get this data from, be it loaded sorted or generated internally, you should not be storing it as a flat array with duplicates.

    You should load a parallel pair of an array and a hash. You load one of each element into the array (in load order if input/derivation is sorted), and increment the hash value keyed by the unique element for each duplicate.

    You can then use binary searches on the uniques array, to locate the bounds, and iterate between the bounds of the array summing the counts in the hash to arrive at the inclusive or exclusive total.

    This both speeds up the search by the removal of the duplicates; and reduces memory requirements by not storing them.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      You are tackling the wrong problem.

      You are certainly entitled to your opinion. This is now a solved problem. I was being lazy yesterday and didn't give a full description of what I was doing. I was half way through a response to your inquiry about sorted data when I accidently closed the browser instead of a tab. I will explain the entire problem to you now.

      I have two tables in a database - Tbl_A and Tbl_B. Both tables have a column called event_date which is a date type. Tbl_A contains between 1 and 5 million rows and rows are added or removed but never modified. Tbl_B contains around 600K rows with adds, modifies and deletes. Whenever there is a change to Tbl_B, a very expensive stored procedure is applied to every row in Tbl_A looking for similar rows to the row in Tbl_B that was added, modified or deleted. At the end of this expensive stored procedure is a comparison of the event_date and rows are discarded regardless of how similar if the event_date in Tbl_A is not within +/- 1 year of the event_date in Tbl_B.

      It has been accepted as a given that the scenario described above is absolute. This is primarily because the stored procedure is a COTS provided C compiled library with only 2 APIs. No dynamic view trickery or the like could be used. Since it was an absolute, no one had investigated to see what the savings would be if it could be changed. That's what I set out to do.

      I exported the event_date from Tbl_A to represent a snapshot in time. Unless formatted, the date comes out as 'MM/DD/YYYY' so during the export process, I formatted the date and sorted it - so to answer your previous question, it came to me sorted. I then exported the event_date from Tbl_B. Here, I didn't sort it but I did format it and selected event_date, event_date - 1 year, event_date + 1 year.

      The final activity was to iterate over each row in Tbl_B to find out how many rows in Tbl_A were within the acceptable date range and provide min, max, ave, std statistics. Armed with this information, I could make a recommendation to the powers that be to petition the vendor to provide a new API (at a cost I am sure) to allow a pre-filter before running the expensive stored procedure.

      This was a 1 time activity and I was being very lazy about it. If this were something that needed to be done regularly to gather statistics over a period of time, I would have put more thought into it. I have already moved on to other things.

      Cheers - L~R

        This was a 1 time activity...

        Seems a simple linear search would have been sufficient then.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      Reduce memory yes, speed-up depends.

      If he needs the number of different(!) elements, there would be a tremendous speed-up but the hash would be superfluous.

      If he needs number of elements, he would have to sum up all different elements between $beg and $end instead of just subtracting $beg from $end. Depending on the data set this additional step could eat away any savings from performing a few steps more in the binary search.

      To be precise: It must be 2*log2(average number of duplicates) > (average number of different elements in a search) to get a speed-up.

        Depending on the data set this additional step could eat away any savings from performing a few steps more in the binary search.

        Sorry, but that would only be true if a binary search worked on data with duplicates. It doesn't.

        So you have to factor in the additional complexity of the basic search plus the steps required to locate the appropriate (upper or lower) boundary of the run of duplicates.

        I don't believe it is possible to code a search over sorted data with duplicates that comes even close to be O(log N). Even in theory. And in practical implementations, it'd be far slower.

        Feel free to prove me wrong :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Modified Binary Search
by salva (Canon) on Jan 13, 2010 at 21:37 UTC
    Build a new array containing the indexes of the elements where sequences of equal strings start. For instance:
    @a = qw(a a b b c c c d e e e e e f f) # ix: 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 # 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 #start: * - * - * - - * * - - - - * - @start = (0, 2, 4, 7, 8, 13);
    Then just perform the binary over @start, using $a[$start[$ix]] as the search key.

    Once you find the index $ix corresponding to the searched element, the start and end offsets will be $start[$ix] and $start[$ix+1] - 1

      That makes an O(log N) algorithm into O(N). Were 10 checks were made, you'd now make 1,000,000.
        That makes an O(log N) algorithm into O(N).

        Obviously it does not!

        As for any binary search algorithm you have a previous step where you build the array, sorting something, loading it from disk, whatever. This operation is O(N) in the best case and usually O(NlogN) because of the sorting.

        Once that array is build you can perform binary searches over it with cost O(logN). In order to amortize the cost of building the array the number of binary searches has to be high (otherwise, there are better algorithms).

        The algorithm I have proposed increments the cost of the setup step but does not change its complexity because it already was O(N) at best.

        Also, if the number of repeated strings is high, @start would be considerably smaller than @a and so less operations will be performed per binary search. That means that if the number of binary searches to be carried out is high enough, my algorithm will actually be faster than any other one attacking @a directly.

        ... and some benchmarking code just to probe it is not worse:
        #!/usr/bin/perl use strict; use warnings; use File::Slurp qw(slurp); use Benchmark qw(cmpthese); sub binary_search { my ($str, $a) = @_; my $l = 0; my $h = @$a; while ($l < $h) { my $p = int (($l + $h) / 2); if ($a->[$p] lt $str) { $l = $p + 1; } else { $h = $p; } } $l } sub make_start { my $a = shift; my $last = $a->[0]; my @start = (0); for my $ix (1..$#$a) { my $current = $a->[$ix]; if ($current ne $last) { push @start, $ix; $last = $current; } } return \@start; } chomp (my @words = slurp '/usr/share/dict/words'); @words = grep /^\w+$/, @words; for my $size (100, 1000, 100_000, 1_000_000) { for my $dups (3, 10, 100) { next unless $size > $dups; for my $reps (100, 100_000, 1_000_000) { print "size: $size, dups: $dups, reps: $reps\n"; # generate data: my @a = map $words[rand @words], 1..1+($size/$dups); push @a, $a[rand @a] while @a < $size; @a = sort @a; cmpthese(-30, { naive => sub { my $ix; $ix = binary_search($a[rand @a], \@a) for + (1..$reps); }, salva => sub { my $ix; my $start = make_start(\@a); my @a_start = @a[@$start]; $ix = $start->[binary_search($a[rand @a], + \@a_start)] for (1..$reps); } }); print "\n"; } } }
        The parameters in the benchmarks are:
        • $size: the size of the array
        • $dups: average number of times any string is repeated in the array
        • $reps: number of binary searchs to perform over one given array.

        Note also than this code only looks for the lowest index where some given string is found. Handling the case described by the OP where he also needs to find the highest index is trivially handled in my algorithm without increasing its computation cost but will require an additional binary search when using the naive algorithm.

        Here are the results I have gotten on my machine: