Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re: Modified Binary Search

by BrowserUk (Pope)
on Jan 14, 2010 at 09:22 UTC ( #817362=note: print w/replies, xml ) Need Help??


in reply to Modified Binary Search

You are tackling the wrong problem.

Wherever you get this data from, be it loaded sorted or generated internally, you should not be storing it as a flat array with duplicates.

You should load a parallel pair of an array and a hash. You load one of each element into the array (in load order if input/derivation is sorted), and increment the hash value keyed by the unique element for each duplicate.

You can then use binary searches on the uniques array, to locate the bounds, and iterate between the bounds of the array summing the counts in the hash to arrive at the inclusive or exclusive total.

This both speeds up the search by the removal of the duplicates; and reduces memory requirements by not storing them.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Modified Binary Search
by Limbic~Region (Chancellor) on Jan 14, 2010 at 14:46 UTC
    BrowserUk,
    You are tackling the wrong problem.

    You are certainly entitled to your opinion. This is now a solved problem. I was being lazy yesterday and didn't give a full description of what I was doing. I was half way through a response to your inquiry about sorted data when I accidently closed the browser instead of a tab. I will explain the entire problem to you now.

    I have two tables in a database - Tbl_A and Tbl_B. Both tables have a column called event_date which is a date type. Tbl_A contains between 1 and 5 million rows and rows are added or removed but never modified. Tbl_B contains around 600K rows with adds, modifies and deletes. Whenever there is a change to Tbl_B, a very expensive stored procedure is applied to every row in Tbl_A looking for similar rows to the row in Tbl_B that was added, modified or deleted. At the end of this expensive stored procedure is a comparison of the event_date and rows are discarded regardless of how similar if the event_date in Tbl_A is not within +/- 1 year of the event_date in Tbl_B.

    It has been accepted as a given that the scenario described above is absolute. This is primarily because the stored procedure is a COTS provided C compiled library with only 2 APIs. No dynamic view trickery or the like could be used. Since it was an absolute, no one had investigated to see what the savings would be if it could be changed. That's what I set out to do.

    I exported the event_date from Tbl_A to represent a snapshot in time. Unless formatted, the date comes out as 'MM/DD/YYYY' so during the export process, I formatted the date and sorted it - so to answer your previous question, it came to me sorted. I then exported the event_date from Tbl_B. Here, I didn't sort it but I did format it and selected event_date, event_date - 1 year, event_date + 1 year.

    The final activity was to iterate over each row in Tbl_B to find out how many rows in Tbl_A were within the acceptable date range and provide min, max, ave, std statistics. Armed with this information, I could make a recommendation to the powers that be to petition the vendor to provide a new API (at a cost I am sure) to allow a pre-filter before running the expensive stored procedure.

    This was a 1 time activity and I was being very lazy about it. If this were something that needed to be done regularly to gather statistics over a period of time, I would have put more thought into it. I have already moved on to other things.

    Cheers - L~R

      This was a 1 time activity...

      Seems a simple linear search would have been sufficient then.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk,
        That's the way I had it coded originally but when it wasn't complete when I came back from lunch - I wasn't happy to continue waiting. Performing a linear search through 5 million records 600,000 times still takes longer than I wanted.

        Cheers - L~R

Re^2: Modified Binary Search
by jethro (Monsignor) on Jan 14, 2010 at 10:00 UTC
    Reduce memory yes, speed-up depends.

    If he needs the number of different(!) elements, there would be a tremendous speed-up but the hash would be superfluous.

    If he needs number of elements, he would have to sum up all different elements between $beg and $end instead of just subtracting $beg from $end. Depending on the data set this additional step could eat away any savings from performing a few steps more in the binary search.

    To be precise: It must be 2*log2(average number of duplicates) > (average number of different elements in a search) to get a speed-up.

      Depending on the data set this additional step could eat away any savings from performing a few steps more in the binary search.

      Sorry, but that would only be true if a binary search worked on data with duplicates. It doesn't.

      So you have to factor in the additional complexity of the basic search plus the steps required to locate the appropriate (upper or lower) boundary of the run of duplicates.

      I don't believe it is possible to code a search over sorted data with duplicates that comes even close to be O(log N). Even in theory. And in practical implementations, it'd be far slower.

      Feel free to prove me wrong :)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I don't believe it is possible to code a search over sorted data with duplicates that comes even close to be O(log N). Even in theory.
        You're wrong. You can easily find the smallest index containing your target value by using a condition like:
        return $i if $A[$i] == $target && ($i == 0 || $A[$i-1] < $target);
        instead of the usual
        return $i if $A[$i] == $target;
        And you find the highest index by using:
        return $i if $A[$i] == $target && ($i == $#A || $A[$i+1] > $target);
        It doesn't increase the run time complexity - the worst case of a binary search is when no match is found anyway.

        Ok, here is the proof. I programmed the algorithm as described in my first post:

        #!/usr/bin/perl use 5.10.0; use strict; use warnings; use Data::Dumper; my @data; for ('aa'..'lz') { push @data, ($_) x 7; } my $size= scalar(@data); say 'array size is ',$size,', log2 of ',$size,' is ',int(log($size)/lo +g(2)); for ('aa','ab','ce','dn','ea','fr','lb') { my ($beg,$iterations)= binary_search($_,@data); say "Found first '$_' at position $beg after $iterations iteration +s"; } #---------------------- sub binary_search { my ($search,@data)= @_; return(0) if (@data<2); my $beg=0; my $end= @data-1; my $iter=0; while ($beg<$end-1) { my $middle= int(($end+$beg)/2); if ($data[$middle] lt $search) { $beg=$middle; } else { $end=$middle; } $iter++; } #handle the special case if you are looking for $data[0] return($beg,$iter) if ($data[$beg] eq $search); return($end,$iter); } #output: array size is 2184, log2 of 2184 is 11 Found first 'aa' at position 0 after 11 iterations Found first 'ab' at position 7 after 11 iterations Found first 'ce' at position 392 after 11 iterations Found first 'dn' at position 637 after 11 iterations Found first 'ea' at position 728 after 11 iterations Found first 'fr' at position 1029 after 11 iterations Found first 'lb' at position 2009 after 11 iterations

        Ok, not a proof in formal language but good enough for us, I hope. As you can see it is a binary search, it takes exactly the number of iterations predicted to find the items. Also it is evident that it finds the first item (see the first two results, and also all following results are dividable by 7)

        BrowserUk,
        Since this was a "one and done" script, I didn't really care too much about this being as efficient as possible. I ended up just performing 2 binary searches (to find each end point) and then reverted to a linear search to handle the duplicates.

        Regarding the statement: I don't believe it is possible to code a search over sorted data with duplicates that comes even close to be O(log N). Even in theory. And in practical implementations, it'd be far slower.

        Why would the following logic be so much slower in implementation?

        • Given: A sorted list containing duplicates
        • Given: A target value
        • Given: A desired anchor (closest to which endpoint)
        • Find: The closest element to the desired anchor that is equal to or $desired_operator than the target value
        1. Perform a normal binary search to find the target item
          • If not found, check if $list[$mid] < $val to determine if $mid has to be adjusted by one to meet $anchor/$desired_operator - done
          • If found, proceed to step 2
        2. Determine if you are at "right" endpoint of the list of duplicates by checking the $list[$mid - 1] eq $val or $list[$mid + 1] eq $val
          • If yes, done
          • If no, proceed to step 3
        3. Check to see if this item is even a duplicate by checking $list[$mid - 1] eq $val or $list[$mid + 1] eq $val - whichever one not done in step 2
          • If not a duplicate - done
          • If a duplicate - proceed to step 4
        4. Use the following logic to find the first element in the desired direction that is not a duplicate. For the description, let's say I am trying to find the last element. $min = $mid from previous search and $max = $#list.
          • If $list[$mid] eq $val, $min = $mid
          • If $list[$mid] ne $val, $max = $mid - 1
          • Stop when $max - $min < 2

        Cheers - L~R

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://817362]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2021-06-20 06:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (94 votes). Check out past polls.

    Notices?