http://www.perlmonks.org?node_id=817503


in reply to Re^3: Modified Binary Search
in thread Modified Binary Search

BrowserUk,
That's the way I had it coded originally but when it wasn't complete when I came back from lunch - I wasn't happy to continue waiting. Performing a linear search through 5 million records 600,000 times still takes longer than I wanted.

Cheers - L~R

Replies are listed 'Best First'.
Re^5: Modified Binary Search
by BrowserUk (Patriarch) on Jan 15, 2010 at 08:01 UTC

    Hm. You have a big list and a small list of dates, both sorted (or readily sortable).

    And for each date in the small list, you want to find the subset of the big list that are 1 year either side.

    And you search anew for each item in the small list?

    Here's a one-pass algorithm:

    my @bigSorted = ...; my @smallSorted = ...; my %ranges; my( $lo, $hi ) = ( 0, 0 ); for ( @smallSorted ) { my( $early, $late ) = calcBoundaryDates( $_ ); ++$lo while $bigList[ $lo ] lt $early; $hi //= $lo; ++$hi while $bigList[ $hi ] le $late; $ranges{ $_ } = $hi - $lo; }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      That works fine if you're doing it as a one-time deal for all elements of Tbl_B. LR's problem description seems to say that the search needs to be done for the modified elements each time Tbl_B is modified. Since the above technique does linear search, it's not going to be optimal.

        LR's problem description seems to say ...

        Hm. That's not how I read it, nor do I see how you could possibly derive that interpretation from the OP?

        Besides Limbic~Region 'imself seemed to think it was okay and he is not exactly known for accepting incomplete solutions.

        Since the above technique does linear search, it's not going to be optimal.

        Perhaps what he saw that you missed, is that whilst it is a linear search, it is a linear search once over the 5 million. Ie. O(5E6).

        Whereas, if you use a binary search, you'd do 600,000 * 5e6 * log 5e6 = O(20,096,910,013,008)

        600,000 * log 5e6 = O(48,494,850)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        jdporter,
        You and BrowserUk are both correct. In the real world problem, anytime there was a change the process had to begin anew. In my "snapshot in time" use case, I only had to run it once. My goal was to demonstrate that changing the behavior of the vendor provided C compiled stored procedure was to our advantage. Everyone was convinced but after a year, I am still waiting to see the change.

        Cheers - L~R