Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re^2: Modified Binary Search

by Limbic~Region (Chancellor)
on Jan 14, 2010 at 14:46 UTC ( #817416=note: print w/replies, xml ) Need Help??


in reply to Re: Modified Binary Search
in thread Modified Binary Search

BrowserUk,
You are tackling the wrong problem.

You are certainly entitled to your opinion. This is now a solved problem. I was being lazy yesterday and didn't give a full description of what I was doing. I was half way through a response to your inquiry about sorted data when I accidently closed the browser instead of a tab. I will explain the entire problem to you now.

I have two tables in a database - Tbl_A and Tbl_B. Both tables have a column called event_date which is a date type. Tbl_A contains between 1 and 5 million rows and rows are added or removed but never modified. Tbl_B contains around 600K rows with adds, modifies and deletes. Whenever there is a change to Tbl_B, a very expensive stored procedure is applied to every row in Tbl_A looking for similar rows to the row in Tbl_B that was added, modified or deleted. At the end of this expensive stored procedure is a comparison of the event_date and rows are discarded regardless of how similar if the event_date in Tbl_A is not within +/- 1 year of the event_date in Tbl_B.

It has been accepted as a given that the scenario described above is absolute. This is primarily because the stored procedure is a COTS provided C compiled library with only 2 APIs. No dynamic view trickery or the like could be used. Since it was an absolute, no one had investigated to see what the savings would be if it could be changed. That's what I set out to do.

I exported the event_date from Tbl_A to represent a snapshot in time. Unless formatted, the date comes out as 'MM/DD/YYYY' so during the export process, I formatted the date and sorted it - so to answer your previous question, it came to me sorted. I then exported the event_date from Tbl_B. Here, I didn't sort it but I did format it and selected event_date, event_date - 1 year, event_date + 1 year.

The final activity was to iterate over each row in Tbl_B to find out how many rows in Tbl_A were within the acceptable date range and provide min, max, ave, std statistics. Armed with this information, I could make a recommendation to the powers that be to petition the vendor to provide a new API (at a cost I am sure) to allow a pre-filter before running the expensive stored procedure.

This was a 1 time activity and I was being very lazy about it. If this were something that needed to be done regularly to gather statistics over a period of time, I would have put more thought into it. I have already moved on to other things.

Cheers - L~R

Replies are listed 'Best First'.
Re^3: Modified Binary Search
by BrowserUk (Pope) on Jan 14, 2010 at 18:14 UTC
    This was a 1 time activity...

    Seems a simple linear search would have been sufficient then.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      That's the way I had it coded originally but when it wasn't complete when I came back from lunch - I wasn't happy to continue waiting. Performing a linear search through 5 million records 600,000 times still takes longer than I wanted.

      Cheers - L~R

        Hm. You have a big list and a small list of dates, both sorted (or readily sortable).

        And for each date in the small list, you want to find the subset of the big list that are 1 year either side.

        And you search anew for each item in the small list?

        Here's a one-pass algorithm:

        my @bigSorted = ...; my @smallSorted = ...; my %ranges; my( $lo, $hi ) = ( 0, 0 ); for ( @smallSorted ) { my( $early, $late ) = calcBoundaryDates( $_ ); ++$lo while $bigList[ $lo ] lt $early; $hi //= $lo; ++$hi while $bigList[ $hi ] le $late; $ranges{ $_ } = $hi - $lo; }

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://817416]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2021-06-17 18:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (84 votes). Check out past polls.

    Notices?