Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Efficient Data Comparison

by Anonymous Monk
on Dec 20, 2005 at 02:51 UTC ( [id://517948]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have five sql tables with about 5,000 rows of data each. Each individual table represents a variable and the rows hold points in time (days) and a specific data point for that particular point in time. I need to create a script that will compare the differences in data points for all specific points in time across the five variables. For example for the date 1/1/2005 I need to compare the data points from table 1, table 2, table 3, table 4, and table 5. I need to do this for all the individual points in time (about 5,000).

I can think of two ways of going about this:
1. Create a simple loop that would iterate through the days, pull the data from each sql table, compare it, and set the appropriate flag depending on the results. Downside of this is that I would be making 25,000 (5,000*5) individual queries to my database.

2. Input all of the data from each sql table into hash structures upfront. Then create a loop that iterates through the days and compares the data. (i.e. if $tableone{'1/1/2005'} > $tabletwo{'1/1/2005'} etc...) Advantage here is that I would only have to poll the sql table five times. Downside is that I would be storing a lot of information in the hash structure and I'm not sure what the max available space is.

I would appreciate any commentary on which of the two methodologies is better / more efficient or if there is a way of going about this that I have not thought of.

Thanks.

Replies are listed 'Best First'.
Re: Efficient Data Comparison
by sk (Curate) on Dec 20, 2005 at 03:17 UTC
    So you have 5 tables each with 2 columns. First column is date and the second one is a variable. Why can't you just join the tables and get one large table with 5 different variables and one date?.

    After you are done with that - you said you want to compare values across these 5 variables. Do you want to compare all rows or do you know how many rows you would want? If you know the dates upfront just pull those dates in a where clause (can be done in the first step also). Once you get the required rows you would retrieve the results row by row and write out your results using the script. (if it is all rows then just do it row by row for the entire table). In all cases there is only one SQL call

    I could be missing something important here, hopefully this points you in some good direction!

    cheers

    SK

Re: Efficient Data Comparison
by McDarren (Abbot) on Dec 20, 2005 at 03:22 UTC
    hmm, I'm quite curious to see what people have to say about this. My feeling is that the 2nd option would be much more efficient - as it's not really a huge amount of data - so available memory won't be an issue.

    One comment though - it appears that your tables aren't very well designed. If you have the exact same set of dates in each of the five tables, then you probably should consider getting rid of that redundancy and consolidating the lot into a single table. One field for the dates, and one field for each of your five "variables".

    Cheers,
    Darren :)

Re: Efficient Data Comparison
by CountOrlok (Friar) on Dec 20, 2005 at 03:44 UTC
    This isn't so much a Perl question, I'd say.

    Anyway, if you know the date ranges are exactly the same on all tables, but don't know the ranges, you only need to run two queries:

    1) select all dates from table 1.
    2) and as you loop through all the dates, select rows from all tables where date = the date in the row you're on in query 1.

    In other words, normalize the tables!

    -imran
Re: Efficient Data Comparison
by dokkeldepper (Friar) on Dec 20, 2005 at 10:07 UTC

    The best choice would be to let the database do it for you. Consequently, I agree with my predecessors, that this is merely a sql related problem.

    The join

    select <do sth useful here> from table a inner join table b on a.date<b.date inner join table c on b.date<c.date

    (and so on) will do it for you.

    Because your tables are small it should be quite fast. An iterative perl solution would be possibly less efficient, not to mention the required connection overhead. If you have to do it regularly create a view. In case your database has a join problem google for 'correlated subqueries'.

    I use perl in such circumstances for automatic creation of these annoying statments.

Re: Efficient Data Comparison
by TedPride (Priest) on Dec 20, 2005 at 09:47 UTC
    This is basically a choice between storing all the data in memory at once, or running thousands of iterations of a query. Even if you merge the tables into 1 table, you'll still have to sort the results by date, which means storing everything in memory at one point or another (unless items are submitted in date order). About the only thing you can choose is whether to use SQL for the sorting or Perl. SQL is more efficient in terms of memory.

    I'd personally use 1 table with fields for date, variable ID, and data point, instead of 5 different tables with fields for date and data point. I'm assuming the data points can all be put in the same field type, since you want to compare them. Then just query for everything sorted by date and process the results.

Re: Efficient Data Comparison
by TomDLux (Vicar) on Dec 20, 2005 at 15:01 UTC

    I hate to deflate your ego, but you're dealing with a tiny amount of data, assuming a modern machine ( less than 5 years old ) equipped with modern main memory ( 256 MB? 512 MB? 1 GB? ).

    25,000 records, at (pessimistically) 100 bytes per value, takes up 2.5 MB, or 1% or a small memory space.

    Some people read multi-megabyte log files into a hash for post-processing. You should have no problems at all.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      Thanks for all the responses. I actually need to compare data points not only across variables but across periods of time as well.

      For example: Divide the data point for variable x on 1/1/2005 by the data point for variable x on 1/30/2005 and compare that to the same metric for variable y. I need to iterate through this algorithm for each day in the database.

      I will therefore need to essentially constantly be polling the database for the datapoints associated with a particular day. I'm thinking that it is going to be much more efficient to load it all into hash structures at the outset and then just run a loop on those.

      Do I need to have a separate hash structure setup for each variable (i.e. %variableone, %variabletwo, etc.) in order to access the information? (code example: $datapoint = $variableone{'1/20/05'). Is there some way I could setup a matrix like data structure in perl that would mimic an sql structure (code example: $datapoint = $data{'1/20/05'}->variableone )

      Thanks.
        It isn't necessary to keep a separate hash variable. That's your choice. They can be in one hash ref, e.g. $href->{'01/01/2005'}->{"variable1"} = $val1 or $href->{'01/30/2005'}->{"variable2"} = $val2

        -imran

Re: Efficient Data Comparison
by SamCG (Hermit) on Dec 22, 2005 at 15:03 UTC
    Just a general note here, but the primary goal of databases isn't always purely efficiency in dealing with data (particularly if you include speed in the definition of efficiency). Databases deal with a lot of overhead (security, integrity, data locking, etc.) that you may (or may not) need.

    As far as hashes, I believe the docs indicate that they're limited only by PC memory, and that the structure of the hash will keep access time even in large hashes quick (I swear I recall the quote "a large hash is nothing to fear" from Learning Perl). I think your second option is probably going to be much quicker.
Re: Efficient Data Comparison
by injunjoel (Priest) on Dec 21, 2005 at 00:08 UTC
    Greetings,
    Two quick Module suggestions:
    1. Class::DBI
    2. Tie::DBI


    -InjunJoel
    "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://517948]
Approved by GrandFather
Front-paged by neversaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (5)
As of 2024-04-19 04:41 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found