Efficient Data Comparison

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Efficient Data Comparison by sk (Curate) on Dec 20, 2005 at 03:17 UTC
So you have 5 tables each with 2 columns. First column is date and the second one is a variable. Why can't you just join the tables and get one large table with 5 different variables and one date?. After you are done with that - you said you want to compare values across these 5 variables. Do you want to compare all rows or do you know how many rows you would want? If you know the dates upfront just pull those dates in a where clause (can be done in the first step also). Once you get the required rows you would retrieve the results row by row and write out your results using the script. (if it is all rows then just do it row by row for the entire table). In all cases there is only one SQL call I could be missing something important here, hopefully this points you in some good direction! cheers SK	[reply]
Re: Efficient Data Comparison by McDarren (Abbot) on Dec 20, 2005 at 03:22 UTC
hmm, I'm quite curious to see what people have to say about this. My feeling is that the 2nd option would be much more efficient - as it's not really a huge amount of data - so available memory won't be an issue. One comment though - it appears that your tables aren't very well designed. If you have the exact same set of dates in each of the five tables, then you probably should consider getting rid of that redundancy and consolidating the lot into a single table. One field for the dates, and one field for each of your five "variables". Cheers, Darren :)	[reply]
Re: Efficient Data Comparison by CountOrlok (Friar) on Dec 20, 2005 at 03:44 UTC
This isn't so much a Perl question, I'd say. Anyway, if you know the date ranges are exactly the same on all tables, but don't know the ranges, you only need to run two queries: 1) select all dates from table 1. 2) and as you loop through all the dates, select rows from all tables where date = the date in the row you're on in query 1. In other words, normalize the tables! -imran	[reply]
Re: Efficient Data Comparison by dokkeldepper (Friar) on Dec 20, 2005 at 10:07 UTC
The best choice would be to let the database do it for you. Consequently, I agree with my predecessors, that this is merely a sql related problem. The join `select <do sth useful here> from table a inner join table b on a.date<b.date inner join table c on b.date<c.date` [download] (and so on) will do it for you. Because your tables are small it should be quite fast. An iterative perl solution would be possibly less efficient, not to mention the required connection overhead. If you have to do it regularly create a view. In case your database has a join problem google for 'correlated subqueries'. I use perl in such circumstances for automatic creation of these annoying statments.	[reply] [d/l]
Re: Efficient Data Comparison by TedPride (Priest) on Dec 20, 2005 at 09:47 UTC
This is basically a choice between storing all the data in memory at once, or running thousands of iterations of a query. Even if you merge the tables into 1 table, you'll still have to sort the results by date, which means storing everything in memory at one point or another (unless items are submitted in date order). About the only thing you can choose is whether to use SQL for the sorting or Perl. SQL is more efficient in terms of memory. I'd personally use 1 table with fields for date, variable ID, and data point, instead of 5 different tables with fields for date and data point. I'm assuming the data points can all be put in the same field type, since you want to compare them. Then just query for everything sorted by date and process the results.	[reply]
Re: Efficient Data Comparison by TomDLux (Vicar) on Dec 20, 2005 at 15:01 UTC
I hate to deflate your ego, but you're dealing with a tiny amount of data, assuming a modern machine ( less than 5 years old ) equipped with modern main memory ( 256 MB? 512 MB? 1 GB? ). 25,000 records, at (pessimistically) 100 bytes per value, takes up 2.5 MB, or 1% or a small memory space. Some people read multi-megabyte log files into a hash for post-processing. You should have no problems at all. -- `TTTATCGGTCGTTATATAGATGTTTGCA`	[reply]
Re^2: Efficient Data Comparison by Anonymous Monk on Dec 20, 2005 at 20:14 UTC
Thanks for all the responses. I actually need to compare data points not only across variables but across periods of time as well. For example: Divide the data point for variable x on 1/1/2005 by the data point for variable x on 1/30/2005 and compare that to the same metric for variable y. I need to iterate through this algorithm for each day in the database. I will therefore need to essentially constantly be polling the database for the datapoints associated with a particular day. I'm thinking that it is going to be much more efficient to load it all into hash structures at the outset and then just run a loop on those. Do I need to have a separate hash structure setup for each variable (i.e. %variableone, %variabletwo, etc.) in order to access the information? (code example: $datapoint = $variableone{'1/20/05'). Is there some way I could setup a matrix like data structure in perl that would mimic an sql structure (code example: $datapoint = $data{'1/20/05'}->variableone ) Thanks.	[reply]
Re^3: Efficient Data Comparison by CountOrlok (Friar) on Dec 20, 2005 at 23:49 UTC
It isn't necessary to keep a separate hash variable. That's your choice. They can be in one hash ref, e.g. $href->{'01/01/2005'}->{"variable1"} = $val1 or $href->{'01/30/2005'}->{"variable2"} = $val2 -imran	[reply]
Re: Efficient Data Comparison by SamCG (Hermit) on Dec 22, 2005 at 15:03 UTC
Just a general note here, but the primary goal of databases isn't always purely efficiency in dealing with data (particularly if you include speed in the definition of efficiency). Databases deal with a lot of overhead (security, integrity, data locking, etc.) that you may (or may not) need. As far as hashes, I believe the docs indicate that they're limited only by PC memory, and that the structure of the hash will keep access time even in large hashes quick (I swear I recall the quote "a large hash is nothing to fear" from Learning Perl). I think your second option is probably going to be much quicker.	[reply]
Re: Efficient Data Comparison by injunjoel (Priest) on Dec 21, 2005 at 00:08 UTC
Greetings, Two quick Module suggestions: Class::DBI Tie::DBI -InjunJoel "I do not feel obliged to believe that the same God who endowed us with sense, reason and intellect has intended us to forego their use." -Galileo	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks