![]() |
|
go ahead... be a heretic | |
PerlMonks |
Re^4: Catching Cheaters and Saving Memoryby BrowserUk (Patriarch) |
on Oct 13, 2006 at 19:29 UTC ( [id://578207]=note: print w/replies, xml ) | Need Help?? |
If your 12 terabyte estimate comes from Re: Catching Cheaters and Saving Memory, ... No. I hadn't seen your post at that point. It came from
The view contains 3 4-byte fields: uid INTEGER, voted_for INTEGER count INTEGER So a full cross-product would be ~ 1e6 * 1e6 * ( 3* 4-bytes) = 12 Terabytes. The where clause will limit the number of rows appearing in the final view, but along the way, (as you point out), most of that 12 TB has to be generated. You're right, depending how well spec'd the server is for memory, it would be possible to perform much of the reduction in ram, but on typical Intel box with a memory limit of 2 or 3GB, there would be a substantial cost in repeated effort doing it in 2GB chunks, as each chunk will require another pass of the raw data. It will depend upon how the server is configured, but on smaller hardware I could well see the server attempting to trade diskspace for extra passes. Either way, it would require some very seriously specified (expensive) hardware to be able to run that create view in anything like a reasonable amount of time. Your rent-a-cluster idea would probably be the only option for many small and medium sized businesses. That said, databases aren't magic, and this problem is one which would show how non-magical they are. Sit down and brace yourself. We agree :) Probably my biggest problem with SQL is that it allows people to write apparently simple and straight forward solutions to problems, without being made aware of the scale of the task they encapsulate. Of course, that's good for DB vendors cos when the query takes forever to run, and you call them for advice, they get to (try to) upgrade you to an distributed solution, for which they charge even bigger bucks. Years ago, before the dominance of the RDBMS, I trialed a new product from DEC called (from memory) DataEase. It was a TUI (terminal UI) build-a-query-by-example thing, backed by (I think) a hierachical DB. One of several nice things about it was that when you'd built your query it went through various steps prompting you to re-order it for efficiency. As a part of that process, it would display a guessimate of how long the query would take to run. One memorable example of using this (that I think I've mentioned here before), was a query that joined over 7 tables. By simply reversing the order of the joins, the runtime fell from an estimated 69 hours to 5 or 6. The difference being the new ordering meant that the data on disk was process mostly sequencially, rather than leaping all over the diskpack. Modern optimisers probably do this kind of thing automatically, but in some ways that is a shame. Having the numbers shown to you and seeing the effect that various apparently minor changes had upon them was a salient learning experience. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
In Section
Seekers of Perl Wisdom
|
|