Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^2: Catching Cheaters and Saving Memory

by BrowserUk (Patriarch)
on Oct 13, 2006 at 08:58 UTC ( [id://578088]=note: print w/replies, xml ) Need Help??


in reply to Re: Catching Cheaters and Saving Memory
in thread Catching Cheaters and Saving Memory

Care to estimate how long that would take to run?

On commodity hardware, I'm guessing (several) days rather than hours.

And that vote_histogram view is going to consume 12 Terabytes minimum. Temporary or not, that a big chunk of hardware to consume, assuming it is available.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: Catching Cheaters and Saving Memory
by tilly (Archbishop) on Oct 13, 2006 at 18:32 UTC
    If your 12 terabyte estimate comes from Re: Catching Cheaters and Saving Memory, then it is an overestimate. The view is likely to be somewhere north of 400 GB. 12 terabytes is an estimate of how much data you need to throw around to sort and group that dataset. But you won't need all 12 terabytes at once, you only need a terabyte or so of it.

    However a database should do that sorting step somewhat faster than the kind of naive program that I'd expect to see someone write for a one-off. For one thing the first several sort and group steps can be done in chunks all in memory. It doesn't need to hit disk until the amount of data you need to throw around exceeds your memory limit. Databases are smart about switching to disk only when they need to. That may double the speed. (I would be unlikely to do that for a one-off because it is a lot of logic to write.)

    That said, databases aren't magic, and this problem is one which would show how non-magical they are. First of all you're likely to run out of disk if you're using commodity hardware. (Oops.) And as you pointed out elsewhere, pre-filtering your dataset so you can focus on only the people you're going to care about is a far bigger win than anything a database knows how to do.

      If your 12 terabyte estimate comes from Re: Catching Cheaters and Saving Memory, ...

      No. I hadn't seen your post at that point. It came from

      CREATE VIEW vote_histogram AS SELECT t1.uid AS uid, t2.uid AS voted_for, sum(t1.voted) AS count FROM rawdata AS t1, rawdata AS t2 WHERE t1.thread_id = t2.thread_id AND t1.uid != t2.uid GROUP BY t1.uid, t2.uid ORDER BY t1.uid;

      The view contains 3 4-byte fields: uid INTEGER, voted_for INTEGER count INTEGER

      So a full cross-product would be ~ 1e6 * 1e6 * ( 3* 4-bytes) = 12 Terabytes.

      The where clause will limit the number of rows appearing in the final view, but along the way, (as you point out), most of that 12 TB has to be generated. You're right, depending how well spec'd the server is for memory, it would be possible to perform much of the reduction in ram, but on typical Intel box with a memory limit of 2 or 3GB, there would be a substantial cost in repeated effort doing it in 2GB chunks, as each chunk will require another pass of the raw data.

      It will depend upon how the server is configured, but on smaller hardware I could well see the server attempting to trade diskspace for extra passes.

      Either way, it would require some very seriously specified (expensive) hardware to be able to run that create view in anything like a reasonable amount of time. Your rent-a-cluster idea would probably be the only option for many small and medium sized businesses.

      That said, databases aren't magic, and this problem is one which would show how non-magical they are.

      Sit down and brace yourself. We agree :)

      Probably my biggest problem with SQL is that it allows people to write apparently simple and straight forward solutions to problems, without being made aware of the scale of the task they encapsulate. Of course, that's good for DB vendors cos when the query takes forever to run, and you call them for advice, they get to (try to) upgrade you to an distributed solution, for which they charge even bigger bucks.

      Years ago, before the dominance of the RDBMS, I trialed a new product from DEC called (from memory) DataEase. It was a TUI (terminal UI) build-a-query-by-example thing, backed by (I think) a hierachical DB. One of several nice things about it was that when you'd built your query it went through various steps prompting you to re-order it for efficiency. As a part of that process, it would display a guessimate of how long the query would take to run.

      One memorable example of using this (that I think I've mentioned here before), was a query that joined over 7 tables. By simply reversing the order of the joins, the runtime fell from an estimated 69 hours to 5 or 6. The difference being the new ordering meant that the data on disk was process mostly sequencially, rather than leaping all over the diskpack. Modern optimisers probably do this kind of thing automatically, but in some ways that is a shame. Having the numbers shown to you and seeing the effect that various apparently minor changes had upon them was a salient learning experience.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Actually your memory estimate is way off. The where clause eliminates most of the view, and it is possible to generate the view using relatively little working space. (Enough to sort the original dataset by session, then enough working space for processing a single sesion.) In fact assuming a half billion threads with 10 users each, you'd only need 540 GB or so - that's about a 95% saving!

        Unfortunately doing anything interesting with the view, such as sorting and grouping it, will require throwing a lot of memory around. (That was where my 12 terabyte back-of-the-envelope estimate came from.)

        About query plans, I agree that it is very instructive to see that kind of detail, and I wish more people were aware of it. But at the same time I like not having to think about it unless I really have to think about it.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://578088]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (2)
As of 2025-06-22 08:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.