Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re^3: Catching Cheaters and Saving Memory

by tilly (Archbishop)
on Oct 13, 2006 at 18:32 UTC ( [id://578201]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Catching Cheaters and Saving Memory
in thread Catching Cheaters and Saving Memory

If your 12 terabyte estimate comes from Re: Catching Cheaters and Saving Memory, then it is an overestimate. The view is likely to be somewhere north of 400 GB. 12 terabytes is an estimate of how much data you need to throw around to sort and group that dataset. But you won't need all 12 terabytes at once, you only need a terabyte or so of it.

However a database should do that sorting step somewhat faster than the kind of naive program that I'd expect to see someone write for a one-off. For one thing the first several sort and group steps can be done in chunks all in memory. It doesn't need to hit disk until the amount of data you need to throw around exceeds your memory limit. Databases are smart about switching to disk only when they need to. That may double the speed. (I would be unlikely to do that for a one-off because it is a lot of logic to write.)

That said, databases aren't magic, and this problem is one which would show how non-magical they are. First of all you're likely to run out of disk if you're using commodity hardware. (Oops.) And as you pointed out elsewhere, pre-filtering your dataset so you can focus on only the people you're going to care about is a far bigger win than anything a database knows how to do.

  • Comment on Re^3: Catching Cheaters and Saving Memory

Replies are listed 'Best First'.
Re^4: Catching Cheaters and Saving Memory
by BrowserUk (Patriarch) on Oct 13, 2006 at 19:29 UTC
    If your 12 terabyte estimate comes from Re: Catching Cheaters and Saving Memory, ...

    No. I hadn't seen your post at that point. It came from

    CREATE VIEW vote_histogram AS SELECT t1.uid AS uid, t2.uid AS voted_for, sum(t1.voted) AS count FROM rawdata AS t1, rawdata AS t2 WHERE t1.thread_id = t2.thread_id AND t1.uid != t2.uid GROUP BY t1.uid, t2.uid ORDER BY t1.uid;

    The view contains 3 4-byte fields: uid INTEGER, voted_for INTEGER count INTEGER

    So a full cross-product would be ~ 1e6 * 1e6 * ( 3* 4-bytes) = 12 Terabytes.

    The where clause will limit the number of rows appearing in the final view, but along the way, (as you point out), most of that 12 TB has to be generated. You're right, depending how well spec'd the server is for memory, it would be possible to perform much of the reduction in ram, but on typical Intel box with a memory limit of 2 or 3GB, there would be a substantial cost in repeated effort doing it in 2GB chunks, as each chunk will require another pass of the raw data.

    It will depend upon how the server is configured, but on smaller hardware I could well see the server attempting to trade diskspace for extra passes.

    Either way, it would require some very seriously specified (expensive) hardware to be able to run that create view in anything like a reasonable amount of time. Your rent-a-cluster idea would probably be the only option for many small and medium sized businesses.

    That said, databases aren't magic, and this problem is one which would show how non-magical they are.

    Sit down and brace yourself. We agree :)

    Probably my biggest problem with SQL is that it allows people to write apparently simple and straight forward solutions to problems, without being made aware of the scale of the task they encapsulate. Of course, that's good for DB vendors cos when the query takes forever to run, and you call them for advice, they get to (try to) upgrade you to an distributed solution, for which they charge even bigger bucks.

    Years ago, before the dominance of the RDBMS, I trialed a new product from DEC called (from memory) DataEase. It was a TUI (terminal UI) build-a-query-by-example thing, backed by (I think) a hierachical DB. One of several nice things about it was that when you'd built your query it went through various steps prompting you to re-order it for efficiency. As a part of that process, it would display a guessimate of how long the query would take to run.

    One memorable example of using this (that I think I've mentioned here before), was a query that joined over 7 tables. By simply reversing the order of the joins, the runtime fell from an estimated 69 hours to 5 or 6. The difference being the new ordering meant that the data on disk was process mostly sequencially, rather than leaping all over the diskpack. Modern optimisers probably do this kind of thing automatically, but in some ways that is a shame. Having the numbers shown to you and seeing the effect that various apparently minor changes had upon them was a salient learning experience.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Actually your memory estimate is way off. The where clause eliminates most of the view, and it is possible to generate the view using relatively little working space. (Enough to sort the original dataset by session, then enough working space for processing a single sesion.) In fact assuming a half billion threads with 10 users each, you'd only need 540 GB or so - that's about a 95% saving!

      Unfortunately doing anything interesting with the view, such as sorting and grouping it, will require throwing a lot of memory around. (That was where my 12 terabyte back-of-the-envelope estimate came from.)

      About query plans, I agree that it is very instructive to see that kind of detail, and I wish more people were aware of it. But at the same time I like not having to think about it unless I really have to think about it.

        I'm kind of glad my proposed solution generated such responds. However, there are few points I rather disagree with:

        1. How using SQL backend would differ with using Perl for this particular solution (with regard of memory usage)? Given the amout of developers time spending on optimizing DB memory usage I would guess it would be far less than ad hoc solution made in Perl. Wouldn't you agree?

        2. I always thought that views are not actual tables and more like indices. In the sense that they are rather collections of pointers to the actual records. Thus, your estimates for the vote_histogram and such are off, IMHO. (To tell the truth I cannot back this statement at all. At least at the moment. If somebody has a good insight how the views are actually constructed, I would really appreciate the pointers!)

        3. As already pointed out, WHERE clause should eliminate most of the cases. We are looking for people with large number of votes for each other. Assuming that most of people are not 'bad' - we are left with relatively small number of suspects.

        4. As for the efficiency of the actual query evaluation, I think this is kind of once-in-a-while problem that does not require everyday resolutions. I also doubt the quatity/amount of actual data to begin with. How many sites can you name, which have millions of unique users that actively participates in some sort of open forum and actually do vote? Or to this matter how many active topics can this site carry? Also, I'm sure that this data can be split in parts (like historic and actual) and dealt with separately.

        Nevertheless, the discussion was really informative. Thank you.

        BR

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://578201]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2025-06-24 07:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.