Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Catching Cheaters and Saving Memory

by caelifer (Scribe)
on Oct 12, 2006 at 20:40 UTC ( [id://577956]=note: print w/replies, xml ) Need Help??


in reply to Catching Cheaters and Saving Memory

I would like to offer different solution which may be completely off-topic. Since you are dealing with large datasets, IMHO, SQL is the perfect tool here. Moreover, any decent SQL backends are optimized to deal with memory issues as well as the efficiency of SQL queries. For illustration I will use PostgreSQL syntax. But this will generally applies to others as well.

Supose you created database test. I would start by creating one table as in:

% cat <<_EOC | psql test CRAETE TABLE rawdata ( uid INTEGER, thread_id INTEGER, voted INTEGER DEFAULT 0 ); _EOC

Then I would populate it like this:

% perl -nle 'printf "INSERT INTO rawdata VALUES (%d, %d, %d);\n", spli +t' < data.txt | psql test

Now, to the fun part:

% cat <<_EOSQL | psql test -- Use transactions so all temporary views are distroyed after rollbac +k. BEGIN TRANSACTION; -- Create view with voting history per account CREATE VIEW vote_histogram AS SELECT t1.uid AS uid, t2.uid AS voted_for, sum(t1.voted) AS count FROM rawdata AS t1, rawdata AS t2 WHERE t1.thread_id = t2.thread_id AND t1.uid != t2.uid GROUP BY t1.uid, t2.uid ORDER BY t1.uid; -- Crate view with suspected accounts with votes for others greater th +en set threshold CREATE VIEW suspects AS SELECT uid, voted_for, count FROM vote_histogram WHERE count > $THRESHOLD; -- Build crossreferenced view of accounts who votes for each other ver +y often SELECT s.uid AS suspect, h.uid AS voted_for, s.count AS suspect_votes, + h.count AS returned_votes FROM suspects AS s, vote_histogram AS h WHERE s.voted_for = h.uid AND s.count <= h.count ORDER BY s.uid, s.count DESC; ROLLBACK; _EOSQL

Voila! You got your suspects in a nicely formated table. Of course, you can limit the number of rows displayed as well as adjust your THRESHOLD parameter. I wrapped-up SQL statements in transaction so all the temporary objects are destroyed automatically once you are finished process. However, if you have a lot of space available, you can let them live for a bit more so you can construct different queries on them.

BR

Replies are listed 'Best First'.
Re^2: Catching Cheaters and Saving Memory
by BrowserUk (Patriarch) on Oct 13, 2006 at 08:58 UTC

    Care to estimate how long that would take to run?

    On commodity hardware, I'm guessing (several) days rather than hours.

    And that vote_histogram view is going to consume 12 Terabytes minimum. Temporary or not, that a big chunk of hardware to consume, assuming it is available.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      If your 12 terabyte estimate comes from Re: Catching Cheaters and Saving Memory, then it is an overestimate. The view is likely to be somewhere north of 400 GB. 12 terabytes is an estimate of how much data you need to throw around to sort and group that dataset. But you won't need all 12 terabytes at once, you only need a terabyte or so of it.

      However a database should do that sorting step somewhat faster than the kind of naive program that I'd expect to see someone write for a one-off. For one thing the first several sort and group steps can be done in chunks all in memory. It doesn't need to hit disk until the amount of data you need to throw around exceeds your memory limit. Databases are smart about switching to disk only when they need to. That may double the speed. (I would be unlikely to do that for a one-off because it is a lot of logic to write.)

      That said, databases aren't magic, and this problem is one which would show how non-magical they are. First of all you're likely to run out of disk if you're using commodity hardware. (Oops.) And as you pointed out elsewhere, pre-filtering your dataset so you can focus on only the people you're going to care about is a far bigger win than anything a database knows how to do.

        If your 12 terabyte estimate comes from Re: Catching Cheaters and Saving Memory, ...

        No. I hadn't seen your post at that point. It came from

        CREATE VIEW vote_histogram AS SELECT t1.uid AS uid, t2.uid AS voted_for, sum(t1.voted) AS count FROM rawdata AS t1, rawdata AS t2 WHERE t1.thread_id = t2.thread_id AND t1.uid != t2.uid GROUP BY t1.uid, t2.uid ORDER BY t1.uid;

        The view contains 3 4-byte fields: uid INTEGER, voted_for INTEGER count INTEGER

        So a full cross-product would be ~ 1e6 * 1e6 * ( 3* 4-bytes) = 12 Terabytes.

        The where clause will limit the number of rows appearing in the final view, but along the way, (as you point out), most of that 12 TB has to be generated. You're right, depending how well spec'd the server is for memory, it would be possible to perform much of the reduction in ram, but on typical Intel box with a memory limit of 2 or 3GB, there would be a substantial cost in repeated effort doing it in 2GB chunks, as each chunk will require another pass of the raw data.

        It will depend upon how the server is configured, but on smaller hardware I could well see the server attempting to trade diskspace for extra passes.

        Either way, it would require some very seriously specified (expensive) hardware to be able to run that create view in anything like a reasonable amount of time. Your rent-a-cluster idea would probably be the only option for many small and medium sized businesses.

        That said, databases aren't magic, and this problem is one which would show how non-magical they are.

        Sit down and brace yourself. We agree :)

        Probably my biggest problem with SQL is that it allows people to write apparently simple and straight forward solutions to problems, without being made aware of the scale of the task they encapsulate. Of course, that's good for DB vendors cos when the query takes forever to run, and you call them for advice, they get to (try to) upgrade you to an distributed solution, for which they charge even bigger bucks.

        Years ago, before the dominance of the RDBMS, I trialed a new product from DEC called (from memory) DataEase. It was a TUI (terminal UI) build-a-query-by-example thing, backed by (I think) a hierachical DB. One of several nice things about it was that when you'd built your query it went through various steps prompting you to re-order it for efficiency. As a part of that process, it would display a guessimate of how long the query would take to run.

        One memorable example of using this (that I think I've mentioned here before), was a query that joined over 7 tables. By simply reversing the order of the joins, the runtime fell from an estimated 69 hours to 5 or 6. The difference being the new ordering meant that the data on disk was process mostly sequencially, rather than leaping all over the diskpack. Modern optimisers probably do this kind of thing automatically, but in some ways that is a shame. Having the numbers shown to you and seeing the effect that various apparently minor changes had upon them was a salient learning experience.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://577956]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2025-06-24 06:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.