![]() |
|
Come for the quick hacks, stay for the epiphanies. | |
PerlMonks |
Re: Catching Cheaters and Saving Memoryby mr_mischief (Monsignor) |
on Oct 18, 2006 at 17:34 UTC ( [id://579153]=note: print w/replies, xml ) | Need Help?? |
It seems to me you could cull the data signifigantly. I'm not sure if my proposition works for your real situation, but from my mental model of what you're trying to do it makes sense to me.
People who are going to cheat are far more likely to do so with a few people they feel they can trust. They are also going to repeat the behavior, or it wouldn't really be a trend you could call cheating anyway. A system that plays to these facts may make more sense than comparing every user to every other user. Envision a table for each user. In Alice's table, we have a row for each of the most recent 50 users Alice has voted up. You only need three columns besides the user IDs. One is the number of times Alice voted for a thread that benefits that user since time X. The second is the number of times that user has voted for a thread which benefits Alice since time X. The third is the number of times those votes came from the same thread. X can be however long it takes Alice to effect 50 users. If your limit of users per thread is low enough, this will show some trends at 50 users. An event in which everyone wants more ++ for everyone involved but in which each person can only participate once reminds me of a hand of poker. Some people raise or call, and others fold. Everyone who doesn't fold has the potential to benefit. Two people who consistently raise on the same hands at different tables are likely table talking. So I'm picturing seats at a table game, up to 6 users. If one of those users is Alice, then you're looking for 5 other people per group. If you're tracking up to 50 players Alice benefitted, then that's at least the last ten hands in which Alice raised. Assuming 4 bytes for each user ID and 2 bytes for each vote column, you end up with 8 bytes per user * 50. That's 400 bytes. Now, assume you have 92 bytes of overhead for each table, 92 bytes overhead for the vote count row, and 25 bytes of overhead for the userid on that row. That's (8 + 25 + 92) * 50 + 92 = 6342 bytes per user. Times a million, that's 6,342,000,000 bytes. If you only need the ten most recent users that overlap to find your trends, it's 20% of that. You might recognize these overheads from earlier in the thread as hash and string overheads in Perl5. A good DBMS may use less overhead. Every time Alice and another user, Bob, have mutually upvoted each other an abnormal number of times, that's a suspected cheating checkmark for both Alice and Bob. A separate table marking the number of times each user is suspected of cheating can be kept with just their ID and the count. Further, deeper analysis could be done on the accounts of the top 100 or 1000 or 10,000 cheating suspects compared to each other. The combinatorics are a much smaller issue at those numbers than at a million, and your cheaters will tend to clump with other cheaters anyway since cheating means more than one person is involved by definition. We're still talking about gigabytes, but at least not terabytes or petabytes. I do hope this type of model meets your needs. It's not as exact as a million by million table, but it should use a lot less memory. Christopher E. Stith
In Section
Seekers of Perl Wisdom
|
|