hgolden has asked for the wisdom of the Perl Monks concerning the following question:
Fellow Monks, Thus far in my Perl career, I’ve never had to worry about memory. Equipped with a nice machine at the office, and problems rich in interest but low in memory demands, I’ve been writing Perl for legibility, and occasionally for speed. But now my memory-loose ways are catching up with me. I can’t talk about my problem domain, except to say that we’re trying to catch cheaters. However, I can explain the specific problem with an analogy to a fictional version of PerlMonks. Imagine that PerlMonks has the following traits: 1. Everyone is an XP whore, and all votes are ++. 2. You can only vote on threads that you participated in, and you can only participate once in a given thread. 3. When you vote on a thread, everyone who posted there gets the ++. 4. Voting behavior is public. Then we can figure out how often each monk votes ++, and we could identify pairs of monks who voted each other up with suspicious regularity. The data that I get looks like this: UserID ThreadID Voted
2 3 1
5 3 0
17 3 1
8 3 1
9 4 1
5 4 0
23 4 0
2 4 1
3 4 1
43 4 1
8 5 0
...
In this made-up data, there were four participants in thread 3, and three of them voted ++. There were 6 participants in thread 4, and four of them voted ++. A few weeks ago, I got a dataset like this that had about 3.8 million entries, spread across ~800K threads. The catch is that you really want to walk back and forth along the data (at least within each thread), so if it was a tiny dataset, I would have slurped it. Seeing as it was fairly large, I wrote a little script that goes through the data, and, basically, slurps each thread at a time, and then increments a pair of 2-D hashes.
I rewrote the program to conform to this fictional problem domain, so a typo might have slipped in, but I'm concernced with concepts.
#!/usr/bin/perl
use warnings;
use strict;
open(IN, "data.txt");
my(%seen_each_other_count, %voted_each_other_count, $old_thread_id, @c
+urrent_user_ids, @current_votes);
my $newline=<IN>;
while ($newline) {
chomp($newline);
(my $user_id, my $thread_id, my $vote) = split(/\t/, $newline);
$old_thread_id=$thread_id;
push @current_user_ids, $user_id;
push @current_votes, $vote;
$newline=<IN>;
chomp($newline);
($user_id, $thread_id, $vote) = split(/\t/, $newline);
while ($thread_id==$old_thread_id) {
push @current_user_ids, $user_id;
push @current_votes, $vote;
$newline=<IN>;
chomp($newline);
($user_id, $thread_id, $vote) = split(/\t/, $newline);
}
for (my $i=0; $i<=$#current_user_ids; $i++) {
for (my $j=$i; $j<=$#current_user_ids; $j++) {
#When $j==$i this just counts how many times you appear, a
+nd how many times you vote
$seen_each_other_count{$current_user_ids[$i]}{$current_use
+r_ids[$j]}++;
if ($current_votes[$i]==1 && $current_votes[$j]==1) {
$voted_each_other_count{$current_user_ids[$i]}{$curren
+t_user_ids[$j]}++;
}
}
}
$old_thread_id=$thread_id;
@current_user_ids=();
@current_votes=();
push @current_user_ids, $user_id;
push @current_votes, $vote;
$newline=<IN>;
chomp($newline);
}
#Since the data is read down, the total number of times that
#one saw two is equal to the number of times that we counted one seein
+g two,
#plus the number of times we counted two seeing one
sub numerically { $a <=> $b };
open(OUT, ">PMTEST.txt");
print OUT "user1\tuser2\tsaw_count\tvote_count\n";
foreach my $first (sort numerically (keys %seen_each_other_count)) {
my $ref=$seen_each_other_count{$first};
foreach my $second (sort numerically (keys %$ref)) {
my ($final_count, $final_votes);
if ($first==$second) {
$final_count=$seen_each_other_count{$first}{$second};
$final_votes=$voted_each_other_count{$first}{$second};
}
else {
$final_count=$seen_each_other_count{$first}{$second}+$seen
+_each_other_count{$second}{$first};
$final_votes=$voted_each_other_count{$first}{$second}+$vot
+ed_each_other_count{$second}{$first};
}
$final_votes||=0;
print OUT "$first\t$second\t$final_count\t$final_votes\n";
}
}
That worked very well for this dataset, but my bosses so enjoyed the results, that they're getting a new dataset which, instead of being 3.8 million lines, is about 5 billion lines (and 55 gigabytes). So, last time it was enough to not slurp the data, but now I'm worried that the hashes are going to blow up (have any of you had a million by million 2-D hash?). I've been pondering how to deal with this, and the first thing that I've come up with is to pick a number of users (say 100), and first just count which people those 100 users see and vote for. Then I can output that to my disk, and do the next 100 users. Doing this in practice would mean only incrementing a hash entry if the first index is in the set of users I'm on, and skipping a thread if there's no user of interest in it. However, I'm worried that this will take a month. Here's the issue: I don't have the data, but my bosses want the results ASAP once we do. That means I have to be ready with a couple of script to do this job, so that if one is taking too much memory, I can switch to another, and if the next is going to take a month, I can switch to a third. When I read SOPW, I'm always impressed at the different and clever approaches that people take to solving problems. Do any of you have a different way of dealing with this that will be reasonably fast, and not take too much memory? I'm happy to work out the details of the code, I'm mostly looking for ideas. Thanks so much,Hays Update: One of the problems with not yet having the dataset is that I don't have the exact numbers on how many users and threads there are. I think a fair (i.e. leaning towards worst case) assumption is that those numbers scale linearly from the smaller dataset (the number of participants in a given thread is capped in my real problem domain). That would give ~1 million users and ~1 billion threads {plus or minus a few orders of magnitude :( }.
Re: Catching Cheaters and Saving Memory
by Corion (Patriarch) on Oct 12, 2006 at 13:47 UTC
|
If I've understood your exposition right,. I've done something similar, counting transactions between pairs of parties. My approach was using a Berkeley DB with the keys being both parties and the value being a pair of count and amount. That was slower than the all-RAM solution but had the nice advantage that it worked. I also think that this should be faster than multiple passes through your large dataset, but I don't know.
If you're doing this in Perl, I'd go for DB_File and store strings in it.
As the order of votes doesn't seem to matter in your data, you can linearly scale your processing time by splitting up your data across machines, at least if you can reduce the results you want to collect to something that's associative and commutative, like the count of items and the sum of items. Of course you will have to be careful when merging your collected data back together - you should code sanity checks that check that the count of records in each chunk is equal to the sum of reported counts, and that the sum over all chunks is equal to the total number of records.
| [reply] |
|
| [reply] |
Re: Catching Cheaters and Saving Memory
by Anonymous Monk on Oct 12, 2006 at 15:27 UTC
|
So, last time it was enough to not slurp the data, but now I'm worried that the hashes are going to blow up (have any of you had a million by million 2-D hash?).
And rightly so. The overhead of an empty, anon hash is 92 bytes on my system (this may vary by a few bytes depending on your build). If you have a million of them, that's 92 Mb just for the hashes. You've nothing stored in them yet. A string takes 25 bytes, not counting the content of the string. Keys take slightly less, but they are still overhead. And each byte overhead multiplied by a million is an Mb.
If you have 55 Gb of raw data, which you are storing as short strings, you'll be looking into hundreds of Gbs of memory usage.
You might want to use a real database instead. | [reply] |
Re: Catching Cheaters and Saving Memory
by GrandFather (Saint) on Oct 12, 2006 at 16:15 UTC
|
How many unique IDs are likely to be used? If the number is modest, say a few hundred thousand, you may be able to use an AoA rather than a HoH and use a ID to index hash to map uniqued IDs to array indexes.
DWIM is Perl's answer to Gödel
| [reply] |
Re: Catching Cheaters and Saving Memory
by tilly (Archbishop) on Oct 13, 2006 at 05:19 UTC
|
I think you need to think veeerrrry carefully about your data volumes and plan accordingly.
Suppose that you have 5 billion lines with 1 billion threads. If the threads are completely evenly dispersed, that's 20 billion observations of x being in the same thread as y. (Note, since x may or may not vote on y, order matters.) Assuming that there is a high incidence of unique meetings (x meets y in one thread only), you'd have order of magnitude 20 billion records in our output set. At, say, 20 bytes per output line, that's 400 GB of data you want to spit out in your final report. Your temporary working space is going to be bigger than that, so you should plan on needing a terabyte or so.
Is that a worst case scenario? Unfortunately no. Suppose that there are 500 million threads evenly spaced at 10 users per thread. Then there are 500 million * 10 * 9 = 45 billion observations. That would take over twice the space to represent.
OK, so once you estimate the data volume and make appropriate preparations to not run out of disk space, let's discuss performance. The idea of using a dbm (eg BerkeleyDB) is a good one. However you're going to be doing tens of billions of reads then writes into this data structure. Let's assume 20 billion read/write pairs, and an average of 1 disk seek per write, and an average disk seek time of 5 ms. You're doing 200 seeks per second, 12,000/minute, 720,000/hour, 17.28 million per day. At that rate you'll take 1157.4 days to finish!
Houston, we may have a problem. That estimate may be off by an order of magnitude either way, but your boss won't like even the best case.
Now how can we speed this up? Well the performance problem here is due to latency, and you can parallelize your job. You can have multiple processes running in parallel on different parts of your file. You need some logic to coordinate locking. You can probably speed it up by an order of magnitude or so that way.
A better trick is that you can put multiple machines to work, each one handling a different range of users. (Note, you'll want to consider all other users observed and voted on by a specific range of users.) Because the cost of skipping over lines in the file is minimal, you could have 10 machines, each of which is handling the votes from 1/10 of the users, and you'll get done approximately 10 times as fast. In fact you could scale this up quite a bit more than that if you have a large compute farm.
If you don't have a compute farm, you could consider purchasing compute time from Amazon.
There are also alternate strategies that might be faster. One that requires a lot more of you is doing a custom merge process. The idea here is similar to a merge sort, except that as you're merging, you may also group and sum. One way to implement it is is this:
- Process your input data and write out each session to a file, ordered by user responding, then other user seen in thread.
- Merge pairs of files, summing lines where appropriate.
- Repeat the merging process until you have one file sorted by user responding, then other user seen in the thread. (With 1 billion sessions this should take about 30 passes.)
(Technical note. I would not create a billion files. Gilesystems don't like that. I'd create 2 files, writing sessions alternately to one or the other. Then I'd merge them into 2 files with pairs of sessions. Then merge those into 2 files with groups of 4 sessions. And so on until was done.) Now how fast is that? Well let's take my original ballpark estimate of 400 GB of data. All of that data has to be read and written 30 times. That's 12 terabytes written, 12 terabytes read. At, say, 40 MB/s, that's 2.4 GB/min, 144 GB/hour, 3.456 TB/day, or about a week to run. (Note how much faster this is than the dbm approach!) Now you're hopefully maxing out your disk throughput so there is no gains from multiple processes, but you can still put multiple computers on it.
I think you can see that this approach should take a lot more programming, but 6he end result is also likely to be a lot faster.
Incidentally if you want data sorted, say from "most intersections" on down, sorting will be about as slow as this merging process was. I'd strongly recommend that before you do that final sort, you do a filter pass and only sort a fraction of your data.
A final suggestion that is very, very important. Whatever approach you take, make it give you partial progress reports so you know how fast it is going! It is one thing to write a process that will take 2 years to finish on one machine, which can easily be parallelized and run in a couple of days for a couple thousand dollars on Amazon's computer cluster. It is quite another to write a process that will take 2 years to finish on one machine, leave it running for a month, and only then get worried about when it will be done! The difference between those two is having a program that tells you how quickly it is going so you can estimate how long it will take fairly early in the process. | [reply] |
|
Update: Link corrected. Thanks to OverlordQ.
What do you think to a single perl script, ~20 hours and no additional storage requirements on commodity hardware?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
Did you mean to link to here instead?
| [reply] |
|
| [reply] |
Re: Catching Cheaters and Saving Memory
by caelifer (Scribe) on Oct 12, 2006 at 20:40 UTC
|
I would like to offer different solution which may be completely off-topic. Since you are dealing with large datasets, IMHO, SQL is the perfect tool here. Moreover, any decent SQL backends are optimized to deal with memory issues as well as the efficiency of SQL queries. For illustration I will use PostgreSQL syntax. But this will generally applies to others as well.
Supose you created database test. I would start by creating one table as in:
% cat <<_EOC | psql test
CRAETE TABLE rawdata (
uid INTEGER,
thread_id INTEGER,
voted INTEGER DEFAULT 0
);
_EOC
Then I would populate it like this:
% perl -nle 'printf "INSERT INTO rawdata VALUES (%d, %d, %d);\n", spli
+t' < data.txt | psql test
Now, to the fun part:
% cat <<_EOSQL | psql test
-- Use transactions so all temporary views are distroyed after rollbac
+k.
BEGIN TRANSACTION;
-- Create view with voting history per account
CREATE VIEW vote_histogram AS
SELECT t1.uid AS uid, t2.uid AS voted_for, sum(t1.voted) AS count
FROM rawdata AS t1, rawdata AS t2
WHERE t1.thread_id = t2.thread_id AND t1.uid != t2.uid
GROUP BY t1.uid, t2.uid
ORDER BY t1.uid;
-- Crate view with suspected accounts with votes for others greater th
+en set threshold
CREATE VIEW suspects AS
SELECT uid, voted_for, count
FROM vote_histogram
WHERE count > $THRESHOLD;
-- Build crossreferenced view of accounts who votes for each other ver
+y often
SELECT s.uid AS suspect, h.uid AS voted_for, s.count AS suspect_votes,
+ h.count AS returned_votes
FROM suspects AS s, vote_histogram AS h
WHERE s.voted_for = h.uid AND s.count <= h.count
ORDER BY s.uid, s.count DESC;
ROLLBACK;
_EOSQL
Voila! You got your suspects in a nicely formated table. Of course, you can limit the number of rows displayed as well as adjust your THRESHOLD parameter. I wrapped-up SQL statements in transaction so all the temporary objects are destroyed automatically once you are finished process. However, if you have a lot of space available, you can let them live for a bit more so you can construct different queries on them.
BR | [reply] [d/l] [select] |
|
Care to estimate how long that would take to run?
On commodity hardware, I'm guessing (several) days rather than hours.
And that vote_histogram view is going to consume 12 Terabytes minimum. Temporary or not, that a big chunk of hardware to consume, assuming it is available.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
|
If your 12 terabyte estimate comes from Re: Catching Cheaters and Saving Memory, then it is an overestimate. The view is likely to be somewhere north of 400 GB. 12 terabytes is an estimate of how much data you need to throw around to sort and group that dataset. But you won't need all 12 terabytes at once, you only need a terabyte or so of it.
However a database should do that sorting step somewhat faster than the kind of naive program that I'd expect to see someone write for a one-off. For one thing the first several sort and group steps can be done in chunks all in memory. It doesn't need to hit disk until the amount of data you need to throw around exceeds your memory limit. Databases are smart about switching to disk only when they need to. That may double the speed. (I would be unlikely to do that for a one-off because it is a lot of logic to write.)
That said, databases aren't magic, and this problem is one which would show how non-magical they are. First of all you're likely to run out of disk if you're using commodity hardware. (Oops.) And as you pointed out elsewhere, pre-filtering your dataset so you can focus on only the people you're going to care about is a far bigger win than anything a database knows how to do.
| [reply] |
|
|
|
Re: Catching Cheaters and Saving Memory
by BrowserUk (Patriarch) on Oct 12, 2006 at 16:31 UTC
|
| [reply] [d/l] |
Re: Catching Cheaters and Saving Memory
by xdg (Monsignor) on Oct 12, 2006 at 21:01 UTC
|
It seems like you're structuring your data around users rather than around the pair-wise observations. Having a hashref for each user is costly. As someone else suggested, join up the user ID's into a single key. And since you only care about when both are in the same thread or both vote, you can sort them before you join them into a key.
sub join_ids {
return = $_[0] < $_[1] ? "$_[0];$_[1]" : "$_[1];$_[0]";
}
Depending on the nature of your user id's, you could possibly pack() them rather than doing string concatenation.
Then, for every pair observation, you're tracking two counters: number of times together and number of votes for each other. Rather than keep two separate hashes (thus duplicating the storage of keypairs) or using an HOA, why not just pack() the two numbers in the data and then unpack them and update them each time you have an observation:
# assume my %obs for holding data;
my $zerozero = pack("LL",0,0);
for (my $i=0; $i<=$#current_user_ids; $i++) {
for (my $j=$i; $j<=$#current_user_ids; $j++) {
my $key = join_ids( $i, $j );
my ($seen, $voted) = unpack( "LL", $obs{ $key } || $zerozero )
+;
$seen++;
if ($current_votes[$i]==1 && $current_votes[$j]==1) {
$voted++;
}
$obj{$key} = pack( "LL", $seen, $voted );
}
}
That gives you a single hash, with a fairly compact data representation (even more compact if you pack the IDs). It almost exactly matches your desired output, too.
If it's still too big to manage (if the number of user-pairs is high), I'd break it up by thread count -- e.g. 1 million threads per file. The output file would be userpairs with seen counts and vote counts and then you can just sum those. (It also parallelizes across machines nicely.)
-xdg
Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.
| [reply] [d/l] [select] |
Re: Catching Cheaters and Saving Memory
by BrowserUk (Patriarch) on Oct 13, 2006 at 09:49 UTC
|
Given those numbers, it's time to be a little smart about what you need to know.
Since your looking for cheaters, your first task is to decide at what level a mutually beneficial voting pattern can be attributed to cheating rather than coincidence. For example, if two users have voted once each, in the only thread they have both posted to, your probably not going to divine that as a pattern of malicious intent.
With million users, a billion threads and 5 billion posts, each user will have an average of 5 posts. If you set a threshold for the minimum number of votes a user has to have cast before you will start vetting them for cheating--say 20 votes?--then make your first pass of the data accumlating votes against users. This will require a hash of 1 million numerical scalar values, around 50 MB.
You then delete any entries in the hash that have less than your minimum_votes_cast threshhold. Set at 20, this is likely to discard 80 or 90% of the users. You can the make your second pass accumulating a hash of pairs showing for whom each of the qualifying voters cast their votes. As suggested elsewhere, if you use the pairs of userids as the keys in a single hash and skipping any voters that do not still exist in your first pass hash, then this will likely takes less space than the original hash before the discard step.
It might look something like this (untested):
use constant { POSTED => 0, VOTED => 1, RATIO => 2 };
## Accumulate users only if they voted, and count their votes.
my %users;
open BIGFILE;
while( <BIGFILE> ) {
my( $user, $thread, $voted ) = split;
++$users{ $user } if $voted;
}
close BIGFILE;
## Discard all users who have voted less times than a sensible thresho
+ld.
$users{ $_ } < MIN_VOTES_THRESHOLD and delete $users{ $_ } for keys %u
+sers;
## Re-scan the file, accumulating counts of posts and votes
## *for those userids remaining in %users only*
## Assumes file ordered by threadid.
my %pairs;
open BIGFILE;
my( $user, $thread, $voted ) = split ' ', <BIGFILE>;
my $lastThread = $thread;
MAINLOOP:
while( 1 ) {
my @users;
while( $thread == $lastThread ) {
## Accumulate users/votes in each thread
push @users, "$user:$voted";
( $user, $thread, $voted ) = split ' ', <BIGFILE>;
last MAINLOOP if eof( BIGFILE );
};
$lastThread = $thread;
## Permute them to generate pairs
for my $pair ( Cnr 2, @users ) {
my( $user1, $voted1 ) = split ':', $pair->[ 0 ];
my( $user2, $voted2 ) = split ':', $pair->[ 1 ];
## Skip if either is not in the 'high voters' list
next unless exists $users{ $user1 } and exists $users{ $user2
+};
## Otherwise increment the coincident pair count (and vote if
+applicable).
my $pair = pack 'LL', $user1,$user2;
++$pairs{ $pair }[ POSTED ];
++$pairs{ $pair }[ VOTED ] if $voted1;
}
}
## Scan the pairs generating a ratio of votes to posts.
my( $totalRatio, $maxRatio ) = ( 0 ) x 2;
for ( keys %pairs ) {
my $pair = $pairs{ $_ };
$pair->[ RATIO ] = ( $pair->[ VOTED ]||0 ) / $pair->[ POSTED ];
$totalRatio += $pair->[ RATIO ];
$maxRatio = $pair->[ RATIO ] if $maxRatio < $pair->[ RATIO ];
}
## The average ratio of pairwise votes to posts might form the basis f
+or discrimination
my $averageRatio = $totalRatio / ( keys %pairs||1 );
printf "The voted/posted ratios averaged to %f; with a maximum of %f\n
+",
$averageRatio, $maxRatio;
## Display those pairs with a vote/post ratio above teh threshold.
$pairs{ $_ }[ RATIO ] > POST_VOTE_THRESHOLD
and print "Pair @{[ unpack 'LL', $_ ]} had a vote/post ratio of @{
+ $pairs{ $_ } }[ POSTED, VOTED, RATIO ]"
for keys %pairs;
On my hardware, the two passes would take around 20 hours. The memory consumed will depend upon the level at which you set MIN_VOTES_THRESHOLD, but should be under 300 MB if you set it at a reasonable level.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
With million users, a billion threads and 5 billion posts, each user will have an average of 5 posts.
Doesn't that come out to an average of 5000 posts per user? Which would be far above a sane minimum for cheating.
That not only eliminates much of the filtering (although the distribution of posts per user might still be uneven enough to filter out some users), but it also brings the (non-filtered) number of hash entries up to anywhere between 5e6 (an even 5 posts per thread, and everyone is cheating) to 2.5e10 (an even 5 posts per thread, but no user pair is repeated) to 1e12 (some threads are sufficiently large, and everyone meets everyone else).
You could reduce the memory requirements a little by running it in two passes. One to count votes, and the second to count coincident posts only for user pairs that voted for eachother enough times to be suspicious.
| [reply] |
Re: Catching Cheaters and Saving Memory
by shmem (Chancellor) on Oct 12, 2006 at 19:39 UTC
|
Sounds like a classic for PDL, since you only deal with numbers
and only a fraction of your matrix slots need really to be allocated.
--shmem
_($_=" "x(1<<5)."?\n".q·/)Oo. G°\ /
/\_¯/(q /
---------------------------- \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
| [reply] |
Re: Catching Cheaters and Saving Memory
by Fletch (Bishop) on Oct 12, 2006 at 20:16 UTC
|
Seconding the "use a real DB" recommendations. Another thing to consider is that since you're basically storing a single bit (true/false) you could use vec and store the data packed that way (with suitable convenience wrappers to extract or set the value for a given thread). That'd get you 8 threads / byte, and with a lucky data set you might could use gzip to compress things for longer term storage (i.e. if you've got long contiguous runs of trues or falses it'll shrink pretty well). I once did something similar where I had 10,000 bits that stored in under half a K on disk.
| [reply] |
Re: Catching Cheaters and Saving Memory
by mr_mischief (Monsignor) on Oct 18, 2006 at 17:34 UTC
|
It seems to me you could cull the data signifigantly. I'm not sure if my proposition works for your real situation, but from my mental model of what you're trying to do it makes sense to me.
People who are going to cheat are far more likely to do so with a few people they feel they can trust. They are also going to repeat the behavior, or it wouldn't really be a trend you could call cheating anyway. A system that plays to these facts may make more sense than comparing every user to every other user.
Envision a table for each user. In Alice's table, we have a row for each of the most recent 50 users Alice has voted up.
You only need three columns besides the user IDs. One is the number of times Alice voted for a thread that benefits that user since time X. The second is the number of times that user has voted for a thread which benefits Alice since time X. The third is the number of times those votes came from the same thread. X can be however long it takes Alice to effect 50 users. If your limit of users per thread is low enough, this will show some trends at 50 users.
An event in which everyone wants more ++ for everyone involved but in which each person can only participate once reminds me of a hand of poker. Some people raise or call, and others fold. Everyone who doesn't fold has the potential to benefit. Two people who consistently raise on the same hands at different tables are likely table talking.
So I'm picturing seats at a table game, up to 6 users. If one of those users is Alice, then you're looking for 5 other people per group. If you're tracking up to 50 players Alice benefitted, then that's at least the last ten hands in which Alice raised.
Assuming 4 bytes for each user ID and 2 bytes for each vote column, you end up with 8 bytes per user * 50. That's 400 bytes.
Now, assume you have 92 bytes of overhead for each table, 92 bytes overhead for the vote count row, and 25 bytes of overhead for the userid on that row. That's (8 + 25 + 92) * 50 + 92 = 6342 bytes per user. Times a million, that's 6,342,000,000 bytes. If you only need the ten most recent users that overlap to find your trends, it's 20% of that. You might recognize these overheads from earlier in the thread as hash and string overheads in Perl5. A good DBMS may use less overhead.
Every time Alice and another user, Bob, have mutually upvoted each other an abnormal number of times, that's a suspected cheating checkmark for both Alice and Bob. A separate table marking the number of times each user is suspected of cheating can be kept with just their ID and the count.
Further, deeper analysis could be done on the accounts of the top 100 or 1000 or 10,000 cheating suspects compared to each other. The combinatorics are a much smaller issue at those numbers than at a million, and your cheaters will tend to clump with other cheaters anyway since cheating means more than one person is involved by definition.
We're still talking about gigabytes, but at least not terabytes or petabytes. I do hope this type of model meets your needs. It's not as exact as a million by million table, but it should use a lot less memory.
| [reply] |
|
|