Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^5: Catching Cheaters and Saving Memory

by tilly (Archbishop)
on Oct 13, 2006 at 19:55 UTC ( [id://578213]=note: print w/replies, xml ) Need Help??


in reply to Re^4: Catching Cheaters and Saving Memory
in thread Catching Cheaters and Saving Memory

Actually your memory estimate is way off. The where clause eliminates most of the view, and it is possible to generate the view using relatively little working space. (Enough to sort the original dataset by session, then enough working space for processing a single sesion.) In fact assuming a half billion threads with 10 users each, you'd only need 540 GB or so - that's about a 95% saving!

Unfortunately doing anything interesting with the view, such as sorting and grouping it, will require throwing a lot of memory around. (That was where my 12 terabyte back-of-the-envelope estimate came from.)

About query plans, I agree that it is very instructive to see that kind of detail, and I wish more people were aware of it. But at the same time I like not having to think about it unless I really have to think about it.

  • Comment on Re^5: Catching Cheaters and Saving Memory

Replies are listed 'Best First'.
Re^6: Catching Cheaters and Saving Memory
by caelifer (Scribe) on Oct 15, 2006 at 05:50 UTC
    I'm kind of glad my proposed solution generated such responds. However, there are few points I rather disagree with:

    1. How using SQL backend would differ with using Perl for this particular solution (with regard of memory usage)? Given the amout of developers time spending on optimizing DB memory usage I would guess it would be far less than ad hoc solution made in Perl. Wouldn't you agree?

    2. I always thought that views are not actual tables and more like indices. In the sense that they are rather collections of pointers to the actual records. Thus, your estimates for the vote_histogram and such are off, IMHO. (To tell the truth I cannot back this statement at all. At least at the moment. If somebody has a good insight how the views are actually constructed, I would really appreciate the pointers!)

    3. As already pointed out, WHERE clause should eliminate most of the cases. We are looking for people with large number of votes for each other. Assuming that most of people are not 'bad' - we are left with relatively small number of suspects.

    4. As for the efficiency of the actual query evaluation, I think this is kind of once-in-a-while problem that does not require everyday resolutions. I also doubt the quatity/amount of actual data to begin with. How many sites can you name, which have millions of unique users that actively participates in some sort of open forum and actually do vote? Or to this matter how many active topics can this site carry? Also, I'm sure that this data can be split in parts (like historic and actual) and dealt with separately.

    Nevertheless, the discussion was really informative. Thank you.

    BR
      Here are responses to your points.
      1. How using SQL backend would differ with using Perl for this particular solution (with regard of memory usage)? Given the amout of developers time spending on optimizing DB memory usage I would guess it would be far less than ad hoc solution made in Perl. Wouldn't you agree?

        The SQL backend would indeed carry out a given execution plan with less memory than an ad hoc Perl solution. It would also use optimizations that an ad hoc solution would not. For instance I mentioned a simple optimizations that databases would use in Re^3: Catching Cheaters and Saving Memory.

        That said, databases are not magic. Give them a problem, and there are only so many approaches that they can take. Following a given approach, there is a necessary amount of work they need to do. This work is fairly predictable, and it can reasonably be estimated. (In fact I wind up having to do a lot of this kind of estimation in my job.) And in this case it is going to amount to a lot of work. Unless you have significant resources, you're going to have a problem.

        A one-off solution can beat a database. I should know, I've done it plenty of times. There are three ways in which this can happen. The first is when the database doesn't correctly find the strategy that it should use but you can. The second is when you use a trick that the database doesn't know about. The third is when trying to think about the problem from nuts and bolts results in your finding a shortcut that could be done in the database, but which you wouldn't have thought about trying to think about it in the database.

        In this case the right trick to use is to pre-filter it to only look at people who were in a lot of threads. You're unlikely to think that up while thinking in SQL. You can express it, but it isn't very natural. You have to create a view from rawdata for people who have been in your threads, join that back to rawdata to get their vote data, then proceed down the rest of your path. People don't normally think up solutions that involve joining a table with a query off of that table.

        Well I'm being unfairly nice to the database. In an ideal world you should be able to say something like, "And we're only interested with people with at least 20 votes" and the database would be able to reorganize its query plan to put a pre-filtering step in to cut data out early. In the real world no database engine that I know of will optimize that cleverly, nor is any likely to soon for the simple reason that once you open the door to trying that kind of optimization, the query optimizer winds up with so many plans to look at that the query optimizer becomes a performance bottleneck.

      2. I always thought that views are not actual tables and more like indices. In the sense that they are rather collections of pointers to the actual records. Thus, your estimates for the vote_histogram and such are off, IMHO. (To tell the truth I cannot back this statement at all. At least at the moment. If somebody has a good insight how the views are actually constructed, I would really appreciate the pointers!)

        Well some views, namely materialized views, really are actual tables. But usually views are not. However views aren't anything like collections of pointers to the actual records. (There you're mixing views up with indexes, which are organized collections of pointers to the actual records.)

        A view is normally nothing but a chunk of SQL. You drop it into your query and (assuming a decent optimizer (Access need not apply)) it becomes part of the query that gets optimized. When you go to query it, though, then the database has to create the part of that view that it needs on the fly. (This is where the optimization part comes in. Whenever possible the database only creates the part of the view it actually will need, and doesn't waste time generating parts of the view that will be thrown away later.)

        In this case, unfortunately, all of the view is needed, so the database has to create the whole thing on the fly. If the view is large, the database will need substantial working space to do this. You can estimate this space by estimating the size of a record, and the number of records that will need to exist. Sometimes you can be pleasantly surprised because it is not all needed at once, but in this case I think it is.

      3. As already pointed out, WHERE clause should eliminate most of the cases. We are looking for people with large number of votes for each other. Assuming that most of people are not 'bad' - we are left with relatively small number of suspects.

        Sorry, but no.

        The way that you coded the condition, "people with a large number of votes for each other" is something that you can only work after you've calculated how many times each person voted for each other person. And that's going to be a large dataset.

      4. As for the efficiency of the actual query evaluation, I think this is kind of once-in-a-while problem that does not require everyday resolutions. I also doubt the quatity/amount of actual data to begin with. How many sites can you name, which have millions of unique users that actively participates in some sort of open forum and actually do vote? Or to this matter how many active topics can this site carry? Also, I'm sure that this data can be split in parts (like historic and actual) and dealt with separately.

        I agree that this is not an everyday problem. Most programmers never need to worry about this kind of data consideration. However I see no reason to accuse the OP of misstating the requirements. Plenty of sites get that kind of volume. The problem description has a "web 2.0" flavour, which would explain why everyone is voting. Couldn't you see, say, http://del.icio.us/ coming up with a problem like this? (And they likely would solve it in Perl.) If you want a list of other possibilities, scan http://en.wikipedia.org/wiki/List_of_social_networking_websites for sites with at least a million users.

        And yes, I'd agree that if we understood the business problem better, it probably is possible to segment the data and handle each segment separately. Unfortunately you'd need to do that segmenting before producing the data set described by the OP. Therefore even though this is probably possible, it probably is not a realistic option for the OP.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://578213]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (3)
As of 2025-06-22 07:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.