Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Database Clean up?

by defyance (Curate)
on Aug 24, 2002 at 01:56 UTC ( [id://192510]=monkdiscuss: print w/replies, xml ) Need Help??

I work for a large corporation, obviously, we have millions upon, millions of customers.

As a result, we have millions of millions of database entries.

One of the services our company offers is Voice Mail, this of course requires a database(s). I just happen to manage one of those databases. As a part of my duties, I run what we call, and "Uninitialized Report". This report consists of a truckload of customers that have never bothered to use this wonderful service, for various reasons. From this report, I make the decision to keep, or delete the entry.

I propose that something similar be done with the PM database.

I'm sure this has been mentioned before, please, tell me if it has.

Why not, write a script that query's the databases user table's, and if the creation date/last here date match, drop the user.

This would clear up some space, quite a bit, I would think.

Any ideas/opinions/thoughts??

Replies are listed 'Best First'.
Re: Database Clean up?
by djantzen (Priest) on Aug 24, 2002 at 03:11 UTC

    The search criteria would need to be refined further. We'd want to dump users that haven't posted anything since we don't want nodes floating around whose authors no longer exist -- assuming this could even happen without violating foreign key contraints. But I don't think it's uncommon that people create an account in order to ask a question, and then leave.

    We'd also have to choose a somewhat arbitrary cutoff date for delinquent users; if the creation date and last-here date match but occurred last week, dropping that user would be uncalled for. More likely we'd want to look at users that haven't visited in over a year or so.

    Looking at users that have never created a node, and whose creation/last-here date occurred over a year ago would (I *speculate*) result in a rather short list. We might have a few hundred users but that's a trivial amount to store in a database, unlike your situation.

    As to the safety issue of altering a production system, adding and deleting entries is just normal use for a database -- it's not like we're dropping tables or changing schemas, so I truly doubt that any havoc would result.

    Update: Cool link theorbtwo. That's more users than I'd expected, although compared to the number of writeups (156786, not counting Q&A presumeably), even say 5,000 or so nonparticipating users is unlikely to be a drain on the system.

      Actualy, it's not as trivial a number of users as you'd think. There are 3,833 users who have never logged in, not even once. There are an additional 9,668 who have logged in, but have no nodes. (These numbers are straight off of the frontpage of Crack for Statisticians.) Even just the never-logged-ins are 20% of all users. (BTW, the second number is "Logged In Once, No Write Ups", but I'd be careful with it -- number-of-logins isn't stored, only last-logged-in, and the number-of-writeups is often wrong, and doesn't count Q&A nodes, IIRC.)


      Confession: It does an Immortal Body good.

        Thanks for the numbers.

        Lets assume that each user takes up 1K, or at least that much when all the overhead of there record is included.
        use strict; my $never_logged_in = 3_833; my $logged_in_no_nodes = 9_668; my $aka_bad_users = $never_logged_in + $logged_in_no_nodes; my $disk_bytes = 1_024 * $aka_bad_users; print "Extra records to wade through = $aka_bad_users\n"; print "Wasted disk space = $disk_bytes\n";
        It appears to be around 13MB of space wasted by having "bad users" on file, I don't find that number large enough to introduce any headaches at this time, but I think it does deserve some thought on how to deal with these records in the future. Even if the removal was done for just the space considerations on the HD I don't think it would generate any great savings. There is also the option of using some form of compressed table format if we don't already, this would make all of the records more space efficent. One future plan might be to just add a column to the user table such as 'is_active' as a key so that searches could be done using it in the where critera where only "active" users are important to the results. A user could be considered inactive until the first login and then be switched active. Inactivity could be triggered by no login for over X months.

        I have only dug around slightly in the Everything engine and I think removing node association may be a risky if not deadly exercise.

        A properly indexed database table can side step the performance impact of numerous "extra" entries in a database.
Re: Database Clean up?
by FoxtrotUniform (Prior) on Aug 24, 2002 at 02:10 UTC

    Unless this is necessary (which is for the gods to decide, I suppose), I'd say "don't mess with the db". (I've learned a lot about messing with production databases in the past year or so, not all of it the easy way. :-)

    --
    F o x t r o t U n i f o r m
    Found a typo in this node? /msg me
    The hell with paco, vote for Erudil!

      I agree wholeheartedly there, I've felt the pain of messing with them once they are going and it hurts like heck ^_^. Also, wouldn't removing users who've posted once but never came back even create problems? I don't know anything about the system under the system at Perlmonks, but it seems to me that it is very much based off of ownership of nodes. Wouldn't these writeups of deleted accounts now be in limbo? Just a thought ~Adam
Re: Database Clean up?
by talexb (Chancellor) on Aug 24, 2002 at 16:52 UTC
    It's a good thought, but that's where it should remain. Unless you are running out of hard drive space or the system is horribly slow, just ignore the 'unused' table entries. You never know when one of those users might decide to call in for the first time .. and "Get lost! -- you don't exist!" wouldn't be a good first impression.

    In any case, I imagine that you could do any deletions using an SQL script -- delete from members where call_count = 0; -- so Perl doesn't even need to get involved.

    --t. alex
    but my friends call me T.

Re: Database Clean up?
by lachoy (Parson) on Aug 24, 2002 at 15:34 UTC

    Um, why? They're not hurting anyone, are they? This sounds like a solution is search of a problem :-)

    Chris
    M-x auto-bs-mode

      This was just an idea that poped up last night,a s I was bored.

      To answer your question simply, to save money, as I know some companies charge by the mb for overage on database usage, and bandwidth..

      Its obvious by the posts rep that this wasn't a good idea, I appologize for the waste of space.

      --~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--
      perl -e '$a="3567"; $b=hex($a); printf("%2X\n",$a);'

        This idea has already been discussed several times, but for different reasons. This is one of the threads.

        Main reason for deleting old users? With monastery growing fast and so many usernames that have been registered but never used, it's becoming more and more difficult to get a "nick" that you want to have. There've been some people frustrated being unable to register a username of their choice, because someone has done that a year or so before them, yet never used it.

Re: Database Clean up?
by JayBonci (Curate) on Aug 24, 2002 at 23:57 UTC
    In ecore, in the user table, lasttime is 0 if the user hasn't logged in (so that's one easy way to tell). However, as an ecore administrator, I'd like to admonish against dropping users (or nodetypes). Ecore 0.8 does not do well with deleting nodes of those types. Now, theoretically, users that have never logged in shouldn't have any record fragments to them, but certainly deleting users that have logged in is a bad idea.

    Ecore assumes when it gets a node (and oftentimes in the code), that the author_user attribute exists. Therefore, if you start deleting users, you're going to see a huge outcropping of Server Errors! until all the places where this assumption is false is special cased in. This includes messages, logs, etc.

    All things considered, the potential performance gain of deleting users is low, seeing as database key lookup is, I believe, O(log n). You're not going to pick up much by whacking those users, compared to the gigantic amounts of breakage you might incur.

    IMO, there are better places where you can pick up perf than carving out never-logged-in users. We've had this same discussion several times at Everything2.com

        --jb
Re: Database Clean up?
by screamingeagle (Curate) on Aug 24, 2002 at 18:10 UTC
    rather than deleting all those users, adding (and maintaining) a few well-thought out indexes on the tables should take care of any performance issues which might be a result of the large database size.
Re: Database Clean up?
by Abigail-II (Bishop) on Aug 26, 2002 at 09:38 UTC
    I've been involved in a MUD. It has registered users too. We've been around for 13, 14 years, and as a matter of policy, we don't delete accounts. We have high level accounts that haven't been used for thousands of days.

    And of course, the "name collision" is a known problem. If someone wants to use a name that was used by someone else, and that person hasn't logged on for a long time, a request for use of that name can be made. Such a request is usually granted by those with powers, but could be denied if the name is historical. If the request was granted, the old name will be filed (with the data associated with it) and made available for reuse.

    Abigail

Re: Database Clean up?
by Spenser (Friar) on Aug 25, 2002 at 05:38 UTC

    I'll give you one good, simple reason not to do this:  it would mean deleting tilly's 2986 write ups.  Those cannot be replaced.  Nor can the richness that every one else who has come and gone has provided this site, this community.

    -Spenser

      Umm, I don't see anybody suggesting deleting users that have writeups. In fact, most people aren't even suggesting deleting users that have ever logged in.


      Confession: It does an Immortal Body good.

      Thats an absolutely perfect reason not to do it.

      Thats why this was just an idea, to give good reasons to do, or not to do this. ;-)

      --~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--~~--
      perl -e '$a="3567"; $b=hex($a); printf("%2X\n",$a);'

(tye)Re: Database Clean up?
by tye (Sage) on Sep 02, 2002 at 19:59 UTC

    Er, we've already done this once. We only deleted users who had never logged in so they couldn't own any nodes so there would be no errors generated. It was done mostly to free up the names and slightly reduce clutter. It certainly wasn't done out of any concern for disk space, table efficiency, etc. We don't link nodes to owner via username, we do it on user ID number; so we could have renamed the users rather than delete them but there isn't much point in that.

    Unfortunately, we were wrong. "Faqmonk" has never logged in and yet owns the Perl FAQ nodes. Note that this does not generate any server errors (much less a flood), but rather just an ugly link in the "by" field of those nodes. See How do I do (anything)? for an example. Prior to some changes, it simply resulted in an "empty" link.

    And "faqmonk" still exists in the tomb and we just need to resurrect him. The resurrection code had some bugs in it so I'm just waiting for the person who was fixing that code to test it on faqmonk.

    We'll do this purge again occasionally. We'll probably also purge accounts that have logged in but never created nodes and haven't logged in for a really, really long time. There was some consideration to e-mailing a warning before deleting, and we'll consider it again the next time.

            - tye (but my friends call me "Tye")
Re: Database Clean up?
by Anonymous Monk on Aug 29, 2002 at 03:22 UTC

    Regardless of the merits of removing some accounts for some definition of 'some', it's seems to me that it was perhaps shortsighted to key on username, rather than say, user homenode id.

    If it had been done this way, allowing people to change the usernames (if they chose badly when they first came here, not thinking about haw long they might stay around), and removing un/little- used names from the system would have been a simple process. All write-ups having been keyed to the home node number, would simply stay in the system but would be attributed to user_nnnnnn.

    Still, that's beauty of 20/20 vision in hindsight.

    Difficult to know without seeing the schema, but (thoroughly tested on a backup DB of course) it doesn't seem that changing things to work this way would be an extraordinarily difficult thing to do.

Re: Database Clean up?
by richardX (Pilgrim) on Aug 29, 2002 at 23:03 UTC
    If you are using Oracle and your database is a VLDB (Very Large Database) then I would suggest using the PARTITION command when creating the tables. Partition on the key date and then you can drop a range of dates with very little impact on the database. Deleting a large number of rows in a large database can kill your server or cluster performance. See partition.htm for more info.

    Richard

    There are three types of people in this world, those that can count and those that cannot. Anon

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: monkdiscuss [id://192510]
Approved by rob_au
Front-paged by jarich
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2024-04-20 02:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found