Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re: Fetch Mysql huge database

by rdfield (Priest)
on Sep 29, 2011 at 15:02 UTC ( #928594=note: print w/ replies, xml ) Need Help??

in reply to Fetch Mysql huge database

getting rid of duplicates too:

select username from table where username != '' and username not like 'deleted%' and username not in (select username from other_table) group by username having count(*) > 1;
I've never used mysql, but in other RDBMS systems I would check that enough memory has been assigned to buffer cache, sort area, hash sort area, etc. And by "enough", I mean "lots".

Joining two 34M rows on a decent enough server shouldn't take long. Nested iterations over 2 large tables does take a long time.

Having indexes on username on both tables would help a lot too


Comment on Re: Fetch Mysql huge database
Download Code
Re^2: Fetch Mysql huge database
by sundialsvc4 (Monsignor) on Sep 29, 2011 at 17:32 UTC

    Indexes would be quite mandatory, to the point that some DBMSes will temporarily create indexes in order to do such a join, or refuse to do the join.

    Even if you do wind up creating indexes only to immediately drop them, or if you set up an index that is “non-maintained” and it must be refreshed now and then, c’est la guerre.   Just a cost of doing business.

    (As of course you surely know... but just for the benefit of any onlookers who maybe don’t...) Left and right so-called “outer” joins are those which cause all of the rows in the table on the left or the right side of the relationship to appear even if they have no matching row on the other side.   Testing for NULL on the right side means that it is a non-matching row.

    One characteristic of the original strategy is that it tries to use “memory” ... and when you have millions of anything, you absolutely cannot use “memory” to do it.   If you don’t literally run out of the stuff (after hours of painful grinding in which your disk-drive might sue for workmen’s comp), you might spend days waiting for a solution, as thrashing sends everything on your computer system to the nether regions.   Whereas the inner-join approach will be virtually instantaneous.   It won’t even take much memory, because all the SQL server’s doing is marching through two compatible index trees.

    Incidentally... “38 million rows” really isn’t that outrageously large.   Oh, but it’s large enough to put a bit of a test on your strategy and to call you out rather quickly if you didn’t pick a good one, though.   So, if you try something and you observe that the system is really getting bolluxed up, it’s time to stop and reconsider how you’re trying to solve this problem.   “There’s more than one way to do it™” and you can be sure that there must be a much-better way.   “Okay, time to sit down and eat a sandwich here... I’m obviously lost in the woods... that is to say, ‘momentarily displaced.’ ”

      Yes. On a large table, indexes are essential for any field (or group of fields) which you're going to use to filter or order your results. Create an index on the UserName field here, and it'll drastically speed things up.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://928594]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (10)
As of 2014-07-29 19:59 GMT
Find Nodes?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:

    Results (226 votes), past polls