Re: Fetch Mysql huge database

getting rid of duplicates too:

select username
from table
where username != ''
and username not like 'deleted%'
and username not in (select username from other_table)
group by username
having count(*) > 1;
[download]

I've never used mysql, but in other RDBMS systems I would check that enough memory has been assigned to buffer cache, sort area, hash sort area, etc. And by "enough", I mean "lots".

Joining two 34M rows on a decent enough server shouldn't take long. Nested iterations over 2 large tables does take a long time.

Having indexes on username on both tables would help a lot too

rdfield

Comment on Re: Fetch Mysql huge database Download Code

Replies are listed 'Best First'.
Re^2: Fetch Mysql huge database by sundialsvc4 (Abbot) on Sep 29, 2011 at 17:32 UTC
Indexes would be quite mandatory, to the point that some DBMSes will temporarily create indexes in order to do such a join, or refuse to do the join. Even if you do wind up creating indexes only to immediately drop them, or if you set up an index that is “non-maintained” and it must be refreshed now and then, c’est la guerre. Just a cost of doing business. (As of course you surely know... but just for the benefit of any onlookers who maybe don’t...) Left and right so-called “outer” joins are those which cause all of the rows in the table on the left or the right side of the relationship to appear even if they have no matching row on the other side. Testing for `NULL` on the right side means that it is a non-matching row. One characteristic of the original strategy is that it tries to use “memory” ... and when you have millions of anything, you absolutely cannot use “memory” to do it. If you don’t literally run out of the stuff (after hours of painful grinding in which your disk-drive might sue for workmen’s comp), you might spend days waiting for a solution, as thrashing sends everything on your computer system to the nether regions. Whereas the inner-join approach will be virtually instantaneous. It won’t even take much memory, because all the SQL server’s doing is marching through two compatible index trees. Incidentally... “38 million rows” really isn’t that outrageously large. Oh, but it’s large enough to put a bit of a test on your strategy and to call you out rather quickly if you didn’t pick a good one, though. So, if you try something and you observe that the system is really getting bolluxed up, it’s time to `stop` and reconsider how you’re trying to solve this problem. “There’s more than one way to do it™” and you can be sure that there must be a much-better way. “Okay, time to sit down and eat a sandwich here... I’m obviously lost in the woods... that is to say, ‘momentarily displaced.’ ”
Re^3: Fetch Mysql huge database by aaron_baugher (Curate) on Sep 29, 2011 at 20:37 UTC
Yes. On a large table, indexes are essential for any field (or group of fields) which you're going to use to filter or order your results. Create an index on the UserName field here, and it'll drastically speed things up.	[reply]


"be consistent"
	PerlMonks