For bulk processing like this, I'd sidestep DBI and use command lines tools.
Create a file called: getnames.sql containing:
select UserName from table where UserName <> '' and UserName not like
+'deleted%'
And two perl scripts. 1) partNames.pl: #! perl -w
use strict;
my %fhs;
while( <> ) {
my $char = substr $_, 0, 1;
if( not exists $fhs{ $char } ) {
open $fhs{ $char }, '>', "names.$char" or die $!;
}
print { $fhs{ $char } } $_;
}
And 2) uniqNames. pl: #! perl -w
use strict;
for my $fname ( glob 'names.*' ) {
my %uniq;
open my $fh, '<', $fname or die $!;
while( <$fh> ) {
print unless exists $uniq{ $_ };
$uniq{ $_ } = 1;
}
}
Then run the following pair of commands: mysql --quick < getnames.sql | perl partNames.pl
perl uniqNames.pl > uniqedNames.txt
The first line dumps the names and pipes them to the perl script that then writes them to a set of files (called names.?, where ? is the first character of the name).
The second script reads those files one at a time and writes the unique names to stdout where they are directed into a file called uniqedNames.txt.
The whole (untested) process should take less time than sorting the 34 million names -- inside the DB or with your system sort utility. I'd hazard a guess that it would take less than 20 minutes.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
|