Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: loading csv files into MySQL databse

by hesco (Deacon)
on May 17, 2008 at 17:52 UTC ( [id://687111]=note: print w/replies, xml ) Need Help??


in reply to loading csv files into MySQL databse

Valentine Sinemani:

All of this advice above is good, using a mysql database engine. LOAD DATA will do mass imports. My issue with it is that (I believe that) single errors will cause an entire import to fail. I work with lots of voter files on a regular basis. And I use perl, not php, as does your friend.

You mention a couple of issues:

Timeouts: get the web browser out of the way between your import script and the database engine. Try to get direct shell level access to the database. It will save you days of frustration.

pdf-ocr-csv conversion: Consider whether pdftotext (a utility a part of the xpdf library in linux) might save you the trouble of dealing with the optical character recognition work. You can handle the pdf's as simple electronic files without the need to print them and then rescan them into electronic files. This may not work depending on the nature of the pdf's you were given. You will then need to parse the resulting data into your csv format. You might find that awk and sed are at least as useful for that purpose as perl.

Working with the imports: My _build_insert() function below escapes the single quotes as they occur. With the caveat that I am working in perl against a postgresql database engine, while you are working in php on a mysql db engine, here is a sample script I use for voter file imports:

#!/usr/bin/perl -w use strict; use warnings; use Data::Dumper; use DBI; my $file = '/home/hesco/lists/voters/ca/psr00693.txt'; my $db = 'voters'; my $host = '127.0.0.1'; my $user = 'vtr_user'; my $pw = 'secret'; my $dsn = "dbi:Pg:database=$db;host=$host"; my $dbh = DBI->connect($dsn, $user, $pw, { RaiseError => 0, PrintError => 0, AutoCommit => 0 }) o +r die $DBI::errstr; my $precinct_sql =<<EOQ; SELECT precinct_ndx FROM precinct WHERE precinct = ? AND precinct_part = ?; EOQ my @fields = (); my $counter = 0; my $field_names; my ($insert_reg,$sth,$prec_id); my ($insert_address,$phone,$insert_hy,$t); my ($m,$d,$y); my $restart = 0; open ('VOTERS','<',$file); while (<VOTERS>){ $counter++; @fields = split ",",$_; # for csv # @fields = split "\t",$_; # for tab delimited my($ndx); if($counter == 1) { foreach my $ndx (0 .. (0 + @fields)){ $fields[$ndx] =~ s/\ *$//; $fields[$ndx] =~ s/\s*$//; $fields[$ndx] =~ s/'/\\'/; } my $field_names_tmp = \@fields; # print STDERR "\$field_names_tmp is: ", Dumper($field_names_tmp); foreach my $i (0 .. (0 + @fields -1)){ $field_names->{'ndx'}->{$i} = $fields[$i]; $field_names->{'field_name'}->{$fields[$i]} = $i; } # print "\$field_names is: \n", Dumper(\$field_names); foreach my $i (0 .. (0 + @fields -1)){ print "$i -- $fields[$i]\n"; } next; } # registrants table my @reg_indexes = (0,1,25,2,3,4,5,5,26,27,28,29,30,31,34,35,36,39); my @hy_indexes = (1,58,59,60,61,62,63,64,65); my @addr_indexes = (1,6,7,8,9,10,11,12,13,14,15,16); # define the indexes for any other tables # into which you want to normalize this data $sth = $dbh->prepare($precinct_sql); $sth->execute($fields[32],$fields[33]); ($prec_id) = $sth->fetchrow_array(); if(!defined($prec_id)){ $prec_id = -1; } $insert_reg = "INSERT INTO registrant VALUES("; $insert_reg = _build_insert(\@fields,$insert_reg,@reg_indexes); $insert_reg =~ s/,$/,$prec_id);/; # print "\$insert_reg is: " . $insert_reg . "\n"; # print "registrant_id is: " . $fields[1] . ":"; $sth = $dbh->prepare($insert_reg); $sth->execute() or warn "\$sql is: $insert_reg, error is: " . $dbh-> +errstr; print "\$registrant_id is: " . $fields[1] . "\t" . $phone . "\n"; # +, $insert_hy,"\n"; $dbh->commit; } 1; sub _build_insert { my($fields, $insert, @indexes) = @_; foreach my $field (@indexes){ unless(defined($fields[$field])){ $insert .= "'',"; next; } $fields[$field] =~ s/\ *$//; $fields[$field] =~ s/\s*$//; $fields[$field] =~ s/'/\\'/; $insert .= "'$fields[$field]',"; } return $insert; }
This is the basics. Its simple and proceeds by brute force, one record at a time. Its willing to permit an occasional error fall by the way side, while it continues to process records. By adding a condition early on to check for a particular precinct or political party, I am able to import only the records I need. I used a variation on this script last week and imported 50,000 records in about an hour.

Thank you for being engaged in the process, against such incredibly violent opposition.

Yours for a just and sustainable future,
-- Hugh Esco
CampaignFoundations.com

if( $lal && $lol ) { $life++; }

Replies are listed 'Best First'.
Re^2: loading csv files into MySQL databse
by EvanCarroll (Chaplain) on May 17, 2008 at 18:05 UTC
    LOAD DATA will do mass imports. My issue with it is that (I believe that) single errors will cause an entire import to fail.
    Sometimes... If the error is in the data-output being inconsistent then maybe. If the error is because the row violates a pkid then again check out REPLACE / IGNORE clauses. PG does not yet have these options and they save a tremendous amount of time.

    Do not ever split on \t for a tab separated file. I've done this many times, that solution is stupid and crufty. Set the sep_char in Text::CSV. All of what your doing now, less the chopping of white space can be done by Text::CSV. It will work faster (if it can make use of Text::CSV_XS) AND it will handle the edge cases which you aren't. There is even DBD::CSV (uses csv_sep_char) which will use Text::CSV_XS on the underside, and you can simple select from one table and insert it into SQL!! Zomfg, split is suicide here.


    Evan Carroll
    I hack for the ladies.
    www.EvanCarroll.com

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://687111]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2024-04-16 07:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found