Beefy Boxes and Bandwidth Generously Provided by pair Networks Joe
"be consistent"
 
PerlMonks  

loading csv files into MySQL databse

by zimvoter (Novice)
on May 16, 2008 at 17:27 UTC ( #686968=perlquestion: print w/ replies, xml ) Need Help??
zimvoter has asked for the wisdom of the Perl Monks concerning the following question:

Hi

Please help. . I am based in Zimbabwe and have volunteered to put the voters roll in a database for the opposition government in waiting. The ruling dictator Mugabe does not offer the database in electronic format. We are to have a second round of elections in the next 4 weeks and lots of our supporters have been killed and beaten by the Mugabe regime. Despite all this we are confident that the will of the people will prevail via these elections. We need to have a database so that displaced people and others can know where and whether they are registered to vote. I have a young programmer who has been working in trying to load the data into an My SQL database. We have all the data in excel or csv files , We have 210 constituencies with total voters registered about 5.9Million.

We are looking for a solution that will quicken and make the process of loading into MySQL database easier We would need some help in someone who knows about php and MySQL to assist my associate by the name Green Below is his description of the problem he is facing and an example of the error messages.

We are working with PHP and MySQL to developing a database front-end to manage data we collected from around the country on a windows operating system. We got the date files in PDF format and they are being converted via an OCR process of pdf2XL so that we can extract the data to input in the database. We have created the CSV’s for uploading to our MySQL database and we have faced a problem of timeouts and with apostrophe’s/single quotes contained in the names and the addresses. Is there a way of making MySQL ignore the single quotes/apostrophe’s when uploading the data? The data we have to upload runs into millions of records and we have limited time to have the database up and running.

Example error message

Error There seems to be an error in your SQL query. The MySQL server error o +utput below, if there is any, may also help you in diagnosing the pro +blem ERROR: Unclosed quote @ 225010 STR: ' SQL: NAYLOR,MELANIE SUE,F,7/6/1971,7,6,1971,63- 759935B 00,11 WINDERME +RE DR STRATHAVEN HARARE,640463,7, MT PLEASANT KURENZVI,PERSEVERENCE,F,8/9/1971,8,9,1971,13- 099830V 13,112 REMBRANDT + COURT HARARE,640463,7, MT PLEASANT ROYSTON,KEVIN VICTOR,M,8/9/1956,8,9,1956,48- 007243K 00,11A LYHURST MO +NAVALE AVONDALE HARARE,640463,7, MT PLEASANT BUNZURA,FORTUNATE,M,13/09/1979,13,9,1979,86- 011207R 86,12 ALIWAL AVE +STRATHAVEN HARARE,640463,7, MT PLEASANT CHIPATO,TSUNGAI,M,25/03/1955,25,3,1955,63- 690359A 12,12 PENNPLACE STR +ATHAVEN HARARE,640463,7, MT PLEASANT JIVAN,NARENDRA RATANJEE,M,26/08/1955,26,8,1955,63- 230113H 00,12 SUAFF +OLK ROAD STRATHAVEN HARARE,640463,7, MT PLEASANT ....

Attached is the sample csv file mtp4date.csv

On the issue of timeouts, we plan on changing the default in the PHP ini file from 300 seconds so that we can upload as much data as we want. What is the best solution regarding timeouts?

Regards

Valentine Sinemani
Valentine@yoafarica.com

Comment on loading csv files into MySQL databse
Download Code
Re: loading csv files into MySQL databse
by dragonchild (Archbishop) on May 16, 2008 at 17:36 UTC
    Read http://dev.mysql.com/doc/refman/5.0/en/load-data.html for how to load data using MySQL's own data loading command. That will tell you how to properly quote, escape, and manage all your data issues.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
      I second this. LOAD DATA is really the best way to do this if it will work with your data format. It is more flexible than PG's COPY too (see REPLACE and IGNORE). If it doesn't work take a look at Text::CSV which is probably the best CSV library ever.


      Evan Carroll
      I hack for the ladies.
      www.EvanCarroll.com

        Get a mysql shell prompt and issue a command like:

        mysql> LOAD DATA INFILE '/path/to/mtp4date.csv' REPLACE INTO TABLE vot +ers FIELDS TERMINATED BY ','
Re: loading csv files into MySQL databse
by roboticus (Canon) on May 16, 2008 at 17:37 UTC
    zimvoter:

    A few notes...

    By the error message, you likely have a column with a single-quote in it. (Such as a last name of O'Brian.)

    You're likely in the wrong place, as this isn't a PHP board.

    You might consider using the quoting facility in your language/library of choice and using that instead of building the insert statements on-the-fly. (Just a guess.)

    Since the data is also in Excel, you *could* write a quickie VB script to export the data into MySQL...

    Finally, there's an ODBC interface to Excel, so you might be able to write a quickie PHP program to open the spreadsheet and copy the values over to the MySQL DB.

    ...roboticus
Re: loading csv files into MySQL databse
by padawan_linuxero (Scribe) on May 16, 2008 at 22:58 UTC
    Well I did something like what you are talking something like this: I had DBF table and then turn it into CSV and then load the data to MySQL. the code is something like this:
    my $dbhX3 = DBI->connect('dbi:XBase(RaiseError=1):'); my $dbhC3 = DBI->connect('dbi:CSV(RaiseError=1):'); my $select3 = $dbhX3->prepare("SELECT * FROM r510"); print "\nEjecutar SELECT de r510\n"; $select3->execute(); $dbhC3->do("CREATE TABLE r510.csv AS IMPORT(?)",{},$select3); print "\nconectar al a base de datos en Mysql\n"; my $mysql_dbh3 = DBI->connect("DBI:mysql:database=$datafilename;host=l +ocalhost", "root", "xyz123", {'RaiseError' => 1}); #---- Here I empty the table in MySql ------ print "Limpiando la tabla\n"; $mysql_dbh3->do("TRUNCATE TABLE r510"); #---- Here I load all the data ------- my $sql3 = "LOAD DATA LOCAL INFILE 'c:/proyecto/r510.csv' INTO TABLE r510 FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'"; my $sth3 = $mysql_dbh3->prepare($sql3); print "ejecutando la carga de la tabla\n"; $sth3->execute ();
    and the error message is that of having extra characters on the line O'Neil or O"Neil stuff like that check the CSV's
Re: loading csv files into MySQL databse
by hesco (Deacon) on May 17, 2008 at 17:52 UTC
    Valentine Sinemani:

    All of this advice above is good, using a mysql database engine. LOAD DATA will do mass imports. My issue with it is that (I believe that) single errors will cause an entire import to fail. I work with lots of voter files on a regular basis. And I use perl, not php, as does your friend.

    You mention a couple of issues:

    Timeouts: get the web browser out of the way between your import script and the database engine. Try to get direct shell level access to the database. It will save you days of frustration.

    pdf-ocr-csv conversion: Consider whether pdftotext (a utility a part of the xpdf library in linux) might save you the trouble of dealing with the optical character recognition work. You can handle the pdf's as simple electronic files without the need to print them and then rescan them into electronic files. This may not work depending on the nature of the pdf's you were given. You will then need to parse the resulting data into your csv format. You might find that awk and sed are at least as useful for that purpose as perl.

    Working with the imports: My _build_insert() function below escapes the single quotes as they occur. With the caveat that I am working in perl against a postgresql database engine, while you are working in php on a mysql db engine, here is a sample script I use for voter file imports:

    #!/usr/bin/perl -w use strict; use warnings; use Data::Dumper; use DBI; my $file = '/home/hesco/lists/voters/ca/psr00693.txt'; my $db = 'voters'; my $host = '127.0.0.1'; my $user = 'vtr_user'; my $pw = 'secret'; my $dsn = "dbi:Pg:database=$db;host=$host"; my $dbh = DBI->connect($dsn, $user, $pw, { RaiseError => 0, PrintError => 0, AutoCommit => 0 }) o +r die $DBI::errstr; my $precinct_sql =<<EOQ; SELECT precinct_ndx FROM precinct WHERE precinct = ? AND precinct_part = ?; EOQ my @fields = (); my $counter = 0; my $field_names; my ($insert_reg,$sth,$prec_id); my ($insert_address,$phone,$insert_hy,$t); my ($m,$d,$y); my $restart = 0; open ('VOTERS','<',$file); while (<VOTERS>){ $counter++; @fields = split ",",$_; # for csv # @fields = split "\t",$_; # for tab delimited my($ndx); if($counter == 1) { foreach my $ndx (0 .. (0 + @fields)){ $fields[$ndx] =~ s/\ *$//; $fields[$ndx] =~ s/\s*$//; $fields[$ndx] =~ s/'/\\'/; } my $field_names_tmp = \@fields; # print STDERR "\$field_names_tmp is: ", Dumper($field_names_tmp); foreach my $i (0 .. (0 + @fields -1)){ $field_names->{'ndx'}->{$i} = $fields[$i]; $field_names->{'field_name'}->{$fields[$i]} = $i; } # print "\$field_names is: \n", Dumper(\$field_names); foreach my $i (0 .. (0 + @fields -1)){ print "$i -- $fields[$i]\n"; } next; } # registrants table my @reg_indexes = (0,1,25,2,3,4,5,5,26,27,28,29,30,31,34,35,36,39); my @hy_indexes = (1,58,59,60,61,62,63,64,65); my @addr_indexes = (1,6,7,8,9,10,11,12,13,14,15,16); # define the indexes for any other tables # into which you want to normalize this data $sth = $dbh->prepare($precinct_sql); $sth->execute($fields[32],$fields[33]); ($prec_id) = $sth->fetchrow_array(); if(!defined($prec_id)){ $prec_id = -1; } $insert_reg = "INSERT INTO registrant VALUES("; $insert_reg = _build_insert(\@fields,$insert_reg,@reg_indexes); $insert_reg =~ s/,$/,$prec_id);/; # print "\$insert_reg is: " . $insert_reg . "\n"; # print "registrant_id is: " . $fields[1] . ":"; $sth = $dbh->prepare($insert_reg); $sth->execute() or warn "\$sql is: $insert_reg, error is: " . $dbh-> +errstr; print "\$registrant_id is: " . $fields[1] . "\t" . $phone . "\n"; # +, $insert_hy,"\n"; $dbh->commit; } 1; sub _build_insert { my($fields, $insert, @indexes) = @_; foreach my $field (@indexes){ unless(defined($fields[$field])){ $insert .= "'',"; next; } $fields[$field] =~ s/\ *$//; $fields[$field] =~ s/\s*$//; $fields[$field] =~ s/'/\\'/; $insert .= "'$fields[$field]',"; } return $insert; }
    This is the basics. Its simple and proceeds by brute force, one record at a time. Its willing to permit an occasional error fall by the way side, while it continues to process records. By adding a condition early on to check for a particular precinct or political party, I am able to import only the records I need. I used a variation on this script last week and imported 50,000 records in about an hour.

    Thank you for being engaged in the process, against such incredibly violent opposition.

    Yours for a just and sustainable future,
    -- Hugh Esco
    CampaignFoundations.com

    if( $lal && $lol ) { $life++; }
      LOAD DATA will do mass imports. My issue with it is that (I believe that) single errors will cause an entire import to fail.
      Sometimes... If the error is in the data-output being inconsistent then maybe. If the error is because the row violates a pkid then again check out REPLACE / IGNORE clauses. PG does not yet have these options and they save a tremendous amount of time.

      Do not ever split on \t for a tab separated file. I've done this many times, that solution is stupid and crufty. Set the sep_char in Text::CSV. All of what your doing now, less the chopping of white space can be done by Text::CSV. It will work faster (if it can make use of Text::CSV_XS) AND it will handle the edge cases which you aren't. There is even DBD::CSV (uses csv_sep_char) which will use Text::CSV_XS on the underside, and you can simple select from one table and insert it into SQL!! Zomfg, split is suicide here.


      Evan Carroll
      I hack for the ladies.
      www.EvanCarroll.com
Re: loading csv files into MySQL databse
by zimvoter (Novice) on May 17, 2008 at 18:35 UTC
    Good people. Thank you very much for all your brilliant sugestions.I will onpass this to my young friend who is still to experiment with the various suggestions that came. Today we buried one of our activists who was murdered by the zanupf thugs for merely not voting for mugabe. Please bear with us as we might ask some simple questions the challenge is to upload 5.9 million voters on this intented voters roll and then have it hosted on the internet. The oppressed struggling people of zimbabwe and many of the extended families that have been displaced into various countries all long for cahnge and see an end to this brutal evil regime . Last night we downloaded the XL2MySQL converter and we still trying to see how we can resolve the errors. thank you good people God bless. valentine sinemani
      In the unlikely event that you continue to have trouble with this problem I would note that it may be possible to have trusted people outside your country work on it and send back the results in formats that should be relatively easy to load into MySQL. MySQL has backup facilities, including a program called mysqldump, that dump records from a table or database in text formats that should load fairly reliably. Some formats are included below. Be aware that different countries may have different date, time and currency formats which are controlled by MySQL locale variables.

      Of course I would expect that there would be concerns about the trustworthiness of those doing the imports and forwarding them in such a case. Other than the issue of trust, I would be happy to help as, I think, would others.

      My expertise in this kind of approach is somewhat limited and I welcome commentary on this suggestion.


      Some sample mysqldump formats:
      (As recommended by another poster, data has been cleansed to attempt to mask voter identities)


      INSERT INTO `voters` VALUES ('GXXX','MELANIE','XXX','F','1982-05-22',22,5,1982,'XX- XXXXXXX XX','202 XXXXXXXXXX XXXX COURT HARARE','640463','7','MT PLEASANT');
      INSERT INTO `voters` VALUES ('O\'GXXX','MELANIE','XXX','F','1982-05-22',22,5,1982,'XX- XXXXXXX XX','19 XXXXX STRATHAVEN HARARE','640463','7','MT PLEASANT');
      INSERT INTO `voters` VALUES ('WXXXXX','HOPE','','F','1985-07-23',23,7,1985,'XX- XXXXXXX XX','17 XXXXXXX DR STRATHAVEN HARARE','640463','7','MT PLEASANT');
      INSERT INTO `voters` VALUES ('MXXXXXXXXXX','KEVIN','XXXXXXXX','M','1981-11-18',18,11,1981,'XX- XXXXXXX XX','24A XXXXXXXXXX XXXX AVONDALE HARARE','640463','7','MT PLEASANT');
      INSERT INTO `voters` VALUES ('DXXXX','NELSON','','M','1986-03-30',30,3,1986,'XX- XXXXXXX XX','10A XXXXXXXXXXXXX ROAD STRATHAVEN HARARE','640463','7','MT PLEASANT');
      INSERT INTO `voters` VALUES ('LXXXXXXXX','TAARIQ','','M','1987-08-14',14,8,1987,'XX- XXXXXXX XX','24A XXXXXXXXXX XXXX AVONDALE HARARE','640463','7','MT PLEASANT');
      INSERT INTO `voters` VALUES ('RXXXXXX','SUVRO','XXXX','M','1988-07-22',22,7,1988,'XX- XXXXXXX XX','18 XXXXXXXX AVE STRATHAVEN HARARE','640463','7','MT PLEASANT');


      INSERT INTO `voters` VALUES ('GXXX','MELANIE','XXX','F','1982-05-22',22,5,1982,'XX- XXXXXXX XX','202 XXXXXXXXXX XXXX COURT HARARE','640463','7','MT PLEASANT'),('O\'GXXX','MELANIE','XXX','F','1982-05-22',22,5,1982,'XX- XXXXXXX XX','19 XXXXX STRATHAVEN HARARE','640463','7','MT PLEASANT'),('WXXXXX','HOPE','','F','1985-07-23',23,7,1985,'XX- XXXXXXX XX','17 XXXXXXX DR STRATHAVEN HARARE','640463','7','MT PLEASANT'),('MXXXXXXXXXX','KEVIN','XXXXXXXX','M','1981-11-18',18,11,1981,'XX- XXXXXXX XX','24A XXXXXXXXXX XXXX AVONDALE HARARE','640463','7','MT PLEASANT'),('DXXXX','NELSON','','M','1986-03-30',30,3,1986,'XX- XXXXXXX XX','10A XXXXXXXXXXXXX ROAD STRATHAVEN HARARE','640463','7','MT PLEASANT'),('LXXXXXXXX','TAARIQ','','M','1987-08-14',14,8,1987,'XX- XXXXXXX XX','24A XXXXXXXXXX XXXX AVONDALE HARARE','640463','7','MT PLEASANT'),('RXXXXXX','SUVRO','XXXX','M','1988-07-22',22,7,1988,'XX- XXXXXXX XX','18 XXXXXXXX AVE STRATHAVEN HARARE','640463','7','MT PLEASANT');

Re: loading csv files into MySQL databse
by aufflick (Deacon) on May 19, 2008 at 06:40 UTC
    Valentine - I hope those sample records are cleansed (ie. not real names/addresses)?
Re: loading csv files into MySQL databse
by aquarium (Curate) on May 19, 2008 at 23:15 UTC
    although sometimes it is helpful to get questions if full context...in this particular case it may be exposing fellow perlmonks to a possible scam. The repeated use of emotional tone is suspect, and is not required for questions of a technical nature. There are other forums for political support etc.
    fellow perlmonks....please do not correspond privately with Valentine, for your own safety. Valentine: please stick to forum guidelines.
    the hardest line to type correctly is: stty erase ^H

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://686968]
Approved by igelkott
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (17)
As of 2014-04-17 17:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (453 votes), past polls