Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Re: Process Text File and Write to Database

by graff (Chancellor)
on Nov 21, 2009 at 18:05 UTC ( [id://808608]=note: print w/replies, xml ) Need Help??


in reply to Process Text File and Write to Database

You may want to try grabbing the full HTML data for the page, and using a parser module on that (HTML::Parser or cpan::/HTML::TokeParser), in case the markup in the web page provides some structural information that you can use (like record boundaries and field labels).

On the other hand, if the blank lines that you are throwing away happen to represent boundaries between records, you should be using them as record separators, rather than throwing them away. Look up the section in the perlvar documentation about $INPUT_RECORD_SEPARATOR ($/) -- if blank lines are used only at record boundaries, then setting  $/=""; (empty string) causes perl to read a complete, multi-line record on each iteration of while(<>){...}.

Apart from that, you should be using placeholders in your insert statement -- prepare it once (before the loop) and execute it repeatedly (in the loop); this makes the "quote()"-ing of values unnecessary.

In case it's true that blank lines in the data represent record boundaries, here's an example of how it could work:

#!/usr/bin/perl use strict; use warnings; use DBI; my $dbh = DBI->connect( " ...whatever... " ); my @insert_fields = qw{ name address1 address2 phone overall inspections staffing quality programs beds ownership }; my $insert_sql = 'insert into nursing homes ('. join( ', ', @insert_fields ). ') values ('. join( ', ', ('?') x @insert_fields ). ')'; my $insert_sth = $dbh->prepare( $insert_sql ); $/ = ""; # set input_record_separator to empty string (paragraph mode +) # just put the input file name on the command line when running the sc +ript # (or pipe the data to the script's STDIN) while (<>) # each iteration reads up to a blank line { my @lines = grep !/ Councils?$|^Mapping|^Continuing/, split( /[\r\ +n]+/ ); if ( @lines != @insert_fields ) { # skip records that won't work print "Record # $. has wrong number of fields:\n$_\n"; next; # if you redirect STDOUT to a file, you can deal with +these later } $insert_sth->execute( @lines ); } $insert_sth->finish; $dbh->disconnect;
(not tested, but it compiles, and the sql statement comes out right)

If the copy/pasted text contains "extra" blank lines within records, the simple paragraph-mode approach above won't work. Try to find some other reliable indicator of record boundaries and use that instead, then remove the blank lines by just altering that grep statement a bit:

@lines = grep !/^\s*$| Councils?$|^Mapping|^Continuing/, split( /[\r\ +n]+/ );

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://808608]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2025-07-11 08:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.