Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Process Text File and Write to Database

by spickles (Scribe)
on Nov 20, 2009 at 18:28 UTC ( [id://808496]=perlquestion: print w/replies, xml ) Need Help??

spickles has asked for the wisdom of the Perl Monks concerning the following question:

Monks -

I've written a script to read in a list of nursing homes that was copied and pasted off the web. I pull this file in and write an output file that removes empty lines. I then pull the output file back in, and now I want to step through it and write certain elements to a database. There are three lines that contain information that I want to skip, as shown by the matches. I've been doing a lot of troubleshooting and can't seem to figure out where I've gone wrong. When I run the code, I get no errors, but no data in the database either.

#!c:/xampp/perl/bin/perl use strict; use warnings; use DBI; use dbConnect_nursing; sub printVariables(@_) { foreach my $variable (@_) { print $variable . "\n"; } print "###############################################\n"; } sub processLine { my @temp; chomp($_[0]); unshift(@temp, $_[0]); my $var = shift(@temp); return $var; } ################################### connect to the database ########## +################################## # data source name my $dsn = "DBI:$dbConnect::db_platform:$dbConnect::db_database:$dbConn +ect::db_host:$dbConnect::db_port"; # perl DBI connect my $connect = DBI->connect($dsn, $dbConnect::db_user, $dbConnect::db_p +w, {'RaiseError' => 1}); ################################### connect to the database ########## +################################## my @file_array; my $in_file = "c:\\nursing_homes.txt"; my $out_file = "c:\\nursing_homes_out.txt"; if (-e $out_file) { unlink $out_file; } open INPUT,'<',$in_file or die "Can't open file " . $in_file . "\n$!\n +"; #Open for read open OUTPUT,'>',$out_file or die "Can't open file " . $out_file . "\n$ +!\n"; #Open for write while (<INPUT>) { chomp ($_); next if $_ =~ /^\s*$/; # skip over blank lines print OUTPUT $_ . "\n"; } close INPUT; close OUTPUT; open INPUT,'<',$out_file or die "Can't open file " . $out_file . "\n$! +\n"; #Open for read while (<INPUT>) { my $name = $connect->quote($_); next; my $address1 = $connect->quote($_); next; my $address2 = $connect->quote($_); next; my $phone = $connect->quote($_); next; next if (($_ =~ /^.*Council.*$/) || ($_ =~ /^Continuing.*$/) | +| ($_ =~ /^Mapping.*$/)); next if (($_ =~ /^.*Council.*$/) || ($_ =~ /^Continuing.*$/) | +| ($_ =~ /^Mapping.*$/)); next if (($_ =~ /^.*Council.*$/) || ($_ =~ /^Continuing.*$/) | +| ($_ =~ /^Mapping.*$/)); my $overall = $connect->quote($_); next; my $inspections = $connect->quote($_); next; my $staffing = $connect->quote($_); next; my $quality = $connect->quote($_); next; my $programs = $connect->quote($_); next; my $beds = $connect->quote($_); next; my $ownership = $connect->quote($_); next; my $query_string = "INSERT INTO nursing_homes (name, address1, + address2, phone, overall, inspections, staffing, quality, programs, +beds, ownership) VALUES ($name, $address1, $address2, $phone, $overal +l, $inspections, $staffing, $quality, $programs, $beds, $ownership)"; #printVariables($name, $address1, $address2, $phone, $overall, + $inspections, $staffing, $quality, $programs, $beds, $ownership, $qu +ery_string); my $query_handle = $connect->prepare("INSERT INTO nursing_home +s (name, address1, address2, phone, overall, inspections, staffing, q +uality, programs, beds, ownership) VALUES ($name, $address1, $address +2, $phone, $overall, $inspections, $staffing, $quality, $programs, $b +eds, $ownership)"); $query_handle->execute(); } close INPUT; $connect->disconnect(); __END__

Sample data is below:

AARON MANOR REHABILITATION & NURSING CENTER 100 ST CAMILLUS WAY FAIRPORT, NY 14450 (585) 377-4000 Resident Council Mapping & Directions 4 out of 5 stars 4 out of 5 stars 3 out of 5 stars 4 out of 5 stars Medicare and Medicaid 140 For profit - Corporation ABSOLUT CTR FOR NURSING & REHAB ALLEGANY LLC 2178 NORTH FIFTH STREET ALLEGANY, NY 14706 (716) 373-2238 Resident & Family Councils Mapping & Directions 3 out of 5 stars 4 out of 5 stars 1 out of 5 stars 4 out of 5 stars Medicare and Medicaid 37 For profit - Corporation ABSOLUT CTR FOR NURSING & REHAB AURORA PARK LLC 292 MAIN STREET EAST AURORA, NY 14052 (716) 652-1560 Resident Council Mapping & Directions 1 out of 5 stars 1 out of 5 stars 2 out of 5 stars 4 out of 5 stars Medicare and Medicaid 320 For profit - Corporation ABSOLUT CTR FOR NURSING & REHAB DUNKIRK LLC 447 449 LAKE SHORE DRIVE WEST DUNKIRK, NY 14048 (716) 366-6710 Resident Council Mapping & Directions 1 out of 5 stars 2 out of 5 stars 1 out of 5 stars 2 out of 5 stars Medicare and Medicaid 40 For profit - Corporation

Replies are listed 'Best First'.
Re: Process Text File and Write to Database
by johngg (Canon) on Nov 20, 2009 at 21:27 UTC

    Others have pointed out your misunderstanding of next. Here are some other pointers.

    • You could use join and the string multiplier (see Multiplicative Operators in perlop) to save a lot of typing in your printVariables subroutine.

      $ perl -e ' > sub printVariables > { > print join qq{\n}, @_, q{#} x 10, q{}; > } > > $v1 = 123; > $v2 = 456; > printVariables( $v1, $v2 ); > > @arr = qw{ pete john mike }; > printVariables( @arr );' 123 456 ########## pete john mike ########## $

    • You don't seem to call it but your processLine subroutine goes a very long way around the houses to achieve the same result as a

      chomp $line;

      in the body of your code would have done.

    • You don't need to unlink a pre-existing file if you are about to open it for writing.

    • You open "c:\\nursing_homes.txt" for reading and process it to remove blank lines writing the changes to "c:\\nursing_homes_out.txt" which you then re-open and read in your database insertion loop. Unless you need that processed file elsewhere, why bother? Just work on the original file in your main database insertion loop and include the next if $_ =~ /^\s*$/; line there.

    • Why do you initialse $query_string but not use it when doing the my $query_handle = $connect->prepare( ... ); line instead of re-typing exactly the same code again? Seems a bit wasteful of effort to me.

    • Rather than using concatenation

      ... die "Can't open file " . $out_file . "\n$!\n";

      just interpolate into the string as you've already done with the $! variable

      ... die "Can't open file $out_file\n$!\n";

    I hope these point are helpful.

    Cheers,

    JohnGG

    Update: Corrected cut'n'paste error where I'd copied an earlier piece of test code with a shorter subroutine name in the call, pvar rather than printVariables

Re: Process Text File and Write to Database
by toolic (Bishop) on Nov 20, 2009 at 18:37 UTC
    It looks like you never call $query_handle->execute(); because of the unconditional next; in your while loop:
    while (<INPUT>) { my $name = $connect->quote($_); next;
    That first connect is the only one that ever gets called in your loop. That doesn't seem right.
Re: Process Text File and Write to Database
by keszler (Priest) on Nov 20, 2009 at 18:39 UTC
    Your second while loop is effectively:
    open INPUT,'<',$out_file or die "Can't open file " . $out_file . "\n$! +\n"; #Open for read while (<INPUT>) { my $name = $connect->quote($_); next; }

    See next

      I call the execute at the bottom. It looks like I don't understand the use of 'next' in this context, so I'm open to other suggestions as to how to read that file line by line and add the records to the database. When I reach a line that contains the word 'profit' or 'Government', that is the end of a record. Within a record, I need to skip lines that match 'Councils', 'Continuing', and 'Mapping' as these are irrelevant lines of data. The problem is that not all records contain all three of those lines, so I can't just arbitrarily increase a counter. I was having a good deal of success reading the file to an array and then using a 'for' loop, but I stumble across the cases where I need to toss out those irrelevant lines.

        It looks like I don't understand the use of 'next' in this context,
        Yes. 'next' means "exit the block I'm in now, and skip anything (other than a continue block) afterwards." Every time you called 'next', your program skipped directly to the next iteration of the loop without going any further - in your original program, that's why only the
        my $name = $connect->quote($_); next;
        parts of the loop were being executed.
        This is quick, ugly, and prone to fail at the slightest change in the data, but maybe it'll get you started:
        open INPUT,'<',$out_file or die "Can't open file " . $out_file . "\n$! +\n"; #Open for read while (<INPUT>) { my $name = $connect->quote($_); my $address1 = $connect->quote(<INPUT>); my $address2 = $connect->quote(<INPUT>); my $phone = $connect->quote(<INPUT>); for (<INPUT>) { next if /(Council|^Continuing|^Mapping)/; last; } my $overall = $connect->quote($_); my $inspections = $connect->quote(<INPUT>); my $staffing = $connect->quote(<INPUT>); my $quality = $connect->quote(<INPUT>); my $programs = $connect->quote(<INPUT>); my $beds = $connect->quote(<INPUT>); my $ownership = $connect->quote(<INPUT>); my $query_string = "INSERT INTO nursing_homes (name, address1, + address2, phone, overall, inspections, staffing, quality, programs, +beds, ownership) VALUES ($name, $address1, $address2, $phone, $overal +l, $inspections, $staffing, $quality, $programs, $beds, $ownership)"; #printVariables($name, $address1, $address2, $phone, $overall, + $inspections, $staffing, $quality, $programs, $beds, $ownership, $qu +ery_string); my $query_handle = $connect->prepare("INSERT INTO nursing_home +s (name, address1, address2, phone, overall, inspections, staffing, q +uality, programs, beds, ownership) VALUES ($name, $address1, $address +2, $phone, $overall, $inspections, $staffing, $quality, $programs, $b +eds, $ownership)"); $query_handle->execute(); }

        You define $query_string but don't use it, and you're not checking return values for the prepare and execute calls. You must.

Re: Process Text File and Write to Database
by graff (Chancellor) on Nov 21, 2009 at 18:05 UTC
    You may want to try grabbing the full HTML data for the page, and using a parser module on that (HTML::Parser or cpan::/HTML::TokeParser), in case the markup in the web page provides some structural information that you can use (like record boundaries and field labels).

    On the other hand, if the blank lines that you are throwing away happen to represent boundaries between records, you should be using them as record separators, rather than throwing them away. Look up the section in the perlvar documentation about $INPUT_RECORD_SEPARATOR ($/) -- if blank lines are used only at record boundaries, then setting  $/=""; (empty string) causes perl to read a complete, multi-line record on each iteration of while(<>){...}.

    Apart from that, you should be using placeholders in your insert statement -- prepare it once (before the loop) and execute it repeatedly (in the loop); this makes the "quote()"-ing of values unnecessary.

    In case it's true that blank lines in the data represent record boundaries, here's an example of how it could work:

    #!/usr/bin/perl use strict; use warnings; use DBI; my $dbh = DBI->connect( " ...whatever... " ); my @insert_fields = qw{ name address1 address2 phone overall inspections staffing quality programs beds ownership }; my $insert_sql = 'insert into nursing homes ('. join( ', ', @insert_fields ). ') values ('. join( ', ', ('?') x @insert_fields ). ')'; my $insert_sth = $dbh->prepare( $insert_sql ); $/ = ""; # set input_record_separator to empty string (paragraph mode +) # just put the input file name on the command line when running the sc +ript # (or pipe the data to the script's STDIN) while (<>) # each iteration reads up to a blank line { my @lines = grep !/ Councils?$|^Mapping|^Continuing/, split( /[\r\ +n]+/ ); if ( @lines != @insert_fields ) { # skip records that won't work print "Record # $. has wrong number of fields:\n$_\n"; next; # if you redirect STDOUT to a file, you can deal with +these later } $insert_sth->execute( @lines ); } $insert_sth->finish; $dbh->disconnect;
    (not tested, but it compiles, and the sql statement comes out right)

    If the copy/pasted text contains "extra" blank lines within records, the simple paragraph-mode approach above won't work. Try to find some other reliable indicator of record boundaries and use that instead, then remove the blank lines by just altering that grep statement a bit:

    @lines = grep !/^\s*$| Councils?$|^Mapping|^Continuing/, split( /[\r\ +n]+/ );

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://808496]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2025-06-21 14:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.