Hi there,
I am new to Perl so I hope this isn't something really simple! It feels like I've been banging my head against a brick wall for hours now and still haven't figured it out. I have written a script to parse thousands of HTML files in a directory. Each file contains a table and I import the table entries into a MySQL database. Functionally it works fine. However, there is some sort of memory problem, and big too! Roughly for every 1000 MySQL queries that are issued a percentage of my memory is taken according to top (I have 2GB RAM). I have used the debugger and tried inserting Dump statements but haven't figured it out yet. Can someone help? The code is below.
Thanks in advance :-)
Martin
#!/usr/bin/perl
use HTML::TableContentParser;
use HTML::Parse;
use HTML::FormatText;
use DBI;
use strict;
use warnings;
# Connect to database and create parser object
my $db = DBI->connect ("DBI:mysql:newsbms","newsbms", "newsbms",
{ RaiseError => 1, PrintError => 0});
# Loop twice
my $loopround = 1;
while ($loopround <= 2)
{
# Choose the table name
my $tablename = "modified";
if ($loopround == 2)
{
$tablename = "deleted";
}
print "\nProcessing the '$tablename' entries...\n\n";
# Create counters to show the number of files and queries processe
+d
my $counter = 0;
my $query_counter = 0;
# Open the directory
my $dirname = "/home/martin/monitoring/newsBMS/$tablename/";
opendir(DIR, $dirname) || die ("Could not open $dirname");
# Loop through all files in the directory
while (defined(my $filename = readdir(DIR)))
{
# Skip special "files": '.' and '..'
next if $filename =~ /^\.\.?$/;
$counter++;
# Open and read the html file into a single string
open(HTMLFILE, $dirname.$filename) || die ("Could not open $fi
+lename");
binmode HTMLFILE;
my $html = join("", <HTMLFILE>);
close(HTMLFILE);
# Parse the html tables
my $tcp = HTML::TableContentParser->new;
my $tables = $tcp->parse($html);
# Remove the html tags from the cells
for my $t (@$tables) {
for my $r (@{ $t->{rows} }) {
for my $c (@{ $r->{cells} })
{
my $stripper = HTML::FormatText->new;
$c->{data} = $stripper->format(parse_html($c->{dat
+a}));
$c->{data} =~ s/'/-/g;
$c->{data} =~ s/[:\\:]/-/g;
}
}
}
# Issue the MySQL queries
for my $t (@$tables)
{
for my $r (@{ $t->{rows} })
{
my $query = "INSERT INTO";
if ($loopround == 1)
{
$query = $query . " modified (id, name, title, dur
+ation,";
$query = $query . "library, modified, user, rev) V
+ALUES (";
}
if ($loopround == 2)
{
$query = $query . " deleted (name, title, duration
+,";
$query = $query . "deleted, library) VALUES (";
}
for my $c (@{ $r->{cells} })
{
chop($c->{data}); # remove the \n
$query = $query . "'" . $c->{data} . "',";
}
chop($query); # Remove the last comma added
$query = $query . ") ON DUPLICATE KEY UPDATE duplicate
+s=duplicates+1";
#print "Query = $query \n\n";
my $execute = $db->prepare($query);
$execute->execute();
$query_counter++;
if ($query_counter % 1000 == 0) {
print "Issued $query_counter MySQL queries.\n";
}
}
}
}
# Close the directory
closedir(DIR);
print "\nDone the '$tablename' table.\nProcessed $counter files an
+d issued $query_counter MySQL queries.\n";
$loopround++;
}
# Disconnect from the database
$db->disconnect();
print "\nProgram Finished.\n";
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.