comment on

Hi there,

I am new to Perl so I hope this isn't something really simple! It feels like I've been banging my head against a brick wall for hours now and still haven't figured it out. I have written a script to parse thousands of HTML files in a directory. Each file contains a table and I import the table entries into a MySQL database. Functionally it works fine. However, there is some sort of memory problem, and big too! Roughly for every 1000 MySQL queries that are issued a percentage of my memory is taken according to top (I have 2GB RAM). I have used the debugger and tried inserting Dump statements but haven't figured it out yet. Can someone help? The code is below.

Thanks in advance :-)

Martin

#!/usr/bin/perl

use HTML::TableContentParser;
use HTML::Parse;
use HTML::FormatText;
use DBI;
use strict;
use warnings;

# Connect to database and create parser object
my $db = DBI->connect ("DBI:mysql:newsbms","newsbms", "newsbms",
                    { RaiseError => 1, PrintError => 0});
# Loop twice
my $loopround = 1;
while ($loopround <= 2)
{
    # Choose the table name
    my $tablename = "modified";
    if ($loopround == 2)
    {
        $tablename = "deleted";
    }

    print "\nProcessing the '$tablename' entries...\n\n";

    # Create counters to show the number of files and queries processe
+d
    my $counter = 0;
    my $query_counter = 0;    

    # Open the directory
    my $dirname = "/home/martin/monitoring/newsBMS/$tablename/";
    opendir(DIR, $dirname) || die ("Could not open $dirname");

    # Loop through all files in the directory
    while (defined(my $filename = readdir(DIR)))
    {

        # Skip special "files": '.' and '..'
        next if $filename =~ /^\.\.?$/; 
        $counter++;

        # Open and read the html file into a single string
        open(HTMLFILE, $dirname.$filename) || die ("Could not open $fi
+lename");
        binmode HTMLFILE;
        my $html = join("", <HTMLFILE>);
        close(HTMLFILE);

        # Parse the html tables
        my $tcp = HTML::TableContentParser->new;
        my $tables = $tcp->parse($html);

        # Remove the html tags from the cells
        for my $t (@$tables) {
            for my $r (@{ $t->{rows} }) {
                for my $c (@{ $r->{cells} })
                {
                    my $stripper = HTML::FormatText->new;
                    $c->{data} = $stripper->format(parse_html($c->{dat
+a}));
                    $c->{data} =~ s/'/-/g;
                    $c->{data} =~ s/[:\\:]/-/g;
                }
            }
        }
    
        # Issue the MySQL queries
        for my $t (@$tables)
        {
            for my $r (@{ $t->{rows} })
            {
                my $query = "INSERT INTO";
                if ($loopround == 1)
                {
                    $query = $query . " modified (id, name, title, dur
+ation,";
                    $query = $query . "library, modified, user, rev) V
+ALUES (";
                }
                if ($loopround == 2)
                {
                    $query = $query . " deleted (name, title, duration
+,";
                    $query = $query . "deleted, library) VALUES (";
                }

                for my $c (@{ $r->{cells} })
                {
                    chop($c->{data}); # remove the \n
                    $query = $query . "'" . $c->{data} . "',";
                }
                chop($query); # Remove the last comma added
                $query = $query . ") ON DUPLICATE KEY UPDATE duplicate
+s=duplicates+1";
                #print "Query = $query \n\n";
                my $execute = $db->prepare($query);
                $execute->execute();
                $query_counter++;
                if ($query_counter % 1000 == 0) {
                    print "Issued $query_counter MySQL queries.\n";
                }
            }
        }
    }
    # Close the directory
    closedir(DIR);

    print "\nDone the '$tablename' table.\nProcessed $counter files an
+d issued $query_counter MySQL queries.\n";
    $loopround++;
}

# Disconnect from the database
$db->disconnect();

print "\nProgram Finished.\n";
[download]

In reply to Massive Memory Leak by martin_ldn

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


laziness, impatience, and hubris
	PerlMonks