Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Re: Massive Memory Leak

by graff (Chancellor)
on Dec 07, 2009 at 20:02 UTC ( #811615=note: print w/replies, xml ) Need Help??

in reply to Massive Memory Leak

There are a few odd things about the OP code:
  • Using HTML::Parse appears to be deprecated; the perldoc man page starts with:
    Disclaimer: This module is provided only for backwards compatibility with earlier versions of this library. New code should not use this module, and should really use the HTML::Parser and HTML::Treebuilder modules directly, instead.

  • This line in your first nested for loop seems to invoke a subroutine called "parse_html", which I would expect to turn up as "undefined":
    $c->{data} = $stripper->format(parse_html($c->{data}))

Apart from that, I wouldn't know whether the memory leak is due to the "unfinished" query statement objects, as suggested by others above, or whether it's due to stranded (non-garbage-collected) objects in the HTML parsing/formatting modules.

One way to tease that apart would be to divide the process into two distinct steps (two processes): step/process 1 is to parse the html data and output tab-delimited flat tables that can then be inserted into your database by any of several easy methods. If that process succeeds, you can conclude that it was the database interaction that caused the leak.

In any case, a better way to load large quantities into mysql tables is via the "mysqlimport" tool that comes with mysql -- it's incredibly fast compared to using Perl/DBI for inserts, and it's the best/easiest way to load a table from a tab-delimited flat-text file. (Rather, Perl/DBI is incredibly slow relative to mysqlimport.)

Another idea: since you are looping over two main directories, you might try just doing one directory per run (giving the desired path name on the command line). If you really want to do both in one run (after solving the memory leak issue), a better loop method would be:

for my $path ( 'modified', 'deleted' ) { ... }
instead of that clunky while-loop.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://811615]
[jrmcc]: Your problem was pregnant, missing a period!

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (11)
As of 2018-05-22 16:49 GMT
Find Nodes?
    Voting Booth?