Another "out of memory!" problem

slugger415 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, this is my first post. I'm getting an "out of memory" message. I've looked at some of the previous posts on this subject (2007, 2005 and 2001) but am not sure if I can resolve my problem.

My script "crawls" a large website and builds a list of all pages it can find, via a-href's, using HTML::Treebuilder and a few other modules. The key part is that it saves each URL to a %ListOfURLs hash, which it checks against so it doesn't hit the same page twice.

I'm finding when the hash gets to be more than 27,000, I get the out-of-memory error. Am I just hitting some kind of memory/hash size limit?

There are lots of other hashes and arrays created along the way, such as arrays of all hrefs on each page, e.g.:

my @aList = $tree->find_by_tag_name('a');

I've tried undef'ing those when they're no longer needed but it doesn't seem to make any difference.

I'm happy to provide some code here but it's a pretty busy script. Any suggestions about how I might otherwise build a list to be checked against that would use less memory would be appreciated.

BTW on one post I saw a suggestion to use 'tie', but the documentation for tie speaks thusly:

"This function binds a variable to a package class that will provide the implementation for the variable. VARIABLE is the name of the variable to be enchanted."

To me that might as well say "Tie a shoelace around a shoebox and wave a magic wand over it." :-) I don't understand a word of it.

Thanks for any help you can provide.

Scott

Comment on Another "out of memory!" problem

Replies are listed 'Best First'.
Re: Another "out of memory!" problem by ikegami (Patriarch) on Jun 22, 2010 at 23:18 UTC
I'm finding when the hash gets to be more than 27,000, I get the out-of-memory error. hum, that's really not that big. Especially if it just hold URLs. You should see the following surpass that number in no time. `my $i; my %urls; for (;;) ++$url{"http://www.perlmonks.org/?node_id=".++$i}; print("$i\n") if $i % 1000 == 0; }` [download] Tie a shoelace around a shoebox and wave a magic wand over it. Actually, that's quite an apt description! `tie` is a form of magic (yes, that's really what it's called) that ties methods to a variable (shoebox) such that those functions are called in response to the different actions taken on that variable. For example, `$tied = $x;` calls the tied variable's `STORE` method. It's easier to understand if you look at it from the other end. `tie` allows a module to use a Perl variable as its interface. Instead of `$obj->store($x)`, it allows `$obj = $x;`.	[reply] [d/l] [select]
Re: Another "out of memory!" problem by BrowserUk (Patriarch) on Jun 23, 2010 at 01:15 UTC
If instead of putting the urls in a hash, you use their md5 digests Digest::MD5 (in binary) as keys, then you will save a subtantial amount of space. 1 million binary MD5s stored as hash keys uses about 36 MB: `use Digest::MD5 qw[ md5 ];; undef $h{ md5( $_ ) } for 1 .. 1e6;; print total_size \%h;; 35665714` [download] Note also the use of `undef $hash{ ... }`. This autovivifies the key without allocating any storage for a value--thereby saving space. Whilst some will view this as an unconscionable "trick", for this type of lookup application that is pushing your memory limits, the savings are worth having. Using this method you should be able to index close to 60 million urls on a typical 2GB machine without problems. And far more quickly than any tie or DB mechanism that requires disk accesses. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l] [select]
Re: Another "out of memory!" problem by jau (Hermit) on Jun 23, 2010 at 09:29 UTC
My guess would be that you are leaking memory because you are not destroying your HTML::Treebuilder objects properly. From the docs: 4. and finally, when you're done with the tree, call $tree->delete() to erase the contents of the tree from memory. This kind of thing usually isn't necessary with most Perl objects, but it's necessary for TreeBuilder objects. See HTML::Element for a more verbose explanation of why this is the case. A simple `undef()` is not enough.	[reply] [d/l]
Re^2: Another "out of memory!" problem by slugger415 (Monk) on Jun 24, 2010 at 00:26 UTC
Hi jau, yes! that was exactly the problem. Nice to see it was such a simple solution, and a real "duh" moment for me. Thanks so much. Scott	[reply]
Re: Another "out of memory!" problem by Plankton (Vicar) on Jun 22, 2010 at 23:13 UTC
Have you thought of or tried storing your data in a DB like SQLite instead of hash ?	[reply]
Re: Another "out of memory!" problem by Marshall (Canon) on Jun 22, 2010 at 23:56 UTC
First, a site with way more than 27,000 pages is likely to have some tools or a special API that you can use for searching their site. For example, here is one post about accessing PubMed Re: CGI to query other websites. You may not wind up being very popular with the sysadmin if you really "beat the heck of their site" within a short period of time. Some other strategies would be to use a Google search to get the number of pages narrowed down and then search further on those pages. Its not clear to me what you are doing and why you have to visit every single page on this large site. A more optimized strategy might be possible if you could present some more application info? A hash with 27,000 keys of URL's doesn't sound large enough by itself to run out of memory, but sounds like there are multiple other large structures. A DB is one possible answer if you really do need to collect this massive amount of information for this site. The Perl DBI is very good and plays very well with MySQL or SQLlite.	[reply]
Re: Another "out of memory!" problem by slugger415 (Monk) on Jun 23, 2010 at 06:24 UTC
Hi all, thank you for your interesting replies. To answer a couple of questions, this is actually for an internal site (actually several large collections of information, Eclipse instances) belonging to my employer. We're trying to catalog every page in every Eclipse instance, in part so that we can see which pages have zero hits. The left nav has a TOC tree that the script can traverse, but not all pages are in the TOC, so the script opens each page and scans for any hrefs not in the TOC, follows them down recursively, and adds them to the list. It's interesting to hear that 27,000 is not really that big, so I'm suspecting something else is going on. In answer to the question why don't I use a database, I'm actually saving the URLs (and some other data from each page) to a CSV file, so other than the URL itself not much is getting written to memory. I'm not sure how a db would solve that -- I'd still need the hash to check for already-cataloged URLs, or else I'd have to query the db for each one. Anyway I'll look through the interesting suggestions and see if I can figure out something that works for me. Thanks. Scott	[reply]
Re: Another "out of memory!" problem by Cody Fendant (Hermit) on Jun 23, 2010 at 05:50 UTC
Where are you running this? Some hosting situations specifically limit the amount of time or the amount of memory that a script can use.	[reply]


more useful options
	PerlMonks