Perl Monk, Perl Meditation PerlMonks

### Memory issue with large array comparison

by bholcomb86 (Novice)
 on May 24, 2012 at 20:35 UTC ( #972317=perlquestion: print w/ replies, xml ) Need Help??
bholcomb86 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am having an issue when doing array comparisons on large arrays. Basically I one array which contains a large list (50,000+) of filepaths ex. C:/abc/abc1/GS12345 then I have another large array (10,000+) ID's like GS12345.

I am using some code I found on here to create a new array that contains pathnames in which the id after the last slash in the pathname doesn't match any of the id's in the array of id's.

The problem is this runs out of memory on large lists, it works fine on smaller lists like 2,000 pathnames vs. 1,000 ID's.

Here is my code:  my @list = grep { my $x =$_; not grep { $x =~ /$_\w+$/ } @safe_list } @pathnames; Comment on Memory issue with large array comparison Download Code Re: Memory issue with large array comparison by snape (Pilgrim) on May 24, 2012 at 21:17 UTC I would recommend you to use a scalar variable that contains the pathname till C:/abc/abc1/, use hash of arrays or hashes to have your IDs and folders like GS12345 and then use another scalar variable to concatenate the path+folder that you are interested in. I think that would not lead to memory issues. Also, searching would be bit faster. Re: Memory issue with large array comparison by pemungkah (Priest) on May 24, 2012 at 21:17 UTC When you say "doesn't match any of" I think of a hash rather than arrays; otherwise you're spending lots of runtime searching, when a hash would just hand it to you, O(1). How about building a hash out of the pathnames, with the keys being the last part of the name, and the value being the path? That way, to get the ones that aren't in the safelist, you'd just look over the safelist and delete those keys. Anything left is not safe. Would that work? It would certainly be faster. Re: Memory issue with large array comparison by Eliya (Vicar) on May 24, 2012 at 21:21 UTC The task doesn't require that you have all pathnames in memory at once. It can just as well be solved by processing one path at a time. So, for example, read them one by one from a file ... while (my$path = <PATHS>) {
# check if path component isn't in @safe_list ...
}
[download]

That should significantly cut down on memory usage.

Also, iterative solutions (plain old nested loops) are generally less memory hungry than functional implementations (grep, map and friends) of the same task. This is because the former avoid creating additional lists on the stack.

BTW, do you really need the regex /$_\w+$/, i.e. wouldn't a simple hash lookup (of the appropriate path fragment) work, too?

Re: Memory issue with large array comparison
by dave_the_m (Parson) on May 24, 2012 at 21:29 UTC
The following code, scaled up to 500,000 filenames and 100,000 IDs, takes 2 seconds to run and uses 90Mb approx.
use warnings;
use strict;

# create some sample data
my (@pathnames, @safe_list);
push @pathnames, sprintf "C:/abc/abc1/GS%06d", $_ for 1..500_000; push @safe_list, sprintf "GS%06d",$_ for 1..100_000;

my %safe_hash;
$safe_hash{$_} = 1 for @safe_list;

my @list;
for (@pathnames) {
# (adjust regex to match whatever the filename/ID format is)
/\/(\w+)$/ or die "bad pathname:$_";
push @list, $_ unless$safe_hash{$1}; } [download] Dave. I decided to try Dave's solution and turns out it works perfectly. I just wanted to solve my memory problem which this does, but as a side benefit it is WAY faster than my previous solution. Consider this problem solved! I appreciate all input given on this matter. Re: Memory issue with large array comparison by aaron_baugher (Deacon) on May 24, 2012 at 21:32 UTC This looks like a pretty standard "scan one list for items existing (or not existing) in another list" problem, which screams for a hash. Make a hash using the values from your second array (like GS12345) as the keys. Then loop through your array of filenames, use split or a filename module to break out the ID from the pathname and see if it exists as a key in the hash. With arrays that large, memory shouldn't be a problem doing it that way. I just created a hash with 10,000 7-character keys, and Devel::Size says it only used 1MB of memory. If your array of pathnames is running you out of memory, maybe you can process them one-by-one from wherever they're coming from, instead of loading them all into an array at once. Aaron B. Available for small or large Perl jobs; see my home node. Re: Memory issue with large array comparison by ww (Bishop) on May 24, 2012 at 22:51 UTC Seems to me (subject to the Monks' corrections) that the OP's precise words mean no hash is required: Just store (push) to a new array those items from the "large list" that have no match in the "large array". And that requires only a pretty basic regex ... unless the OP means (without specifying it) that the "large list" includes more than the single path in the example... and even that solves itself with the same simple regex... capturing everything after the last slash to test the match. OP makes no statement that the "new array" referred to in the 2nd para "I am using some code I found on here to create a new array that contains pathnames in which the id after the last slash in the pathname doesn't match any of the id's in the array of id's." that requires anything more than an array of non-matches. When there's a simple way and a hard way that produce the same results, go simple! My understanding is that he has two arrays: the first containing 50,000+ pathnames, and the second containing 10,000 strings which may match a filename in one of the pathnames. I don't see how you exclude those matches with a regex, unless you concat all 10,000+ strings together with pipe symbols, and I doubt a regex like that would be very efficient. Aaron B. Available for small or large Perl jobs; see my home node. You may be right. Certainly the regex you envision would be a tad clumsy. :) At the moment, it's looking to me as though I should have written the code to support my claim re regex solutions, before posting, rather than simply doing a mental sketch (which proved to be wrong in at least that particular instance). Thank you for making me "think, again!" However, on the theory that simple (or, in this case, 'concise') is desirable: #!/usr/bin/perl use 5.014; # 972317 my @arr1 = ('C:\abc\def\zz456','C:\abc\xyz\ab123','C:\abb\bba\ac234',' +C:\abc\xyz\ab321'); my @arr2 = qw/ab123 ac234 ad456 ab321/; # no match for C: +\abc\def\zz456 my$dir_entry;

my @arr3 = grep { $dir_entry =$_; not grep { $dir_entry =~ /\Q$_/i }
+@arr2 } @arr1;

say "Following is/are dir entry/entries (in \$arr1) without matching i +tems in \@arr2"; say each(@arr3); [download] Output ( index and array element; see each ): 0C:\abc\def\zz456 And can some knowledgeable Monk explain why the index of the zeroth element of the array is rendered as 0C? Update: A knowledgeable Monk (++) has indeed explained... and the answer has little to do with the assumption immediately above. Meh! Blech on me. I don't see how you exclude those matches with a regex, unless you concat all 10,000+ strings together with pipe symbols, and I doubt a regex like that would be very efficient. One solution: Regexp::Assemble It creates an optimized regex that checks all your strings at once. It is usually shorter and much faster than your "pipe"d regex. For example, 10,000 strings of each 5 characters were turned into a regex less than 25,000 characters long. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics Re: Memory issue with large array comparison by Cristoforo (Deacon) on May 25, 2012 at 01:10 UTC Just to point out, in your code my @list = grep { my$x = $_; not grep {$x =~ /$_\w+$/ } @safe_list } @pathnames;

the first grep from the left should probably be a map instead.

Chris

Update: Didn't read the code correctly.

Re: Memory issue with large array comparison
by sundialsvc4 (Abbot) on May 25, 2012 at 02:02 UTC

One thing that occurs to me also is ... “how often do you do this?”   How many other tasks do you do that are similar to this?   My thought is that, if this is not strictly a one-off task, a database (e.g. an SQLite file) might stand you in very good stead, vis-a-vis a Perl application that is strictly dedicated to the task.   It has obviously and plainly been demonstrated how Perl can knock the immediate requirement out of the park ... now, is this one-off, or are there enough other requirements out there that might be resolved by a clever JOIN?   Is it (or is it not) beneficial to write Perl programs that can, by loading a database, circumvent the need to write many more almost-the-same but-not-quite Perl programs?   It is an angle worth contemplating; I don’t suggest what is the answer.

Re: Memory issue with large array comparison
by CountZero (Bishop) on May 25, 2012 at 10:07 UTC
The "hash" solution is probably the fastest, but less flexible as you can only check on the exact match between IDs and the filename part of the filepath.

It will not work if you have to check if the ID is *somewhere* mentioned in the filepath.

Still you do not have to despair! Nor do you have to check each filepath against each ID with a separate regex (which would be very slow).

Through the magic of Regexp::Assemble it only takes a program of a few lines:

use Modern::Perl;
use Regexp::Assemble;
use autodie;

open my $PATTERNS, '<', './patterns.txt'; my$re_pattern = Regexp::Assemble->new->add(<$PATTERNS>)->re; close$PATTERNS;

open my $PATHS, '<', './paths.txt'; do {print unless /$re_pattern/} while <$PATHS>; close$PATHS;
[download]

CountZero

A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

My blog: Imperial Deltronics

Wow, that's an impressive module. Thanks for the tip on that. Out of curiosity, I benchmarked it against the hash solution, and the hash is faster of course, but only 13 times faster. As you say, there will be times when you can't use a standard hash lookup, so something like this is a good alternative. I had it print out the regex it built for 10,000 different keys (all 7-char random lowercase letters) and it was huge, yet it only took a little over a second to compare it to 50,000 different strings on my system. The Perl regex machine is amazing.

Aaron B.
Available for small or large Perl jobs; see my home node.

Create A New User
Node Status?
node history
Node Type: perlquestion [id://972317]
Approved by Corion
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (9)
As of 2015-05-04 22:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?