Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re^4: Memory issue with large array comparison

by aaron_baugher (Curate)
on May 25, 2012 at 03:56 UTC ( #972359=note: print w/replies, xml ) Need Help??

in reply to Re^3: Memory issue with large array comparison
in thread Memory issue with large array comparison

This is what the original poster did, but with different variable names and a slightly different regex, and it ran out of memory. But isn't this O(N2)? It seems to me that it greps every item in one array against all the items in the other array, so it's really no different from this:

my @array3; OUTER: for my $a1 (@array1){ for my $a2 (@array2){ next OUTER if they_match_somehow(); } push @array3, $a1; # it didn't match anything in @array2 }

Both cases have two nested loops; it's just harder to see them in the grep-within-a-grep method.

Aaron B.
Available for small or large Perl jobs; see my home node.

Replies are listed 'Best First'.
Re^5: Memory issue with large array comparison
by ww (Bishop) on May 26, 2012 at 00:47 UTC
    + + aaron_baugher; I didn't even notice the similarity ...and shame on me for that, as it means it's no answer to OP's original dilemma.

    I was, I realize now (thanks to your watchfulness), obsessing on the multiple responses offering use of a hash as a solution. I still think those represent something close to cargo-culting a meme, (rather than actual code) -- but not an optimal solution, since, if I read the wisdom of the sages correctly (and if they're right, of course), using a hash would be at least as memory intensive and probably more so.

    That's also an issue with map and grep (cf Eliya's observations, above), but perhaps less so than using a hash (that's another test that I haven't undertaken, but which might lead to a publishable finding). And in the same node, Eliya makes a cogent point (echoed in slightly different context by dave_the_m's code: there are a variety of ways to attack OP's problem with reduced memory demand. Yet another might be a step-wise solution: first, separate the id portion of the first dataset to a file of it's own; then identify the ids in the second file that don't have identical (or identically normalized, if that's involved, too) values.

    But, again, ++ for casting a sharp eye on the prior responses.

      Thanks for the compliment, ww. You have a point, that sometimes when everyone comes out with the same suggestion, it reflects cultish thinking. But sometimes it means there really is one best way to do it. When the problem is "find strings from one list in another list," it's just pretty hard to beat a hash lookup for speed and simplicity, and this was a pretty typical case. A hash lookup is so superior to other methods that it makes sense to go to it automatically -- without thinking, even -- unless there's some reason it won't work. It's like using strict: you should always use it unless you know enough to know when not to use it.

      On the memory issue, I'm really not sure why the grep-in-a-grep solution ran the OP out of memory. Maybe it causes temporary lists to be built in memory? In any case, a hash isn't all that memory intensive. I created a 10,000 item array, and then turned it into a hash's keys. The hash took 150% as much memory as the array. So edge cases where you have enough memory to use an array but not a hash will be unusual. I agree that solving the problem in less memory could be an interesting challenge, but only worth tackling if a hash lookup fails first.

      Aaron B.
      Available for small or large Perl jobs; see my home node.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://972359]
[Corion]: ambrus: Yeah, I read that, but it's somewhat vague as in what I really have to implement, and where/how my "other" mainloop should/needs to call AnyEvent
[Corion]: (or maybe I just work better from existing code that I munge until it works and I understand it rather than a short abstract text like "implement everything that's needed" ;) )
[ambrus]: Corion: I think in this case you can get away with only a stub for idle, one that always dies when you create it, because AnyEvent::HTTP doesn't use it, not even indirectly through AnyEvent::Handle or AnyEvent::Socket or AnyEvent::DNS.
[Corion]: The "and I understand it" part is optional.
[Corion]: ambrus: Yes but I also need to implement the file / IO watcher, because Prima has that (in Prima::File), and I need to supply the appropriate thing to make push_write etc. work with Prima
[ambrus]: Corion: yes, you need to implement the io watcher, which should be simple because Prima::File is basically that, and the timer watcher form Prima::Timer
[Corion]: ... or so I think. As I said, I'm somewhat vague on how to make AnyEvent cooperate with a callback-driven IO event loop that gives me callbacks when data is available or can be written
[ambrus]: what push_write thing? I don't think you need that. that's implemented generically by AnyEvent::Handle
[Corion]: ambrus: Yeah, that's what I think as well. But you give me an idea, maybe I should start with implementing the timer, as that should be far simpler and with fewer edge-cases/nasty interaction than the file watcher
[ambrus]: You only provide the watcher part that tells when the handle is readable or writable, not the actual writing and reading.

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (9)
As of 2016-12-08 12:17 GMT
Find Nodes?
    Voting Booth?
    On a regular basis, I'm most likely to spy upon:

    Results (141 votes). Check out past polls.