sectokia has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have input which is 500M pairs of 32bit integers, which don't occur more than 256 times each.

I want to load them into memory, and also have an array of how many times each int is in a pair.

Now in C, to do this I only need 5GB of ram: For the pairs: 500M * 4 (32bit int) * 2 (pairs) = 4GB. For the occurances: 1G * 1 (8bit int) = 1GB.

However when I do the same thing in perl, the ram usage is more like 256 bytes per item:

```my @d;
my @p;
for (my \$i=0;\$i<500000000;\$i++)
{
my (\$p1,\$p2) = getPair();
\$d[\$p1] += 1;
\$d[\$p2] += 1;
push @p, [ \$p1, \$p2];
}

I am seeing about 1GB RAM usage per 4M input pairs, so I would need 125G of RAM!

Is there any way you can tell perl a scalar is to be a int only of a certain size?

The other idea I have is that I should not be using array, but rather a gigantic scalar, and then pack/unpack the values in/out of the scalar?

What would be the best way to approach this in perl?

Thanks!

Replies are listed 'Best First'.
Re: Memory efficient way to deal with really large arrays?
by AnomalousMonk (Bishop) on Dec 13, 2020 at 00:19 UTC
Is there any way you can tell perl a scalar is to be a int only of a certain size?

See PDL. I have no experience with this software other than knowing of its existence, but I think it might be able to do this trick.

The other idea I have is [to use] a gigantic scalar, and then pack/unpack the values in/out of the scalar ...

Take a look at BrowserUk's ultimate solution for the problem posed here. substr is used rather than pack/unpack.

Give a man a fish:  <%-{-{-{-<

Indeed. PDL is probably the most powerful and efficient module both in terms of memory usage and speed, but it has its learning curve.

On the other hand, there are several modules on CPAN allowing one to access packed strings as regular arrays as for instance Packed::Array, Tie::Array::PackedC or my own Tie::Array::Packed.

Those modules are memory efficient, but relatively slow due to the tie interface overhead. Well, actually, for instance, Tie::Array::Packed which is written in fast XS, would probably be on par with using perl builtins (i.e. substr, pack, unpack, etc.) to access a packed string. Probably it depends of the specific operations being performed.

PDL is what I'd recommend, too. Note that it is a complete framework and leverages custom "array" and other data type containers that can be used in perl code. But, yay TIMTOWTDI!
Re: Memory efficient way to deal with really large arrays?
by eyepopslikeamosquito (Bishop) on Dec 13, 2020 at 05:18 UTC

I have input which is 500M pairs of 32bit integers, which don't occur more than 256 times each. I want to load them into memory, and also have an array of how many times each int is in a pair.

From Fastest way to lookup a point in a set (a question I asked three years ago):

I have a large set of points. Each point has an x and y coordinate. The range of values for x and y are -2,147,483,648 to 2,147,483,647 (i.e. 32-bit signed int). I am running a 64-bit version of Perl; this code is not required to run in a 32-bit environment. I need to insert the set of points into a Perl data structure ... then perform many lookups to tell if a given point is in the set or not. I want the lookup to be as fast as possible.
These two problem statements sound similar enough that examining replies to my original Fastest way to lookup a point in a set question might prove worthwhile.

Word of warning though, as indicated at Re^2: More Betterer Game of Life, while applying a long series of micro-optimizations to the code improved my running time from 1635 secs to 450 secs (3.6 times faster), stepping away from the code to find a better algorithm reduced my running time from 450 secs to 17 secs (26 times faster).

Refs:

Re: Memory efficient way to deal with really large arrays?
by GrandFather (Saint) on Dec 13, 2020 at 00:07 UTC
What would be the best way to approach this in perl(sic)?

Umm, do it in C (XS)? "best" in this case depends on a whole lot of stuff you haven't told us. How performant does the code have to be. How much memory does your machine have? How often do you expect to run the code? Is this approach even appropriate?

At first blush using XS is likely to be the best option with packing/unpacking pairs and counts into a large string or two a reasonable second option. But a solution using a Database or just doing the whole thing in C++ may be better. Without any context it is impossible for us to tell.

Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
Re: Memory efficient way to deal with really large arrays?
by salva (Canon) on Dec 13, 2020 at 11:37 UTC
Instead of using an array of arrays for storing the pairs, you can use two arrays, one for the first elements and another for the second ones. That is going to reduce your memory consumption by an order of magnitude:
```my @d;
my (@p1, @p2);
for (my \$i=0;\$i<500000000;\$i++)
{
my (\$p1,\$p2) = getPair();
\$d[\$p1]++;
\$d[\$p2]++;
push @p1, \$p1;
push @p2, \$p2;
}
That's brilliant. It's simple and very efficient.

Another approach would have been to pack the pairs into a 64 bit string to be stored in one array, that might be more efficient but wouldn't be as simple as this trick.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

that might be more efficient

I am not sure. Some years ago Nicholas Clark introduced a change in the internal representation of SVs that made then very efficient for integers and doubles. Briefly, all internal types used to have a fixed size head (SV_HEAD) and a variable sized body. That change, used a trick to embed the body inside the header for SVs representing numbers, making then more cache friendly and reducing memory usage by eliminating the body part.

So, having two arrays with numbers may actually use less memory than an array of small strings.

Re: Memory efficient way to deal with really large arrays?
by Tux (Canon) on Dec 13, 2020 at 09:57 UTC

If you do not need performance, use a database.

To make a relative transparent treansition: Tie::Array::DBD. The more complex your data, the slower it is, but you can work around memory limitations of a computer relatively easy (which was the only motivation for this module).

Here's an overview of speed differences on my machine. Higher numbers are better:

```  Size op       perl     SQLite         Pg      mysql    MariaDB
+  CSV
------ -- ---------- ---------- ---------- ---------- ---------- -----
+-----
20 rd  6666666.7   227272.7     3916.2     7538.6     9267.8     4
+444.4
20 wr  6666666.7   172413.8     2842.1     1759.3     1882.7    11
+560.7
200 rd  8333333.3   414078.7    12738.9     8710.4    24330.9
+901.9
200 wr  9523809.5   310559.0    12190.7     7399.7    13898.5    13
+556.6
600 rd 24000000.0   371977.7    14399.2    15106.9    24207.2
+346.0
600 wr 20689655.2   376884.4    14260.9    12633.4    23992.3    14
+080.5
2000 rd 13986014.0   392310.7    45244.8    23147.1    23431.8
+107.1
2000 wr 51282051.3   405515.0    18964.5    16786.4    31738.5    14
+729.2
20000 rd 25380710.7   374995.3    42310.4    22139.2    22889.8
+    -
20000 wr 40899795.5   390930.4    33410.4    37265.1    32508.7
+    -

Enjoy, Have FUN! H.Merijn
Re: Memory efficient way to deal with really large arrays?
by LanX (Cardinal) on Dec 13, 2020 at 00:17 UTC
> push @p, [ \$p1, \$p2];

I guess that creating 500e06 sub-arrays is the most memory consuming part.

I can't tell how efficient the usage of @d is, because it depends on how sparse the entries are used.

32 bits means 4.2e9 potential entries but with 500e6/256 = 1953125 in worst case you'll end up with 2200 empty entry ratio in @d. I bet a hash is more memory efficient then.

But it really depends what you are trying to achieve with all that data.

##### update

> which don't occur more than 256 times each.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

Re: Memory efficient way to deal with really large arrays?
by LanX (Cardinal) on Dec 13, 2020 at 00:28 UTC
On a side note: We recently had a discussion about Judy arrays which claim to be far more memory efficient than hashes.

But without further details this is just a shot in the dark.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

My impression is that Judy arrays are about addressing speed rather than space concerns with an emphasis on cache coherency as an optimisation technique. Probably not really applicable in a Perl context.

Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
I don't have any first hand experience but as I said they claim to be far more memory efficient.

Of course it also depends on the OPs usage, storing 500M Perl arrays holding just two integer entries will still exhaust his memory.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

Re: Memory efficient way to deal with really large arrays?
by LanX (Cardinal) on Dec 13, 2020 at 15:09 UTC
> Now in C, to do this I only need 5GB of ram: For the pairs: 500M * 4 (32bit int) * 2 (pairs) = 4GB. For the occurances: 1G * 1 (8bit int) = 1GB

that's an illusion, if you use an array of bytes in C, you'll have 2**32 potential points. That's already 4.3GB for the occurrence counts in @d, not 1GB.

But if you think it's that easy in C, you'll easily reproduce it's arrays with pack , substr and unpack in Perl.

Of course slower, but you haven't told us yet what your goal is.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

That makes me thing that one can actually calculate the counters without using any memory at all. You only have to sort the array (or arrays) and then traverse it counting consecutive equal elements.

... well, obviously, unless you need to keep the counters in memory for some further processing, but as you have said, the OP hasn't told us yet what his final goal is!

If you keep the pairs, i.e. 2 dim points, how would you want to sort them in one dimension?

> the OP hasn't told us yet what his final goal is!

Exactly!

For instance, if he wanted to process the data sequentially he could also sort them into HoHs or AoHs, since Perl is swapping out unused data.

Like this only the currently processed second level structure would be in RAM and the performance overhead for swapping would be negligible.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

A reply falls below the community's threshold of quality. You may see it by logging in.