http://www.perlmonks.org?node_id=1052786


in reply to Re^2: Debugging XS modules under Strawberry perl
in thread Debugging XS modules under Strawberry perl

See also http://www.gamedev.net/topic/482230-mingw-dll-used-in-mingw-and-vc-app-memory-alignment/.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^4: Debugging XS modules under Strawberry perl
by salva (Canon) on Sep 07, 2013 at 15:09 UTC
    Actually I was able to solve the issue declaring the int128 integers stored in perl-malloc'ed memory as follows:
    typedef int128_t int128_t_a8 __attribute__ ((aligned(8))); typedef uint128_t uint128_t_a8 __attribute__ ((aligned(8)));
    Now gcc generates MOVDQU instructions to load them. MOVDQU is just a 40% slower than MOVDQA and so the performance degradation probably unnoticeable for a perl program.

      You have a solution that is probably good enough.

      This is little more than a recurring idea that I've yet to use, but could be used for this.

      When writing a library to deal with relatively small 'value types' -- n-bit integers are a prefect example -- the time-costs and memory overheads of malloc()/free()(1) can be a substantial part of the aggregate costs of using the type. Indeed, C++11 attempts to deal with another part of this -- avoiding the process of allocating a new instance, copying the value of an old instance (transformed by the current operation) into the new instance and then discarding the old instance -- by the implementation of 'move semantics'. But that's a separate issue.

      ((1)Compiler malloc/free are bad enough, but Perl's is horrible. Leastwise on Windows.)

      A good solution to this is create a pool of the value types and manage them yourself. As the allocations are all fixed sized, the pool can be managed as a simple array of values with pointer to the next free; and each unused value pointing to the next free.

      That gives 0 per-value memory overhead; 1 dereference and a fixup per allocation; and worst case a short pointer chain for free. It can also give good cache locality. And, guaranteed alignment! (An alternative implementation that uses a bitmap rather free-chain can be even more efficient.)

      On Windows this idea is made very simple by using VirtualAllocEx()(2) to reserve a chunk of virtual memory space, that lives completely outside the sight of the CRT (heap), that can be grown and shrunk in-place in system page sized chunks.

      So, you reserve (say) 1GB virtual memory(3), but only allocate the first page (4k). That's enough for an array of 256 128-bit integers, but if you need more, you can efficiently, dynamically extend that array to accommodate 67 million more, without any need to copy or move any of the existing values.

      ((2) I don't know/can't find the equivalent system function for *nix. I thought for a while valloc() might be it; but that seems to have been deprecated and replaced by a call that has all sort of silly restrictions (must be compatible with the standard free() in order to be posix complaint.))

      ((3) You might need to be a little more conservative on 32-bit systems; but on 64-bit there is ~16 million times as much address space available as any of the current crop of processors can actually address, so there is no penalty for spreading ourselves around a bit.)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      typedef int128_t int128_t_a8 __attribute__ ((aligned(8)));

      Struck the same issue messing with the __float128 type (Math-Float128) ... at least, the solution you provided worked for me, too.

      What I find interesting is that there is a need to apply this measure only when using the 64-bit MinGW compiler. With the 32-bit MinGW compilers, there's no need.
      Mind you, Strawberry Perl's gcc-4.6.3 compilers (both 32-bit and 64-bit) create runtime crashes whenever quadmath.h's expq() function is called - I couldn't reproduce that problem anywhere else.

      Anyway, salva, I'm glad you found an answer to the problem - I hate to think how long it might have taken me to find this solution (let alone decipher what the problem was in the first place).

      Cheers,
      Rob