http://www.perlmonks.org?node_id=1052781


in reply to Re: Debugging XS modules under Strawberry perl
in thread Debugging XS modules under Strawberry perl

Thanks, that was enough to get near the place where the error was happening and from there trace the program with WinDbg.

The actual problem is that gcc generates MOVDQA (Move Aligned Double Quadword) instructions to load the int128 integers from memory to the mmx registers, and those require 16-byte alignment for their memory arguments. Perl own malloc implementation only guarantees 8-byte alignment on Windows, so eventually, the module crashes.

Replies are listed 'Best First'.
Re^3: Debugging XS modules under Strawberry perl
by bulk88 (Priest) on Sep 06, 2013 at 22:42 UTC
    Maybe this will help, its only for VC, maybe Mingw implemented it to the GCC equivalent. http://msdn.microsoft.com/en-us/library/ms253978%28v=vs.80%29.aspx . You can create aligned Perl memory with sv_chop and initially allocating extra bytes that may or may not get used, but the used portion of the memory block may get moved forward depending on the claimed start of the memory block from Newx/Perl mem allocator. Here is some private code that shows an example of aligning with sv_chop. It allows kernel DMA writes directly into SV which have huge alignment requirements (512, etc). The code also allows alignment to non-powers of 2 for completeness.
    void ReadFileEx(self, hFile, opBuffer, uBytes, uqwOffset, uAlign=0) HV * self HANDLE hFile SV * opBuffer unsigned long uBytes unsigned __int64 uqwOffset unsigned long uAlign ..................................................... PREINIT: char * opBufferPtr; BOOL ret; unsigned long alignRemainder; ................................................ CODE: ................................................ //idea is to avoid a sv_grow and a sv_chop if current SvPVX meets +alignment and len requirements //SvPVX may already be aligned from a previous ReadFileEx if( SvTYPE(opBuffer) < SVt_PV) { goto grow;} // +svpvx cant be used unless sv type >= pv else if( ! (opBufferPtr = SvPVX(opBuffer))){ goto growPV;} else if(uAlign){ alignRemainder = (DWORD_PTR)opBufferPtr % uAlign; //opBufferP +tr and uAlign can't be 0 here //test uBytes+offset to align multiple for current PVX, uBytes ++uAlign only needed for a re/malloc op //since the alignment of the returned alloc block is assumed t +o be 0 thru uAlign if(alignRemainder && ! (SvLEN(opBuffer) > uBytes + (uAlign-ali +gnRemainder))){ goto growPV; } // PVX is already aligned, check alloc len else { goto testLen;} } testLen: if(SvLEN(opBuffer) <= uBytes) { goto growPV;} if(0){ growPV: //only if PV, access vio, prevent needless copy in sv_ +grow SvCUR(opBuffer) = 0; grow: //always include full alignment here, we can't predict a +lignment of realloced ptr opBufferPtr = sv_grow(opBuffer, uBytes+1+uAlign); } //offset to next align bouindary must be recacled, opBufferPtr m +ay have changed from 1st align calc if(uAlign && (alignRemainder = (DWORD_PTR)opBufferPtr % uAlign)) { char * newptr = opBufferPtr = opBufferPtr+(uAlign-alignRemaind +er); //any alignment, any odd number, any even number, any number, +not just power of 2, rope to hang yourself //(alignRemainder && 1) is a cancel part, so ptr 26 on align 1 +3 doesn't become 39 (26+13) needlessly assert(newptr + uBytes < SvPVX(opBuffer) + SvLEN(opBuffer)); SvPOK_on(opBuffer); sv_chop(opBuffer,newptr); //there is an assumption that sv_cho +p will not realloc since PVX is long enough assert(opBufferPtr == SvPVX(opBuffer)); } assert(uAlign?((DWORD_PTR)SvPVX(opBuffer) % uAlign == 0) :1); ......................................................
    Some code was removed, regarding sanity checks on whether the SV opBuffer is safe to use or not and all mentions of self object removed.

    Here is a 2nd usage of Perl and aligned memory I use, https://github.com/bulk88/perl5-win32-api/blob/master/Callback/Callback.xs#L650

    TLDR: Use Visual C or Intel C

    Although this isn't an answer to your problem. I hate Mingw64/Org/GCC, usually for its debugging abilities. Windows uses pdb symbol files, they work, forwards and back. VC 2008 with VS 2003 GUI. VC 2003 with VS 6 GUI. VC 2003 with VS 2008 GUI. ICC with VS 2008 GUI. It is a wonderful life. With GCC family, symbols are in the binaries, now you have to hack Makefile.PL and MakeMaker and provide those MY:: method overrides to keep the symbols in the binaries. Next problem is the gdb binaries having poor to no binary compatibility forwards or back or w64 vs org vs cygwin. If every binary with GCC symbols, wasnt made by the exact gcc binary that came in the archive/tarball that the gdb binary you are using came with, expect the gdb process to freeze (Task Man kill time), throw errors/not set breakpoints, cant find the original source code, take really long timeouts on the order of dozens of seconds with no CPU usage in the debugger and the debugee. If you are using a win32 GUI for gdb, expect more freezes or timeouts than in the gdb prompt (I've tried 2 different ones). Even if the latest GCCs from the 3 GCC on Win32 families generate cross compatible symbols now (and IDK if they do). It doesn't fix the 10 years of older Win32 GCCs out in the field and their binaries symbol formats. I think there is a principle in the GCC world, that different GCC builds dont need to be binary compatible, since each OS is compiled with 1 C compiler, and all binaries on that OS were made with the same C compiler. Compiling your own GCC with different compile flags and replacing the system GCC, then asking why nothing works, is simply insane and "not supported". But that is what we have on Windows with 3 different GCC forks.

    Originally Strawberry used org. Then due to seeing the political problems that org was causing and the w64 project more progressive, Strawberry switch to w64. ActivePerl's GCC PPM package is still org 3.4.5 from 2005ish as of 2013. So still as of today Win32 Perl has 2 different GCCs in circulation with great animosity for each others developers. Cygwin GCC has always been Cygwin GCC. I dont think anyone is crazy enough to try to build XS modules using Cygwin GCC that will load into a native Perl process. Org was a fork of cygwin GCC from 1998. The frequency of code sharing of Cygwin GCC and Org and W64 I am not sure of. But I've had to create ifdefs for all 3 compilers due to header differences and different spec files compiled into each GCC fork.
      Use Visual C or Intel C

      AFAIK, those do not have support for 128bit integers.

      In any case, my aim was to get the module working in Strawberry Perl so that anybody could use it easily. Switching to a different compiler was not an option.

Re^3: Debugging XS modules under Strawberry perl
by BrowserUk (Patriarch) on Sep 07, 2013 at 00:23 UTC

    See also http://www.gamedev.net/topic/482230-mingw-dll-used-in-mingw-and-vc-app-memory-alignment/.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Actually I was able to solve the issue declaring the int128 integers stored in perl-malloc'ed memory as follows:
      typedef int128_t int128_t_a8 __attribute__ ((aligned(8))); typedef uint128_t uint128_t_a8 __attribute__ ((aligned(8)));
      Now gcc generates MOVDQU instructions to load them. MOVDQU is just a 40% slower than MOVDQA and so the performance degradation probably unnoticeable for a perl program.

        You have a solution that is probably good enough.

        This is little more than a recurring idea that I've yet to use, but could be used for this.

        When writing a library to deal with relatively small 'value types' -- n-bit integers are a prefect example -- the time-costs and memory overheads of malloc()/free()(1) can be a substantial part of the aggregate costs of using the type. Indeed, C++11 attempts to deal with another part of this -- avoiding the process of allocating a new instance, copying the value of an old instance (transformed by the current operation) into the new instance and then discarding the old instance -- by the implementation of 'move semantics'. But that's a separate issue.

        ((1)Compiler malloc/free are bad enough, but Perl's is horrible. Leastwise on Windows.)

        A good solution to this is create a pool of the value types and manage them yourself. As the allocations are all fixed sized, the pool can be managed as a simple array of values with pointer to the next free; and each unused value pointing to the next free.

        That gives 0 per-value memory overhead; 1 dereference and a fixup per allocation; and worst case a short pointer chain for free. It can also give good cache locality. And, guaranteed alignment! (An alternative implementation that uses a bitmap rather free-chain can be even more efficient.)

        On Windows this idea is made very simple by using VirtualAllocEx()(2) to reserve a chunk of virtual memory space, that lives completely outside the sight of the CRT (heap), that can be grown and shrunk in-place in system page sized chunks.

        So, you reserve (say) 1GB virtual memory(3), but only allocate the first page (4k). That's enough for an array of 256 128-bit integers, but if you need more, you can efficiently, dynamically extend that array to accommodate 67 million more, without any need to copy or move any of the existing values.

        ((2) I don't know/can't find the equivalent system function for *nix. I thought for a while valloc() might be it; but that seems to have been deprecated and replaced by a call that has all sort of silly restrictions (must be compatible with the standard free() in order to be posix complaint.))

        ((3) You might need to be a little more conservative on 32-bit systems; but on 64-bit there is ~16 million times as much address space available as any of the current crop of processors can actually address, so there is no penalty for spreading ourselves around a bit.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        typedef int128_t int128_t_a8 __attribute__ ((aligned(8)));

        Struck the same issue messing with the __float128 type (Math-Float128) ... at least, the solution you provided worked for me, too.

        What I find interesting is that there is a need to apply this measure only when using the 64-bit MinGW compiler. With the 32-bit MinGW compilers, there's no need.
        Mind you, Strawberry Perl's gcc-4.6.3 compilers (both 32-bit and 64-bit) create runtime crashes whenever quadmath.h's expq() function is called - I couldn't reproduce that problem anywhere else.

        Anyway, salva, I'm glad you found an answer to the problem - I hate to think how long it might have taken me to find this solution (let alone decipher what the problem was in the first place).

        Cheers,
        Rob