Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

How to make your Perl 30% faster

by PetaMem (Priest)
on Nov 16, 2004 at 12:18 UTC ( #408091=perlmeditation: print w/ replies, xml ) Need Help??

Hello Monks,

I have been using the fastest "Perl interpreter ever" (at least from my experience) for quite some time now. It seems stable, so I'd like to share that knowledge with you.

Nicholas gave an excellent talk about the topic When Perl is not fast enough some time ago. He mentions that "compiling your own perl" may be an option and reports speed gains in between 5% and 14%.

With recent improvements of GCC and its autovectorization feature, I thought that I could spend a sunday trying to find out what it would bring me.

The Results:

I fetched sources for both gcc 3.4 and perl 5.8.5. Then I compiled GCC 3.4 and then compiled with that GCC 3.4 Perl 5.8.5 with the options -msse2 and -O3. Use -msse if your CPU doesn't support sse2. The "autovectorized" perl is constantly 30% faster than the plain-vanilla perl that comes with a standard linux distribution (I suppose compiled for pentium), with the lowest speedup seen at 20% for store/retrieval and highest speedup about 40% for some list manipulations.

I can tell you, that 30% is significant and makes recompiling worthwhile. Moreover, it seems GCC doesn't autovectorize all cases it could, so we can probably expect some more improvements. I also suppose, that real GCC cracks could find more optimizations for the P4 architecture, but neither my time, nor my knowledge allowed me more experiments.

Update:

More specifications about environment and compared interpreters:

As you may or may not see from the data below, environment is (SuSE) Linux 8.2, CPU is a Pentium 4-M 1,8GHz. Benchmarked was our application for natural language processing/understanding. Where some heavy operations on N-ary trees (List-based implementation of ours - not that on CPAN) happen. E.g. the "normalization" of a swedish lexicon (removing redundant data, sorting trees etc.) takes 423 seconds with the standard perl, and 288 seconds with the optimized one. This is a pretty hard benchmark as it extremely shuffles data around. We have also results where information about a lexicon is gathered where the speedup is a factor of ten(10!). I.e. about 150.000 lexicon entries are iterated and the number of meanings per entry is evaluated and added to a total. Takes for the swedish lexicon 20 seconds on the unoptimized version and 2(!!) seconds on the optimized version.

This is the "fast baby":
(none):/tmp # /usr/local/bin/perl -V
Summary of my perl5 (revision 5 version 8 subversion 5) configuration:
  Platform:
    osname=linux, osvers=2.4.26-mh2, archname=i686-linux-thread-multi
    uname='linux sol 2.4.26-mh2 #7 mon aug 23 11:30:25 cest 2004 i686 unknown unknown gnulinux '
    config_args=''
    hint=previous, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='/usr/local/bin/gcc', ccflags ='-fno-strict-aliasing
    -pipe -I/usr/local/include -D_LARGEFILE_SOURCE
    -D_FILE_OFFSET_BITS=64',
    optimize='-O3 -msse2',
    cppflags='-fno-strict-aliasing -pipe -I/usr/local/include -fno-strict-aliasing -pipe
-I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64 -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64'
    ccversion='', gccversion='3.4.2', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='/usr/local/bin/gcc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -ldb -ldl -lm -lcrypt -lutil -lc
    perllibs=-lnsl -ldl -lm -lcrypt -lutil -lc
    libc=, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version='2.3.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-Wl,-E'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'


Characteristics of this binary (from libperl):
  Compile-time options: MULTIPLICITY USE_LARGE_FILES PERL_IMPLICIT_CONTEXT
  Built under linux
  Compiled at Oct 26 2004 23:48:45
  @INC:
    /usr/local/lib/perl5/5.8.5/i686-linux
    /usr/local/lib/perl5/5.8.5
    /usr/local/lib/perl5/site_perl/5.8.5/i686-linux
    /usr/local/lib/perl5/site_perl/5.8.5
    /usr/local/lib/perl5/site_perl
This is the perl it was compared against
Summary of my perl5 (revision 5.0 version 8 subversion 0) configuration:
  Platform:
    osname=linux, osvers=2.4.20-4gb-athlon, archname=i686-linux-thread-multi
    uname='linux builder 2.4.20-4gb-athlon #1 mon mar 17 17:56:47 utc 2003 i686 unknown unknown gnulinux '
    config_args='-ds -e -Dprefix=/opt/perl-5.8.0_t -Dman1dir=/opt/perl-5.8.0_t/man/man1
-Dman3dir=/opt/perl-5.8.0_t/man/man3 -Uinstallusrbinperl -Dusethreads -Di_db -Duseshrplib=true'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef useithreads=define usemultiplicity=define
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
  Compiler:
    cc='cc', ccflags ='-D_REENTRANT -D_GNU_SOURCE
    -fno-strict-aliasing -D_LARGEFILE_SOURCE
    -D_FILE_OFFSET_BITS=64',
    optimize='-O3',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -fno-strict-aliasing'
    ccversion='', gccversion='3.3 20030226 (prerelease) (SuSE Linux)', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', lseeksize=8
    alignbytes=4, prototype=define
  Linker and Libraries:
    ld='cc', ldflags =' -L/usr/local/lib'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil
    perllibs=-lnsl -ldl -lm -lpthread -lc -lcrypt -lutil
    libc=, so=so, useshrplib=true, libperl=libperl.so
    gnulibc_version='2.3.2'
  Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags='-rdynamic
-Wl,-rpath,/opt/perl-5.8.0_t/lib/5.8.0/i686-linux-thread-multi/CORE'
    cccdlflags='-fpic', lddlflags='-shared -L/usr/local/lib'


Characteristics of this binary (from libperl):
  Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL_IMPLICIT_CONTEXT
  Built under linux
  Compiled at Jan 12 2004 10:13:27
  @INC:
    /opt/perl-5.8.0_t/lib/5.8.0/i686-linux-thread-multi
    /opt/perl-5.8.0_t/lib/5.8.0
    /opt/perl-5.8.0_t/lib/site_perl/5.8.0/i686-linux-thread-multi
    /opt/perl-5.8.0_t/lib/site_perl/5.8.0
    /opt/perl-5.8.0_t/lib/site_perl

As you can see, even the old perl was compiled with -O3 so one cannot say it was not optimized in any way.

I'd like to reiterate, that I also saw this as an experiment that probably would fail, because I also was reluctant sacrificing stability for speed. But I'm using the optimized Perl now on a regular basis and it has proven to work with only one side effect. It's faster. :-)

Bye
 PetaMem
    All Perl:   MT, NLP, NLU

Comment on How to make your Perl 30% faster
Re: How to make your Perl 30% faster
by bluto (Curate) on Nov 16, 2004 at 17:00 UTC
    Using this probably depends on how much risk you can live with. Personally, I've always hesitated turning on extra optimization during compiles. For example, on AIX there is a caveat in the man pages for the built-in compiler of "The -O3 specific optimizations have the potential to alter the semantics of a user's program", which doesn't give me any warm feelings.

    I'm not saying gcc is subject to this since I haven't used it in a while, but the fact that they don't enable these cool features by default seems to indicate they understand there is some risk involved.

      Many of the optimizations make it hard to debug the program. Most of us don't debug the perl binary, so that's a non-issue.

      There are cases where optimization can break code. This is one good reason to have a good test suite. Usually it happens around particularly hairy code (IIRC, Duff's Device tends to trip up optimizers).

      Also, higher optimization levels may start trading off time for space, which might make someone still running Perl on an old VAX angry.

      In the general case of a regular Perl programmer, running on a reasonably up-to-date machine, higher optimization is fine.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

        Usually it happens around particularly hairy code.

        Kind of like the C code that perl itself is written in? :-)

        Also, higher optimization levels may start trading off time for space, which might make someone still running Perl on an old VAX angry.

        Who is interested in an old VAX? ;-)

        But now that you mention it:
        -rwxr-xr-x    2 root root   11588 2004-11-17 09:49 /opt/perl-5.8.0_t/bin/perl
        -rwxr-xr-x    2 root root 1213952 2004-11-17 09:49 /usr/local/bin/perl
        
        Which looks really weird for me in the first case.

        Bye
         PetaMem
            All Perl:   MT, NLP, NLU

      The original interpreter the "autovectorized" one was compared against was also compiled with -O3 as one can see in the updated node.

      All runs stable - at least on the P4 - don't know about other architectures, but sse2 is not of much interest there - I guess. Probably Mac/PPC users could experience similar results if GCC supports the altivec engine.

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

Re: How to make your Perl 30% faster
by samtregar (Abbot) on Nov 16, 2004 at 17:48 UTC
    Very cool. It would be interesting to see what effect this would have on the performance of large-scale Perl apps, like Krang. I've been meaning to try compiling Perl with Intel's C compiler. From what I've heard it's very good.

    -sam

Re: How to make your Perl 30% faster
by Anonymous Monk on Nov 16, 2004 at 18:12 UTC
    I see we have Gentoo users here.
Re: How to make your Perl 30% faster
by Luca Benini (Scribe) on Nov 16, 2004 at 19:41 UTC
    There are some mistake in your reasons of performance increase. (I'm work in computational optimitation and i'm a newbie boys of gcc@...) tree-ssa will be avaible only from gcc 4.0 (next major) the flag -msse or -msse2 can or cannot increase your performance, but they don't activate vectoritation, they only ask to the compiler to use simd instruction (see gcc's info) you can yous -mfpmath=sse,387 (as from gcc's info). 30% it's a good result, but how do you obtain this number? (old perl version, old flags, old compilers flag etc.) Recompiling can however increase your performance (Slack + Gentoo rules ;)) P.S. The intel C compiler works fine in vectoritation but: 1) No source code 2) No good code for Non-Intel 3) No AMD64 support

      Well - actually I only wanted to see the effect of using the SIMD instructions of the SSE-engine. Probably I've misunderstood the GCC pages but thought that "using the SSE engine" in automatically vectorizing sequential code IS autovectorization.

      I'm actually considering recompiling Perl once again with more optimizations for my specific CPU

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

Re: How to make your Perl 30% faster
by talexb (Canon) on Nov 16, 2004 at 22:07 UTC

    Fascinating research, but until the results are reproducible, this sounds like a pipe dream.

    Where are the benchmarks for these claims? What platform was used?

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

Re: How to make your Perl 30% faster
by Luca Benini (Scribe) on Nov 17, 2004 at 11:56 UTC
    Well the faster was compilated with gccversion=3.4.2 the slower with 3.3 20030226(prerelease), two different kernel (!= kernel-headers), two different version of perl.... The slower is compilated with -DUSETHREAD, using multithreading can (not always) reduce cpu-bound performance application on no-SMP machine. Well, on the other way 30% is more or less the stimated gain from using 3.4 vs 3.3, but this test can be considered a useful test in this direction. If performance is your objective you can googling around for a perl script called cpuflags and use it's output as compiler flag, but attention some of the flags it suggest can break perl internal semantic. If you do it please report your success vs. failure. tried perlcc?
      Hi,

      the two different kernels don't matter in this case, as they only indicate on what kernel the perl was compiled on. Actually both interpreters run on the same kernel (2.4.26) now.

      Also I don't see any issue in comparing various perls compiled with various compilers. Basically we tried to be as pragmatic as possible: Given Perl X (where X is the parameter space): What can we do to make it faster?

      It was mentioned here, that disabling threads helps a lot (which I can confirm), but that also some e.g. linux distributions ship perl with threads enabled by default.

      I've recompiled Perl once again, this time also for the specific P4 architecture and fpmath=sse. The result: Binary is about 200k shorter, and the following execution times:

      Normalization of a Swedish Lexicon in our Language Processing Suite:
      
      own 5.8.0-threaded (old): 241 seconds 
      system 5.8.0+threads:     196 seconds
      "fast baby" from above:   178 seconds
      "new fast baby" (P4):     161 seconds
      
      You know - from a practical point of view it is not important how the speed gain was achieved, it's just important THAT it can be achieved, and that compilation of perl can matter.

      And one poster was right: The GCC we've used still didn't use autovectorization. I'm very interested what this will bring.

      perlcc is not an issue as it cannot compile even the simpler modules... Not even the -o option.

      Bye
       PetaMem
          All Perl:   MT, NLP, NLU

The magic of threads
by barbie (Deacon) on Nov 17, 2004 at 14:59 UTC
    There is one single characteristic that is significantly improving your performance. The lack of ithreads support. From your print outs:

    The "fast baby":

    Characteristics of this binary (from libperl): Compile-time options: MULTIPLICITY USE_LARGE_FILES PERL_IMPLICIT_CON +TEXT

    and the installed version:

    Characteristics of this binary (from libperl): Compile-time options: MULTIPLICITY USE_ITHREADS USE_LARGE_FILES PERL +_IMPLICIT_CONTEXT

    The gain from removing threads can vary between 10-40% in the tests we've done. However, you are not comparing like for like. The installed version is based on 5.8.0 and your fast version is based on 5.8.5.

    Unfortunately RH, and probably several other distros, come with a threaded Perl. Even though when 5.8.0 was released, ithreads were not recommended for production environments. From what I understand, ithreads support is much more stable now.

    --
    Barbie | Birmingham Perl Mongers user group | http://birmingham.pm.org/

      You opinion about this it's nice. But also compiler version and perl version can giustify performance increase
        They can indeed. However, in this case the biggest factor is the none use of ithreads. In the tests we did here we did a like for like test, which showed a significant improvement when we simply recompiled without ithread support. If we were to add compiler flags I'm sure we could improve further, but as others have mentioned, adding further optimisations could have side effects.

        --
        Barbie | Birmingham Perl Mongers user group | http://birmingham.pm.org/

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://408091]
Approved by Corion
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (7)
As of 2014-11-24 02:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (135 votes), past polls