Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
There's more than one way to do things
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

There have been several posts recently by people looking to process large data files more efficiently than they can achieved using the default methods and standard paradigms perl provides.

There is a prevalence hereabouts to say, "Don't optimise!". Throw bigger hardware at it. Use a database. Use a binary library. Code in C.

  • But if your already using top end hardware and it's still too slow, what then?
  • Databases are never quicker unless you can use some fairly simplistic criteria to make wholesale reductions in the volume of the data that you need to process within your application program.
  • If there is a binary library available to perform your task, with a perlish interface, that doesn't have a huge learning curve or require you to structure your entire program around it's, often Byzantine and/or non-intuative data structures and control mechanisms, then you may be in luck.
  • If you have a C compiler. Know how to code in C. Don't mind spending days (or weeks?) re-inventing complex data structures or finding pre-existing ones and trying to make them work together. Oh, and be sure to take a crash (sic) course in using a symbolic debugger. Not that it will do you that much good because almost all efficient C code makes extensive use of inlining through the use of the macro preprocessor, and by the time you get into the symbolic debugger what you see there bares little or no resemblance to the source code anyway.
  • Of course, you could learn C++...but life's short and a rusty knife across your own throat is more fun:)

Your other option is to look at making best--eg. most efficient--use of the facilities that Perl provides in order to speed things up and/or use less memory.

As an example of the sorts of techniques that perl provides for doing this, I'll use a recent post as an example. Not withstanding the fact that some people recognised the form of the data to be processed and pointed out that there may well be an existing wheel tailor-made for that particular case, the question as asked has generic application beyond that particular specialisation. It is also simply stated and makes for a good example to work from.

#! perl -slw use strict; my $ch; open (FH, '<', $ARGV[ 0 ]) or die $!; until (eof(FH)) { $ch = getc(FH); } close FH;

The task is to process every byte of a very large file in a reasonable amount of time. The example code supplied used fairly naive coding and as a result, for the stated data volume (3GB) I project that it would take ~12 hrs 40 minutes to run on my hardware (2.66 GHz P4). It's possible that using a top-of-the-line PC that this might be reduced by half through shear (single) processor power and otherwise faster hardware. You might reduce it to a third by multiprocessing, or bigger iron--maybe. But the cost in complexity of the former and the simple cost of the latter would be prohibitive for many situations. So what can we do to speed this up with out resort to that expense.

The first thing to notice is that since 5.6.2?, perl has the smarts built-in to handle unicode data in files. This is a good thing if you need to process unicode data, but it extracts a penalty for those who do not. However, this penalty, along with some penalties imposed by the underlying C-runtime for handling formatted ASCII data can be quickly and easily side-stepped by specifying that you want to process the file as simple bytes. The mechanism is to specify ':raw' when opening the file.

#! perl -slw use strict; my $ch; open (FH, '< :raw', $ARGV[ 0 ]) or die $!; until (eof(FH)) { $ch = getc(FH); } close FH;

This simple change results in a reduction of the (projected) runtime from 12 1/2 hours to something under 6 (~5:45)! Not bad for the sake of typing 4 characters.

The next thing I looked at was getch(). Part of the stdio (emulation?), this has overhead built in. For example, to allow the eponymous ungetch().

What happens if we bypass this by using sysread(), to get one character at a time?

#! perl -slw use strict; my $ch; open (FH, '< :raw', $ARGV[ 0 ]) or die $!; while ( sysread( FH, $ch, 1 ) ) { } close FH;

The result is a projected runtime of 1 hour 23 minutes!

Not bad! down to 10% of the time with around 20 characters changed from the original program. That's a 900+% improvement of you want to look at it that way.

Of course, the original problem required that it was possible to look at each pair of bytes separated by 500 bytes. and so far, I haven't tackled this. This will obviously need a buffer of some kind, so rather than reading a single char at a time, I'll read 4096 bytes (a good choice on my OS/file system) into a scalar and then access the bytes. But how to access them? An often used method of splitting a scalar into its bytes is split, so I try that.

#! perl -slw use strict; my( $buf, $ch ); open (FH, '< :raw', $ARGV[ 0 ]) or die $!; while ( sysread( FH, $buf, 1 ) ) { for $ch ( split'', $buf ) { } } close FH;

Ouch! That costs dear. As well as the cost of actually splitting the string, there is also the overhead of building a 4000 element array (which in perl is a linked list), storing one byte per element. This is rapidly consumed and then the array is discarded and a new one built. There is also an intermediate list built by split. This consumes substantial amounts of memory, and on a file this large, causes the GC collector to run with increasing frequency as the processing progresses. The upshot of this is that we're back to 4 1/2 hours of processing time, and I haven't added code to retain the last 500 chars for the comparison required. So, don't do that:)

Another popular method of separating a scalar into it chars is @chars = $scalar =~ m[(.)]g; but this suffers the same problems as split.

Let's try accessing the bytes using substr.

#! perl -slw use strict; my( $read, $buf, $ch ); open (FH, '< :raw', $ARGV[ 0 ]) or die $!; while ( $read = sysread( FH, $buf, 4096 ) ) { for my $p ( 0 .. $read ) { $ch = substr $buf, $p, 1; } } close FH;

Time taken 31 minutes 40 seconds. Nice. Close to a third of the time we took when processing the file byte-by-byte, despite the fact that the OS is undoubtedly doing some buffering for us under the covers. Making 98% (1/4096) less system calls has benefits, even if the OS is caching the same size buffer of data. We've also avoided the need to construct and garbage collect 3/4 of a million, 4000-element linked lists by not splitting the string.

But, we still haven't retained the 500 character buffer required. Several suggestions where made for how to do this is the thread, mostly to do with pushing and shifting to an array. We've already seen that splitting the string to an array imposes a substantial penalty. You will have to take my word for it that maintaining a buffer of 500 chars by pushing to one end of an array and shifting off the other takes a loooooong time. I can't give figures as even using the 30MB file that I used as the basis for projecting most of the times in this post, I gave up waiting for it after over 3 hours. The original code that I project would have processed a 3GB file in 12 1/2 hours, took less than 8 minutes for 30MB.

Another suggestion muted was to use a sliding buffer. This basically involves filling a buffer, processing as far as you can, then copying the required residual bytes from the end of the buffer to the beginning, re-filling the remainder of the buffer and processing again. Repeat until done. It's harder to describe than code.

#! perl -slw use strict; my( $read, $buf, $ch, $pair ); my $offset = 0; open (FH, '< :raw', $ARGV[ 0 ]) or die $!; while ( $read = sysread( FH, $buf, 4096 - $offset, $offset ) ) { for my $p ( 500 .. $read ) { $pair = substr( $buf, $p - 500, 1 ) . substr( $buf, $p, 1 ); } $buf = substr( $buf, $read - 500, 500 ); $offset = 500; } close FH;

That's the entire program. The variable $pair iteratively takes on the value of each pair of bytes, separated by 500 chars, throughout the entire input file. And the time taken to process 3GB?

A second or two under 31 minutes (*). That's less than 4% of the original runtime or if you prefer, 2500+% improvement. And this is retaining the 500 char cache that the original program didn't.

And note. The whole 3GB file has been processed through a single, 4096 byte buffer. The process consumed less than 2MB of memory total. There was no memory growth and the GC never had to run.

So, what price optimisation? You may be able to get close to these results using the original program if you spent $50k on new hardware?

You certainly wouldn't achieve it by putting the data into a DB.

You might even improve the performance by writing the final program in C, but I doubt that a brute force conversion of the original into C would be as quick. Even then, you are then stuck with coding the rest of the program--whatever needs to be done with those pairs of chars--in C also. Devoid of all the nice things perl gives us.

It could be pointed out that the original program was overly naive, but my counter to that is that without knowledge of where the inherent costs arise in a program, ALL programs are naive. And one way to become familiar with the perpetual trade-offs between ease and efficiency is to experiment. Or to put it another way, to optimise.

Besides, optimising is fun!

* Actual time to process a 3GB file: 30:57 minutes. This is within 1% of the value projected from a run of the same code on the 30MB file used elsewhere. Several runs of each version of the code were timed using the 30MB file, and projections confirmed by a single run on a 300MB file for all but the first two which would take longer than I was prepared to wait:)


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail

Edited by BazB: enclosed more of the node in readmore tags.


In reply to Optimising processing for large data files. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others wandering the Monastery: (10)
    As of 2014-04-18 22:39 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      April first is:







      Results (472 votes), past polls