Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re^2: when to c, when to perl

by mpeg4codec (Pilgrim)
on Jul 25, 2008 at 19:04 UTC ( #700204=note: print w/replies, xml ) Need Help??

in reply to Re: when to c, when to perl
in thread when to c, when to perl

Although C wins out the gate when it comes to memory usage, never forget to consider the flexibility that Perl provides. Sometimes you have so much data that even switching to C wouldn't solve your memory woes. Enter BerkeleyDB: by simply tying a hash or array to a file, you immediately drop memory usage to almost nothing while leaving your implementation otherwise completely intact.

Perl gives you the flexibility to try this, test if the performance is adequate, and have a completed, tested solution, all with the inclusion of a single line of code. Even in matters of memory, Perl can have a leg up on C.

Replies are listed 'Best First'.
Re^3: when to c, when to perl
by TGI (Parson) on Jul 25, 2008 at 19:53 UTC

    There absolutely are ways to reduce memory usage in Perl programs. And in most cases they are sufficient.

    However in cases where the volume of data to be handled is large (eg processing dna sequences) or the system is resource constrained (eg a small ARM linux based appliance) it makes sense to consider a C implementation.

    Let's consider an example where I have confronted this issue. I've been working on software for an ARM system that has only 32 MB of RAM and no virtual memory. One can write a simple socket based server in C that has much smaller memory needs than the equivalent perl server. The downside, is that it is harder make the C server do anything useful. I wrote my server application in Perl first, and it fits nicely on the system. However, I know that I can get rid of a fair amount of overhead if I rewrite the server in C. I won't do that unless it becomes absolutely necessary. Perl is much easier to debug, extend and maintain.

    In the case above, adding the BerkelyDB to the mix would just make things worse. In data intensive applications, a dbfile is often the answer. Other times simple things like using the ternary for loop instead of for (0..$#bigArray) can make a big difference.

    C has it's own problems. Did I forget to free that chunk of memory? Oops--memory leak. Just because it is possible to write more efficient code in C does not mean that C code will always be more efficient. So, you've got to be selective with what you try to do in C.

    My basic position is that premature optimization is a bad thing. Writing code in C instead of Perl is an acceptable, if costly (in terms of labor) speed and memory optimization. So, before applying that optimization, make damn sure there aren't any less expensive tricks you can try that will yeild acceptable results AND be sure that your C code will, in fact be smaller/faster than the Perl it replaces.

    TGI says moo

      Could you please elaborate on why adding berkelyDB to the mix would be worse than Dbfile? Let me give a for instance. I have a bacterial genome of 5 million bases. I want to break this up into kmers of various sizes. I need to pay attention to both DNA strands, so I record the orientation in which I see the kmer. For each kmer I want to see if it has already been seen. If so, increment the number of times the kmer was seen, record where it was seen, record the orientation of the kmer. Now go through the kmers and record which ones have low sequence comlexity - lots of repeats or other characterisitcs that might make identifying overlapping kmers difficult, for instance. So now I have a number of different hashes, usually keyed to the kmer sequence which I will use in the next part of my project. I will probabaly also sort all the kmers in my hash/database to speed up the search process in the next steps. I amy even precompute a series of such kmer databases ahead of time for different sizes, simply to help with processing the data.

      Now take a set of 140 milliion kmers from a next generation sequencing platform - the population covers both strands of the DNA. First question is how to quickly identify how many times each kmer from the reference genome was covered with a kmer from the next gen sequencing data. Are all the reference kmers represented or are some of them over or under represented?

      Now we look for differences in the remaining kmers - do these represent base changes, base deletions or base insertions as compared to the reference genome. Again, you're doing a lot of hashing, counting and inferring based on this data.

      Finally you get to create standardized files that will allow you to represent this information in a standard file format for display in a series of genome browsers.

      My original thought had been that berkelyDB would be more robust for this type of large scale data processing project. Can you can provide more information on why DBfile is more effective in this approach than berkeleyDB?

      yet another biologist hacking perl....

        tilly is correct. For a case where the main issue is simply small RAM, adding another library will just use more memory and exacerbate the problem.

        Your situation is different. BerkeleyDB is certainly up to the task. I used the term dbfile as a generic term for databases like Berkeley and GDBM. I'm sorry if this sloppy usage was confusing.

        I don't work much with large scale data processing tasks like this, but it looks like you are on the right track with your plans.

        Using the berkeleyDB is a way to offload memory intensive and speed critical operations to a C library through XS. Exactly what has been widely advocated in this thread. The best thing is that someone else has already written and carefully optimized this code. What could be better than that?

        TGI says moo

        You misread that. He wasn't saying that dbfiles are better than BerkeleyDB. It would make little sense to say that since BerkelyDB is nothing more or less than a specific kind of dbfile.

        Instead he said that in the case he described, BerkeleyDB would be a bad choice. In other cases a dbfile would be a good choice. So it is all about why different cases make a different difference.

        For your problem a dbfile is a reasonable choice. But I'll note that if you can you really want to pre-sort your data then store it in BTrees as much as possible. That will massively improve your locality of reference, which will reduce disk seeks. And I guarantee that with that problem you're being killed on disk seeks.

      Not to get too far off the topic of Perl and C, but this type of constrained environment is where Forth really shines.

      I don't intend to knock Perl or C, but there's room for more languages in our tool belts. Speaking of room, one place you often can't spare much space is on smaller platforms. That's where a precision screwdriver like Forth can be handier than the proverbial Swiss Army Chainsaw that is Perl. Besides, learning more languages is fun!

      That pretty much concludes the point I wanted to make, but I'll ramble a bit about Forth now for those unfamiliar with it...

      Forth has in some ways the same relationship to assembly as Perl has to C. It's some abstractions and a library management system over the top of the simpler language, which performs about as well for many tasks. Think of a postfix assembly language for a stack machine which has instructions for input and output, if/then statements, a strong macro management system for extensibility, and automated memory management. You're well on your way to Forth.

      Some Forth systems are interpreted while others are compiled. Some are interpreted at the top level but compile their library entries. All tend to be very easy on program storage and working memory. Some are native systems for general-purpose platforms (PFE, bigForth, ForthCMP, SwiftForth), some are embedded in other code as extensions (FICL, Misty Beach Forth), and some are cross-compilers for everything from 8-bit microcontrollers to embedded servers. There are even Forth-native multi-core processors. There are several LGPL, GPL, and otherwise open licenses among the Forth systems, including the official GNU Forth, GForth. Some are commercial products, some are free for noncommercial use, and at least one (pForth) is public domain.

      As far as expressive power, Forth has it. This web server, these Sieves of Eratosthenes and many other projects can show. It's not always as expressive as Perl, but in some cases it's more expressive. It's often more expressive than C.

      The great thing about Forth is you can use it on pretty much every platform. The roughest thing about Forth may be that there are so many mutually almost compatible versions out there. So much for a committee-approved language standard being the way to ultimate portability, as Forth has one and Perl doesn't. However, there are fairly simple ways to get most source code for one Forth that doesn't include inline assembly to work on most other Forth systems.

      Some Forths have built-in TCP/IP and concurrency while others need libraries. Some have built-in access to C structs and can even call methods in C dynamic and static libraries natively, while others need libraries to do those things. There are libraries for Forth to do numerical analysis and scientific computing which are ported from Fortran. It can do desktop applications, audio, video, and more. It's most often used in embedded systems, from kitchen appliances to space programs.

      Perl is great and C is, too. If you're ever looking to learn a new language, though, Forth is interesting. It's quite simple once you're accustomed to postfix syntax. It's powerful for its simplicity. Programs written in it can be small, efficient in memory and CPU use, and somewhat portable. In many cases, a Forth program is smaller on disk and in memory than a comparable C program.

      I still have a hard time calling 32 megabytes of RAM "constrained" for an embedded system, although OTOH I realize that with a full OS it's actually more of a low-spec general-purpose server than truly "embedded". No matter why memory is a concern, though, you might want to check out Forth.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://700204]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2017-09-23 17:51 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (273 votes). Check out past polls.