http://www.perlmonks.org?node_id=474438

ralphch has asked for the wisdom of the Perl Monks concerning the following question:

Hey,

I'm running a photo gallery solution (ImageFolio) coded in Perl running as plain CGI. Everytime a request is received, the software reads all data (category descriptions, etc) from text files (some over 600KB) as well as the photo IPTC information. So having many concurrent users brings the system to a SERIOUS crawl. I'm aware that the software (ImageFolio) is completely non-scalable in the way it has been developed. And I'm working on replacing the whole photo gallery solution for something new and efficient.

However, in the meantime, I'm considering a hardware upgrade for the server. Is there any hardware combination that could possibly speed up the current system (ie: SCSI drives, more RAM, etc)? So that Perl works faster and processes files faster with many concurrent users.

Also, would changing the way I read files in Perl make much of a difference? Right now it uses a while loop to keep RAM usage at a minimum, rather than loading the whole file into an array. I've also considered having some caching system that will reduce the processing load from each request.

Thanks,
Ralph

Replies are listed 'Best First'.
Re: Speeding up large file processing
by gryphon (Abbot) on Jul 13, 2005 at 02:31 UTC

    Greetings ralphch,

    Wow. Well, not knowing the current state of your code-line, I'll just give general pointers. Take a look at FastCGI for starters. Ultimately, a mod_perl solution would be better, but it may be more difficult to migrate your code. Also, if I were you, I'd look into moving the data from large text files into a database. MySQL is my database of choice, but there are alternatives that are free, good, and fast.

    My general rule of thumb is not to spend time (or money) on different or additional hardware if there's a fairly straight-forward software solution to improve efficiency.

    gryphon
    Whitepages.com Development Manager (DSMS)
    code('Perl') || die;

Re: Speeding up large file processing
by GrandFather (Saint) on Jul 13, 2005 at 02:27 UTC

    If you need to process the whole file then slurping it (reading the whole thing in one hit) is likely to be faster than reading a line at a time. On the other hand, if you can bail out early after reading a small portion of the file then that may save a heap of time.

    It may help to take a look at Memoize for some easy to implement caching.

    Hardware upgrades may help a little, but improving the algorithm can help a lot.

    It may help to implement a database to cache some of the information alongside ImageFolio and circumvent some of the overhead that way.


    Perl is Huffman encoded by design.
Re: Speeding up large file processing
by davidrw (Prior) on Jul 13, 2005 at 02:31 UTC
    Do you have to read all the text files? Is this all read-only in terms of the text files (i'm think you could dump it all into sqlite or mysql db)? I don't know how performance will differ, but maybe try DBD::CSV or DBD::AnyData to use DBI to access your data.. Might at least help in transitioning to a better solution.. and also if you can use Class::DBI there's Class::DBI::Cacheable so that you control cache'ing of retrieve calls..
Re: Speeding up large file processing
by sgifford (Prior) on Jul 13, 2005 at 05:39 UTC

    If you're devoted to the idea of buying hardware to solve the problem, I'd first look at adding enough RAM to put everything on a RAMdisk. That would speed the I/O up dramatically, though I/O may not be your problem.

    Consider profiling to see what the slow parts of your program are; that should help you figure out whether speeding up I/O will help very much.

Re: Speeding up large file processing
by ZlR (Chaplain) on Jul 13, 2005 at 09:00 UTC
    Hello ralphch ,

    What's your system ?
    If Unix, run vmstat and iostat and see what's going on during heavy load (5 seconds sampling time is a good start, monitor a long run of your system with both busy and idle state if you can).

    I'm guessing that since you read files, disk access may be limiting (then stripe or raid or scsi or fddi or ramdisk or ... ) . But since it's always the same files it may very well be that they are already cached into ram after the first access so disk performance could be dismissed from being a problem. sar will give you some cache hit stats.

    It's important to monitor this data because an hardware upgrade may very well be completly useless !

Re: Speeding up large file processing
by radiantmatrix (Parson) on Jul 13, 2005 at 13:32 UTC

    Historically, the solution to the type of problem that you're describing would be to write a small server application that would intelligently cache the files, and cause your application to become a client which would send queries for data to the small server application.

    In the spirit of major code re-use, I suggest using something like MySQL or DBD::SQLite2 (a self-contained, lightweight, file-based database and driver).

    <-radiant.matrix->
    Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
    The Code that can be seen is not the true Code
Re: Speeding up large file processing
by ralphch (Sexton) on Jul 13, 2005 at 17:34 UTC
    Hey, thanks for all of your replies and suggestions! I'll start looking into these different alternatives and keep you updated. Thanks again, Ralph
Re: Speeding up large file processing
by ralphch (Sexton) on Jul 14, 2005 at 19:10 UTC
    Hi, I was able to get it running a lot faster by hardcoding some information rather than having it read from the gigantic flat file databases. I also recoded the way in which it loaded and matched information in these files, by loading into an array and looping as few times as possible to get results. Thanks again! Ralph
Re: Speeding up large file processing
by Anonymous Monk on Jul 15, 2005 at 14:06 UTC

    Here's the new code that's making it run a lot faster now. The script reads a category descriptions file with 4000 lines, and retrieves the description for each of the categories that are to display on a page.

    open(FILE, "$catdesc"); my @desc = <FILE>; close(FILE); chomp @desc; %category_descriptions = {}; ## Create a hash with the category names to display. foreach $directory_name (@subdirectories) { my($date_a, $directory_name) = split(/\t/,$directory_name); if($directory_name ne '') { $category_descriptions{"$FORM{'direct'}/$d +irectory_name"} = 1; } } ## Set the description for each category foreach $line (@desc) { my ($catname, $catdescription) = split(/\t/, $line); $catdescription =~ s/^\s+//g; # trim leading blanks... $catdescription =~ s/\s+$//g; # trim trailing blanks... next if (!$catdescription); # skip line if no description if($category_descriptions{$catname} == 1) { $category_descriptions{$ca +tname} = "<br><$font>$catdescription</font><br>"; } }

    I'd really appreciate to know if there's even a faster way of doing this. Having it all migrated to MySQL would be great but the system would need too many modifications, that it's just as easy to change the whole thing.

    Thanks,
    Ralph

      How often does the information you are extracting from the directory structure change?

      Seems to me that instead of rebuilding the html representing the structure every time the cgi script is called, you should be maintaining a pre-built file that contains the html.

      When a request arrives to display the data, you just read that pre-formatted file and present it.

      When changes are made to the directory structure, you run a server process that re-creates the html file.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
        Hi, thanks for your reply!
        Yeah, I thought of doing something like that to speed it even further. However, the information changes randomly at different times since I'm not the one updating the content. For now, this solution is running at least fast enough for the current load. Regards, Ralph.