Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Sharing a 300MB variable.

by foxops (Monk)
on Dec 26, 2002 at 19:22 UTC ( #222392=perlquestion: print w/replies, xml ) Need Help??

foxops has asked for the wisdom of the Perl Monks concerning the following question:

Hey friends,

I was just wondering if it is possible to share a 300 MB file in ram between multiple Perl processes (in Win2k) without using sockets? If that were not possible, how difficult would it be to implement this over the current setup of loading from a file? Or (Warning, non-Perl question) can I hold the file cached in ram through the OS, instead of Perl?

Thanks a million.

Replies are listed 'Best First'.
Re: Sharing a 300MB variable.
by Nitrox (Chaplain) on Dec 26, 2002 at 20:30 UTC
      Thanks, that will work beautifully.
Re: Sharing a 300MB variable.
by em (Scribe) on Dec 26, 2002 at 21:09 UTC
    You may want to consider why you want to share a 300MB file in memory. Personally, I'd consider a file based or db based sharing system for anything this large.

    You can easily run out of memory and degrade performance depending on how you are accessing data.

    How are you processing the file? You don't want to slurp the entire file into a variable and then process it.

    my @data; @data = <FILE>; # this reads in the entire file

    If you slurp the entire file, this means each process has a copy of the 300MB file (i.e. 5 processes x 300MB = 1500MB + Perl runtime memory x5)!

    Please note that runtime memory requires for large datastructures can be many times the size of data on disk. I had a program work with a 100MB hash that took over 2GB of memory and swap!

    I would suggest you look into DB_File and BerkeleyDB and use a tied variable to a DB file.

    Using a variable tied to a file isn't bad in terms of performance (especially in comparison of your system slowly thrashing itself to disk).

      The Cache::Cache module can be used via a file cache as well as a memory one.

      -Nitrox

Re: Sharing a 300MB variable.
by Beatnik (Parson) on Dec 26, 2002 at 22:35 UTC
    Well, there's always IPC::Shareable altho I don't see any positive tests for Win32 on CPAN Testers.

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: Sharing a 300MB variable.
by demerphq (Chancellor) on Dec 26, 2002 at 23:19 UTC
    can I hold the file cached in ram through the OS,

    The OS will automatically cache the file. But I dont know that that will be useful to you. But I wonder if you have considered setting up a RAM drive. It sounds like this may be a suitable solution.

    HTH

    --- demerphq
    my friends call me, usually because I'm late....

Re: Sharing a 300MB variable.
by John M. Dlugosz (Monsignor) on Dec 27, 2002 at 02:47 UTC
    In Win32, you can create memory-mapped files. Any file, whether using read/write or memory-mapping, is cached by the OS. So unless you are using raw pointer code that really benifits from not having read/write calls but accessing things transparently, it probably won't matter.
Re: Sharing a 300MB variable.
by em (Scribe) on Dec 28, 2002 at 03:07 UTC
    You might want to start transitioning towards database type solutions as a 'hint' that the MotherLog should become the MotherDB.

    I'd suggest looking at the DBI module and one of the following:

    • DBD::CSV - treat a text file like a SQL database
    • DBD::ODBC - driver for ODBC connections (i.e. Access, MS SQL Server, etc)
    I've used DBD::ODBC to connect to MS SQL Server and it worked great.

    Having several programs access this monster log file is going to be painful. It won't get better -- trust me on this. Start agitating on getting this done right. You'll save a few sanity points if you can get the 'powers that be' to start changing their infrastructure.

Re: Sharing a 300MB variable.
by richardX (Pilgrim) on Dec 28, 2002 at 05:30 UTC
    If you want the fastest development time solution, then continue with the RAM disk solution.
    If you want a more robust and scalable solution, then I agree with the others that suggest that you start migrating to a DB centric solution.
    I use to work for a major vendor of web log processing software, and our solution was definitely a DB backend.

    Here is a short list of potential bottlenecks when using DBs to import web logs, just as a FYI.

    Import speed
    Import munging (filtering out invalid web log entries etc.)
    DB table size growth
    Managing DB access from a Data Warehouse (DW) if you keep your logs for an extended period

    The up sides to a DB centric solution for web log processing are:

    Scalability
    Accessibility (Many different methods to read the data)
    Flexibility (Integrate web log DB with other DBs like sales, surveys, email campaigns)
    Analysis and Data Mining (Lots of COTS {Commercial off the Shelf} software available)

    Richard

    There are three types of people in this world, those that can count and those that cannot. Anon

Re: Sharing a 300MB variable.
by osama (Scribe) on Dec 27, 2002 at 09:44 UTC
    I have a few questions for you...
    1. Why do you want to share it in memory... is it slow to access...
    2. Do you always need the file as one 300MB string? Can it be parsed into smaller usable parts??
    3. Why not have the data parsed into smaller parts and put into a database???? It only seems logical...
    4. What is in these 300MB, what do you want to do with it??
    Maybe you/i/we could find a different approach to your problem... It does not seem logical to have a shared 300MB Variable (are you rendering an image??)
      The mystery 300MB file is a trimmed, translated, and sorted EDI dataset. Although it would make much more sense to divide the file up by daily transactions, it is just a little to risky for me - as the file is appended on by several different other scripts. The file (which I lovingly refer to as "The Mother Log") is not only written to by several other Perl scripts, but it is also accessed by several cgi scripts (not all Perl), and queried by Access (lol). I would prefer to load the file into memory once a day, and then when asked for - it grabs the info out of Ram, instead of thrashing about the HD.

      UPDATE: I've decided for now that I'll rotate the log more often in order to keep it below 150MB, and setup a RamDisk that will mirror the backup file. I think this setup is best for now for several reasons:

      1. The log isn't meant to replace existing datasets, but it is meant to remove duplicate transactions in order to keep the file size down (one day's worth of EDI data for my company is around 700MB, the MotherLog will be about 150MB - and will contain a month's worth of transactions).
      2. I don't want to rearrange the data too much because the log has in the past been used to extract charge backs from suppliers breaking their contracts. I'd rather the file remain extremely simple so that errors can be recognized immediately.
      3. I'm not getting paid enough to reprogram this thing again :)

      Thanks everyone!

        Given the disparate processes needing access, and that you need read and write access. If you have the memory to maintain 300MB in RAM, then a ramdisc is by far your easiest and sanest option.

        Ram discs have minimal overhead, and the only change required by the applications would be for the paths to change. Even that could be eliminated if you can use a symbolic link. The only downside is the risk of data loss in the event that the server crashes.

        There are several ways that you could approach mitigating the data loss, some simplistic, other quite involved depending on the level of reliablility to you need.

        An interesting thought, but I emphasis its nothing more than a thought, would be to create a partition/filesystem on disc the same size as the ram disc, and use mirroring to reflect the ramdisc onto the filesystem. Whether there is any mileage in the idea depends on which os, mirroring software etc and whether the latter can be configured to use a ram disk.

        Attempting to use any form of caching is likely to be a problem as caching only really benefits you if the same sections of the data are being repeatedly accessed. Given all the different processes that would be vying for cache space in your scenario, your likely to slow things down with cache thrash rather than speed them up.


        Examine what is said, not who speaks.

        I would still think that a DB would be prime for this. Access is not too shabby at getting around DB's, and if nothing else, an export process should be able to keep a copy in Access format.

        The problem you should be having is not so much disk read time (Although it should be ridiculous) but the actual retrieval of data from a 300 MB variable. I cannot understand why it would be risky to split the transactions. If they are transactions, the should be sequential and without dependency. (Don't you at least have to split the variable within your script??) I would think that even splitting transactions by some other method, like by date, would have some benefit.

        If nothing else, sorting your current records and putting them in a DB would help. Then just update the other scripts to write to a DB not a file. If you still have one or two scripts that need the flat-file, do an export on transaction. Less if you need less. Even halfing the disk reads would have to help. A 300 MB dataset is roughly equivalent to 600 copies of Gulliver's travels. I personally panic when datasets get in the 30-40 MB range.

        While caching might work, it is not a long term solution. Heck, just maintaining a 300 MB file in ram without swapping in Win2k means you need > than 1 GIG of RAM. If that Dataset grows too large.... Also, if you are noticing alot of HDD thrashing, it is most likely because of Swapping. For a std file read, the process is quick, and one-time. The problem is that as you load that 300 MB dataset into memory, many things need to be swapped out. Windows has a very aggressive swapping system and will swap well before memory is full. Also, you start to contend with growing and shrinking the swap if it is dynamic in size.

        If the machine running this has less than 1 GIG of RAM, look at your swapping performance. Another test is to Manually set Virtual Memory to twice RAM and wait to see if the machine carps about out of memory. If so, you need more memory, more swap is a losing proposition.

        ~Hammy

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://222392]
Approved by fglock
Front-paged by diotalevi
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (3)
As of 2023-09-23 18:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?