Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

large perl module

by minek (Novice)
on Mar 04, 2010 at 21:09 UTC ( #826814=perlquestion: print w/replies, xml ) Need Help??

minek has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have a huge module that's being used by a mod_perl script.
The module is huge because it contains giant hash table with all usa/can area codes, phone number prefixes and mobile carrier info.

I need fast lookups of this kind of data, and this is the fastest way.

The size of the source code .pm file is 2.5 MB, the Devel::Size::total_size() says the size of the hash in RAM is 11MB.

Do you think these facts will have any undesirable impact on performance of the system (beside RAM usage)?

Or I just worry too much...

I've never seen .pm files of this size...

Thanks, Dan

Replies are listed 'Best First'.
Re: large perl module
by CountZero (Bishop) on Mar 04, 2010 at 21:32 UTC
    There is no way of knowing other than by trying. A modern computer has easily at least 1 Gbyte of RAM so your data-structure which is a little over 1% of the RAM, shouldn't matter much, unless of course it takes away the very last free bytes and your OS starts swapping things in and out of the harddisk. At that moment you wish you had put al that data into a real database.

    Indeed a hash is the fastest way of looking up that data, but do you really need split millisecond speed?

    Also, keeping the data in the module itself is wasteful of memory: once the module has finished loading and compiling, you now have the whole of the module in memory PLUS the data in some variables. If you go for the data to be loaded from a data-file (say a YAML or JSON file) then you can save at least the amount of data you now store in your module.

    Update:I now note that your module runs in a "mod_perl" environment. You have to be very careful in designing your programs so that you avoid the big module being private to each child process, since these child processes are regularly reaped and respawned by the web-server (see the Apache MaxRequestsPerChild Directive) and thus will reload your big module again and again and again. Ideally your data-structure should remain shared over the server's lifetime. I guess that on a heavily loaded system having to regularly reload your data will cost you more time and resources than looking-up the data in a database does.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      ... and thus will reload your big module again and again

      I think the effects of this largely depend on how things are configured. Let's say MaxRequestsPerChild is set to 10000 and an average request takes 500 ms(*) to be delivered, then the total time of service of one process will be >= 5000 secs. Assuming the reload time of the module is in the order of 5 secs, this will still be a ratio of only 1 : 1000 (in other words, acceptable).

      ___

      (*)  I hadn't yet seen the OP's reply above when guessing that number (which would be more typical when serving end users over the web).

        Within Apache it might be of course different,
        but judging from running a test perl script from command line, the load time of the script/module is below 0.5s
        And the MaxRequestsPerChild is actually set to 10000.
        So it should be fine..

        Thanks, Dan
      Any hints on doing this (shared 'module')?
      Right now mod_perl script looks like this:
      use GiantModule(); use CGI; .... my $carrier = GiantModule::lookup($phone_number); .... exit 0; ###################### That's how the .pm looks like: module GiantModule; my %carrier = ( .... GIANT HASH .... ); sub lookup { my $phone=shift; ..... return $carrier{.....}; }
Re: large perl module
by almut (Canon) on Mar 04, 2010 at 21:26 UTC
    Do you think these facts will have any undesirable impact on performance of the system (beside RAM usage)?

    No, not if you have enough RAM  (but if the system starts paging out memory, there will be an impact on performance, of course...).

    P.S.: the copy-on-write semantics when trying to share memory between (mod_perl-)processes is generally a problem with large Perl data structures, because unless you're extremely careful and know what you're doing, Perl tends to cause write accesses to lots of the memory pages as a result of updating internal housekeeping data (flags, caching strings/numbers from implicit conversions(*), etc.), even if you only ever read the data. In other words, the pages often become unshared rather soon. In such cases, at least try to avoid implicit type conversions (like stringification of numbers).

    ___

    (*)
    use Devel::Peek; my $x = 42; Dump $x; print "$x\n"; # <-- conversion to string behind the scenes Dump $x; __END__ SV = IV(0x62de18) at 0x604fa0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,IOK,pIOK) IV = 42 42 SV = PVIV(0x606130) at 0x604fa0 REFCNT = 1 FLAGS = (PADBUSY,PADMY,IOK,POK,pIOK,pPOK) IV = 42 PV = 0x624fb0 "42"\0 CUR = 2 LEN = 8

    The cached string (PV), the modified flags, etc. have caused write accesses at the memory location of the respective C struct that holds the scalar $x, even though it had seemingly only been read in "$x\n".  Similar issues arise when taking references of $x, which causes a change of REFCNT.

Re: large perl module
by DrHyde (Prior) on Mar 05, 2010 at 12:53 UTC
    Sounds like you're doing something remarkably similar to Number::Phone::UK::Data. That originally had a humungous hash in it. It got faster and smaller when I switched to embedding a DBM::Deep database in the module, as a __DATA__ segment. Check out what I've done in that module, and we should probably collaborate further too, as I also have Number::Phone::NANP::* modules in the same distribution ...

      It depends on the thing you are searching - that is how are you searching it. I've tried DBM::Deep for prefix search exactly because of the speed it's docs mention. However, when tested, even from a RAM disk (aka /dev/shm) it was at least two/tree orders of magnitude (not sure exactly) slower than loading up everything in HoHoH... http://en.wikipedia.org/wiki/Trie and searching by going "up the tree".

      Of course, I had the luxury or all requests going through one process that's doing prefix search and routing requests depending on it. Which might also be a solution for the original question.

      You could implement the mod_perl handler only to pass on the requests, well put in work queue, and wait for results in results queue. And have one process (or several, possibly on several different servers) that do the actual (is it prefix?) searches - reading requests from the in/work queue, and putting them back in the out_queue.

      The trie implementation was able to find something like 50K random prefixes per second in a loop - from the pool of 45 to 50K prefixes in the database (loaded on startup into large HoH trie). 50.000 per second should outperform any web server. And that's on x2 AMD with 1 GB or RAM ...

      You could use Memcache for both input queue, and output - implementation (in Perl!) can be seen here: http://3.rdrail.net/blog/memcached-based-message-queues/

      PS - I've been contracting for the past year, for an company that's in SMS gateway business. So I get to tweak code that searches prefixes a lot ;)


      Have you tried freelancing/outsourcing? Check out Scriptlance - I work there since 2003. For more info about Scriptlance and freelancing in general check out my home node.
      Hi Dave, I actually use couple of your modules in my project, and I'm doing exactly what you did for UK mobile data, but for USA/CAN. It's also an SMS messaging server.
      I couldn't find any reliable data for the carrier lookup for NANP numbers, and I ended buying the data from a commercial supplier.

      Instead of DBM I'm using my own object persistance module, which works basically the same as DBM::Deep but faster. I stored the data in one data file per area code.

      Each data file was on average 20-40kB. The lookups were doing fine and were fast, the only problem was with IMPORTING huge CSV files with e.g. 20k phone numbers.
      The whole import procedure had to be finished within 20s, including saving them in the DB, checking if they unique, etc..., and for 10k numbers, it means 10k lookups - reading on average 300MB of data.
      At this point it was taking on average 70s, so I moved the data into something like this:
      use constant _nanp_area => { 201 => { 4143 => 267, 206 => 357, .... }, 604 => { ... }, .... }; # { # { area_code1 => { prefix => carrier_id, ... }, # { area_code2 => { prefix => carrier_id, ... }, # }
      So it's basically a hash of hashes of 3 or 4 digit prefixes and corresponding carriers.
      I didn't do exact measurements, but it's fast enough, and works for us just fine. 10k phone numbers gets imported and stored in DB within 14s.
      The lookup function first tries to find the carrier based on area code and 4 digit prefix, and if undef, then with 3 digit prefix.
      At this point everything seems to work OK and fast, but I will keep my eye on it...
Re: large perl module
by jrsimmon (Hermit) on Mar 04, 2010 at 21:31 UTC

    While I suspect your statement "this is the fastest way" woulnd't hold up under heavy scrutiny, if you have enough system resources and it's working, what's the concern?

    You could always move the data to a data file (better design), but unless you put it in a database or some other "lookup on demand" solution, you'll still end up with the large hash and it won't really make a difference.

      Well, this is supposed to be 'lookup on demand' system providing lookup services to other servers.
      Since it's doing it all the time, there is no point of loading the data over and over again from external storage.
      The data was before in a data file (many of them actually), but that made the disk too busy.... It seems like these data files were not being cached on the OS level.
      For each 10k lookups, perl had to read approx 2gb of files. 10k lookups were taking (when executed in a loop) 35s. Now, with the .pm loaded, it takes 10s.
      The server has 8GB of ram, and it's using (according to top) only 3GB.
        providing lookup services to other servers
        A database server does exactly that and much more efficient than a "converted" web-server. Indexing, caching, ... are all highly optimised in a good database server.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        Moving the data to a data file doesn't mean loading it every time a request comes in. As pointed out below, having all the data in your .pm at best means you have it duplicated (loaded into memory in the .pm, and in whatever variable holds it). Worse, any changes to the data require changes to your source code, which is a design no-no.
Re: large perl module
by BrowserUk (Pope) on Mar 06, 2010 at 16:44 UTC
    1. Have you considered using nested arrays rather than nested hashes?

      Arrays use far less memory that hashes as they don't need to store the keys. The downside is that they aren't sparse.

      But by (say) subtracting 100 from your 3-digit area codes, you eliminate a large chunk of unused space.

    2. Have you considered persisting the db to disk using Storable instead of cvs files.

      Why reparse the csv and reconstruct the db every time?

    I did a couple of quick tests.

    In the first I construct a HoHs with keys 100..999 and 100..4999 respectively and a 4 char string as the values.

    C:\test>826814 Total size in ram: 15132924 1e6 lookups took 1.619 size on disk (excluding blocking): 65340000 Time to construct hash from (ram) file: 14.578

    In the second I used the same ranges of numeric keys and the same data. This time I used a single array, indexed by area_code - 100. I then packed the "info" numbers as 16-bit integers, into strings. The strings are 'indexed' (via substr) using the prefix-code - 100 *2. The results are:

    C:\test>826814-b Total size in ram: 8871552 1e6 lookups took 0.813 size on disk (excluding blocking): 8822720 Time to construct hash from storable (ram) file: 0.039

    That's 1/2 half the memory usage; 1/8th the disk usage; 1/2 the lookup time; and a tiny fraction of the load time. Though that last figure is deceptive because there was no IO involved (in either test).

      Hi, this is very valuable post, I appreciate your input.
      Yes, I tried Storable first, hash solution is faster.
      I think it's worth to explore your solution with arrays, or some kind of trees in more detail.
      Few explanations:
      The hash structure us never rebuilt. It's static (use constant), the data never changes.
      The issue is: to import a list of unknown phone numbers, and based on the area code and 3 or 4 digit prefix, determine the name of carrier for this phone number.
      Carrier names (or rather IDs) are kept in a hash of hashes (two level) - that's the structure that is used for lookups, and is big in size. First 'index' level is area code, second one is 3 or 4 digit prefix.
      Hash is used for lookups only. The phone data is stored somewhere else (it happens to be object data store based on Storable).
      With hashes, the speed of the lookup is pretty good, I have no numbers but:
      reading 500k of CSV with 12k phone numbers, parsing it, doing carrier lookup for each single number, removing redundant numbers and storing the numbers with carrier info in the data store takes now 14s, which is sufficient.

        How many uniq carrier IDs are there?

Re: large perl module
by JSchmitz (Canon) on Mar 05, 2010 at 04:49 UTC
    Have you checked out this node? 159645 Hopefully it will be of some assistance - cheers

    Jeffery
Re: large perl module
by Xiong (Hermit) on Mar 06, 2010 at 08:26 UTC

    I'm unable to think of a good reason, including speed, not to use a database for this.

    I may be missing something; it wouldn't be the first time.

      On the contrary, I'm unable to to think why relational database server (or embedded solution) would be
      a) more reliable
      b) faster
      c) easier to maintain
      d) faster to develop
      e) cheaper
      f) use less resources (RAM)
      than hard-coded perl module. This data is static, and seldom changes (once a year?).
      The thing is that we do need network based (e.g. HTTP) lookups for other server to make, but even more important is some kind of library (a module that is), that will be used by other (perl) scripts (web and backend) for batch processing.
      E.g. a CSV file with 50k phone numbers to be imported over the Net, and processed. IT doesn't make sense to do SQL query for each single number. A library module is needed.
      Also, we simply do not use relational DB server for this project, and it's not worth to set up one, just for these lookups.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://826814]
Front-paged by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2020-07-03 13:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?