It depends on the thing you are searching - that is how are you searching it. I've tried DBM::Deep for prefix search exactly because of the speed it's docs mention. However, when tested, even from a RAM disk (aka /dev/shm) it was at least two/tree orders of magnitude (not sure exactly) slower than loading up everything in HoHoH... http://en.wikipedia.org/wiki/Trie and searching by going "up the tree".
Of course, I had the luxury or all requests going through one process that's doing prefix search and routing requests depending on it. Which might also be a solution for the original question.
You could implement the mod_perl handler only to pass on the requests, well put in work queue, and wait for results in results queue. And have one process (or several, possibly on several different servers) that do the actual (is it prefix?) searches - reading requests from the in/work queue, and putting them back in the out_queue.
The trie implementation was able to find something like 50K random prefixes per second in a loop - from the pool of 45 to 50K prefixes in the database (loaded on startup into large HoH trie). 50.000 per second should outperform any web server. And that's on x2 AMD with 1 GB or RAM ...
You could use Memcache for both input queue, and output - implementation (in Perl!) can be seen here: http://3.rdrail.net/blog/memcached-based-message-queues/
PS - I've been contracting for the past year, for an company that's in SMS gateway business. So I get to tweak code that searches prefixes a lot ;)
Have you tried freelancing/outsourcing? Check out Scriptlance - I work there since 2003. For more info about Scriptlance and freelancing in general check out my home node.
| [reply] |
Hi Dave, I actually use couple of your modules in my project,
and I'm doing exactly what you did for UK mobile data, but for USA/CAN. It's also an SMS messaging server.
I couldn't find any reliable data for the carrier lookup for NANP numbers, and I ended buying the data from a commercial supplier.
Instead of DBM I'm using my own object persistance module, which works basically the same as DBM::Deep but faster.
I stored the data in one data file per area code.
Each data file was on average 20-40kB.
The lookups were doing fine and were fast,
the only problem was with IMPORTING huge CSV files with e.g. 20k phone numbers. The whole import
procedure had to be finished within 20s, including saving them in the DB, checking if they unique, etc...,
and for 10k numbers, it means 10k lookups - reading
on average 300MB of data.
At this point it was taking on average 70s, so
I moved the data into something like this:
use constant _nanp_area =>
{
201 => {
4143 => 267,
206 => 357, .... },
604 => { ... },
....
};
# {
# { area_code1 => { prefix => carrier_id, ... },
# { area_code2 => { prefix => carrier_id, ... },
# }
So it's basically a hash of hashes of 3 or 4 digit prefixes and corresponding carriers.
I didn't do exact measurements, but it's fast enough, and works for us just fine. 10k phone numbers gets imported and stored in DB within 14s.
The lookup function first tries to find the carrier based on area code and 4 digit prefix, and if undef, then with 3 digit prefix.
At this point everything seems to work OK and fast, but I will keep my eye on it... | [reply] [d/l] |