Calling all REGEX Gurus - nasty problem involving regular expressions combined with hash keys - I need ideas as to how to even approach the problemby ted.byers (Scribe)
|on Feb 08, 2013 at 21:11 UTC||Need Help??|
ted.byers has asked for the
wisdom of the Perl Monks concerning the following question:
I am at a loss as to how to even begin this problem.
In a transaction processing system, in most cases the server sends a response code and a response text (giving the meaning of the response code). Those are trivially easy to work with. In fact, that data is good enough that I could even establish a hash containing the response text value as a key and the response code as a value, and use that with the response text provided for a given transaction and accurately get the code.
Of course, when they send both, there is no need. However, I have to use one service that does not send the standardized code, and the standard response text is provided as part of a much longer 'response text', and at that, the complete 'standardized response' is not sent, only an abbreviated form of it. Therefore, what I have to do is compare the response text they return and use a regular expression to see which key in a hash, that maps response text to response code, best fits it. In other words, once I create the hash mapping response text to response code, I have to determine, as quickly as possible, which key contains the longest substring that is contained in the response text that this service has provided.
Alas, I do not control this particular server, and suggestions to improve it take a long time to be acted upon, if they choose to act at all. I could 'ask' that they send the proper response code as a separate field in their response, but I might not live long enough to see that happen.
Now, I know I could implement a brute force algorithm, using a statistical similarity measure drawn from information theory, but I am hoping there is a guru out there who knows how best to handle this sort of problem in an efficient manner. I have to ensure that whatever other processing the script in question has to do, that the response time is less than a second (the faster the better). I had more than half a dozen algorithms to trim whitespace from the start and end of an arbitrary string, that all worked flawlessly using one regex or another, and yet there was an order of magnitude difference in how quickly they did the job. Thus, it would be good to know several options that would all work correctly, and I'd then write some benchmarking code to facilitate comparing their speed.
Any suggestions as to fast algorithms to solve this problem would be greatly appreciated.