Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Calling all REGEX Gurus - nasty problem involving regular expressions combined with hash keys - I need ideas as to how to even approach the problem

by ted.byers (Scribe)
on Feb 08, 2013 at 21:11 UTC ( #1017887=perlquestion: print w/ replies, xml ) Need Help??
ted.byers has asked for the wisdom of the Perl Monks concerning the following question:

I am at a loss as to how to even begin this problem.

In a transaction processing system, in most cases the server sends a response code and a response text (giving the meaning of the response code). Those are trivially easy to work with. In fact, that data is good enough that I could even establish a hash containing the response text value as a key and the response code as a value, and use that with the response text provided for a given transaction and accurately get the code.

Of course, when they send both, there is no need. However, I have to use one service that does not send the standardized code, and the standard response text is provided as part of a much longer 'response text', and at that, the complete 'standardized response' is not sent, only an abbreviated form of it. Therefore, what I have to do is compare the response text they return and use a regular expression to see which key in a hash, that maps response text to response code, best fits it. In other words, once I create the hash mapping response text to response code, I have to determine, as quickly as possible, which key contains the longest substring that is contained in the response text that this service has provided.

Alas, I do not control this particular server, and suggestions to improve it take a long time to be acted upon, if they choose to act at all. I could 'ask' that they send the proper response code as a separate field in their response, but I might not live long enough to see that happen.

Now, I know I could implement a brute force algorithm, using a statistical similarity measure drawn from information theory, but I am hoping there is a guru out there who knows how best to handle this sort of problem in an efficient manner. I have to ensure that whatever other processing the script in question has to do, that the response time is less than a second (the faster the better). I had more than half a dozen algorithms to trim whitespace from the start and end of an arbitrary string, that all worked flawlessly using one regex or another, and yet there was an order of magnitude difference in how quickly they did the job. Thus, it would be good to know several options that would all work correctly, and I'd then write some benchmarking code to facilitate comparing their speed.

Any suggestions as to fast algorithms to solve this problem would be greatly appreciated.

Thanks

Ted

Comment on Calling all REGEX Gurus - nasty problem involving regular expressions combined with hash keys - I need ideas as to how to even approach the problem
Re: Calling all REGEX Gurus - nasty problem involving regular expressions combined with hash keys - I need ideas as to how to even approach the problem
by Anonymous Monk on Feb 08, 2013 at 21:16 UTC
    I have to determine, as quickly as possible, which key contains the longest substring that is contained in the response text that this service has provided.

    Do you mean to say that they do not just truncate, but send random chunks from the middle of the error text? And are they inconsistent in which chunk of the error text they send for each code?

      I have not seen a consistent truncation pattern, save that it looks like some of the beginning and some of the end of the "standard" response may be truncated"

      And we're talking something of the order of 100 possible responses

      Thanks

      Ted

Re: Calling all REGEX Gurus - nasty problem involving regular expressions combined with hash keys - I need ideas as to how to even approach the problem
by davido (Archbishop) on Feb 08, 2013 at 22:50 UTC

    I didn't see in your problem explanation a description of how many "standardized responses" there could be. Are we talking about thousands? Hundreds? Tens?

    It would also be useful to know whether the incomplete versions of the standardized responses are at least predictable, and unique. I understand that the entire response text might differ from transaction to transaction, but does a "100 - Bad Transaction" message always get abbreviated as "Bad T" before being embedded in the response text, and is the abbreviation unique so that no two standardized response codes could have the same abbreviation?

    Let's say you've got a total possible 100 standardized responses / codes. Start by building a crossreference table that x-refs abbreviations with their full-sized versions:

    my %xref = ( 'abbrev1' => '100 - Non-Abbreviated 1', 'abbrev2' => '200 - Non-Abbreviated 2', ... );

    Next build up a big regex full of alternations:

    my $alternations = join '|', keys %xref; my $regexp = qr/\b($alternations)\b/;

    Next, scan your response text and look up the crossref:

    my( $abbrev ) = $raw_response =~ m/$regexp/; my $std_response; if( exists $xref{$abbrev} ) { $std_response = $xref{$abbrev}; } else { die "No valid response found in <<$std_response>>"; } print "$std_response\n";

    Perl's regular expression engine (as of 5.10, if I recall) performs "trie optimization" for alternation, which should be very fast. While hash keys cannot be Regexp objects, they could contain the text that you will use as components of a regexp pattern.

    It's possible that this approach won't work for you if the possible abbreviations aren't unique, or if one abbreviation could be truncated in some way as to produce another valid abbreviation. It also won't work if you can't count on abbreviations being predictable. If those sorts of issues exist, you might have to explain to us how you as a human would look at the response text and visually/mentally detect a standardized response abbreviation. Then the problem would be to try to turn that process into a set of rules that could be implemented programatically.


    Dave

      Thanks

      There are around 100 possible standard response texts.

      I have not seen a consistent pattern in the abbreviations, although I hope against hope that they'd at least abbreviate any given response text the same way consistently.

      Part of the challenge is to deliberately trigger each response code, so that we can see how each is abbreviated at least once. It is easy to generate an error related to a bad date, some of the other errors are quite hard to deliberately trigger, especially when the server in question has to be treated as a black box (we know what the response codes are, but not their validation rules that apply them - and their documentation leaves everything to be desired).

      That said, I will make a test script based on the code you show, just in case the assumptions that code makes are satisfied by the data they send.

      Thanks

      Ted

Re: Calling all REGEX Gurus - nasty problem involving regular expressions combined with hash keys - I need ideas as to how to even approach the problem
by BrowserUk (Pope) on Feb 08, 2013 at 23:00 UTC

    Just spend a few minutes constructing requests that cause the remote server to respond with each of their response texts; then store this in your text-to-response-code hash. Job done.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thanks for this

      Alas, it is a bit harder than that. There are about 100 standard responses, and while some are trivially easy to trigger, some are notoriously hard to deliberately trigger, in no small part because the validation rules they use are not documented (at least inthe documentatin they provide to us).

      Thanks

      Ted

        Then add the common ones and fix up the rare ones when and if they ever occur.

        (And look for an alternative supplier of this service that doesn't screw you around.)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Calling all REGEX Gurus - nasty problem involving regular expressions combined with hash keys - I need ideas as to how to even approach the problem
by dd-b (Monk) on Feb 09, 2013 at 22:20 UTC

    So, if you don't have bagged samples of what they send, you can't validate for sure any hypothetical solution. They really can't send you a list of exactly what they send in each situation?

    If you do have examples, you can make a table from them them.

    Therefore, you can put the ones you've got samples of (which will be the common ones) in a table, and make guesses on the key words that will be kept from the middle of the response codes for the ones you don't have samples of. You can make sure that your guesses don't give false positives on any of the common responses. And make sure you have logging that will show you anything you get that isn't in your list of expected returns, so you can figure out what's going on and update your tables. This isn't excellent, but might be tolerable?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1017887]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (13)
As of 2014-08-20 17:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (120 votes), past polls