|Just another Perl shrine|
Re: Calling all REGEX Gurus - nasty problem involving regular expressions combined with hash keys - I need ideas as to how to even approach the problemby davido (Archbishop)
|on Feb 08, 2013 at 22:50 UTC||Need Help??|
I didn't see in your problem explanation a description of how many "standardized responses" there could be. Are we talking about thousands? Hundreds? Tens?
It would also be useful to know whether the incomplete versions of the standardized responses are at least predictable, and unique. I understand that the entire response text might differ from transaction to transaction, but does a "100 - Bad Transaction" message always get abbreviated as "Bad T" before being embedded in the response text, and is the abbreviation unique so that no two standardized response codes could have the same abbreviation?
Let's say you've got a total possible 100 standardized responses / codes. Start by building a crossreference table that x-refs abbreviations with their full-sized versions:
Next build up a big regex full of alternations:
Next, scan your response text and look up the crossref:
Perl's regular expression engine (as of 5.10, if I recall) performs "trie optimization" for alternation, which should be very fast. While hash keys cannot be Regexp objects, they could contain the text that you will use as components of a regexp pattern.
It's possible that this approach won't work for you if the possible abbreviations aren't unique, or if one abbreviation could be truncated in some way as to produce another valid abbreviation. It also won't work if you can't count on abbreviations being predictable. If those sorts of issues exist, you might have to explain to us how you as a human would look at the response text and visually/mentally detect a standardized response abbreviation. Then the problem would be to try to turn that process into a set of rules that could be implemented programatically.