|Problems? Is your data what you think it is?|
MUMPS Array Subscripts Parsing Via RegExby Clovis_Sangrail (Beadle)
|on May 14, 2012 at 17:22 UTC||Need Help??|
Clovis_Sangrail has asked for the
wisdom of the Perl Monks concerning the following question:
Greetings exalted Monks and anyone else reading this. I am a journeyman Perl Programmer, buffeted about the IT industry by the vagaries of the economy and the responsibilities attendant to having a life away from the keyboard. I am called upon to solve some problem lending itself to Perl at intervals approximately twice as long as it takes me forget most of the Perl that I've learned.
My current employer makes extensive use of a software I expected never to encounter in my life, MUMPS! It turns out that MUMPS (much older than Perl) is a database and language, much like SQL or Oracle, but it's heirarchical and not relational. I'm told. Whatever that means. I need to report on certain transactions found in Journals, in particular I need to extract the Global Variables from the record types in which they are created, modified, or cleared.
Global Variable names start with an initial '^' character, followed by any combination of upper and/or lower case letters, digits, and the '%' character. They may or may not be arrays, in the case of arrays the name is immediately followed by a parenthesized set of subscripts. The Global Variable may be the end of the record, or it may be set equal to a value. Though I regard them as one of the language's fearsome mysteries, I turned to Perl for it's Regular Expression capabilities to give me the inital Global Variable label and the whole subscripted Global variable.
I came up with:/^((\^[%A-Za-z\d]+)($|=|.*?\)))/
And lo and behold it seemed to work! I was extracting Global Variables! My first memory variable $1 was the whole Global, $2 was the initial label, and I was generating reports. But then I began to encounter some of the bizarre things that MUMPS permits as subscripts. The Global Variable:^STUFF(1,"gobble)(degook", X)
broke my little Regular Expression! Or, well, strictly speaking it did not break it but it set the first memory variable equal to:^STUFF(1,"gobble)
rather than what I would prefer. It seemed that I could not simply search for the first ')', I had to somehow skip over ')'s, maybe a bunch, embedded in quoted subscripts. And then along came the Global Variable:^STUFF2(A,"""%BU""")
Nested double quotes! Oy Vey! Now, I know that I can express a set of double quotes as(["]+)
and match it later in the regex as \X (rather than $X) but what happens when I encounter a subscript enclosed in nested parentheses, or nested single quotes? This Regular Expression is getting out of hand, at least for me!
Now, when I encounter multiple members of the same Global Variable array I summarize the results in a single entry and write the Global Variable name with a 'generic' subscript, ex:^STUFF(..)
So in most cases it does not matter that I extract the wrong set of subscripts, but when I have a single update to a Global Variable Array I would like to present the correct bunch of subscripts. It seems like my regex must become something like:/^((\^[%A-Za-z\d]+)($|=|\(bunch_of_ugly_subscripts\)))/
But this is becoming far more of a RegEx than I had hoped it would be. I created some things that caused Perl to lock up; my hope is that this issue has been encountered and solved already, and you folks could tell me how to do this. I'm also very intersted in anything you have to say about how (if at all) you can document or comment a Regular Expression, so it at least looks like commented modem noise instead of just modem noise. Thanks for any help you can give.