http://www.perlmonks.org?node_id=970472

Clovis_Sangrail has asked for the wisdom of the Perl Monks concerning the following question:

Greetings exalted Monks and anyone else reading this. I am a journeyman Perl Programmer, buffeted about the IT industry by the vagaries of the economy and the responsibilities attendant to having a life away from the keyboard. I am called upon to solve some problem lending itself to Perl at intervals approximately twice as long as it takes me forget most of the Perl that I've learned.

My current employer makes extensive use of a software I expected never to encounter in my life, MUMPS! It turns out that MUMPS (much older than Perl) is a database and language, much like SQL or Oracle, but it's heirarchical and not relational. I'm told. Whatever that means. I need to report on certain transactions found in Journals, in particular I need to extract the Global Variables from the record types in which they are created, modified, or cleared.

Global Variable names start with an initial '^' character, followed by any combination of upper and/or lower case letters, digits, and the '%' character. They may or may not be arrays, in the case of arrays the name is immediately followed by a parenthesized set of subscripts. The Global Variable may be the end of the record, or it may be set equal to a value. Though I regard them as one of the language's fearsome mysteries, I turned to Perl for it's Regular Expression capabilities to give me the inital Global Variable label and the whole subscripted Global variable.

I came up with:

 /^((\^[%A-Za-z\d]+)($|=|.*?\)))/

And lo and behold it seemed to work! I was extracting Global Variables! My first memory variable $1 was the whole Global, $2 was the initial label, and I was generating reports. But then I began to encounter some of the bizarre things that MUMPS permits as subscripts. The Global Variable:

 ^STUFF(1,"gobble)(degook", X)

broke my little Regular Expression! Or, well, strictly speaking it did not break it but it set the first memory variable equal to:

 ^STUFF(1,"gobble)

rather than what I would prefer. It seemed that I could not simply search for the first ')', I had to somehow skip over ')'s, maybe a bunch, embedded in quoted subscripts. And then along came the Global Variable:

 ^STUFF2(A,"""%BU""")

Nested double quotes! Oy Vey! Now, I know that I can express a set of double quotes as

 (["]+)

and match it later in the regex as \X (rather than $X) but what happens when I encounter a subscript enclosed in nested parentheses, or nested single quotes? This Regular Expression is getting out of hand, at least for me!

Now, when I encounter multiple members of the same Global Variable array I summarize the results in a single entry and write the Global Variable name with a 'generic' subscript, ex:

 ^STUFF(..)

So in most cases it does not matter that I extract the wrong set of subscripts, but when I have a single update to a Global Variable Array I would like to present the correct bunch of subscripts. It seems like my regex must become something like:

 /^((\^[%A-Za-z\d]+)($|=|\(bunch_of_ugly_subscripts\)))/

But this is becoming far more of a RegEx than I had hoped it would be. I created some things that caused Perl to lock up; my hope is that this issue has been encountered and solved already, and you folks could tell me how to do this. I'm also very intersted in anything you have to say about how (if at all) you can document or comment a Regular Expression, so it at least looks like commented modem noise instead of just modem noise. Thanks for any help you can give.

Replies are listed 'Best First'.
Re: MUMPS Array Subscripts Parsing Via RegEx
by toolic (Bishop) on May 14, 2012 at 18:23 UTC
    I'm also very intersted in anything you have to say about how (if at all) you can document or comment a Regular Expression, so it at least looks like commented modem noise instead of just modem noise.
    See perlre /x modifier.

    Also, courtesy of YAPE::Regex::Explain, here is your /^((\^[%A-Za-z\d]+)($|=|\(bunch_of_ugly_subscripts\)))/ with comments:

    (?x-ims: # group, but do not capture (disregarding # whitespace and comments) (case-sensitive) # (with ^ and $ matching normally) (with . not # matching \n): ^ # the beginning of the string ( # group and capture to \1: ( # group and capture to \2: \^ # '^' [%A-Za-z\d]+ # any character of: '%', 'A' to 'Z', 'a' # to 'z', digits (0-9) (1 or more times # (matching the most amount possible)) ) # end of \2 ( # group and capture to \3: $ # before an optional \n, and the end of # the string | # OR = # '=' | # OR \( # '(' bunch_of_ugly_su # 'bunch_of_ugly_subscripts' bscripts # \) # ')' ) # end of \3 ) # end of \1 ) # end of grouping
    UPDATE: Does this CPAN search help?

      Wow, that module is pretty impressive! It seems like there is a Perl module for just about anything. On a smaller scale, I will try making use of the '/x' modifier and indentation in the future. It seems like there is some danger that, for people like me who do not use Perl often enough to keep from forgetting stuff between usages, the '/x' modifier could add to the confusion as well as clarify the inner workings of the Regex. I think I would sort of feel obligated to (briefly) explain how '/x' allows whitespace and comments in the comments that I add, probably not a bad thing.

Re: MUMPS Array Subscripts Parsing Via RegEx
by scorpio17 (Canon) on May 14, 2012 at 20:41 UTC

      That has got to be the best name for a website that I've seen in a long, long, time! I like the idea of a daily WTF...

      Perhaps out of a general sense of respect for ones elders, I have a more charitable attitude towards Mumps than does the author of the article you reference. I do not dispute the various uglynesses he describes, but I believe that they were much less relevant when the language was created, back when the main consideration was shoehorning code into very tiny amounts of memory.

      The place where I'm at now extensively uses the GT.M implementation of Mumps (FIS, they own it), and they tell me it's rock-solid and blindingly fast, they have to wait for Oracle to catch up to it when interfacing with customers who use Oracle. I don't really understand it, but apparently they don't write big programs in the Mumps language, they write DB interfaces that they call with Java.

      OMFG

      HTH,

      planetscape
      I shuddered and emitted audible gasps as I read that...
Re: MUMPS Array Subscripts Parsing Via RegEx
by johngg (Canon) on May 14, 2012 at 22:04 UTC

    I wonder if the Text::Balanced module would be helpful.

    Cheers,

    JohnGG

      At a cursory inspection it looks like this module is something that I'd use instead of Regular expressions. (Unless you can call a Perl Function from within a Regex. Can you do that? (OMG according to http://www.perlmonks.org/?node_id=832028 you can do just that. My head is going to explode!)). I'm not ready to give up on Regex's yet. I can decompose

      \(bunch_of_ugly_subscripts\)

      into

      \(uglysub,)*uglysub\)

      and then maybe I'd use variable substitition in the regex and create a separate regex variable that's an alternation for the different kinds of ugly subscripts:

      $UG = 'notsougly|uglywith"|uglywith\'|...';

      This would be updatable as I encounter different outrageous subscripts. Maybe I'd use calls to the Text::Balanced module functions in here?

Re: MUMPS Array Subscripts Parsing Via RegEx
by afoken (Chancellor) on May 16, 2012 at 04:59 UTC
    My current employer makes extensive use of a software I expected never to encounter in my life, MUMPS! It turns out that MUMPS (much older than Perl) is a database and language, much like SQL or Oracle, but it's heirarchical and not relational. I'm told. Whatever that means.

    I feel your pain, because MUMPS has become part of my current job, much more than I ever wanted it to be part of my job.

    MUMPS is a database in that all "globals" are stored on disk rather than just in memory. The globals are stored as trees (heirarchical), not as tables (relational). A global, like any MUMPS variable, can store a single value (like a perl scalar), or it can store key-value pairs (much like a perl hash, but with implicitly sorted keys), or it can store both at the same time. The values of the key-value pairs can again be single values or key-value pairs, deeply nested. But there is nothing like SQL to query these trees, you have to write MUMPS code. (Caché (see below) does offer an SQL interface to the trees, but it looks very strange.)

    Oh, by the way: Did you know that MUMPS started as an operating system running on bare metal of ancient computers? All current implementations still provide the grey-haired coder with this illusion.

    I need to report on certain transactions found in Journals, in particular I need to extract the Global Variables from the record types in which they are created, modified, or cleared.

    There are several very different implementations of MUMPS, despite being standardised by ANSI or some other authority. I know only the Micronetics implementation (MSM) from personal experience, the Caché implementation from a big distance, and the Perl implementation from a short "just forget it" experience.

    Parsing MUMPS code is easy for the common case, but there are some edges that make your live really hard. The indirection operator (@) is my favorite here, directly followed by the string-eval command XECUTE. As soon as you find one of them, you are essentially lost with a simple parser. You need to know the current values of the variables referenced in the code to continue. So, you can't simply parse MUMPS, you have to interpret it. It's the "only perl can parse Perl" of the punch card age. With the minor difference that each and every MUMPS implementation has its own set of incompatible extensions.

    From the Micronetics implementation, I know that there are several tools for handling MUMPS code. The INDEX program is able to generate a cross-reference for a single MUMPS program or a bunch of MUMPS programs, with variables, syntax warnings, and so on. See ftp://ftp.intersys.com/pub/msm/docs/msm44/utility.pdf for details. Of course, it's just a MUMPS parser, not a MUMPS interpreter, and it seems to be ported from a really ancient version. It can be confused by "modern" code that uses device mnemonics, but it's the best available tool for the job (simply because it's the only one).

    Generally, don't try to work with MUMPS code outside a MUMPS system. You will fail at writing a MUMPS interpreter. Try to solve your problem inside MUMPS, or export your data from MUMPS to text files and handle those exports in a modern language. It's quite easy to write even XML or JSON from MUMPs to files, but parsing those formats correctly from MUMPS is nearly impossible.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      As yet I'm, not called on to parse any Mumps code. All I get from Mumps are binary journals, data much like transaction journals from other DataBase Systems. The GT.M implementation of Mumps comes with a 'mupip' program that (among other things) gives me a character-based dump of the Journals, basically a human-readable text file of newline-delimited records with '\' as the field separator. I wonder if other current implementations of Mumps include mupip (or journals at all, for that matter). I can use 'split' to make a list of each record, and I'm fortunate that the Global Variable is in the final field, because I've even found embedded '\' characters in some Global Variable subscripts, but I can ignore them via the 3rd (limit) parameter to split.

        Partly related:

        MSM has a %GS (global save) command that writes globals to text files, in a pretty simple format non-surprisingly called MSM format. The first line contains date and time, and some constants, the second line is the comment entered while running %GS, the following lines contain alternating the global name inclusing all subscripts, and the value. To announce the end of a global, both lines are "*", to announce end of file, both lines are "**". Simple, readable, parseable with nearly no efford. Unless one of the globals happen to contain control characters like CR or LF. Even MSM can't read back those files it wrote just seconds ago. It's a shame.

        The companion program %GR (global restore) reads the globals back into the system. And I remember from browsing the sources that there is a second file format named "ANSI format", but unfortunately, I don't remember the details, and I don't have to access to the MSM systems at work from home.

        My idea is to search for tools that are written to exchange data with other MUMPS systems. One of the design goals of ANSI MUMPS was to be able to exchange programs and data across the various implementations, so there should be tools. And because MUMPS is so old, my bet is that most exchange formats are simple ASCII files with a line-oriented format and simple delimiters, because that's what all MUMPS systems (and those grey-haired MUMPS coders) are able to handle.

        And by the way: Don't expect much error checking or even error handling in old tools. All MUMPS code I've seen (not only or own legacy system, but also the code delivered by Micronetics) is very optimistic regarding the well-formedness and validity of its input. It seems that no MUMPS coder ever mistrusted foreign data or user input. Unexpected input usually leads to crashes or damaged or lost data, get used to it.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: MUMPS Array Subscripts Parsing Via RegEx (sock puppet alert)
by Anonymous Monk on May 14, 2012 at 18:08 UTC
    For someone who just "showed up," your language hints that you have been here for a long long time.

      I've been in the industry since 1978 (PDP 11/04 assembly under DOS/Batch (no, not MSDOS, rather the DEC predecessor to RT11 & RSX).) but I encounter Perl only sporadically, and I've never posted to this website before. What's a 'sock puppet'?

Re: MUMPS Array Subscripts Parsing Via RegEx
by Clovis_Sangrail (Beadle) on May 16, 2012 at 15:50 UTC

    Thanks for the many helpful and/or just plain fascinating and entertaining replies! (Not that I'm saying you should stop replying...) I may wind up just staying with the occaisionally inaccurate subscript extraction that I have now, but should I need to improve it you've all provided several helpful directions to pursue.

      I just ran across this in a Google search for something unrelated. I realize it has been almost five years since the original post, but if you're trying to pull data from a mupip journal extract output of GT.M, it's likely easiest to do it in GT.M - likely just a few lines of code. (I manage GT.M.)