http://www.perlmonks.org?node_id=1031639

BillDowns has asked for the wisdom of the Perl Monks concerning the following question:

I have a script that has works just fine, but my issue is with a regular expression. The script reads in a parameters file where any number of regexes specify which HL7 transactions qualify for extraction from an archive file. Each complete transaction is contained in a string, then the regexes are applied, and if all are true, the transaction is extracted. Each transaction is compose of "segments" each with their own identifier. Each segment contains several fields, components, subcomponents, etc.

Sample transaction with patient info (the PID stuff) removed:

MSH|^~\&|||||201304221951||ADT^A02|1492|P|2.2|||AL|NE| EVN|A02|201304221951| PID|1|... PV1|1|O|6MED^623^1|O||ASP|doctor-id^doctor-last-name^doctor-first-name +^doctor-mi^^^^^^^^^XX|||ASP||||OS||||SDC||A|||||||||||||||||||MSMH||R +EG|||201304220918

I need any Patient Transfers (identified by the ADT^A02) where the previous location (field 6 in segment PV1) is missing.

The first condition is MSH\|(.*?\|){7}ADT\^A02 and it works fine.

The issue is finding empty PV1-6 fields. I think that condition is PV1\|1\|O\|(.*?\|){3}\| However, I am getting transactions where field 6 is valued.

So what am I getting wrong?

Replies are listed 'Best First'.
Re: Selecting HL7 Transactions
by LanX (Saint) on May 01, 2013 at 20:36 UTC
    better try [^|] instead of .

    PV1\|1\|O\|([^|]*?\|){3}\|

    do you really need to capture field 2-5 ?

    Honestly this code hurts my eyes and horrifies my senses for maintenance !

    Better consider something like putting your first regex in a while loop and analyzing the captured string within that loop.

    Friedl's book is a good read!

    Cheers Rolf

    ( addicted to the Perl Programming Language)

Re: Selecting HL7 Transactions
by kcott (Archbishop) on May 01, 2013 at 22:54 UTC

    G'day BillDowns,

    You have a number of issues here. I've included a fair amount of detail below but refer to perlre for the full story.

    • You're not actually showing a regex but just a fragment of one (I'll assume "/PV1\|1\|O\|(.*?\|){3}\|/"). I'm not trying to be pedantic but I can only respond to what you've written: for all I know, "PV1\|1\|O\|(.*?\|){3}\|" may be part of a larger regex. Also, I have no idea what modifiers, if any, you've used.
    • The "." in ".*?" matches any character including a pipe ("|") character which isn't what you want. (That's a slight oversimplication: it doesn't match a newline character unless you used the "s" modifier.) So, ".*?" would be better as "[^|]*" (zero or more characters that aren't pipe characters).
    • You don't anchor the regex so it could match anywhere in the string. To match at the beginning of the string you'll need to prepend "^" or "\A".
    • You've used capturing parentheses "( ... )" here. This won't break anything as it currently stands but could become an issue if you do want to capture fields later: "(?: ... )" (for clustering, not capturing) would be better.
    • Purely as a matter of style and personal taste, replacing the escaped pipe "\|" with the character class "[|]" may reduce what's been referred to as backslashitis and improve readability. Either is fine, it's up to you.

    Putting all that together, you end up with a few options. Minimal changes would give: "/^PV1\|1\|O\|(?:[^|]*\|){3}\|/".

    Having said all that, I'm wondering if splitting the lines on pipe characters might just be a whole lot easier in terms of general readability and future maintenance. Something along these lines:

    my @fields = split /[|]/ => $line; ... if ($fields[0] eq 'MSH' and $fields[8] eq 'ADT^A02') { ... } ... if ($fields[0] eq 'PV1' and $fields[6] eq '') { ... } ...

    -- Ken

      Thanks, but I guess I did not make it clear - this is a utility script that extracts transactions from an archive based on the regular expressions I give it at run time. That's all it does - extracts transactions to a file.

      /PV1\|1\|O\|(.*?\|){3}\|/ was one of several regexes evaluated by itself. If all are true, the transaction is extract to an output file.

      I know about anchors - PV1 segments are a ways into the transaction as I showed in the sample, so I could not use an anchor.

      The parentheses are used for the repeat factor. All my research on the internet indicates a multi-character pattern that needs to be repeated multiple times should be enclosed in parentheses. Is this not correct?

        "The parentheses are used for the repeat factor. All my research on the internet indicates a multi-character pattern that needs to be repeated multiple times should be enclosed in parentheses. Is this not correct?"

        Here's a test showing clustering and capturing. Both match as expected. Capturing also sets $1.

        $ perl -Mstrict -Mwarnings -E ' my $re1 = qr{PV1\|1\|O\|(?:[^|]*\|){3}\|}; my $re2 = qr{PV1\|1\|O\|([^|]*\|){3}\|}; my $x = q{PV1|1|O|F3|F4|F5|F6|F7}; my $y = q{PV1|1|O|F3|F4|F5||F7}; say "------- Clustering -------"; say "Match in \$x" if $x =~ /$re1/; say $1 if $1; say "Match in \$y" if $y =~ /$re1/; say $1 if $1; say "------- Capturing -------"; say "Match in \$x" if $x =~ /$re2/; say $1 if $1; say "Match in \$y" if $y =~ /$re2/; say $1 if $1; ' ------- Clustering ------- Match in $y ------- Capturing ------- Match in $y F5|

        -- Ken

Re: Selecting HL7 Transactions
by Kenosis (Priest) on May 01, 2013 at 22:20 UTC

    Why a regex, in this case, instead of splitting on "|" or (better yet) using Text::CSV to get the fields' values?

Re: Selecting HL7 Transactions
by GotToBTru (Prior) on May 01, 2013 at 22:11 UTC
    I would suggest using regexes only to pull individual segments out of the transaction string.
    $PV1segment = ($transaction =~ /PV1.+?~/;
    assuming ~ is the segment terminator. Then, use split to break the segments into arrays. It is much easier than dealing with regexes, and takes advantage of the delimited structure of the data.
Re: Selecting HL7 Transactions
by sundialsvc4 (Abbot) on May 02, 2013 at 12:29 UTC

    My first comment would be that you should check out Net::HL7, either to use it directly or to study its mysteries.   “Do Not Do A Thing Already Done.™”

    Second, the HL7 format is designed to be simple:   it is “pipe-delimited,” therefore a simple split() can be used to break up each string, and the parameters are positional.

    Now reaching down a level to the core problem itself ... this sort of thing is most-easily handled by finite-state machine (FSM) techniques.   Each HL7 transaction starts with a known group, such as MSH in this case, and ends either with a known terminator group or the start of the next group.   A state-machine might therefore start in a SKIP_FOR_MSH state, whereupon it empties a list, pushes the current record onto it, and switches, say, to SKIP_FOR_PV1 which is pushing strings until it encounters a PV1 or another MSH (the latter meaning that PV1 is not there).   Now, it sees if the group is missing.   If not, this message is uninteresting and we revert to SKIP_FOR_MSH state.   Otherwise we switch to COLLECT_BAD_TXN and I think you can take it from here.

    This logic is simplified by writing the main loop in such a way that you can process the same record twice, so that you can, if you want to, stash the MSH record, switch to SKIP_FOR_MSH state, and know that the just-stashed record will be the next record seen in that state.

    Lest your eyes glaze over, go to http://search.cpan.org and search for “FSM.”   Do Not Do A Thing Already Done.™

Re: Selecting HL7 Transactions
by BillDowns (Novice) on May 01, 2013 at 22:44 UTC

    Using [^|] instead of . worked, but I do not understand why. It was my impression that .*?\| would match anything to the next vertical bar, which is what [^|]*?\| does.

      > It was my impression that .*?\| would match anything to the next vertical bar, which is what ^*?\| does.

      exactly anything starting from the left including other delimiters, till you have the empty cell between two delimiters you are looking for! :)

      just test it: (with semicolons as delimiters to simplify the regex)

      DB<112> $x=';a;not;b;not;c;;' => ";a;not;b;not;c;;" DB<113> print $x =~ /;(.*?);;/ a;not;b;not;c DB<114> print $x =~ /;([^;]*?);;/ c

      clearer now?

      Cheers Rolf

      ( addicted to the Perl Programming Language)

        Actually, no. Everything I've read on non-greedy matching says .*?\| should match everything up to the next \|.