Re: Selecting HL7 Transactions
by LanX (Saint) on May 01, 2013 at 20:36 UTC
|
better try [^|] instead of .
PV1\|1\|O\|([^|]*?\|){3}\|
do you really need to capture field 2-5 ?
Honestly this code hurts my eyes and horrifies my senses for maintenance !
Better consider something like putting your first regex in a while loop and analyzing the captured string within that loop.
Friedl's book is a good read!
Cheers Rolf
( addicted to the Perl Programming Language)
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: Selecting HL7 Transactions
by kcott (Archbishop) on May 01, 2013 at 22:54 UTC
|
G'day BillDowns,
You have a number of issues here. I've included a fair amount of detail below but refer to perlre for the full story.
-
You're not actually showing a regex but just a fragment of one (I'll assume "/PV1\|1\|O\|(.*?\|){3}\|/"). I'm not trying to be pedantic but I can only respond to what you've written: for all I know, "PV1\|1\|O\|(.*?\|){3}\|" may be part of a larger regex. Also, I have no idea what modifiers, if any, you've used.
-
The "." in ".*?" matches any character including a pipe ("|") character which isn't what you want. (That's a slight oversimplication: it doesn't match a newline character unless you used the "s" modifier.) So, ".*?" would be better as "[^|]*" (zero or more characters that aren't pipe characters).
-
You don't anchor the regex so it could match anywhere in the string. To match at the beginning of the string you'll need to prepend "^" or "\A".
-
You've used capturing parentheses "( ... )" here. This won't break anything as it currently stands but could become an issue if you do want to capture fields later: "(?: ... )" (for clustering, not capturing) would be better.
-
Purely as a matter of style and personal taste, replacing the escaped pipe "\|" with the character class "[|]" may reduce what's been referred to as backslashitis and improve readability. Either is fine, it's up to you.
Putting all that together, you end up with a few options. Minimal changes would give: "/^PV1\|1\|O\|(?:[^|]*\|){3}\|/".
Having said all that, I'm wondering if splitting the lines on pipe characters might just be a whole lot easier in terms of general readability and future maintenance. Something along these lines:
my @fields = split /[|]/ => $line;
...
if ($fields[0] eq 'MSH' and $fields[8] eq 'ADT^A02') { ... }
...
if ($fields[0] eq 'PV1' and $fields[6] eq '') { ... }
...
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
Thanks, but I guess I did not make it clear - this is a utility script that extracts transactions from an archive based on the regular expressions I give it at run time. That's all it does - extracts transactions to a file.
/PV1\|1\|O\|(.*?\|){3}\|/ was one of several regexes evaluated by itself. If all are true, the transaction is extract to an output file.
I know about anchors - PV1 segments are a ways into the transaction as I showed in the sample, so I could not use an anchor.
The parentheses are used for the repeat factor. All my research on the internet indicates a multi-character pattern that needs to be repeated multiple times should be enclosed in parentheses. Is this not correct?
| [reply] [Watch: Dir/Any] [d/l] |
|
$ perl -Mstrict -Mwarnings -E '
my $re1 = qr{PV1\|1\|O\|(?:[^|]*\|){3}\|};
my $re2 = qr{PV1\|1\|O\|([^|]*\|){3}\|};
my $x = q{PV1|1|O|F3|F4|F5|F6|F7};
my $y = q{PV1|1|O|F3|F4|F5||F7};
say "------- Clustering -------";
say "Match in \$x" if $x =~ /$re1/;
say $1 if $1;
say "Match in \$y" if $y =~ /$re1/;
say $1 if $1;
say "------- Capturing -------";
say "Match in \$x" if $x =~ /$re2/;
say $1 if $1;
say "Match in \$y" if $y =~ /$re2/;
say $1 if $1;
'
------- Clustering -------
Match in $y
------- Capturing -------
Match in $y
F5|
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
|
|
|
|
Re: Selecting HL7 Transactions
by Kenosis (Priest) on May 01, 2013 at 22:20 UTC
|
Why a regex, in this case, instead of splitting on "|" or (better yet) using Text::CSV to get the fields' values?
| [reply] [Watch: Dir/Any] [d/l] |
Re: Selecting HL7 Transactions
by GotToBTru (Prior) on May 01, 2013 at 22:11 UTC
|
I would suggest using regexes only to pull individual segments out of the transaction string.
$PV1segment = ($transaction =~ /PV1.+?~/;
assuming ~ is the segment terminator. Then, use split to break the segments into arrays. It is much easier than dealing with regexes, and takes advantage of the delimited structure of the data.
| [reply] [Watch: Dir/Any] [d/l] |
Re: Selecting HL7 Transactions
by sundialsvc4 (Abbot) on May 02, 2013 at 12:29 UTC
|
My first comment would be that you should check out Net::HL7, either to use it directly or to study its mysteries. “Do Not Do A Thing Already Done.™”
Second, the HL7 format is designed to be simple: it is “pipe-delimited,” therefore a simple split() can be used to break up each string, and the parameters are positional.
Now reaching down a level to the core problem itself ... this sort of thing is most-easily handled by finite-state machine (FSM) techniques. Each HL7 transaction starts with a known group, such as MSH in this case, and ends either with a known terminator group or the start of the next group. A state-machine might therefore start in a SKIP_FOR_MSH state, whereupon it empties a list, pushes the current record onto it, and switches, say, to SKIP_FOR_PV1 which is pushing strings until it encounters a PV1 or another MSH (the latter meaning that PV1 is not there). Now, it sees if the group is missing. If not, this message is uninteresting and we revert to SKIP_FOR_MSH state. Otherwise we switch to COLLECT_BAD_TXN and I think you can take it from here.
This logic is simplified by writing the main loop in such a way that you can process the same record twice, so that you can, if you want to, stash the MSH record, switch to SKIP_FOR_MSH state, and know that the just-stashed record will be the next record seen in that state.
Lest your eyes glaze over, go to http://search.cpan.org and search for “FSM.” Do Not Do A Thing Already Done.™
| [reply] [Watch: Dir/Any] |
Re: Selecting HL7 Transactions
by BillDowns (Novice) on May 01, 2013 at 22:44 UTC
|
Using [^|] instead of . worked, but I do not understand why. It was my impression that .*?\| would match anything to the next vertical bar, which is what [^|]*?\| does.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
> It was my impression that .*?\| would match anything to the next vertical bar, which is what ^*?\| does.
exactly anything starting from the left including other delimiters, till you have the empty cell between two delimiters you are looking for! :)
just test it: (with semicolons as delimiters to simplify the regex)
DB<112> $x=';a;not;b;not;c;;'
=> ";a;not;b;not;c;;"
DB<113> print $x =~ /;(.*?);;/
a;not;b;not;c
DB<114> print $x =~ /;([^;]*?);;/
c
clearer now?
Cheers Rolf
( addicted to the Perl Programming Language)
| [reply] [Watch: Dir/Any] [d/l] |
|
Actually, no. Everything I've read on non-greedy matching says .*?\| should match everything up to the next \|.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|