regex anchoring issue

penguin-attack has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have 3 potential input files that user can input

8=FIX.4.2<SOH>

8=FIX.4.2^A (^A in text )

8=FIX.4.2^A (^A as a Control char hidden)

I want to split by the end character either <SOH> ^A or ^A

my split will work for \cA and <SOH> but I cant get it working for the text based ^A

Logic - 1) read the file in . 2) run through each line splitting on regex match for either text ^A or control A or <SOH . 3)stick in array .4) remove whitespace from array . 5) else throw error

Heres the code:

Problem is that if a tag is encounter like this : 8=A^A, it strips the first A instead of the ^A

I know its staring at me in the face but dunno why my regex ignores the text based ^A

sub reader {
my $stream = "$somefile";
open( FILE, $stream ) or die "Cant open File:$!\n";
    while (<FILE>) {
     if ( $_ =~ m/[\^A\cA\<SOH\>]{1}/g ) {
                          @split= split /[\^A\cA\<SOH\>]+/g, $_;
 

            foreach my $line (@split) {
   foreach my $line (@split) { ( $line =~ s/[\s\n]{1,}//g );  }

}
}


 else {
            my $url = "http:/xxxxx.com/error.html";
            my $t = 0;    # time until redirect activates
            print "<META HTTP-EQUIV=refresh CONTENT=\"$t;URL=$url\">\n
+";
            last;
        }
}
[download]

Thanks for your time. Penguin.

Comment on regex anchoring issue Download Code

Replies are listed 'Best First'.
Re: regex anchoring issue by kcott (Archbishop) on Feb 15, 2013 at 05:49 UTC
G'day penguin-attack, Welcome to the monastery. Firstly, your data description seems a little ambiguous: you say "end character" then describe `<SOH>` (5 chars), `^A` (2 chars) and `Ctrl-A` (1 char). If, by `<SOH>`, you mean the ASCII character - that is the same character as `Ctrl-A` (i.e. the character with the ASCII value of 1). Your main problem in your regexp is the use of a character class (i.e. `[...]`) - see Character Classes and other Special Escapes under perlre - Regular Expressions for details. You also don't need the '`g`' modifier in either the match (`m/.../`) or the split function. The following script does what I think you want (in terms of identifying the line endings). If not, please provide some sample data with expected output to remove the ambiguity I mentioned at the start. `#!/usr/bin/env perl use 5.010; use strict; use warnings; my $soh_string = 'soh_string<SOH>'; my $caret_a_string = 'caret_a_string^A'; my $ctrl_a_string = 'ctrl_a_string' . chr(1); my $test_string = join('', $soh_string, $caret_a_string, $ctrl_a_string, $caret_a_string, $ctrl_a_string, $soh_string, $ctrl_a_string, $soh_string, $caret_a_string ); my $string_re = qr{(?><SOH>\|\^A\|\cA)}; say for split $string_re => $test_string;` [download] Output: `$ pm_soh_split.pl soh_string caret_a_string ctrl_a_string caret_a_string ctrl_a_string soh_string ctrl_a_string soh_string caret_a_string` [download] -- Ken	[reply] [d/l] [select]
Re^2: regex anchoring issue by BillKSmith (Monsignor) on Feb 15, 2013 at 14:01 UTC
Refer to charnames for a neat way to code the value of your $soh_string. `use charnames qw(:full); $soh_string = "\N{SOH}";` [download] Bill	[reply] [d/l]
Re^3: regex anchoring issue by kcott (Archbishop) on Feb 16, 2013 at 06:35 UTC
Thanks, Bill. I had considered that but decided not to use it due to the ambiguity I noted in my opening paragraph. Had penguin-attack wanted the single ASCII character `SOH`, instead of the string '`<SOH>`', that was covered by `Ctrl-A` (also noted). [Side issue (struggling not to appear grossly pedantic): the `charnames` pragma has been distributed with Perl since at least v5.8.8 - the perldoc link (charnames) would provide the most recent documentation.] -- Ken	[reply] [d/l] [select]
Re^2: regex anchoring issue by smls (Friar) on Feb 15, 2013 at 11:23 UTC
Why do you place the regex in a `(?>...)` non-backtracking group?	[reply] [d/l]
Re^3: regex anchoring issue by kcott (Archbishop) on Feb 16, 2013 at 05:58 UTC
Wrapping regexp alternations in `(?>...)` is something I do by default. While there may be rare cases where this might be problematical, I haven't encountered any: it's something that doesn't hurt and, indeed, often helps. This usage is based on a "Perl Best Practices" guideline: Backtracking (page 269). It's summarised on page 271 as: ... rewrite any instance of: `X \| Y` [download] as: `(?> X \| Y )` [download] While I'm not a slave to all "Perl Best Practices" guidelines, this is one I have found to be useful. Update: `s/have encountered/haven't encountered/` -- Ken	[reply] [d/l] [select]
Re: regex anchoring issue by smls (Friar) on Feb 15, 2013 at 12:01 UTC
If you need your regex to match one of several possible text fragments of which at least one is longer than 1 character, you have to use an alternation (`...\|...\|...`) instead of a character class (`[...]`). Some additional comments on your code: The `{1}` quantifier in the first regex is redundant. Matching one occurrence is the default behavior if no quantifier is given. The `{1,}` quantifier in the third regex can be more succinctly written as `+`. The `/g` modifier is not needed in the first two regexes, as kcott already noted. Before the first regex, the `$_ =~` is redundant because Perl will match against that variable by default. Similarly, passing `$_` as the second parameter to split is redundant. I'm pretty sure you don't need to nest two `foreach my $line (@split) { ... }` loops... :) The outer parenthesis in the line `( $line =~ s/[\s\n]{1,}//g )` are redundant. Inside the while loop, you create a new `@split` array for each iteration (i.e. for each line from the input file), but then you don't do anything with it. Did you just cut out the code that does something with the current line's `@split` array to keep your question shorter, or did you actually intend to add all split fragments from all lines into a single array that will be available after the end of the while loop? In the latter case you need to modify your code. You can probably restructure the code to avoid specifying the split regex twice. It would depend on your exact requirements. For example if each split fragment has to be followed by one of the `"<SOH>"`/`"^A"`/`chr(1)` markers, i.e. no line of the file may end like `"8=FIX.4.2<SOH>8=FIX.4.2<SOH>8=FIX.4.2"` with a lone fragment at the end, you could use an `m/../g` regex like this to do the splitting without calling split, and then print the error based on whether any matches were found: `my @split; foreach (m/(.+?)(?:<SOH>\|\^A\|\cA)/g) { s/\s+//g; push @split, $_; } if (!@split) { # print error here last; } # do stuff with @split here` [download]	[reply] [d/l] [select]
Re^2: regex anchoring issue by Anonymous Monk on Feb 15, 2013 at 22:30 UTC
Please avoid using colors casually (without specifying both foreground and background) on perlmonks , they don't play well with themes. See "Tags You Should Not Use" in Markup in the Monastery `<font (something="something")>... </font>` tags are frowned upon. Don't use them except in extraordinary circumstances. and Customizing PerlMonks CSS, Help for Display Settings, CSS Show and Tell: Colored Code	[reply] [d/l]

Back to Seekers of Perl Wisdom