http://www.perlmonks.org?node_id=1018840

penguin-attack has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have 3 potential input files that user can input

8=FIX.4.2<SOH>

8=FIX.4.2^A (^A in text )

8=FIX.4.2^A (^A as a Control char hidden)

I want to split by the end character either <SOH> ^A or ^A

my split will work for \cA and <SOH> but I cant get it working for the text based ^A

Logic - 1) read the file in . 2) run through each line splitting on regex match for either text ^A or control A or <SOH . 3)stick in array .4) remove whitespace from array . 5) else throw error

Heres the code:

Problem is that if a tag is encounter like this : 8=A^A, it strips the first A instead of the ^A

I know its staring at me in the face but dunno why my regex ignores the text based ^A

sub reader { my $stream = "$somefile"; open( FILE, $stream ) or die "Cant open File:$!\n"; while (<FILE>) { if ( $_ =~ m/[\^A\cA\<SOH\>]{1}/g ) { @split= split /[\^A\cA\<SOH\>]+/g, $_; foreach my $line (@split) { foreach my $line (@split) { ( $line =~ s/[\s\n]{1,}//g ); } } } else { my $url = "http:/xxxxx.com/error.html"; my $t = 0; # time until redirect activates print "<META HTTP-EQUIV=refresh CONTENT=\"$t;URL=$url\">\n +"; last; } }
Thanks for your time. Penguin.

Replies are listed 'Best First'.
Re: regex anchoring issue
by kcott (Archbishop) on Feb 15, 2013 at 05:49 UTC

    G'day penguin-attack,

    Welcome to the monastery.

    Firstly, your data description seems a little ambiguous: you say "end character" then describe <SOH> (5 chars), ^A (2 chars) and Ctrl-A (1 char). If, by <SOH>, you mean the ASCII character - that is the same character as Ctrl-A (i.e. the character with the ASCII value of 1).

    Your main problem in your regexp is the use of a character class (i.e. [...]) - see Character Classes and other Special Escapes under perlre - Regular Expressions for details. You also don't need the 'g' modifier in either the match (m/.../) or the split function.

    The following script does what I think you want (in terms of identifying the line endings). If not, please provide some sample data with expected output to remove the ambiguity I mentioned at the start.

    #!/usr/bin/env perl use 5.010; use strict; use warnings; my $soh_string = 'soh_string<SOH>'; my $caret_a_string = 'caret_a_string^A'; my $ctrl_a_string = 'ctrl_a_string' . chr(1); my $test_string = join('', $soh_string, $caret_a_string, $ctrl_a_string, $caret_a_string, $ctrl_a_string, $soh_string, $ctrl_a_string, $soh_string, $caret_a_string ); my $string_re = qr{(?><SOH>|\^A|\cA)}; say for split $string_re => $test_string;

    Output:

    $ pm_soh_split.pl soh_string caret_a_string ctrl_a_string caret_a_string ctrl_a_string soh_string ctrl_a_string soh_string caret_a_string

    -- Ken

      Refer to charnames for a neat way to code the value of your $soh_string.

      use charnames qw(:full); $soh_string = "\N{SOH}";
      Bill

        Thanks, Bill. I had considered that but decided not to use it due to the ambiguity I noted in my opening paragraph. Had penguin-attack wanted the single ASCII character SOH, instead of the string '<SOH>', that was covered by Ctrl-A (also noted).

        [Side issue (struggling not to appear grossly pedantic): the charnames pragma has been distributed with Perl since at least v5.8.8 - the perldoc link (charnames) would provide the most recent documentation.]

        -- Ken

      Why do you place the regex in a (?>...) non-backtracking group?

        Wrapping regexp alternations in (?>...) is something I do by default. While there may be rare cases where this might be problematical, I haven't encountered any: it's something that doesn't hurt and, indeed, often helps.

        This usage is based on a "Perl Best Practices" guideline: Backtracking (page 269). It's summarised on page 271 as:

        ... rewrite any instance of:

        X | Y

        as:

        (?> X | Y )

        While I'm not a slave to all "Perl Best Practices" guidelines, this is one I have found to be useful.

        Update: s/have encountered/haven't encountered/

        -- Ken

Re: regex anchoring issue
by smls (Friar) on Feb 15, 2013 at 12:01 UTC

    If you need your regex to match one of several possible text fragments of which at least one is longer than 1 character, you have to use an alternation (...|...|...) instead of a character class ([...]).
     

    Some additional comments on your code:

    • The {1} quantifier in the first regex is redundant. Matching one occurrence is the default behavior if no quantifier is given.

    • The {1,} quantifier in the third regex can be more succinctly written as +.

    • The /g modifier is not needed in the first two regexes, as kcott already noted.

    • Before the first regex, the $_ =~ is redundant because Perl will match against that variable by default. Similarly, passing $_ as the second parameter to split is redundant.

    • I'm pretty sure you don't need to nest two foreach my $line (@split) { ... } loops... :)

    • The outer parenthesis in the line ( $line =~ s/[\s\n]{1,}//g ) are redundant.

    • Inside the while loop, you create a new @split array for each iteration (i.e. for each line from the input file), but then you don't do anything with it. Did you just cut out the code that does something with the current line's @split array to keep your question shorter, or did you actually intend to add all split fragments from all lines into a single array that will be available after the end of the while loop? In the latter case you need to modify your code.

    • You can probably restructure the code to avoid specifying the split regex twice. It would depend on your exact requirements. For example if each split fragment has to be followed by one of the "<SOH>"/"^A"/chr(1) markers, i.e. no line of the file may end like "8=FIX.4.2<SOH>8=FIX.4.2<SOH>8=FIX.4.2" with a lone fragment at the end, you could use an m/../g regex like this to do the splitting without calling split, and then print the error based on whether any matches were found:

      my @split; foreach (m/(.+?)(?:<SOH>|\^A|\cA)/g) { s/\s+//g; push @split, $_; } if (!@split) { # print error here last; } # do stuff with @split here