Re^2: Extract sequence of UC words?

by BrowserUk (Pope)
on Aug 18, 2008 at 14:10 UTC

in reply to Re: Extract sequence of UC words?
in thread Extract sequence of UC words?


This doesn't work because the space in the character class means it matches the first single space in the line and returns that. You need to ensure that the match starts with an UPPER alpha, and then continues with UPPER alpha or space:

print $data =~ m/(\b[A-Z][A-Z ]+\b)/;; TEST SENTENCE

Re^3: Extract sequence of UC words?
by monarch (Priest) on Aug 18, 2008 at 17:01 UTC

    Unfortunately this would also match "TEST SENTENCE " (note the trailing whitespace).

    The following test illustrates another method:

    #!/usr/bin/perl -w my $data = <<'EOF'; This is a sentence. THIS \ IS A SENTENCE. This is \ a SEQUENCE OF UPPER WORDS and \ this is not. EOF while ( $data =~ m/(\b(?:[A-Z]+(?:\s+[A-Z]+)*)+\b)/g ) { print "Upper Sentence: \"$1\"\n"; }


    Upper Sentence: "THIS IS A SENTENCE" Upper Sentence: "SEQUENCE OF UPPER WORDS"
        The issue I have with your examples, BrowserUk, is that you are mandating at least 2 upper case letters. My regexp permits a single capital letter.

        I think it is important to have the optional section, because the desired expression is "one or more upper case letters" optionally followed by any number of "spaces followed by upper case letters".

      I may be wrong but I'm guessing from the backslashes in your heredoc that you want $data to contain a single-line string. I don't think what you have written will achieve that. Single quotes result in literal backslashes along with the newlines in the string and double quotes don't seem to escape the meaning of the newline. Doing a global substitution is one way of getting a single line. Consider the following code

      use strict; use warnings; my $rcSep = sub { return q{*} x 20 . qq{\n} }; print $rcSep->(); my $singleQuoted = <<'EOD'; Line 1\ Line 2\ Line 3 EOD print $singleQuoted, $rcSep->(); my $doubleQuoted = <<"EOD"; Line 1\ Line 2\ Line 3 EOD print $doubleQuoted, $rcSep->(); ( my $transformed = <<'EOD' ) =~ s{\n+(?!\z)}{ }g; Line 1 Line 2 Line 3 EOD print $transformed, $rcSep->();

      and its output

      ******************** Line 1\ Line 2\ Line 3 ******************** Line 1 Line 2 Line 3 ******************** Line 1 Line 2 Line 3 ********************

      I hope this is of interest.



Re^3: Extract sequence of UC words?
by dHarry (Abbot) on Aug 18, 2008 at 14:26 UTC

Re^3: Extract sequence of UC words?
by gaal (Parson) on Aug 18, 2008 at 15:58 UTC
