Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Regex word options

by jeffrgsf (Novice)
on Oct 31, 2012 at 22:30 UTC ( #1001753=perlquestion: print w/ replies, xml ) Need Help??
jeffrgsf has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to make one regex that will grab strings that start with either "Transcribed," OR "Arranged,". This works:

$work =~ / (\bTranscribed,\b? For [^,^\(^#^-]+)/ )

but why won't this? It doesn't grab either! :-(

$work =~ / (\bTranscribed,\b?\bArranged,\b? For [^,^\(^#^-]+)/ )


PS: I know the commas make no sense grammatically. I'm dealing with data with tons of weird errors in it.

Comment on Regex word options
Select or Download Code
Re: Regex word options
by state-o-dis-array (Hermit) on Oct 31, 2012 at 22:36 UTC
    There's some stuff there I haven't seen before, but perhaps this might at least get you moving forward:
    $work =~ / (\b(?:Transcribed|Arranged),\b? For [^,^\(^#^-]+)/ )
Re: Regex word options
by ww (Bishop) on Nov 01, 2012 at 01:03 UTC

    Not exactly as you stated the problem, but an extensible approach:

    C:\>perl -E "my @work=(\"I Arranged a meet.\", \"Transcribe this\"); for $work(@work) { if ($work =~ /Transcribe||Arranged/) { say $work; } } I Arranged a meet. Transcribe this C:\>

    The \b does nothing useful as you state your problem; alternation is better done with an ||, ("or"). Solving the captures as you need them is left as an exercise.

      The || does not work inside a regular expression (well, it does, but it matches anything). Add some more testing strings. Use single| inside a regex, or use two regexes connected by ||.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        Alas, I erred.

        The honorable choroba is correct; blame simple carelessness; intermittent, inadequate internet access in my second consecutive month on the road and my plain errror for the wrongful use of ||. As written above it should be a single Vbar; as choroba notes, it could also be written as:

        #!/usr/bin/perl use 5.10.0; my @work=("I Arranged a meet.", "Transcribe this", "Shud Not MATCH"); for $work(@work) { if (($work =~ /Transcribe/) || ($work =~ /Arranged/) ) { say $work; } }
Re: Regex word options
by abualiga (Scribe) on Nov 01, 2012 at 02:51 UTC

    agree with ww. The '||' operator may be better suited here than the word boundary '\b'. Also, what are you trying with the '?' non-greedy quantifier? Are you looking for lines starting with these words? Perhaps I'm missing something, but you would probably benefit more from providing some input data.

      why not try:

      if( $inputStr =~ /^(Arranged\,|Transcribe\,)/ ){ ### do your stuff }
Try this
by space_monk (Chaplain) on Nov 01, 2012 at 10:40 UTC
    Input:
    This line Transcribed, For David Jones # This line Arranged, For Mike Johnson (great) Terrible weather today
    Program:
    #!/bin/perl my $file='regex.txt'; my $work = do { local $/; open my $fh, "<", $file or die "could not open $file: $!"; <$fh>; }; while ($work =~ /(Transcribed|Arranged),\s+For\s+([\w\s]+)/g) { print "Match: $1 Person: $2\n"; }
    Output:
    Match: Transcribed Person: David Jones Match: Arranged Person: Mike Johnson
    Comments:

    The \b (word boundary) matches have been removed as they don't really do anything. Note that it looks for (Transcribed|Arranged) as suggested by other The code looks for normal alpha characters and spaces as a name, terminating on the first that doesn't match, but this can be easily changed

      Thank you all for your efforts. I see I should have been clearer about what I want. Here goes.

      The script I'm writing will process about 180,000 lines of a text file, each of which is the title of work of music. The data is a mess-- there's no consistency at all. My job is to put it into a consistent format.

      To take out the instrument type, I used this regex:
      if ( $work =~ / (For [^,^\(^#^-]+)/ ) ...
      ...to grab anything between the word "For" and a comma, open parenth pound sign, or hyphen. No problem. But I saw that some works have the text "Transcribed, For" OR "Arranged, For". In those cases, I want to grab the word "Transcribed" OR "Arranged" as well.

      state-o-dis-array's solution would seem work, but doesn't. It actually doesn't even catch a simple example like the first below:

      INPUT:
      Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Péchés De Vieillesse, Book 1), Qr Iv/30
      À La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)
      All Through The Night, Traditional Welsh Song Arranged, For Mixed Voices


      DESIRED POST-REGEX OUTPUT:
      Tirana Alla Spagnola (Rossinizzatta), (Péchés De Vieillesse, Book 1), Qr Iv/30
      À La Chapelle Sixtine, S 360 (Lw G26)
      All Through The Night, Traditional Welsh Song


      (NOTE that I'm saving the instrument type in another variable, but that's not the problem.) To answer abualiga's question, I need the '?' non-greedy quantifier because not every line will have "Transcribed" or "Arranged" in it. I realize I could use a few different regex's connected with the OR operator '||', but I have so many different cases that it starts getting very long and tedious. It may come to that, though.

      To answer ww and space_monk, I think I need to have the '\b' word boundaries in there so I can use the '?'.

        Here's an approach for the 'simple' input you give as an example. A larger chunk (but still a reasonable amount!) of more realistic input might yield a better solution. I see some other, similar postings from you – is there another thread on this with more data?

        Notes:

        • I use \x23 instead of a '#' character in the  [^-,^(\x23] character set below because of a peculiarity of my little command-line processor. You should just use '#' instead. (BTW: I'm not sure what all the '^' (carat) characters were doing in this set as originally posted, so I left one in there just for good luck!)
        • The input text I use does not have accented characters. I can't display these easily on my console and so cannot test them.
        • Because I do not use accented characters in my test input text, the regexes are untested with such characters. (Update: See Update 1 below.)
        • As to the input text: Note that word 'Arranged' in the third record (i.e., line) has no preceding comma: 'Welsh Song Arranged'. Is this an example of real input, or a posting tyop? In any event, the code as it stands handles this variation.
        • I concentrate on extracting what I take to be the critical fields from each record: the title of the piece and its source. You can stitch them together how you want, with commas, whitespace, whatever.
        • Sorry for any wrap-around in the code listing.

        >perl -wMstrict -le "my @input = ( 'Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches +De Vieillesse, Book 1), Qr Iv/30', 'A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)' +, 'All Through The Night, Traditional Welsh Song Arranged, For Mixed +Voices', ); ;; my $not_source = qr{ [^-,^(\x23] }xms; my $kruft = qr{ \s* ,? \s* }xms; my $ar_tr = qr{ Arranged | Transcribed }xms; my $at_for = qr{ $kruft $ar_tr? $kruft For $not_source+ $kruft }xms; my $rx_title = qr{ (?! $at_for) . }xms; ;; for (@input) { print qq{[[$_]]}; my ($title, $source) = m{ \A \s* ($rx_title+) $at_for (.*?) \s* \z }xms ; print qq{:$title: :$source:}; } " [[Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches De + Vieillesse, Book 1), Qr Iv/30]] :Tirana Alla Spagnola (Rossinizzatta): :(Peches De Vieillesse, Book 1 +), Qr Iv/30: [[A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)]] :A La Chapelle Sixtine: :S 360 (Lw G26): [[All Through The Night, Traditional Welsh Song Arranged, For Mixed Vo +ices]] :All Through The Night, Traditional Welsh Song: ::

        Updates:

        1. I have since tried this code as a regular source file with accented characters, and it seems to work.
        2. Actually,  m{ \A \s* (.*?) $at_for (.*?) \s* \z }xms works just as well (for the limited test set), is probably a bit faster.

        Okay:

        #!/bin/perl my $file='regex.txt'; open my $fh, "<", $file or die "could not open $file: $!"; while (<$fh>) { chomp; if ( /(.*?)\s?(Transcribed|Arranged)?,\s+For\s+([^(,]+),?(.*)/) { print "$1$4\n"; print STDERR "# Instrument: $3 T/A:$2\n"; } }

        Using your input, ./program.pl 2>/dev/null output is:

        Tirana Alla Spagnola (Rossinizzatta)(Péchés De Vieillesse, Book 1), Qr + Iv/30 À La Chapelle Sixtine, S 360 (Lw G26) All Through The Night, Traditional Welsh Song
Re: Regex word options
by space_monk (Chaplain) on Nov 04, 2012 at 15:07 UTC

    Put this in music.pl and then go perl -n music.pl < data_file 2>/dev/null

    chomp; if ( /(.*?)\s?(Transcribed|Arranged)?,\s+For\s+([^(,]+),?(.*)/) { print "$1$4\n"; print STDERR "# Instrument: $3 T/A:$2\n"; }
    This produces the output requested given the sample input you provided.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1001753]
Approved by state-o-dis-array
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2014-08-30 03:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (291 votes), past polls