Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Regex word options

by jeffrgsf (Novice)
on Oct 31, 2012 at 22:30 UTC ( #1001753=perlquestion: print w/ replies, xml ) Need Help??
jeffrgsf has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to make one regex that will grab strings that start with either "Transcribed," OR "Arranged,". This works:

$work =~ / (\bTranscribed,\b? For [^,^\(^#^-]+)/ )

but why won't this? It doesn't grab either! :-(

$work =~ / (\bTranscribed,\b?\bArranged,\b? For [^,^\(^#^-]+)/ )


PS: I know the commas make no sense grammatically. I'm dealing with data with tons of weird errors in it.

Comment on Regex word options
Select or Download Code
Re: Regex word options
by state-o-dis-array (Hermit) on Oct 31, 2012 at 22:36 UTC
    There's some stuff there I haven't seen before, but perhaps this might at least get you moving forward:
    $work =~ / (\b(?:Transcribed|Arranged),\b? For [^,^\(^#^-]+)/ )
Re: Regex word options
by ww (Bishop) on Nov 01, 2012 at 01:03 UTC

    Not exactly as you stated the problem, but an extensible approach:

    C:\>perl -E "my @work=(\"I Arranged a meet.\", \"Transcribe this\"); for $work(@work) { if ($work =~ /Transcribe||Arranged/) { say $work; } } I Arranged a meet. Transcribe this C:\>

    The \b does nothing useful as you state your problem; alternation is better done with an ||, ("or"). Solving the captures as you need them is left as an exercise.

      The || does not work inside a regular expression (well, it does, but it matches anything). Add some more testing strings. Use single| inside a regex, or use two regexes connected by ||.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        Alas, I erred.

        The honorable choroba is correct; blame simple carelessness; intermittent, inadequate internet access in my second consecutive month on the road and my plain errror for the wrongful use of ||. As written above it should be a single Vbar; as choroba notes, it could also be written as:

        #!/usr/bin/perl use 5.10.0; my @work=("I Arranged a meet.", "Transcribe this", "Shud Not MATCH"); for $work(@work) { if (($work =~ /Transcribe/) || ($work =~ /Arranged/) ) { say $work; } }
Re: Regex word options
by abualiga (Scribe) on Nov 01, 2012 at 02:51 UTC

    agree with ww. The '||' operator may be better suited here than the word boundary '\b'. Also, what are you trying with the '?' non-greedy quantifier? Are you looking for lines starting with these words? Perhaps I'm missing something, but you would probably benefit more from providing some input data.

      why not try:

      if( $inputStr =~ /^(Arranged\,|Transcribe\,)/ ){ ### do your stuff }
Try this
by space_monk (Chaplain) on Nov 01, 2012 at 10:40 UTC
    Input:
    This line Transcribed, For David Jones # This line Arranged, For Mike Johnson (great) Terrible weather today
    Program:
    #!/bin/perl my $file='regex.txt'; my $work = do { local $/; open my $fh, "<", $file or die "could not open $file: $!"; <$fh>; }; while ($work =~ /(Transcribed|Arranged),\s+For\s+([\w\s]+)/g) { print "Match: $1 Person: $2\n"; }
    Output:
    Match: Transcribed Person: David Jones Match: Arranged Person: Mike Johnson
    Comments:

    The \b (word boundary) matches have been removed as they don't really do anything. Note that it looks for (Transcribed|Arranged) as suggested by other The code looks for normal alpha characters and spaces as a name, terminating on the first that doesn't match, but this can be easily changed

      Thank you all for your efforts. I see I should have been clearer about what I want. Here goes.

      The script I'm writing will process about 180,000 lines of a text file, each of which is the title of work of music. The data is a mess-- there's no consistency at all. My job is to put it into a consistent format.

      To take out the instrument type, I used this regex:
      if ( $work =~ / (For [^,^\(^#^-]+)/ ) ...
      ...to grab anything between the word "For" and a comma, open parenth pound sign, or hyphen. No problem. But I saw that some works have the text "Transcribed, For" OR "Arranged, For". In those cases, I want to grab the word "Transcribed" OR "Arranged" as well.

      state-o-dis-array's solution would seem work, but doesn't. It actually doesn't even catch a simple example like the first below:

      INPUT:
      Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Péchés De Vieillesse, Book 1), Qr Iv/30
      À La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)
      All Through The Night, Traditional Welsh Song Arranged, For Mixed Voices


      DESIRED POST-REGEX OUTPUT:
      Tirana Alla Spagnola (Rossinizzatta), (Péchés De Vieillesse, Book 1), Qr Iv/30
      À La Chapelle Sixtine, S 360 (Lw G26)
      All Through The Night, Traditional Welsh Song


      (NOTE that I'm saving the instrument type in another variable, but that's not the problem.) To answer abualiga's question, I need the '?' non-greedy quantifier because not every line will have "Transcribed" or "Arranged" in it. I realize I could use a few different regex's connected with the OR operator '||', but I have so many different cases that it starts getting very long and tedious. It may come to that, though.

      To answer ww and space_monk, I think I need to have the '\b' word boundaries in there so I can use the '?'.

        Here's an approach for the 'simple' input you give as an example. A larger chunk (but still a reasonable amount!) of more realistic input might yield a better solution. I see some other, similar postings from you – is there another thread on this with more data?

        Notes:

        • I use \x23 instead of a '#' character in the  [^-,^(\x23] character set below because of a peculiarity of my little command-line processor. You should just use '#' instead. (BTW: I'm not sure what all the '^' (carat) characters were doing in this set as originally posted, so I left one in there just for good luck!)
        • The input text I use does not have accented characters. I can't display these easily on my console and so cannot test them.
        • Because I do not use accented characters in my test input text, the regexes are untested with such characters. (Update: See Update 1 below.)
        • As to the input text: Note that word 'Arranged' in the third record (i.e., line) has no preceding comma: 'Welsh Song Arranged'. Is this an example of real input, or a posting tyop? In any event, the code as it stands handles this variation.
        • I concentrate on extracting what I take to be the critical fields from each record: the title of the piece and its source. You can stitch them together how you want, with commas, whitespace, whatever.
        • Sorry for any wrap-around in the code listing.

        >perl -wMstrict -le "my @input = ( 'Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches +De Vieillesse, Book 1), Qr Iv/30', 'A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)' +, 'All Through The Night, Traditional Welsh Song Arranged, For Mixed +Voices', ); ;; my $not_source = qr{ [^-,^(\x23] }xms; my $kruft = qr{ \s* ,? \s* }xms; my $ar_tr = qr{ Arranged | Transcribed }xms; my $at_for = qr{ $kruft $ar_tr? $kruft For $not_source+ $kruft }xms; my $rx_title = qr{ (?! $at_for) . }xms; ;; for (@input) { print qq{[[$_]]}; my ($title, $source) = m{ \A \s* ($rx_title+) $at_for (.*?) \s* \z }xms ; print qq{:$title: :$source:}; } " [[Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches De + Vieillesse, Book 1), Qr Iv/30]] :Tirana Alla Spagnola (Rossinizzatta): :(Peches De Vieillesse, Book 1 +), Qr Iv/30: [[A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)]] :A La Chapelle Sixtine: :S 360 (Lw G26): [[All Through The Night, Traditional Welsh Song Arranged, For Mixed Vo +ices]] :All Through The Night, Traditional Welsh Song: ::

        Updates:

        1. I have since tried this code as a regular source file with accented characters, and it seems to work.
        2. Actually,  m{ \A \s* (.*?) $at_for (.*?) \s* \z }xms works just as well (for the limited test set), is probably a bit faster.

        Okay:

        #!/bin/perl my $file='regex.txt'; open my $fh, "<", $file or die "could not open $file: $!"; while (<$fh>) { chomp; if ( /(.*?)\s?(Transcribed|Arranged)?,\s+For\s+([^(,]+),?(.*)/) { print "$1$4\n"; print STDERR "# Instrument: $3 T/A:$2\n"; } }

        Using your input, ./program.pl 2>/dev/null output is:

        Tirana Alla Spagnola (Rossinizzatta)(Péchés De Vieillesse, Book 1), Qr + Iv/30 À La Chapelle Sixtine, S 360 (Lw G26) All Through The Night, Traditional Welsh Song
Re: Regex word options
by space_monk (Chaplain) on Nov 04, 2012 at 15:07 UTC

    Put this in music.pl and then go perl -n music.pl < data_file 2>/dev/null

    chomp; if ( /(.*?)\s?(Transcribed|Arranged)?,\s+For\s+([^(,]+),?(.*)/) { print "$1$4\n"; print STDERR "# Instrument: $3 T/A:$2\n"; }
    This produces the output requested given the sample input you provided.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1001753]
Approved by state-o-dis-array
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (9)
As of 2014-07-23 22:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (154 votes), past polls