Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Try this

by space_monk (Chaplain)
on Nov 01, 2012 at 10:40 UTC ( #1001796=note: print w/ replies, xml ) Need Help??


in reply to Regex word options

Input:

This line Transcribed, For David Jones # This line Arranged, For Mike Johnson (great) Terrible weather today
Program:
#!/bin/perl my $file='regex.txt'; my $work = do { local $/; open my $fh, "<", $file or die "could not open $file: $!"; <$fh>; }; while ($work =~ /(Transcribed|Arranged),\s+For\s+([\w\s]+)/g) { print "Match: $1 Person: $2\n"; }
Output:
Match: Transcribed Person: David Jones Match: Arranged Person: Mike Johnson
Comments:

The \b (word boundary) matches have been removed as they don't really do anything. Note that it looks for (Transcribed|Arranged) as suggested by other The code looks for normal alpha characters and spaces as a name, terminating on the first that doesn't match, but this can be easily changed


Comment on Try this
Select or Download Code
Replies are listed 'Best First'.
Re: Try this
by jeffrgsf (Novice) on Nov 01, 2012 at 17:59 UTC
    Thank you all for your efforts. I see I should have been clearer about what I want. Here goes.

    The script I'm writing will process about 180,000 lines of a text file, each of which is the title of work of music. The data is a mess-- there's no consistency at all. My job is to put it into a consistent format.

    To take out the instrument type, I used this regex:
    if ( $work =~ / (For [^,^\(^#^-]+)/ ) ...
    ...to grab anything between the word "For" and a comma, open parenth pound sign, or hyphen. No problem. But I saw that some works have the text "Transcribed, For" OR "Arranged, For". In those cases, I want to grab the word "Transcribed" OR "Arranged" as well.

    state-o-dis-array's solution would seem work, but doesn't. It actually doesn't even catch a simple example like the first below:

    INPUT:
    Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Péchés De Vieillesse, Book 1), Qr Iv/30
    À La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)
    All Through The Night, Traditional Welsh Song Arranged, For Mixed Voices


    DESIRED POST-REGEX OUTPUT:
    Tirana Alla Spagnola (Rossinizzatta), (Péchés De Vieillesse, Book 1), Qr Iv/30
    À La Chapelle Sixtine, S 360 (Lw G26)
    All Through The Night, Traditional Welsh Song


    (NOTE that I'm saving the instrument type in another variable, but that's not the problem.) To answer abualiga's question, I need the '?' non-greedy quantifier because not every line will have "Transcribed" or "Arranged" in it. I realize I could use a few different regex's connected with the OR operator '||', but I have so many different cases that it starts getting very long and tedious. It may come to that, though.

    To answer ww and space_monk, I think I need to have the '\b' word boundaries in there so I can use the '?'.

      Here's an approach for the 'simple' input you give as an example. A larger chunk (but still a reasonable amount!) of more realistic input might yield a better solution. I see some other, similar postings from you – is there another thread on this with more data?

      Notes:

      • I use \x23 instead of a '#' character in the  [^-,^(\x23] character set below because of a peculiarity of my little command-line processor. You should just use '#' instead. (BTW: I'm not sure what all the '^' (carat) characters were doing in this set as originally posted, so I left one in there just for good luck!)
      • The input text I use does not have accented characters. I can't display these easily on my console and so cannot test them.
      • Because I do not use accented characters in my test input text, the regexes are untested with such characters. (Update: See Update 1 below.)
      • As to the input text: Note that word 'Arranged' in the third record (i.e., line) has no preceding comma: 'Welsh Song Arranged'. Is this an example of real input, or a posting tyop? In any event, the code as it stands handles this variation.
      • I concentrate on extracting what I take to be the critical fields from each record: the title of the piece and its source. You can stitch them together how you want, with commas, whitespace, whatever.
      • Sorry for any wrap-around in the code listing.

      >perl -wMstrict -le "my @input = ( 'Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches +De Vieillesse, Book 1), Qr Iv/30', 'A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)' +, 'All Through The Night, Traditional Welsh Song Arranged, For Mixed +Voices', ); ;; my $not_source = qr{ [^-,^(\x23] }xms; my $kruft = qr{ \s* ,? \s* }xms; my $ar_tr = qr{ Arranged | Transcribed }xms; my $at_for = qr{ $kruft $ar_tr? $kruft For $not_source+ $kruft }xms; my $rx_title = qr{ (?! $at_for) . }xms; ;; for (@input) { print qq{[[$_]]}; my ($title, $source) = m{ \A \s* ($rx_title+) $at_for (.*?) \s* \z }xms ; print qq{:$title: :$source:}; } " [[Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches De + Vieillesse, Book 1), Qr Iv/30]] :Tirana Alla Spagnola (Rossinizzatta): :(Peches De Vieillesse, Book 1 +), Qr Iv/30: [[A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)]] :A La Chapelle Sixtine: :S 360 (Lw G26): [[All Through The Night, Traditional Welsh Song Arranged, For Mixed Vo +ices]] :All Through The Night, Traditional Welsh Song: ::

      Updates:

      1. I have since tried this code as a regular source file with accented characters, and it seems to work.
      2. Actually,  m{ \A \s* (.*?) $at_for (.*?) \s* \z }xms works just as well (for the limited test set), is probably a bit faster.

      Okay:

      #!/bin/perl my $file='regex.txt'; open my $fh, "<", $file or die "could not open $file: $!"; while (<$fh>) { chomp; if ( /(.*?)\s?(Transcribed|Arranged)?,\s+For\s+([^(,]+),?(.*)/) { print "$1$4\n"; print STDERR "# Instrument: $3 T/A:$2\n"; } }

      Using your input, ./program.pl 2>/dev/null output is:

      Tirana Alla Spagnola (Rossinizzatta)(Péchés De Vieillesse, Book 1), Qr + Iv/30 À La Chapelle Sixtine, S 360 (Lw G26) All Through The Night, Traditional Welsh Song

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1001796]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (10)
As of 2015-07-08 01:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (93 votes), past polls