Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: Try this

by AnomalousMonk (Monsignor)
on Nov 02, 2012 at 01:14 UTC ( #1001896=note: print w/ replies, xml ) Need Help??


in reply to Re: Try this
in thread Regex word options

Here's an approach for the 'simple' input you give as an example. A larger chunk (but still a reasonable amount!) of more realistic input might yield a better solution. I see some other, similar postings from you – is there another thread on this with more data?

Notes:

  • I use \x23 instead of a '#' character in the  [^-,^(\x23] character set below because of a peculiarity of my little command-line processor. You should just use '#' instead. (BTW: I'm not sure what all the '^' (carat) characters were doing in this set as originally posted, so I left one in there just for good luck!)
  • The input text I use does not have accented characters. I can't display these easily on my console and so cannot test them.
  • Because I do not use accented characters in my test input text, the regexes are untested with such characters. (Update: See Update 1 below.)
  • As to the input text: Note that word 'Arranged' in the third record (i.e., line) has no preceding comma: 'Welsh Song Arranged'. Is this an example of real input, or a posting tyop? In any event, the code as it stands handles this variation.
  • I concentrate on extracting what I take to be the critical fields from each record: the title of the piece and its source. You can stitch them together how you want, with commas, whitespace, whatever.
  • Sorry for any wrap-around in the code listing.

>perl -wMstrict -le "my @input = ( 'Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches +De Vieillesse, Book 1), Qr Iv/30', 'A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)' +, 'All Through The Night, Traditional Welsh Song Arranged, For Mixed +Voices', ); ;; my $not_source = qr{ [^-,^(\x23] }xms; my $kruft = qr{ \s* ,? \s* }xms; my $ar_tr = qr{ Arranged | Transcribed }xms; my $at_for = qr{ $kruft $ar_tr? $kruft For $not_source+ $kruft }xms; my $rx_title = qr{ (?! $at_for) . }xms; ;; for (@input) { print qq{[[$_]]}; my ($title, $source) = m{ \A \s* ($rx_title+) $at_for (.*?) \s* \z }xms ; print qq{:$title: :$source:}; } " [[Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches De + Vieillesse, Book 1), Qr Iv/30]] :Tirana Alla Spagnola (Rossinizzatta): :(Peches De Vieillesse, Book 1 +), Qr Iv/30: [[A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)]] :A La Chapelle Sixtine: :S 360 (Lw G26): [[All Through The Night, Traditional Welsh Song Arranged, For Mixed Vo +ices]] :All Through The Night, Traditional Welsh Song: ::

Updates:

  1. I have since tried this code as a regular source file with accented characters, and it seems to work.
  2. Actually,  m{ \A \s* (.*?) $at_for (.*?) \s* \z }xms works just as well (for the limited test set), is probably a bit faster.


Comment on Re^2: Try this
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1001896]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (15)
As of 2014-07-11 14:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (227 votes), past polls