Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^2: Try this

by AnomalousMonk (Abbot)
on Nov 02, 2012 at 01:14 UTC ( #1001896=note: print w/ replies, xml ) Need Help??


in reply to Re: Try this
in thread Regex word options

Here's an approach for the 'simple' input you give as an example. A larger chunk (but still a reasonable amount!) of more realistic input might yield a better solution. I see some other, similar postings from you – is there another thread on this with more data?

Notes:

  • I use \x23 instead of a '#' character in the  [^-,^(\x23] character set below because of a peculiarity of my little command-line processor. You should just use '#' instead. (BTW: I'm not sure what all the '^' (carat) characters were doing in this set as originally posted, so I left one in there just for good luck!)
  • The input text I use does not have accented characters. I can't display these easily on my console and so cannot test them.
  • Because I do not use accented characters in my test input text, the regexes are untested with such characters. (Update: See Update 1 below.)
  • As to the input text: Note that word 'Arranged' in the third record (i.e., line) has no preceding comma: 'Welsh Song Arranged'. Is this an example of real input, or a posting tyop? In any event, the code as it stands handles this variation.
  • I concentrate on extracting what I take to be the critical fields from each record: the title of the piece and its source. You can stitch them together how you want, with commas, whitespace, whatever.
  • Sorry for any wrap-around in the code listing.

>perl -wMstrict -le "my @input = ( 'Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches +De Vieillesse, Book 1), Qr Iv/30', 'A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)' +, 'All Through The Night, Traditional Welsh Song Arranged, For Mixed +Voices', ); ;; my $not_source = qr{ [^-,^(\x23] }xms; my $kruft = qr{ \s* ,? \s* }xms; my $ar_tr = qr{ Arranged | Transcribed }xms; my $at_for = qr{ $kruft $ar_tr? $kruft For $not_source+ $kruft }xms; my $rx_title = qr{ (?! $at_for) . }xms; ;; for (@input) { print qq{[[$_]]}; my ($title, $source) = m{ \A \s* ($rx_title+) $at_for (.*?) \s* \z }xms ; print qq{:$title: :$source:}; } " [[Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches De + Vieillesse, Book 1), Qr Iv/30]] :Tirana Alla Spagnola (Rossinizzatta): :(Peches De Vieillesse, Book 1 +), Qr Iv/30: [[A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)]] :A La Chapelle Sixtine: :S 360 (Lw G26): [[All Through The Night, Traditional Welsh Song Arranged, For Mixed Vo +ices]] :All Through The Night, Traditional Welsh Song: ::

Updates:

  1. I have since tried this code as a regular source file with accented characters, and it seems to work.
  2. Actually,  m{ \A \s* (.*?) $at_for (.*?) \s* \z }xms works just as well (for the limited test set), is probably a bit faster.


Comment on Re^2: Try this
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1001896]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2015-07-02 04:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (27 votes), past polls