Here's an approach for the 'simple' input you give as an example. A larger chunk (but still a reasonable amount!) of more realistic input might yield a better solution. I see some other, similar postings from you – is there another thread on this with more data?
Notes:
-
I use \x23 instead of a '#' character in the [^-,^(\x23] character set below because of a peculiarity of my little command-line processor. You should just use '#' instead. (BTW: I'm not sure what all the '^' (carat) characters were doing in this set as originally posted, so I left one in there just for good luck!)
-
The input text I use does not have accented characters. I can't display these easily on my console and so cannot test them.
-
Because I do not use accented characters in my test input text, the regexes are untested with such characters.
(Update: See Update 1 below.)
-
As to the input text: Note that word 'Arranged' in the third record (i.e., line) has no preceding comma: 'Welsh Song Arranged'. Is this an example of real input, or a posting tyop? In any event, the code as it stands handles this variation.
-
I concentrate on extracting what I take to be the critical fields from each record: the title of the piece and its source. You can stitch them together how you want, with commas, whitespace, whatever.
-
Sorry for any wrap-around in the code listing.
>perl -wMstrict -le
"my @input = (
'Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches
+De Vieillesse, Book 1), Qr Iv/30',
'A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)'
+,
'All Through The Night, Traditional Welsh Song Arranged, For Mixed
+Voices',
);
;;
my $not_source = qr{ [^-,^(\x23] }xms;
my $kruft = qr{ \s* ,? \s* }xms;
my $ar_tr = qr{ Arranged | Transcribed }xms;
my $at_for = qr{
$kruft $ar_tr? $kruft For $not_source+ $kruft
}xms;
my $rx_title = qr{ (?! $at_for) . }xms;
;;
for (@input) {
print qq{[[$_]]};
my ($title, $source) =
m{ \A \s* ($rx_title+) $at_for (.*?) \s* \z }xms
;
print qq{:$title: :$source:};
}
"
[[Tirana Alla Spagnola (Rossinizzatta), For Soprano & Piano (Peches De
+ Vieillesse, Book 1), Qr Iv/30]]
:Tirana Alla Spagnola (Rossinizzatta): :(Peches De Vieillesse, Book 1
+), Qr Iv/30:
[[A La Chapelle Sixtine, Transcribed, For Orchestra, S 360 (Lw G26)]]
:A La Chapelle Sixtine: :S 360 (Lw G26):
[[All Through The Night, Traditional Welsh Song Arranged, For Mixed Vo
+ices]]
:All Through The Night, Traditional Welsh Song: ::
Updates:
- I have since tried this code as a regular source file with accented characters, and it seems to work.
- Actually, m{ \A \s* (.*?) $at_for (.*?) \s* \z }xms works just as well (for the limited test set), is probably a bit faster.