Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Transliteration inside an XML file

by nikop (Initiate)
on Jun 19, 2014 at 10:14 UTC ( #1090437=perlquestion: print w/replies, xml ) Need Help??

nikop has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks!

I apologise if my question is very complicated or somehow unclear, I'm very new in this! I've looked into this problem for a while, but now I got in my work to a point where getting this run correctly would save me lots of time. I'm a linguist, and I encounter really often text files which are in the language I study, but in a wrong transcription or orthography. I understood that Perl can help me with converting them to another character set, and after looking models and hints from several transliteration scripts I found online I ended up to this, and it works very well:

#!/usr/bin/perl %trans = ( "t'i" =>'&#1090;&#1080;', "t'a" =>'&#1090;&#1103;', "t'u" =>'&#1090;&#1102;', "t'e" =>'&#1090;&#1077;', #It continues like this for several hundred lines, this is just a snip +pet. So it just goes through all character combinations and turns the + text to cyrillic. ); # Actual Translation Logic: @signs = sort {length($b) <=> length($a)} keys %trans; @signs = map quotemeta($_), @signs; $re = join '|', @signs, '.'; # Read Input from Stdin - one line at a time while (<STDIN>) { $input = "$_"; $input =~ s/($re)/exists($trans{$1}) ? $trans{$1} : $1/geo; print $input, ""; }

It does its job well and converts text like "men uny niko" to "менӧ шуӧны нико".

However, I often have the old transcription inside an XML file. They are done in program called ELAN. It has basically a structure like this:

<?xml version="1.0" encoding="UTF-8"?> <ANNOTATION_DOCUMENT> <TIER LINGUISTIC_TYPE_REF="orthT" PARENT_REF="ref@S1" PARTICIPANT= +"S1" TIER_ID="orth@S1"> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1978" ANNOTATION_REF="a2"> <ANNOTATION_VALUE>men uny niko</ANNOTATION_VALUE> </REF_ANNOTATION> </ANNOTATION> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1979" ANNOTATION_REF="a5"> <ANNOTATION_VALUE>at't' perl manastyrly!</ANNOTATION_ +VALUE> </REF_ANNOTATION> </ANNOTATION> </TIER> </ANNOTATION_DOCUMENT>

So I would like to run the transliteration script to the text: "men uny niko" inside the structure:

<ANNOTATION_VALUE>men uny niko</ANNOTATION_VALUE>

However, this would need to happen only in the nodes inside the structure:

<TIER LINGUISTIC_TYPE_REF="orthT" PARENT_REF="ref@S1" PARTICIPANT="S1" + TIER_ID="orth@S1"> </TIER>

So the final result would be like:

<?xml version="1.0" encoding="UTF-8"?> <ANNOTATION_DOCUMENT> <TIER LINGUISTIC_TYPE_REF="orthT" PARENT_REF="ref@S1" PARTICIPANT= +"S1" TIER_ID="orth@S1"> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1978" ANNOTATION_REF="a2"> <ANNOTATION_VALUE>&#1084;&#1077;&#1085;&#1255; &#1096; +&#1091;&#1255;&#1085;&#1099; &#1085;&#1080;&#1082;&#1086;</ANNOTATION +_VALUE> </REF_ANNOTATION> </ANNOTATION> <ANNOTATION> <REF_ANNOTATION ANNOTATION_ID="a1979" ANNOTATION_REF="a5"> <ANNOTATION_VALUE>&#1072;&#1090;&#1090;&#1100;&#1255; +&#1087;&#1077;&#1088;&#1083; &#1084;&#1072;&#1085;&#1072;&#1089;&#109 +0;&#1099;&#1088;&#1083;&#1099;!</ANNOTATION_VALUE> </REF_ANNOTATION> </ANNOTATION> </TIER> </ANNOTATION_DOCUMENT>

It would need to do the change only here as there are other tiers with different data that should remain as it is.

Also if you think I should specifically read something more about this I'm ready to do that. I honestly want to learn Perl. I didn't know if it is ok to post really long pieces of code, so I just took these small pieces that illustrate what I'm doing. I guess I would need to select the right XML node in XPath or something similar, but I have no idea where to put this into the perl script! I've been learning about Perl and XML during the last months, but I'm still taking very early steps.

Thank you for all the help!

Replies are listed 'Best First'.
Re: Transliteration inside an XML file
by choroba (Archbishop) on Jun 19, 2014 at 11:17 UTC
    Here's how you can process the XML with XML::LibXML:
    #!/usr/bin/perl use warnings; use strict; use XML::LibXML; my $xml = 'XML::LibXML'->load_xml( location => '1.xml' ); for my $element ($xml->findnodes('/ANNOTATION_DOCUMENT/TIER' . '[@LINGUISTIC_TYPE_REF="orthT"]/ANN +OTATION/' . 'REF_ANNOTATION/ANNOTATION_VALUE/te +xt()') ) { # Replace by your logic, or better, see what AnomalousMonk wrote. (my $translit = $element) =~ s/m/M/g; $element->setData($translit); } print $xml;
    لսႽ ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: Transliteration inside an XML file
by AnomalousMonk (Bishop) on Jun 19, 2014 at 10:59 UTC

    I don't know enough about translation in general or XML extraction and rendering of Cyrillic text to comment on those aspects of your post, but certain details of your basic approach caught my eye.

    First: Do yourself a favor and always use warnings and strictures (see strict), and lexical variables whenever possible.

    Second, the form of the basic matching regex in the OPed code seems inefficient. The appendage of a  . (dot) metacharacter to the regex means that it will match and capture each and every character (except a newline). This is compensated in the substitution expression by replacing those characters for which there is no valid translation with the character just captured, a net change of zero.

    It would seem more efficient to capture and replace only those character sequences needing translation. This also means you need no  /e execution during replacement evaluation.

    use warnings; use strict; my %trans = ( q{t'i} => '&#1090;&#1080;', q{t'a} => '&#1090;&#1103;', q{t'u} => '&#1090;&#1102;', q{t'e} => '&#1090;&#1077;', ); my @signs = sort {length($b) <=> length($a)} keys %trans; @signs = map quotemeta($_), @signs; my $re = join '|', @signs, '.'; print "original: '$re' \n"; my ($cyril) = map qr{ $_ }xms, join ' | ', map quotemeta, sort { length($b) <=> length($a) } keys %trans ; print "suggested: $cyril \n"; my $text = "t'i don't understand t'any cyrillic."; print "raw: [$text] \n"; $text =~ s{ ($cyril) }{$trans{$1}}xmsgo; print "translation: [$text] \n";

    Output:

    c:\@Work\Perl\monks\nikop>perl xlate_cyrillic_1.pl original: 't\'e|t\'i|t\'u|t\'a|.' suggested: (?^msx: t\'e | t\'i | t\'u | t\'a ) raw: [t'i don't understand t'any cyrillic.] translation: [&#1090;&#1080; don't understand &#1090;&#1103;ny cyrilli +c.]

    Further reading re: regexes: perlre, perlrequick, and especially perlretut. Also the Pattern Matching Regular Expressions and Parsing tutorials.

      It would seem more efficient to capture and replace only those character sequences needing translation.

      Well, no, actually - not in this case. The OP is transliterating from a "Romanized" (Latin-alphabet-based) "transcription" into Cyrillic. All characters in a given string will need to be replaced, because Cyrillic has its own dedicated "page" within the Unicode table. The incoming Latin characters (and diacritic marks) may come from the ASCII table or somewhere else, but when the transliteration is finished, every character will have been replaced.

Re: Transliteration inside an XML file
by mirod (Canon) on Jun 19, 2014 at 14:06 UTC

    It looks like the code below would work with XML::Twig, provided you have the complete translation table. It's a bit hard to test it when the data you provided did not include any of the translitatration in the piece of %trans</%> that's included.

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; my %trans = ( "t'i" =>'&#1090;&#1080;', "t'a" =>'&#1090;&#1103;', "t'u" =>'&#1090;&#1102;', "t'e" =>'&#1090;&#1077;', #It continues like this for several hundred lines, this is just a snip +pet. So it just goes through all character combinations and turns the + text to cyrillic. ); # Actual Translation Logic: my @signs = sort {length($b) <=> length($a)} keys %trans; @signs = map quotemeta($_), @signs; my $re = join '|', @signs, '.'; XML::Twig->new( twig_roots => { q{TIER[@LINGUISTIC_TYPE_REF='orthT' and @PARENT_REF='ref@S1' an +d @PARTICIPANT='S1' and @TIER_ID='orth@S1']/ANNOTATION/REF_ANNOTATION +/ANNOTATION_VALUE} => sub { my $in= $_->text; #warn "called text: ", $_->text, +"\n"; $in=~ s/($re)/exists($trans{$1}) ? $trans{$1} : $1/ +geo; $_->set_text( $in); $_->print; }, }, twig_print_outside_roots => 1, ) ->parsefile( 'in.xml');
Re: Transliteration inside an XML file
by flowdy (Scribe) on Jun 19, 2014 at 10:53 UTC

    To select nodes inside an XML document in order to edit them, CPAN module XML::Twig will be your friend I'm sure. :)

    Note also that when you come to use XPath, and your paths do not match surprisingly, you probably need to declare a namespace and put its prefix throughout your path. I might be wrong, but this is just my experience working with XPath (but I used XML::LibXML).

Re: Transliteration inside an XML file
by grondilu (Friar) on Jun 19, 2014 at 10:48 UTC

    Well, if what you want is to do the transliteration only inside the TIER element, you can use flip/flop.

    if (/\<TIER/ .. /\/TIER\>/) { $input =~ $your_substitution_regex; }

    That's a quick and dirty solution but it's cheap so you may just be happy with it if you don't want to seriously parse the XML.

    See the perlop man page for more info about this operator.

      As described in OP, the tags to select seem to be qualified by certain attributes. Hence your regexes fall short here.

Re: Transliteration inside an XML file
by nikop (Initiate) on Jun 19, 2014 at 21:49 UTC

    Thank you so much, Monks! I understand that my example wasn't ideal, I should had select parts which are testable in themselves. AnomalousMonk, I will modify my code with your suggestions. I was combining that with choroba's example, but then I saw mirod's post and that model also worked well. I had no time to test things further today, but I will take time to try out different approaches presented here. Thanks for the links too, they are very useful!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1090437]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (1)
As of 2021-10-24 03:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (88 votes). Check out past polls.

    Notices?