http://www.perlmonks.org?node_id=1218247

corfuitl has asked for the wisdom of the Perl Monks concerning the following question:

Hi perlmoks

I have a TMX file which looks like this one

<?xml version="1.0" encoding="UTF-8"?> <tmx version="1.4"><header creationtool="xx" creationtoolversion="1" s +egtype="sentence" o-tmf="undefined" adminlang="en" srclang="en" datat +ype="undefined"></header><body> <tu changedate="20180321T113135Z" creationdate="20180321T113135Z" chan +geid="user" tuid="1"> <prop type="client"> </prop> <prop type="project"> </prop> <prop type="domain"> </prop> <prop type="subject"> </prop> <prop type="corrected">no</prop> <prop type="aligned">no</prop> <tuv xml:lang="en"><seg>Hello <b>world!</b></seg></tuv> <tuv xml:lang="fr"><seg>Bonjour <b> monde</b></seg></tuv> </tu> <tu changedate="20180321T113135Z" creationdate="20180321T113135Z" chan +geid="user2" tuid="2"> <prop type="client"> </prop> <prop type="project">yes</prop> <prop type="corrected">no</prop> <prop type="aligned">no</prop> <tuv xml:lang="en"><seg>Hello <b>world!</b></seg></tuv> <tuv xml:lang="fr"><seg>Bonjour <b> monde</b></seg></tuv> </tu> </body> </tmx>

and I would like to export all the information in one line (tab separated).

I have the following code to export en and fr segments but it is not possible to export all other attributes.

use XML::LibXML; my $dom = 'XML::LibXML'->load_xml(IO => *STDIN); for my $child ( @{ $dom->find('/tmx/body/tu/tuv[@xml:lang=\'en\']/seg | /tmx/body/ +tu/tuv[@xml:lang=\'fr\']/seg | tmx/body/tu/prop | /tmx/body/tu/@creat +iondate') } ) { ( my $contents = join '', $child->childNodes ) =~ s,\n, <lb/> ,g; print $contents, $child->nodeName eq 'source' ? "\t" : "\n"; }

The ideal scenario would be to whatever props there are in the nodes and align them.

Could you please help me improve the code and sort it out?

Thanks

Replies are listed 'Best First'.
Re: Strip XML document
by choroba (Archbishop) on Jul 10, 2018 at 16:32 UTC
    It's not clear what output you expect.

    Something like this?

    for my $tu ($dom->findnodes('/tmx/body/tu')) { for my $child ($tu->findnodes('*')) { ( my $text = $child->textContent ) =~ s,\n, <lb/> ,g; print $text, "\t"; } print "\n"; } __END__ no no Hello <lb/> world! Bonjour <lb/> m +onde yes no no Hello <lb/> world! Bonjour <lb/> monde

    Or do you want a table of all the prop types?

    use feature qw{ say }; use List::Util qw{ uniq }; my @headers = sort +uniq(map $_->value, $dom->findnodes('/tmx/body/tu/prop/@type' +)); for my $tu ($dom->findnodes('/tmx/body/tu')) { my %props; for my $prop ($tu->findnodes('prop')) { $props{ $prop->findvalue('@type') } = $prop->textContent; } print join("\t", map $_ // "", @props{@headers}), "\t"; for my $child ($tu->findnodes('tuv')) { ( my $text = $child->textContent ) =~ s,\n, <lb/> ,g; print $text, "\t"; } print "\n"; } __END__ aligned client corrected domain project subject no no Hello <lb/> world! Bonjour <lb/> monde + no no yes Hello <lb/> world! Bonjour <lb/> mo +nde
    </c>

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thank you so much! Both solutions work but I prefer the second one.

      Hi

      I am back again regarding this because I realised that some of my TUVs contain properties. So, this is an example:

      <tu changedate="20180321T113135Z" creationdate="20180321T113135Z" chan +geid="user2" tuid="2"> <prop type="client"> </prop> <prop type="project">yes</prop> <prop type="corrected">no</prop> <prop type="aligned">no</prop> <tuv xml:lang="en"><seg>Hello <b>world!</b></seg></tuv> <tuv xml:lang="fr"> <prop type="client"> </prop> <prop type="project">yes</prop> <prop type="corrected">no</prop> <prop type="aligned">no</prop> <seg>Bonjour <b> monde</b></seg></tuv> </tu>

      How can I get these properties and distinguish them from the others? For instance, the column may have the name TU:client and TUV:client or so.

      Thanks

        Can you have a TUV:property for more than one language ?. For example TUV:client:en and TUV:client:fr

        poj