Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: fix ODT files with line breaks looking poor

by Jenda (Abbot)
on Apr 16, 2019 at 23:32 UTC ( #1232696=note: print w/replies, xml ) Need Help??

in reply to fix ODT files with line breaks looking poor

I know nothing of ODT and the example presented by haukex fails to open so I can't test whether I broke something, but here's a possible solution using XML::Rules.

use strict; use XML::Rules; my $filter = XML::Rules->new( style => 'filter', namespaces => { 'urn:oasis:names:tc:opendocument:xmlns:text:1.0' => 'text', 'urn:oasis:names:tc:opendocument:xmlns:office:1.0' => 'office' }, rules => { _default => 'raw', # we do not care what's inside the tags, # we just want to preserve everything 'text:p' => sub { return $_[0] => $_[1] }, # this doesn't seem + to do anything, # but it's necessary. The filter mode sends everything out +side tags # with special rules directly to output 'text:line-break' => sub { my ($tag, $attrs, $parents, $parentAttrs, $parser) = @_; my $idx = $#$parents; # find the <text:p> tag enclosing th +is one $idx-- while ($idx >=0 && $parents->[$idx] ne 'text:p'); return $tag => $attrs if ($parents->[$idx] ne 'text:p'); # line break outside paragraph, leave alone my $level = $#$parents - $idx + 1; print { $parser->{FH} } $parser->parentsToXML( $level); #output the <text:p> and everything inside we read so far print { $parser->{FH} } $parser->closeParentsToXML( $level +); # close the opened tags all the way to the <text:p> print { $parser->{FH} } "\n"; foreach my $i ($idx .. $#$parents) { # remove the printed +content delete $parentAttrs->[$i]->{_content}; # leaves the at +tributes intact } return; # remove the <text:line-break/> } } ); $filter->filter( \*DATA, \*STDOUT); __DATA__ <?xml version="1.0"?> <office:document-content office:version="1.2" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"> <office:body><office:text> <text:p text:style-name="P1"> Fo<text:span text:style-name="T1">o<text:line-break/> B</text:span><text:span text:style-name="T3">a</text:span> <text:span text:style-name="T5">r<text:line-break/></text:span> </text:p> </office:text></office:body> </office:document-content>

The code will work correctly (provided I understood the requirements right) no matter how many tags are open within the <text:p>.

Enoch was right!
Enjoy the last years of Rome.

Replies are listed 'Best First'.
Re^2: fix ODT files with line breaks looking poor
by haukex (Chancellor) on Apr 17, 2019 at 19:48 UTC
    the example presented by haukex fails to open

    An ODT file is basically a ZIP file that contains a bunch of other files, one of them being content.xml, which I extracted and edited down to what I considered a minimal but representative example, which is what I showed. I cared more about the structure of the XML, and I also tested my code on the <r><p>a<x>b<y>c<x>d<s/>e</x>f</y>g</x>h</p></r> --> <r><p>a<x>b<y>c<x>d</x></y></x></p><p><x><y><x>e</x>f</y>g</x>h</p></r> example.

      I see. :-) I just saved it with .odt extension and tried to force Word to open it. It was two in the morning.

      The result is valid XML and looks right according to how I understand the task so let's hope it helps anyone. I think the code is kinda neat.

      Enoch was right!
      Enjoy the last years of Rome.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1232696]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2020-05-31 23:50 GMT
Find Nodes?
    Voting Booth?
    If programming languages were movie genres, Perl would be:

    Results (177 votes). Check out past polls.