Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: LiBXML: New markup while preserving earlier tags?

by choroba (Cardinal)
on Mar 29, 2018 at 15:14 UTC ( [id://1211984]=note: print w/replies, xml ) Need Help??


in reply to LiBXML: New markup while preserving earlier tags?

This is indeed ugly.

The "cleaner" way is to find in what text() elements and at what positions inside them you need to insert the tags, then replace each text() element with the text before the position, the tag, and the text after the position:

#!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use XML::LibXML; use List::Util qw{ sum }; sub insert_tag { my ($text, $pos, $tag_name, $query) = @_; my $before = substr $text, 0, $pos; my $after = substr $text, $pos; my $parent = $text->parentNode; $parent->insertBefore('XML::LibXML::Text'->new($before), $text); $parent->insertAfter('XML::LibXML::Text'->new($after), $text); my $tag = 'XML::LibXML::Element'->new($tag_name); $parent->replaceChild($tag, $text); $tag->{query} = $query; } my $xml = "<foo>The quick br<bar>o</bar>wn <baz>f<bar>o</bar>x</baz>"; $xml .= " jumps over the lazy d<bar>o</bar>g.</foo>"; my $new_element = "canid"; my @queried = ("lazy dog", "quick brown fox",); my $dom = 'XML::LibXML'->load_xml(string => $xml); for my $query (@queried) { my @texts = $dom->findnodes('//text()'); my ($from, $to) = (0, 0); my $found; OUTER: while ($to <= $#texts) { my $subtext = join "", @texts[ $from .. $to ]; for my $length (1 .. length $query) { my $subquery = substr $query, 0, $length; $found = index $subtext, $subquery; ++$to, next OUTER if -1 != $found && $length < length $query && $length == length($subtext) - $found; $to = ++$from, next OUTER if -1 == $found; } my $subtext_length = sum(map length, @texts[ $from .. $to ]); my $last_pos = length($texts[$to]) - ($subtext_length - $found + - length $query); insert_tag($texts[$to], $last_pos, 'end', $query); insert_tag($texts[$from], $found, 'start', $query); last OUTER; } } print $dom;
($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Replies are listed 'Best First'.
Re^2: LiBXML: New markup while preserving earlier tags?
by Samantabhadra (Acolyte) on Jun 23, 2018 at 18:15 UTC

    Thank you very much indeed, choroba, for this answer. It took me a while to get back to it and I learned a lot from analyzing your ingenious approach, although the core techniques your are using are still beyond my grasp.

    That is why I am not able to mend an unexpected outcome of your script: If more than one matches are found, only one is tagged, and in part only.

    If for example @queried = ("he"), the last occurence in the source string (-> tHE lazy dog) is not tagged, and the first occurence (-> tHE ... fox) is only furnished with the closing demarcation "<end query="he"/>".

    I keep on trying, but if the necessary correction of the script pops into your mind immediately, I would be grateful for a short hint, i.e. more indebted than I am already are.

      Congratulations, you've found a bug!

      The problem is that if the start and end tags both belong to the same text() node, insertion of the end tag creates a new text() nodes that replace the old one, so inserting the start tag into the old text() node doesn't insert it to the newly created one. Easily fixed by the following patch:

      @@ -12,18 +12,20 @@ my $after = substr $text, $pos; my $parent = $text->parentNode; - $parent->insertBefore('XML::LibXML::Text'->new($before), $text); + my $preceding + = $parent->insertBefore('XML::LibXML::Text'->new($before), $t +ext); $parent->insertAfter('XML::LibXML::Text'->new($after), $text); my $tag = 'XML::LibXML::Element'->new($tag_name); $parent->replaceChild($tag, $text); $tag->{query} = $query; + return $preceding } my $xml = "<foo>The quick br<bar>o</bar>wn <baz>f<bar>o</bar>x</baz>" +; $xml .= " jumps over the lazy d<bar>o</bar>g.</foo>"; my $new_element = "canid"; -my @queried = ("lazy dog", "quick brown fox",); +my @queried = ("lazy dog", "quick brown fox", "the"); my $dom = 'XML::LibXML'->load_xml(string => $xml); @@ -48,8 +50,10 @@ my $subtext_length = sum(map length, @texts[ $from .. $to ]); my $last_pos = length($texts[$to]) - ($subtext_length - $foun +d - length $query); - insert_tag($texts[$to], $last_pos, 'end', $query); - insert_tag($texts[$from], $found, 'start', $query); + my $preceding = insert_tag($texts[$to], $last_pos, 'end', $qu +ery); + + my $start_text = $from == $to ? $preceding : $texts[$from]; + insert_tag($start_text, $found, 'start', $query); last OUTER; }

      i.e. the insert_tag subroutine returns the newly created text() node preceding the tag, and it's used as the target text() when $from == $to, i.e. when both the elements belong to the same text().

      The code as written doesn't handle multiple occurrences of the query. Again, the fix is easy:

      @@ -25,7 +25,7 @@ $xml .= " jumps over the lazy d<bar>o</bar>g.</foo>"; my $new_element = "canid"; -my @queried = ("lazy dog", "quick brown fox", "the"); +my @queried = ("lazy dog", "quick brown fox", "he", "e"); my $dom = 'XML::LibXML'->load_xml(string => $xml); @@ -55,7 +55,10 @@ my $start_text = $from == $to ? $preceding : $texts[$from]; insert_tag($start_text, $found, 'start', $query); - last OUTER; + @texts = $dom->findnodes('//text()'); + $from += $from == $to ? 1 : 2; + + last OUTER if $from > @texts; } } print $dom;

      i.e. after the replacement, reload the text() nodes to search (this could probably be optimized*) to only replace the split one by the ones it's been split to), and start searching from the text where the end tag was inserted (when both the tags were inserted into the same text() node, the node was split into three text() nodes, if they belong to different text() nodes, each of them was split into two nodes, so there are four new nodes).

      *) Update: Here's the optimization:

      @@ -14,11 +14,12 @@ my $parent = $text->parentNode; my $preceding = $parent->insertBefore('XML::LibXML::Text'->new($before), $t +ext); - $parent->insertAfter('XML::LibXML::Text'->new($after), $text); + my $following + = $parent->insertAfter('XML::LibXML::Text'->new($after), $tex +t); my $tag = 'XML::LibXML::Element'->new($tag_name); $parent->replaceChild($tag, $text); $tag->{query} = $query; - return $preceding + return $preceding, $following } my $xml = "<foo>The quick br<bar>o</bar>wn <baz>f<bar>o</bar>x</baz>" +; @@ -50,12 +51,14 @@ my $subtext_length = sum(map length, @texts[ $from .. $to ]); my $last_pos = length($texts[$to]) - ($subtext_length - $foun +d - length $query); - my $preceding = insert_tag($texts[$to], $last_pos, 'end', $qu +ery); + my @new_texts = insert_tag($texts[$to], $last_pos, 'end', $qu +ery); + + splice @texts, $to, 1, @new_texts; + + my $start_text = $from == $to ? $new_texts[0] : $texts[$from] +; - my $start_text = $from == $to ? $preceding : $texts[$from]; insert_tag($start_text, $found, 'start', $query); - @texts = $dom->findnodes('//text()'); $from += $from == $to ? 1 : 2; last OUTER if $from > @texts;

      Note that it's not needed to splice the texts around the start tag, because that part has already been searched for the current query.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        Thank you so much again, choroba! I am incredibly grateful for the patch you have provided for your initial solution. This is my second question at perlmonks and I could not have dreamed how fruitful it would turn out to be.

        I benchmarked the two alternative methods you provided in the patch of the answer (= method one) and the updated optimization (= method two). As shown in the reports of these tests below, method two is indeed faster (test-1), and in some contexts significantly faster (test-2), than method one.
        But: method 2 sometimes tags fewer items than method 1 (test-3 to test-5). I think the reason for this is that method 2 only tags the first occurence of a queried item in a parents text node (test-6).

        I have organized the scripts utilized in this discussion in a repository at github. If you feel uncomfortable with that, or if this is considered bad practise here, please let me know, I will then remove the scripts.

        With the script "4_benchmark" at the repository the reports below should be reproduced. If I enter "4_benchmark test-1", etc. at my Ubuntu 16.04 terminal the following reports are given:


        test-1 (Application of the methods to the very small canid xml-string)

        Size of document to string: 122
        Number of queried items: 4
        ---
        Runtime method 1: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
        Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
        ---
        7 item(s) tagged with both methods.
        Unique item(s): he, e, quick brown fox, lazy dog
        ---
        Benchmark: running Method One, Method Two for at least 5 CPU seconds...
        Method One: 6 wallclock secs ( 5.37 usr + 0.00 sys = 5.37 CPU) @ 1375.98/s (n=7389)
        Method Two: 5 wallclock secs ( 5.40 usr + 0.00 sys = 5.40 CPU) @ 1673.33/s (n=9036)
        Rate Method One Method Two
        Method One 1376/s -- -18%
        Method Two 1673/s 22% --

        I report the individual single runtime of the methods, because the actual runtime is distorted in the comparision of the two methods with benchmark as, for now, I cannot but clone the dom n-times.
        Both methods tag the same number of items, method two is ca a fith faster here. The better rate of other iterations range between 19% and 23%.


        test-2 to test-5 (Application of the methods to the relatively small xml file "5_sample_file", rich with non-Asccii latin unicode characters)

        test-2 (method 2 is faster)


        file: 5_sample_file
        Size of document to string: 65715
        Number of queried items: 3
        ---
        Runtime method 1: 0 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
        Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
        ---
        46 item(s) tagged with both methods.
        Unique item(s): śrī, syād, āpta
        ---
        Benchmark: timing 350 iterations of Method One, Method Two...
        Method One: 10 wallclock secs (10.40 usr + 0.00 sys = 10.40 CPU) @ 33.65/s (n=350)
        Method Two: 4 wallclock secs ( 3.54 usr + 0.00 sys = 3.54 CPU) @ 98.87/s (n=350)
        Rate Method One Method Two
        Method One 33.7/s -- -66%
        Method Two 98.9/s 194% --

        Same number of items tagged satisfactory. Method Two nearly twice as fast.


        test-3 to test-5 (method 2 tags fewer results)

        test-3


        file: 5_sample_file
        Size of document to string: 65715
        Number of queried items: 1
        ---
        Runtime method 1: 1 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
        Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
        ---
        Method one tagged 62 item(s), method two 54!
        Method one tagged: patra
        Method two tagged: patra

        No need for benchmarking of what has different results. With Text::Diff I could verify that method 1 here always tags, firstly, the same occurences as method 2, and, secondly, additional occurences.
        Similarly with test-4 and test-5, which I carried out because I suspected unicode conflicts with system, dom or method 2.
        But no, the differences occur regardless of unicode characters in queried items and or document.

        In short test-4:
        Method one tagged 47 item(s), method two 45!
        Tagged: artha, vacana


        test-5:
        Method one tagged 26 item(s), method two 23!
        Tagged: ekāṃta, pratyakṣa


        The matter gets clearer, when the methods are applied to an extensively tagged xml-file:


        test-6 (Application of the methods to 6_sample_file2)

        file: 6_sample_file2
        Size of document to string: 4354
        Number of queried items: 1
        ---
        Runtime method 1: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU)
        Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
        ---
        Method one tagged 96 item(s), method two 21!
        Method one tagged: a
        Method two tagged: a
        ---
        Show Differences: y/n?
        y

        @@ -10 +10 @@
        - <genre>dr<start query="a"/>a<end query="a"/>m<start query="a"/>a<end query="a"/></genre>
        + <genre>dr<start query="a"/>a<end query="a"/>ma</genre>
        ...
        @@ -95 +95 @@
        - <title>The M<start query="a"/>a<end query="a"/>rti<start query="a"/>a<end query="a"/>n</title>
        + <title>The M<start query="a"/>a<end query="a"/>rtian</title>
        ...
        + During <start query="a"/>a<end query="a"/> manned mission to Mars, Astronaut Mark Watney is
        + presumed dead after a fierce storm and left behind by his crew.
        + But Watney has survived and finds himself stranded and alone on
        + the hostile planet. With only meager supplies, he must draw upon
        + his ingenuity, wit and spirit to subsist and find a way to
        + signal to Earth that he is alive.


        Method 2 (+ above) only tags the first occurence of the queried item within the parents textnode.

        ***
        The implementation of method 1 is sufficient for what I am now exited to do. One last report:

        Size of document to string: 66252
        Number of queried items: 1
        ---
        Runtime method 1: 95 wallclock secs (95.37 usr + 0.00 sys = 95.37 CPU)
        Runtime method 2: 0 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
        ---
        Benchmark: timing 1 iterations of Method One, Method Two...
        Method One: 78 wallclock secs (77.91 usr + 0.03 sys = 77.94 CPU) @ 0.01/s (n=1)
        (warning: too few iterations for a reliable count)
        Method Two: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 33.33/s (n=1)
        (warning: too few iterations for a reliable count)
        s/iter Method One Method Two
        Method One 77.9 -- -100%
        Method Two 3.00e-02 259700% --
        ---
        Method one tagged 9135 item(s), method two 502!
        Tagged: a

        I am happy to wait 1.5 minutes for 9135 items in this document to be tagged. Still, I dream of more complex operations, if the consistency of method 1 and the performance of method 2 (like in test-2) could be achieved.

        The test can be reproduced with "4_benchmark a 1". The script accepts three arguments: (1) queried items separated by comma without space, (2) numeral for benchmark and (3) a filelocation.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1211984]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (6)
As of 2024-04-19 08:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found