Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^3: LiBXML: New markup while preserving earlier tags?

by choroba (Cardinal)
on Jun 24, 2018 at 15:21 UTC ( [id://1217324]=note: print w/replies, xml ) Need Help??


in reply to Re^2: LiBXML: New markup while preserving earlier tags?
in thread LiBXML: New markup while preserving earlier tags?

Congratulations, you've found a bug!

The problem is that if the start and end tags both belong to the same text() node, insertion of the end tag creates a new text() nodes that replace the old one, so inserting the start tag into the old text() node doesn't insert it to the newly created one. Easily fixed by the following patch:

@@ -12,18 +12,20 @@ my $after = substr $text, $pos; my $parent = $text->parentNode; - $parent->insertBefore('XML::LibXML::Text'->new($before), $text); + my $preceding + = $parent->insertBefore('XML::LibXML::Text'->new($before), $t +ext); $parent->insertAfter('XML::LibXML::Text'->new($after), $text); my $tag = 'XML::LibXML::Element'->new($tag_name); $parent->replaceChild($tag, $text); $tag->{query} = $query; + return $preceding } my $xml = "<foo>The quick br<bar>o</bar>wn <baz>f<bar>o</bar>x</baz>" +; $xml .= " jumps over the lazy d<bar>o</bar>g.</foo>"; my $new_element = "canid"; -my @queried = ("lazy dog", "quick brown fox",); +my @queried = ("lazy dog", "quick brown fox", "the"); my $dom = 'XML::LibXML'->load_xml(string => $xml); @@ -48,8 +50,10 @@ my $subtext_length = sum(map length, @texts[ $from .. $to ]); my $last_pos = length($texts[$to]) - ($subtext_length - $foun +d - length $query); - insert_tag($texts[$to], $last_pos, 'end', $query); - insert_tag($texts[$from], $found, 'start', $query); + my $preceding = insert_tag($texts[$to], $last_pos, 'end', $qu +ery); + + my $start_text = $from == $to ? $preceding : $texts[$from]; + insert_tag($start_text, $found, 'start', $query); last OUTER; }

i.e. the insert_tag subroutine returns the newly created text() node preceding the tag, and it's used as the target text() when $from == $to, i.e. when both the elements belong to the same text().

The code as written doesn't handle multiple occurrences of the query. Again, the fix is easy:

@@ -25,7 +25,7 @@ $xml .= " jumps over the lazy d<bar>o</bar>g.</foo>"; my $new_element = "canid"; -my @queried = ("lazy dog", "quick brown fox", "the"); +my @queried = ("lazy dog", "quick brown fox", "he", "e"); my $dom = 'XML::LibXML'->load_xml(string => $xml); @@ -55,7 +55,10 @@ my $start_text = $from == $to ? $preceding : $texts[$from]; insert_tag($start_text, $found, 'start', $query); - last OUTER; + @texts = $dom->findnodes('//text()'); + $from += $from == $to ? 1 : 2; + + last OUTER if $from > @texts; } } print $dom;

i.e. after the replacement, reload the text() nodes to search (this could probably be optimized*) to only replace the split one by the ones it's been split to), and start searching from the text where the end tag was inserted (when both the tags were inserted into the same text() node, the node was split into three text() nodes, if they belong to different text() nodes, each of them was split into two nodes, so there are four new nodes).

*) Update: Here's the optimization:

@@ -14,11 +14,12 @@ my $parent = $text->parentNode; my $preceding = $parent->insertBefore('XML::LibXML::Text'->new($before), $t +ext); - $parent->insertAfter('XML::LibXML::Text'->new($after), $text); + my $following + = $parent->insertAfter('XML::LibXML::Text'->new($after), $tex +t); my $tag = 'XML::LibXML::Element'->new($tag_name); $parent->replaceChild($tag, $text); $tag->{query} = $query; - return $preceding + return $preceding, $following } my $xml = "<foo>The quick br<bar>o</bar>wn <baz>f<bar>o</bar>x</baz>" +; @@ -50,12 +51,14 @@ my $subtext_length = sum(map length, @texts[ $from .. $to ]); my $last_pos = length($texts[$to]) - ($subtext_length - $foun +d - length $query); - my $preceding = insert_tag($texts[$to], $last_pos, 'end', $qu +ery); + my @new_texts = insert_tag($texts[$to], $last_pos, 'end', $qu +ery); + + splice @texts, $to, 1, @new_texts; + + my $start_text = $from == $to ? $new_texts[0] : $texts[$from] +; - my $start_text = $from == $to ? $preceding : $texts[$from]; insert_tag($start_text, $found, 'start', $query); - @texts = $dom->findnodes('//text()'); $from += $from == $to ? 1 : 2; last OUTER if $from > @texts;

Note that it's not needed to splice the texts around the start tag, because that part has already been searched for the current query.

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

Replies are listed 'Best First'.
Re^4: LiBXML: New markup while preserving earlier tags?
by Samantabhadra (Acolyte) on Jun 26, 2018 at 17:56 UTC
    Thank you so much again, choroba! I am incredibly grateful for the patch you have provided for your initial solution. This is my second question at perlmonks and I could not have dreamed how fruitful it would turn out to be.

    I benchmarked the two alternative methods you provided in the patch of the answer (= method one) and the updated optimization (= method two). As shown in the reports of these tests below, method two is indeed faster (test-1), and in some contexts significantly faster (test-2), than method one.
    But: method 2 sometimes tags fewer items than method 1 (test-3 to test-5). I think the reason for this is that method 2 only tags the first occurence of a queried item in a parents text node (test-6).

    I have organized the scripts utilized in this discussion in a repository at github. If you feel uncomfortable with that, or if this is considered bad practise here, please let me know, I will then remove the scripts.

    With the script "4_benchmark" at the repository the reports below should be reproduced. If I enter "4_benchmark test-1", etc. at my Ubuntu 16.04 terminal the following reports are given:


    test-1 (Application of the methods to the very small canid xml-string)

    Size of document to string: 122
    Number of queried items: 4
    ---
    Runtime method 1: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
    Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
    ---
    7 item(s) tagged with both methods.
    Unique item(s): he, e, quick brown fox, lazy dog
    ---
    Benchmark: running Method One, Method Two for at least 5 CPU seconds...
    Method One: 6 wallclock secs ( 5.37 usr + 0.00 sys = 5.37 CPU) @ 1375.98/s (n=7389)
    Method Two: 5 wallclock secs ( 5.40 usr + 0.00 sys = 5.40 CPU) @ 1673.33/s (n=9036)
    Rate Method One Method Two
    Method One 1376/s -- -18%
    Method Two 1673/s 22% --

    I report the individual single runtime of the methods, because the actual runtime is distorted in the comparision of the two methods with benchmark as, for now, I cannot but clone the dom n-times.
    Both methods tag the same number of items, method two is ca a fith faster here. The better rate of other iterations range between 19% and 23%.


    test-2 to test-5 (Application of the methods to the relatively small xml file "5_sample_file", rich with non-Asccii latin unicode characters)

    test-2 (method 2 is faster)


    file: 5_sample_file
    Size of document to string: 65715
    Number of queried items: 3
    ---
    Runtime method 1: 0 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
    Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
    ---
    46 item(s) tagged with both methods.
    Unique item(s): śrī, syād, āpta
    ---
    Benchmark: timing 350 iterations of Method One, Method Two...
    Method One: 10 wallclock secs (10.40 usr + 0.00 sys = 10.40 CPU) @ 33.65/s (n=350)
    Method Two: 4 wallclock secs ( 3.54 usr + 0.00 sys = 3.54 CPU) @ 98.87/s (n=350)
    Rate Method One Method Two
    Method One 33.7/s -- -66%
    Method Two 98.9/s 194% --

    Same number of items tagged satisfactory. Method Two nearly twice as fast.


    test-3 to test-5 (method 2 tags fewer results)

    test-3


    file: 5_sample_file
    Size of document to string: 65715
    Number of queried items: 1
    ---
    Runtime method 1: 1 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
    Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
    ---
    Method one tagged 62 item(s), method two 54!
    Method one tagged: patra
    Method two tagged: patra

    No need for benchmarking of what has different results. With Text::Diff I could verify that method 1 here always tags, firstly, the same occurences as method 2, and, secondly, additional occurences.
    Similarly with test-4 and test-5, which I carried out because I suspected unicode conflicts with system, dom or method 2.
    But no, the differences occur regardless of unicode characters in queried items and or document.

    In short test-4:
    Method one tagged 47 item(s), method two 45!
    Tagged: artha, vacana


    test-5:
    Method one tagged 26 item(s), method two 23!
    Tagged: ekāṃta, pratyakṣa


    The matter gets clearer, when the methods are applied to an extensively tagged xml-file:


    test-6 (Application of the methods to 6_sample_file2)

    file: 6_sample_file2
    Size of document to string: 4354
    Number of queried items: 1
    ---
    Runtime method 1: 0 wallclock secs ( 0.02 usr + 0.00 sys = 0.02 CPU)
    Runtime method 2: 0 wallclock secs ( 0.00 usr + 0.00 sys = 0.00 CPU)
    ---
    Method one tagged 96 item(s), method two 21!
    Method one tagged: a
    Method two tagged: a
    ---
    Show Differences: y/n?
    y

    @@ -10 +10 @@
    - <genre>dr<start query="a"/>a<end query="a"/>m<start query="a"/>a<end query="a"/></genre>
    + <genre>dr<start query="a"/>a<end query="a"/>ma</genre>
    ...
    @@ -95 +95 @@
    - <title>The M<start query="a"/>a<end query="a"/>rti<start query="a"/>a<end query="a"/>n</title>
    + <title>The M<start query="a"/>a<end query="a"/>rtian</title>
    ...
    + During <start query="a"/>a<end query="a"/> manned mission to Mars, Astronaut Mark Watney is
    + presumed dead after a fierce storm and left behind by his crew.
    + But Watney has survived and finds himself stranded and alone on
    + the hostile planet. With only meager supplies, he must draw upon
    + his ingenuity, wit and spirit to subsist and find a way to
    + signal to Earth that he is alive.


    Method 2 (+ above) only tags the first occurence of the queried item within the parents textnode.

    ***
    The implementation of method 1 is sufficient for what I am now exited to do. One last report:

    Size of document to string: 66252
    Number of queried items: 1
    ---
    Runtime method 1: 95 wallclock secs (95.37 usr + 0.00 sys = 95.37 CPU)
    Runtime method 2: 0 wallclock secs ( 0.04 usr + 0.00 sys = 0.04 CPU)
    ---
    Benchmark: timing 1 iterations of Method One, Method Two...
    Method One: 78 wallclock secs (77.91 usr + 0.03 sys = 77.94 CPU) @ 0.01/s (n=1)
    (warning: too few iterations for a reliable count)
    Method Two: 0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU) @ 33.33/s (n=1)
    (warning: too few iterations for a reliable count)
    s/iter Method One Method Two
    Method One 77.9 -- -100%
    Method Two 3.00e-02 259700% --
    ---
    Method one tagged 9135 item(s), method two 502!
    Tagged: a

    I am happy to wait 1.5 minutes for 9135 items in this document to be tagged. Still, I dream of more complex operations, if the consistency of method 1 and the performance of method 2 (like in test-2) could be achieved.

    The test can be reproduced with "4_benchmark a 1". The script accepts three arguments: (1) queried items separated by comma without space, (2) numeral for benchmark and (3) a filelocation.
      Oh sorry, I didn't have enough testing data. Thank you for providing them.

      Does the following line fix the optimized version?

      $from += $from == $to++ ? 1 : 2;
      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
        oh.my.holy.pope -- sorry bishop ;-) yes it does!

        test-6

        file: 6_sample_file2
        Size of document to string: 4354
        Number of queried items: 1
        ---
        96 item(s) tagged with both methods.
        Unique item(s): a
        ---
        Benchmark: timing 1000 iterations of Method One, Method Two...
        Method One: 19 wallclock secs (18.96 usr + 0.00 sys = 18.96 CPU) @ 52.74/s (n=1000)
        Method Two: 5 wallclock secs ( 5.01 usr + 0.00 sys = 5.01 CPU) @ 199.60/s (n=1000)
        Rate Method One Method Two
        Method One 52.7/s -- -74%
        Method Two 200/s 278% --

        All other tests also successfull. With iterations all timed over 5 wallclock seconds, better rates of method 2 are now: 287%, 300%, 435%. -- The longer it takes, the better the rates.
        Finally, the test where every letter a is marked in the document:

        a-test

        Size of document to string: 65715
        Number of queried items: 1
        ---
        9103 item(s) tagged with both methods.
        Unique item(s): a
        ---
        Runtime method 1: 77 wallclock secs (77.06 usr + 0.12 sys = 77.18 CPU)
        Runtime method 2: 0 wallclock secs ( 0.43 usr + 0.00 sys = 0.43 CPU)

        An optimized version indeed. Thank you.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1217324]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-04-19 16:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found