Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Removing duplicate subtrees from XML

by matth (Monk)
on Dec 03, 2002 at 01:41 UTC ( #217102=perlquestion: print w/replies, xml ) Need Help??

matth has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I have an input file looking like:
1<species xxx = "sp"> 1 <sequence xx = "" xxxxx = "xxxxxxx"> 1 <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " "> 1 <gene xx = "xxxxxxxxxxx" xxxxxx = "x"> 1 <gene_seq xxxxxxx = "" xxxxxx = "" xxxxxxx = "2" xxxxxxxxx = + "" xxxxx = "5999" xxxx = "6318" xxxxxxx = "" xxxxxxx = "" xxxxx +x = "F"> 1 </gene_seq> 1 </gene> 1 </genome_feature> 1 </sequence> 1</species> 2<species xxx = "sp"> 2 <sequence xx = "" xxxxx = "xxxxxxx"> 2 <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " "> 2 <gene xx = "xxxxxxxxxxx" xxxxxx = "x"> 2 <gene_seq xxxxxxx = "" xxxxxx = "" xxxxxxx = "2" xxxxxxxxx = + "" xxxxx = "5999" xxxx = "6318" xxxxxxx = "" xxxxxxx = "" xxxxx +x = "F"> 2 </gene_seq> 2 </gene> 2 </genome_feature> 2 </sequence> 2</species> etc......................................... (xxxxxxs substitute real words)
My program goes through this file (above) with a while statement. It seeks to remove duplicate nodes so that it does not go back to the root nodes of species, sequence etc. for each gene tag.But it does not work. I have a subroutine along the lines of: (just add a few more lines dealing with more XML nodes)
sub deal_with_xml_line_by_line($){ $final_out = "new_out_again.txt"; open (OUTPUT_SLIMED, "+>>$final_out"); my ($XML_line) = @_; $XML_class_node_X_old = $XML_class_node_X; $XML_class_first_node_old = $XML_class_first_node; if ($XML_process_line =~ /^(\d{1,10})([\%|\<].{1,1000}\>)/){ print "\nhereF\n"; print "\n$1\n"; #exit; $XML_class_node_X = "$1.$2"; if ($XML_class_node_X_old == $XML_class_node_X){ #do nothing } else{ print OUTPUT_SLIMED "$XML_class_node_X\n"; return $XML_class_node_X; } } if ($XML_process_line =~ /^(\d{1,10})(\s[\%|\<].{1,1000}\>)/){ print "\nhereF\n"; print "\n$1.$2\n"; #exit; $XML_class_first_node = $1.$2; # print ":$XML_class_fist_node\n"; if ($XML_class_first_node_old == $XML_class_first_node){ #do nothing } else{ print OUTPUT_SLIMED "$XML_class_first_node\n"; return $XML_class_first_node; } } }
The output produced from this is :
1 <species xxx = "sp"> 1 <sequence xx = "" xxxxx = "xxxxxxx"> 1 <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " "> 1 <gene xx = "xxxxxxxxxxx" xxxxxx = "x"> 1 <gene_seq xxxxxxx = "" xxxxxx = "" xxxxxxx = "2" xxxxxxxxx = + "" xxxxx = "5999" xxxx = "6318" xxxxxxx = "" xxxxxxx = "" xxxxx +x = "F"> 2 <species xxx = "sp"> 2 <sequence xx = "" xxxxx = "xxxxxxx"> 2 <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " "> 2 <gene xx = "xxxxxxxxxxx" xxxxxx = "x"> 2 <gene_seq xxxxxxx = "" xxxxxx = "" xxxxxxx = "2" xxxxxxxxx = + "" xxxxx = "5999" xxxx = "6318" xxxxxxx = "" xxxxxxx = "" xxxxx +x = "F">
This is not what I want. Given the time I expect that I could solve this problem. But I have to go to bed now. Any suggestions?

Replies are listed 'Best First'.
Re: Removing duplicate subtrees from XML
by dingus (Friar) on Dec 03, 2002 at 08:28 UTC
    Two suggestions:

    1) if the data is machine generated and you know that it will be identical when the gene sequence etc. is identical, that is to say you know that you will never see the tags presented in a a different order then you can usefully compare whole records at a time by setting the input record seperator appropriately:

    $/ = '</species>.$/; my $prev_rec=''; while (<INFILE>) { next if ($_ eq $prev_rec); $prev_rec = $_; # process $_ somehow as it is unique }
    Note that in addition to the drawback noted above, this code requires records that are identical to be adjacent.

    2) A better way is probably to XML::Twig (or XML::Simple perhaps) the file and compare the resulting data structures. This avoids the tags must be in identical order problem and, depending on your code, may also avoid the identical records must be adjacent problem too.

    A cunning way to compare two arbitrary data structures is to use Data::Dumper and string comparisons:

    use Data::Dumper; if (Dumper(\%struct1) eq Dumper(\%struct2)) { do something; }
    The disadvantage is that this second method will be quite a lot slower. Because I quite like processing enormous files quickly I have used methods similar to the first method successfully on records pulled from pubmed.

    Dingus


    Enter any 47-digit prime number to continue.
Re: Removing duplicate subtrees from XML
by elusion (Curate) on Dec 03, 2002 at 01:55 UTC
    I can see right now a BIG problem with this line:

    if ($XML_process_line =~ /^(\d{1,10})([\%|\<].{1,1000}\>)/){

    I don't think it does what you want. First of all, this: [\%|\<]. You use [] and |, I think you want one or the other. If you want to alternate between % and <, use [\%<].

    Second, and more important, this: .{1,1000}>. Perl's regexes are greedy, that means that if you do this: "<one></one>" =~ /<(.{1,1000})>/;print $1;, you're going to get one></one printed, because it matches as many characters as possible before stopping.

    Instead, you'd want to use [\%|\<][^>]{1,1000}>, which uses a negative character class.

    That being said, this is hard to do and even harder to do right, so you should use a module. I would suggest XML::Twig, but there plenty of others as well.

    elusion : http://matt.diephouse.com

    Update: I also noticed that you use two variables for your line. You assign to $XML_line, but use your regex on $XML_process_line. Remember to use -w and strict.

      reply to Update. Well spotted. I hand edited that variable in an attempt make the variable names more meaningfull, prior to pasting. Thanks for all the advice.
Re: Removing duplicate subtrees from XML
by dakkar (Hermit) on Dec 03, 2002 at 11:51 UTC

    I think it would be better if you used some of the XML modules out there. Parsing XML by hand is (nearly) always the wrong choice.

    For example, I wrote a simple XSLT solution. It's not Perl, and it's not pretty, but it seems to work.

    -- 
            dakkar - Mobilis in mobile
    
      XSLT look like a logical way of representing the data structure. However, it also looks like I would have to spend some time learning about this system. I may come back to it in the future. Is there first class perl module support for XSLT?

        XSLT is a language based on very different principles than Perl. It is very popular for XML transformation (although I could never quite figure out why ;--). See Recommend XSL module for a list of Perl modules that support it (usually by interfacing with external XSLT engines).

Re: Removing duplicate subtrees from XML
by Zaxo (Archbishop) on Dec 03, 2002 at 02:15 UTC

    If I understand correctly, you want to rewrite the xml file to consolodate like roots in a common root subtree. I made a stab at it, but your sample data is too ill-formed to test.

    Please post some correct data. Schematic or sketchy data is ok, but at least validate it.

    After Compline,
    Zaxo

Re: Removing duplicate subtrees from XML
by mirod (Canon) on Dec 03, 2002 at 12:41 UTC

    Your requirements are a bit sketchy and I can see 2 things you might want to do in this case (once you get proper XML of course ;--):

    • get the minimun document with properly nested container elements, each "low-level" element being in the proper hierarchy. In general, unless the low-level elements are already ordered (in which case see below), you will need to load the entire data set in memory, which might be a problem if you are dealing with the Human Genome,
    • just try to merge adjacent nodes, that is if a node can be merged with the previous one, then be it, otherwise don't try to merge it with nodes that are further apart. This can be performed by keeping just parts of the data in memory, so it can be performed on much bigger data sets. This should give the same result as the previous option if the nodes are sorted.

    Here is a piece of code using the usual suspect that does both (on a made up XML data set, but adapting it to your data should be really easy):

    #!/usr/bin/perl -w use strict; use XML::Twig; $/="\n\n"; # tag => attribute we are interested in # you could avoid having this global by putting it in the twig my %att= ( elt => 'elt_class', subelt => 'subelt_class', ); my $doc = <DATA>; # the original data set my $expected_sorted_doc = <DATA>; # sorted result my $expected_merged_doc = <DATA>; # merged result my $sorted_doc= sort_doc( $doc); if( compact( $sorted_doc) eq compact( $expected_sorted_doc)) { print "sorted doc generation OK\n"; } else { print "sorted doc generation NOK: \n", "expected:\n$expected_sorted_doc\n", "found:\n$sorted_doc\n"; } my $merged_doc= merge_doc( $doc); if( compact( $merged_doc) eq compact( $expected_merged_doc)) { print "merged doc generation OK\n"; } else { print "merged doc generation NOK: \n", "expected:\n$expected_merged_doc\n", "found:\n$merged_doc\n"; } # sort: for each relevant node (elt or subelt in this case) generate a + location # key and move the content if the location already exists sub sort_doc { my( $doc)= @_; my $location={}; # location key => existing subelt element with t +his location key my $t= XML::Twig->new( twig_handlers => { elt => sub { sort_nod +e( $location, @_); }, subelt => sub { sort_nod +e( $location, @_); }, }, pretty_print => 'indented', # makes debuggi +ng easier ); $t->parse( $doc); return $t->sprint; } sub sort_node { my( $location, $t, $node)= @_; # compute the location key, which must describe uniquely the node +category my $location_key= location( $node); # now see if we need to move the content if( my $new_parent= $location->{$location_key}) { # there is already an element with this location key # move all content's there foreach my $content ($node->children) { $content->move( last_child => $new_parent); } # no need to keep the empty shell $node->delete unless( $node->has_child); } else { # first time we see the location key, store the element in $lo +cation $location->{$location_key}= $node; } } # the location describes a node category, nodes with the same location + should have the same parent sub location { my( $node)= @_; # a compact way to just join the values of the proper attributes o +f the ancestors of the node my $location= join( '-', grep {$_} map { $_->att( $att{$_->tag}) | +| '' } (@{[$node->ancestors]}, $node)); #warn "location: $location\n"; return $location; } # merge doc sub merge_doc { my( $doc)= @_; my $t= XML::Twig->new( twig_handlers => { elt => \&merge_node, subelt => \&merge_node, }, pretty_print => 'indented', ); $t->parse( $doc); return $t->sprint; } sub merge_node { my( $t, $node)= @_; my $potential_merger= $node->prev_elt( $node->tag) or return; # re +turn if this is the first node of this type if( location( $node) eq location( $potential_merger)) { # bingo! we can merge the contents foreach my $content ($node->children) { $content->move( last_child => $potential_merger); } } else { # this branch is not used for this test as we are working in m +emory # but this is where you could free the memory by dumping the p +art of # the tree that will no longer need to be updated # $t->flush_up_to( $potential_merger); } $node->delete unless( $node->has_child); } sub compact { my( $doc)= @_; $doc=~ s{^\s+}{}; # trim at the begining $doc=~ s{\s+$}{}; # trim the end $doc=~ s{>\s*<}{><}g; # trim spaces between tags return $doc; } __DATA__ <doc> <elt elt_class="class1"> <subelt subelt_class="sclass1"><content id="content1"/></subelt> </elt> <elt elt_class="class1"> <subelt subelt_class="sclass1"><content id="content2"/></subelt> </elt> <elt elt_class="class1"> <subelt subelt_class="sclass2"><content id="content3"/></subelt> </elt> <elt elt_class="class2"> <subelt subelt_class="sclass3"><content id="content4"/></subelt> </elt> <elt elt_class="class1"> <subelt subelt_class="sclass1"><content id="content5"/></subelt> </elt> </doc> <doc> <elt elt_class="class1"> <subelt subelt_class="sclass1"> <content id="content1"/> <content id="content2"/> <content id="content5"/> </subelt> <subelt subelt_class="sclass2"> <content id="content3"/> </subelt> </elt> <elt elt_class="class2"> <subelt subelt_class="sclass3"> <content id="content4"/> </subelt> </elt> </doc> <doc> <elt elt_class="class1"> <subelt subelt_class="sclass1"> <content id="content1"/> <content id="content2"/> </subelt> <subelt subelt_class="sclass2"> <content id="content3"/> </subelt> </elt> <elt elt_class="class2"> <subelt subelt_class="sclass3"> <content id="content4"/> </subelt> </elt> <elt elt_class="class1"> <subelt subelt_class="sclass1"> <content id="content5"/> </subelt> </elt> </doc>
      Thanks for this code. It works very well with your data. Unfortunately it does not seem to work when attribute data is missing. Is there an easy way to overcome this?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://217102]
Approved by Courage
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (3)
As of 2019-08-21 02:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?