matth has asked for the wisdom of the Perl Monks concerning the following question:
Hi all,
I have an input file looking like:
My program goes through this file (above) with a while statement. It seeks to remove duplicate nodes so that it does not go back to the root nodes of species, sequence etc. for each gene tag.But it does not work. I have a subroutine along the lines of: (just add a few more lines dealing with more XML nodes)1<species xxx = "sp"> 1 <sequence xx = "" xxxxx = "xxxxxxx"> 1 <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " "> 1 <gene xx = "xxxxxxxxxxx" xxxxxx = "x"> 1 <gene_seq xxxxxxx = "" xxxxxx = "" xxxxxxx = "2" xxxxxxxxx = + "" xxxxx = "5999" xxxx = "6318" xxxxxxx = "" xxxxxxx = "" xxxxx +x = "F"> 1 </gene_seq> 1 </gene> 1 </genome_feature> 1 </sequence> 1</species> 2<species xxx = "sp"> 2 <sequence xx = "" xxxxx = "xxxxxxx"> 2 <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " "> 2 <gene xx = "xxxxxxxxxxx" xxxxxx = "x"> 2 <gene_seq xxxxxxx = "" xxxxxx = "" xxxxxxx = "2" xxxxxxxxx = + "" xxxxx = "5999" xxxx = "6318" xxxxxxx = "" xxxxxxx = "" xxxxx +x = "F"> 2 </gene_seq> 2 </gene> 2 </genome_feature> 2 </sequence> 2</species> etc......................................... (xxxxxxs substitute real words)
The output produced from this is :sub deal_with_xml_line_by_line($){ $final_out = "new_out_again.txt"; open (OUTPUT_SLIMED, "+>>$final_out"); my ($XML_line) = @_; $XML_class_node_X_old = $XML_class_node_X; $XML_class_first_node_old = $XML_class_first_node; if ($XML_process_line =~ /^(\d{1,10})([\%|\<].{1,1000}\>)/){ print "\nhereF\n"; print "\n$1\n"; #exit; $XML_class_node_X = "$1.$2"; if ($XML_class_node_X_old == $XML_class_node_X){ #do nothing } else{ print OUTPUT_SLIMED "$XML_class_node_X\n"; return $XML_class_node_X; } } if ($XML_process_line =~ /^(\d{1,10})(\s[\%|\<].{1,1000}\>)/){ print "\nhereF\n"; print "\n$1.$2\n"; #exit; $XML_class_first_node = $1.$2; # print ":$XML_class_fist_node\n"; if ($XML_class_first_node_old == $XML_class_first_node){ #do nothing } else{ print OUTPUT_SLIMED "$XML_class_first_node\n"; return $XML_class_first_node; } } }
This is not what I want. Given the time I expect that I could solve this problem. But I have to go to bed now. Any suggestions?1 <species xxx = "sp"> 1 <sequence xx = "" xxxxx = "xxxxxxx"> 1 <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " "> 1 <gene xx = "xxxxxxxxxxx" xxxxxx = "x"> 1 <gene_seq xxxxxxx = "" xxxxxx = "" xxxxxxx = "2" xxxxxxxxx = + "" xxxxx = "5999" xxxx = "6318" xxxxxxx = "" xxxxxxx = "" xxxxx +x = "F"> 2 <species xxx = "sp"> 2 <sequence xx = "" xxxxx = "xxxxxxx"> 2 <genome_xxxxxx = "CDS" xxxxxx = "" xxxxxxx = "" xxxxxxxxx = " "> 2 <gene xx = "xxxxxxxxxxx" xxxxxx = "x"> 2 <gene_seq xxxxxxx = "" xxxxxx = "" xxxxxxx = "2" xxxxxxxxx = + "" xxxxx = "5999" xxxx = "6318" xxxxxxx = "" xxxxxxx = "" xxxxx +x = "F">
Back to
Seekers of Perl Wisdom