Re^5: multiple XML fields in one line

by poj (Monsignor)
on Aug 08, 2014 at 21:36 UTC

in reply to Re^4: multiple XML fields in one line
in thread multiple XML fields in one line

This test program works against your sample data, try running it against the complete file

#!perl use strict; use warnings; use XML::Simple; use Data::Dump 'pp'; my $blast = XMLin('BLAST1.XML'); my $hits = $blast->{BlastOutput_iterations}->{Iteration}->{Iteration_h +its}->{Hit}; my $ret; #push @ret, $_->{Hit_def} foreach (@{$hits}); foreach (@{$hits}) { push @{$ret},join '|', $_->{Hit_def}, $_->{Hit_num}, $_->{Hit_hsps}->{Hsp}->{Hsp_identity}; } pp $ret;

Replies are listed 'Best First'.
Re^6: multiple XML fields in one line
on Aug 09, 2014 at 22:16 UTC

    Ah, it's killing me. I tried your test program with my original XML file. Same error as before: 'Not a HASH reference at line 12' (which is: push @{$ret},join '|',).

    I tried it however with the partial file that I sent you. Wow! It works perfectly! I ran again the original program with your modification on the partial XML file. Again, it works perfectly, I get exactly the results I hoped for.

    So is it related to the input file? Maybe my XML file is somehow messed up. So for testing I generated a few more XML files with the appropriate software, but all of them caused this 'Not a HASH reference' error. I compared the complete XML files with the partial XML I sent you, went over and over them like a thousand times, but I couldn't find any difference, except for the number of 'Hit'-s of course, and consequently, the size. Oh, there was one other thing: In the complete XMLs the lines ended with a single newline character (\n), but in the partial XML the EOL was a carriage return and a newline (\r\n). So I replaced all the \n with \r\n, but I still got the error, so the EOL seems to be irrelevant. And with the partial XML the program still worked correctly even if I replaced every \r\n with \n.

    I also tried to shamelessly hack into your code with my limited Perl knowledge, trying different ways to reference, but it only got worse (as had been expected :))

    So all in all, I am totally clueless. I don't get why it should be a HASH reference in the first place; @{$ret} is an array, right? Not a hash. Then I don't get how the input file influences the reference. Especially that in line 12 there is nothing related to the input file, it only says that we will push values into the end of the empty @{$ret} array (and join some of them). And finally I don't get what is the key difference between the 'good' and 'bad' XML files. Why only the partial file is working? If the program runs properly for 2 hits, why it doesn't for 99 hits?

    Mysterious. So much for today, tomorrow I will start removing the hits from a complete XML file one by one, to see if there is a size limit somewhere, or if it has any effect at all...

    Thank you for your selfless help again!

      Look in the XMl file for instances where you have multiple <Hit_hsps> tags within a <Hit> or multiple <Hsp> tags within a <Hit_hsps>.

      This test data replicates your error

      <?xml version="1.0"?> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://ww"> <BlastOutput> <BlastOutput_iterations> <Iteration> <Iteration_hits> <Hit> <Hit_num>1</Hit_num> <Hit_def>Uncultured Sulfuricurvum sp. RIFRC-1, complete genome</Hit_ +def> <Hit_hsps> <Hsp> <Hsp_identity>16</Hsp_identity> </Hsp> </Hit_hsps> </Hit> <Hit> <Hit_num>2</Hit_num> <Hit_def>Neosartorya fischeri NRRL 181 conserved hypothetical protei +n (NFIA_106270) partial mRNA</Hit_def> <Hit_hsps> <Hsp> <Hsp_identity>16</Hsp_identity> </Hsp> </Hit_hsps> <Hit_hsps> <Hsp> <Hsp_identity>16a</Hsp_identity> </Hsp> <Hsp> <Hsp_identity>16b</Hsp_identity> </Hsp> </Hit_hsps> </Hit> </Iteration_hits> </Iteration> </BlastOutput_iterations> </BlastOutput>
      Update : Try this poj

        Sorry I couldn't access the net yesterday.

        Aaand... YES! That works. Fanfare and fireworks! :)

        So as you also found out, the problem was caused by some hits that have multiple <Hsp>-s within one <Hit>. Funny how it was right in front of my eyes yet I couldn't realize it for the first time. Actually the first two hits I sent as a a sample file were just incidental exceptions.

        I've edited the original script to include your code, and now it processes every file without any problem.

        Excellent work, thank you very much for doing my job and saving me a lot of headache! It was also nice to learn about Perl. Thanks a lot again!

