Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^5: multiple XML fields in one line

by poj (Priest)
on Aug 08, 2014 at 21:36 UTC ( #1096812=note: print w/ replies, xml ) Need Help??


in reply to Re^4: multiple XML fields in one line
in thread multiple XML fields in one line

This test program works against your sample data, try running it against the complete file

#!perl use strict; use warnings; use XML::Simple; use Data::Dump 'pp'; my $blast = XMLin('BLAST1.XML'); my $hits = $blast->{BlastOutput_iterations}->{Iteration}->{Iteration_h +its}->{Hit}; my $ret; #push @ret, $_->{Hit_def} foreach (@{$hits}); foreach (@{$hits}) { push @{$ret},join '|', $_->{Hit_def}, $_->{Hit_num}, $_->{Hit_hsps}->{Hsp}->{Hsp_identity}; } pp $ret;
poj


Comment on Re^5: multiple XML fields in one line
Download Code
Re^6: multiple XML fields in one line
by smice (Initiate) on Aug 09, 2014 at 22:16 UTC

    Ah, it's killing me. I tried your test program with my original XML file. Same error as before: 'Not a HASH reference at line 12' (which is: push @{$ret},join '|',).

    I tried it however with the partial file that I sent you. Wow! It works perfectly! I ran again the original program with your modification on the partial XML file. Again, it works perfectly, I get exactly the results I hoped for.

    So is it related to the input file? Maybe my XML file is somehow messed up. So for testing I generated a few more XML files with the appropriate software, but all of them caused this 'Not a HASH reference' error. I compared the complete XML files with the partial XML I sent you, went over and over them like a thousand times, but I couldn't find any difference, except for the number of 'Hit'-s of course, and consequently, the size. Oh, there was one other thing: In the complete XMLs the lines ended with a single newline character (\n), but in the partial XML the EOL was a carriage return and a newline (\r\n). So I replaced all the \n with \r\n, but I still got the error, so the EOL seems to be irrelevant. And with the partial XML the program still worked correctly even if I replaced every \r\n with \n.

    I also tried to shamelessly hack into your code with my limited Perl knowledge, trying different ways to reference, but it only got worse (as had been expected :))

    So all in all, I am totally clueless. I don't get why it should be a HASH reference in the first place; @{$ret} is an array, right? Not a hash. Then I don't get how the input file influences the reference. Especially that in line 12 there is nothing related to the input file, it only says that we will push values into the end of the empty @{$ret} array (and join some of them). And finally I don't get what is the key difference between the 'good' and 'bad' XML files. Why only the partial file is working? If the program runs properly for 2 hits, why it doesn't for 99 hits?

    Mysterious. So much for today, tomorrow I will start removing the hits from a complete XML file one by one, to see if there is a size limit somewhere, or if it has any effect at all...

    Thank you for your selfless help again!

      Look in the XMl file for instances where you have multiple <Hit_hsps> tags within a <Hit> or multiple <Hsp> tags within a <Hit_hsps>.

      This test data replicates your error

      <?xml version="1.0"?> <!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://ww +w.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd"> <BlastOutput> <BlastOutput_iterations> <Iteration> <Iteration_hits> <Hit> <Hit_num>1</Hit_num> <Hit_def>Uncultured Sulfuricurvum sp. RIFRC-1, complete genome</Hit_ +def> <Hit_hsps> <Hsp> <Hsp_identity>16</Hsp_identity> </Hsp> </Hit_hsps> </Hit> <Hit> <Hit_num>2</Hit_num> <Hit_def>Neosartorya fischeri NRRL 181 conserved hypothetical protei +n (NFIA_106270) partial mRNA</Hit_def> <Hit_hsps> <Hsp> <Hsp_identity>16</Hsp_identity> </Hsp> </Hit_hsps> <Hit_hsps> <Hsp> <Hsp_identity>16a</Hsp_identity> </Hsp> <Hsp> <Hsp_identity>16b</Hsp_identity> </Hsp> </Hit_hsps> </Hit> </Iteration_hits> </Iteration> </BlastOutput_iterations> </BlastOutput>
      Update : Try this poj

        Sorry I couldn't access the net yesterday.

        Aaand... YES! That works. Fanfare and fireworks! :)

        So as you also found out, the problem was caused by some hits that have multiple <Hsp>-s within one <Hit>. Funny how it was right in front of my eyes yet I couldn't realize it for the first time. Actually the first two hits I sent as a a sample file were just incidental exceptions.

        I've edited the original script to include your code, and now it processes every file without any problem.

        Excellent work, thank you very much for doing my job and saving me a lot of headache! It was also nice to learn about Perl. Thanks a lot again!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1096812]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (8)
As of 2014-12-19 00:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (69 votes), past polls