Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Sorry I know my description sucks I'll try to answer your questions by providing some of the actual data and the output which I made manually. There will never be set dimensions as ever dataset will change depending on the patients tumors coming in. The idea is to have a script that can handle any size dimensions.

The file I need to read in is this structure. These 19 lines are taken from the real file which has 5,772,080 lines in the same format :

Gene Name Patient ID Patient Diagnosis Ammino Acid Mutation a +nd Sit Protein Length AAK1 19679 adenocarcinoma L661I 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19679 adenocarcinoma L664T 21265 AAK1 19676 adenocarcinoma L664T 21265 AAK1 19677 adenocarcinoma L64F 21265 AAK1 19678 adenocarcinoma L64R 21265 FKT1 101063 ER-PR-sitive_carcinoma p.L52R 2773 FKT1 103872 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 107590 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 107600 ER-PR-sitive_carcinoma p.E17K 2773 FKT1 1135911 NS E17K 2773 TET3 152 chronic_lymocytic_leukaemia p.R401H 10982 TET3 587220 adenocarcinoma M935V 10982 TET3 587220 adenocarcinoma R1534Q 10982 TET3 587256 adenocarcinoma G1356R 10982 TET3 587338 adenocarcinoma G1356W 10982
Now I need to count all positions that match in Amino Acid Site (the number but not the letters of the 4th column) but are in different samples. Note : Patient ID19679 and AA mutation L664T only corresponds to a count of 2 because all of them are in the same patient except one in patient 19676.

The out put needs to be in this format, where you have rows as genes and columns are 1-Length(the fifth column above). L is different for every gene. I've listed spans as no1.....no2 just for sake of space, but in the real file all these numbers in between have to be filled with 0's:

1-Largest Gene Length AA site -1 AA site -2 AA site -3 4 +16 AA site -17 18..51etc AA site 52 AA site 64 654 +00 AA site 401 402.660 AA site 661 AA site 664 AA sit +e 935 AA site 1356 AA site 1534 AAK1 0 0 0 0 0 0 0 2 0 1 2 + 0 0 0 FKT1 0 0 0 0 4 0 1 0 0 0 0 + 0 0 0 TET3 0 0 0 0 0 0 0 0 1 0 0 + 1 2 1

I'm simplifying because I also need to calculate a second table but this time with Amino acid position and mutation ( thus numbers and letters of Column 4) matching in different patients. Thats why my script is so elaborate, the $key3=$key4 is to remove the letters etc. I know i've done a poor job scripting it. Any advice would be fantastic!! Thanks so much for helping.

In reply to Re^2: Memory issue with cancer data (analogy) by ZWcarp
in thread Memory issue with cancer data (analogy) by ZWcarp

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others exploiting the Monastery: (4)
    As of 2018-06-19 22:35 GMT
    Find Nodes?
      Voting Booth?
      Should cpanminus be part of the standard Perl release?

      Results (115 votes). Check out past polls.