Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Unable to extract matched string from input

by sudhasane (Initiate)
on Sep 25, 2013 at 09:36 UTC ( #1055639=perlquestion: print w/ replies, xml ) Need Help??
sudhasane has asked for the wisdom of the Perl Monks concerning the following question:

Hi I was unable to extract matched string from the line when the line is too long and separated by " ; " below are the two lines from input file from which I could not extract desired match string. Any help in this regard is highly appreciated

chr4 46221709 46252153 cytoBand=p12;name2=GABRA2;name=NM_0008 +07;name2=GABRA2;name=NM_001114175;HGNC_GeneAnnotation=GABRA2,gamma-am +inobutyric acid (GABA) A receptor, alpha 2,4p12;abParts_IG_T_CelRecep +tors=False;conrad_Cnv=False;mcCarroll_Cnv=False;dgv_Cnv=True,overlap= +5.37%;genomicSuperDups=False;gadAll=IMMUNE;tfbsRegion=V$OCT_C.chr4.46 +233051.46233064;tfbsRegion=V$ER_Q6.chr4.46236213.46236232;tfbsRegion= +V$GR_Q6.chr4.46236213.46236232;tfbsRegion=V$GATA1_02.chr4.46236259.46 +236273;tfbsRegion=V$EVI1_03.chr4.46236263.46236274;tfbsRegion=V$E2F_0 +1.chr4.46236275.46236290;tfbsRegion=V$SRY_01.chr4.46236290.46236297;t +fbsRegion=V$CDP_02.chr4.46236484.46236499;tfbsRegion=V$FOXJ2_02.chr4. +46236490.46236504;tfbsRegion=V$IRF1_01.chr4.46237473.46237486;tfbsReg +ion=V$POU3F2_01.chr4.46240251.46240265;tfbsRegion=V$CART1_01.chr4.462 +40277.46240295;tfbsRegion=V$HFH3_01.chr4.46240446.46240459;tfbsRegion +=V$HNF3B_01.chr4.46240446.46240461;tfbsRegion=V$NKX22_01.chr4.4624244 +2.46242452;tfbsRegion=V$IRF7_01.chr4.46242532.46242550;tfbsRegion=V$I +SRE_01.chr4.46242534.46242549;tfbsRegion=V$IRF1_01.chr4.46242535.4624 +2548;tfbsRegion=V$GATA1_04.chr4.46242699.46242712;tfbsRegion=V$RORA1_ +01.chr4.46243515.46243528;tfbsRegion=V$RSRFC4_01.chr4.46243802.462438 +18;tfbsRegion=V$RSRFC4_01.chr4.46244279.46244295;tfbsRegion=V$CDP_02. +chr4.46244448.46244463;tfbsRegion=V$FOXD3_01.chr4.46244533.46244545;t +fbsRegion=V$CDC5_01.chr4.46250444.46250456;tfbsRegion=V$RORA1_01.chr4 +.46250574.46250587;tfbsRegion=V$TATA_C.chr4.46250704.46250714;tfbsReg +ion=V$PBX1_01.chr4.46250706.46250715;tfbsRegion=V$HOX13_01.chr4.46250 +735.46250765;tfbsRegion=V$MSX1_01.chr4.46250746.46250755;tfbsRegion=V +$HNF1_C.chr4.46250836.46250853;tfbsRegion=V$POU6F1_01.chr4.46250948.4 +6250959;tfbsRegion=V$MEF2_03.chr4.46250962.46250984;tfbsRegion=V$NKX6 +1_01.chr4.46250989.46251002;tfbsRegion=V$CDC5_01.chr4.46251232.462512 +44;tfbsRegion=V$CMYB_01.chr4.46251282.46251300;tfbsRegion=V$OCT1_04.c +hr4.46251340.46251363;tfbsRegion=V$FOXO1_02.chr4.46251351.46251365;tf +bsRegion=V$NKX3A_01.chr4.46251376.46251388;tfbsRegion=V$NKX22_01.chr4 +.46251378.46251388;tfbsRegion=V$NFAT_Q6.chr4.46251591.46251603;tfbsRe +gion=V$CDP_01.chr4.46251597.46251609;tfbsRegion=V$MEF2_01.chr4.462516 +34.46251650;tfbsRegion=V$HFH1_01.chr4.46251639.46251651;tfbsRegion=V$ +MEF2_03.chr4.46251647.46251669;tfbsRegion=V$RSRFC4_01.chr4.46251651.4 +6251667;tfbsRegion=V$FOXJ2_02.chr4.46251656.46251670;tfbsRegion=V$S8_ +01.chr4.46251697.46251713;tfbsRegion=V$OCT1_07.chr4.46251704.46251716 +;tfbsRegion=V$TCF11MAFG_01.chr4.46251714.46251736;tfbsRegion=V$BACH1_ +01.chr4.46251719.46251734;tfbsRegion=V$NFE2_01.chr4.46251720.46251731 +;tfbsRegion=V$AP1_01.chr4.46251720.46251733;tfbsRegion=V$BACH2_01.chr +4.46251721.46251732;tfbsRegion=V$EVI1_01.chr4.46251733.46251749;tfbsR +egion=V$HNF1_01.chr4.46252094.46252109;tfbsRegion=V$TCF11MAFG_01.chr4 +.46252150.46252172 chr1 25583341 25646986 cytoBand=p36.11;name2=RHD;name=NM_0011 +27691;name2=RHD;name=NM_016124;name=NM_001127691;exon=ex1/7,ex2/7,ex3 +/7,ex4/7,ex5/7,ex6/7;name=NM_016124;exon=ex1/10,ex2/10,ex3/10,ex4/10, +ex5/10,ex6/10,ex7/10,ex8/10;HGNC_GeneAnnotation=RHD,Rh blood group, D + antigen,RH,Rh30a, Rh4, RhPI, RhII, DIIIc, CD240D,1p36.11,AB012623,NM +_016124;abParts_IG_T_CelReceptors=False;conrad_Cnv=True,overlap=97.04 +%;mcCarroll_Cnv=True,overlap=84.62%;dgv_Cnv=True,overlap=14.43%;genom +icSuperDups=True,overlap=14.37%,otherChrom=chr1,otherStart=25655516,o +therEnd=25664845;gadAll=OTHER;gadAll=NORMALVARIATION;gadAll=INFECTION +;putativePromoterRegion=RHD,NM_001127691,+,;putativePromoterRegion=RH +D,NM_016124,+,;tfbsRegion=V$E2F_02.chr1.25598933.25598941;tfbsRegion= +V$HMX1_01.chr1.25598970.25598980;tfbsRegion=V$AP4_01.chr1.25598995.25 +599013;tfbsRegion=V$CHX10_01.chr1.25599041.25599055;tfbsRegion=V$LUN1 +_01.chr1.25599071.25599088;tfbsRegion=V$ARP1_01.chr1.25599075.2559909 +1;tfbsRegion=V$SRF_Q6.chr1.25605129.25605143;tfbsRegion=V$MZF1_02.chr +1.25605519.25605532;tfbsRegion=V$NGFIC_01.chr1.25605530.25605542;tfbs +Region=V$MEF2_03.chr1.25606025.25606047;tfbsRegion=V$GR_Q6.chr1.25606 +037.25606056;tfbsRegion=V$HSF1_01.chr1.25607490.25607500;tfbsRegion=V +$HSF2_01.chr1.25607490.25607500;tfbsRegion=V$ISRE_01.chr1.25607548.25 +607563;tfbsRegion=V$IRF2_01.chr1.25610593.25610606;tfbsRegion=V$NKX22 +_01.chr1.25610824.25610834;tfbsRegion=V$AP4_01.chr1.25611125.25611143 +;tfbsRegion=V$PPARG_02.chr1.25611131.25611154;tfbsRegion=V$MYOGNF1_01 +.chr1.25611152.25611181;tfbsRegion=V$EVI1_01.chr1.25615448.25615464;t +fbsRegion=V$GATA6_01.chr1.25616197.25616207;tfbsRegion=V$GATA_C.chr1. +25616199.25616210;tfbsRegion=V$GATA1_01.chr1.25617215.25617225;tfbsRe +gion=V$MYCMAX_01.chr1.25627462.25627476;tfbsRegion=V$MEF2_02.chr1.256 +27475.25627497;tfbsRegion=V$SRF_Q6.chr1.25627477.25627491;tfbsRegion= +V$MRF2_01.chr1.25628024.25628038;tfbsRegion=V$TGIF_01.chr1.25628121.2 +5628132;tfbsRegion=V$MEIS1_01.chr1.25628121.25628133;tfbsRegion=V$HTF +_01.chr1.25628128.25628152;tfbsRegion=V$BACH1_01.chr1.25628144.256281 +59;tfbsRegion=V$AP1_01.chr1.25628145.25628158;tfbsRegion=V$TAXCREB_01 +.chr1.25628146.25628161;tfbsRegion=V$CDPCR3_01.chr1.25628225.25628240 +;tfbsRegion=V$NF1_Q6.chr1.25629831.25629849;tfbsRegion=V$SREBP1_02.ch +r1.25629912.25629923;tfbsRegion=V$EVI1_01.chr1.25631284.25631300;tfbs +Region=V$GATA1_02.chr1.25631285.25631299;tfbsRegion=V$GATA1_01.chr1.2 +5643538.25643548;tfbsRegion=V$TGIF_01.chr1.25643561.25643572;tfbsRegi +on=V$AREB6_01.chr1.25643564.25643577
#!/usr/bin/perl -w open (AN_ANNOT, "<$ARGV[0]")or die "can not open vcf file\n"; open (OUTDGV,">$ARGV[0].mcCorrolcnv.txt") || die "can not open $ARGV[0 +].dgvcnv.txt ($!)\n"; @Annotation_file = <AN_ANNOT>; close AN_ANNOT; foreach $annotation_line(@Annotation_file) { if ( $annotation_line =~ /^chr/ ) { chomp $annotation_line; my( $chr, $start, $end, $description ) = split( /\s+/ , $annotat +ion_line ); if ( $description =~ /dgv_Cnv=True,overlap=\d+\.\d+\%\;/ ) { print OUTDGV "$chr\t$start\t$end\t$&\n"; } } }

Comment on Unable to extract matched string from input
Select or Download Code
Re: Unable to extract matched string from input
by BrowserUk (Pope) on Sep 25, 2013 at 09:43 UTC

    Why are you spliting on whitespace (split( /\s+/ ) when some of your fields contain embedded spaces?

    Also, please put your data lines in code tags.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Unable to extract matched string from input
by drmrgd (Beadle) on Sep 25, 2013 at 10:19 UTC
    Rather than splitting on whitespace since there is whitespace in the descriptions as well, why not just split the whole data chunk with a regex from the start. Assuming that your entries are always formatted with the chr, start, end, and description, you could do something like:
    my ( $chr, $start, $end, $description ) = $annotation_line =~ /(chr\d+ +) (\d+) (\d+) (.*)/; if ( $description =~ /dgv_Cnv=True,overlap=\d+\.\d+%;/ ) { print "$chr\t$start\t$end\t$&\n"; }
Re: Unable to extract matched string from input
by hdb (Parson) on Sep 25, 2013 at 10:27 UTC

    Try to use the third argument to split like this

    my( $chr, $start, $end, $description ) = split( /\s+/ , $annotation_li +ne, 4 );

    This way, $description will contain all of the remaining line and not only up to the next whitespace.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1055639]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (12)
As of 2014-07-31 21:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (253 votes), past polls