Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Unable to extract matched string from input

by sudhasane (Initiate)
on Sep 25, 2013 at 09:36 UTC ( #1055639=perlquestion: print w/ replies, xml ) Need Help??
sudhasane has asked for the wisdom of the Perl Monks concerning the following question:

Hi I was unable to extract matched string from the line when the line is too long and separated by " ; " below are the two lines from input file from which I could not extract desired match string. Any help in this regard is highly appreciated

chr4 46221709 46252153 cytoBand=p12;name2=GABRA2;name=NM_0008 +07;name2=GABRA2;name=NM_001114175;HGNC_GeneAnnotation=GABRA2,gamma-am +inobutyric acid (GABA) A receptor, alpha 2,4p12;abParts_IG_T_CelRecep +tors=False;conrad_Cnv=False;mcCarroll_Cnv=False;dgv_Cnv=True,overlap= +5.37%;genomicSuperDups=False;gadAll=IMMUNE;tfbsRegion=V$OCT_C.chr4.46 +233051.46233064;tfbsRegion=V$ER_Q6.chr4.46236213.46236232;tfbsRegion= +V$GR_Q6.chr4.46236213.46236232;tfbsRegion=V$GATA1_02.chr4.46236259.46 +236273;tfbsRegion=V$EVI1_03.chr4.46236263.46236274;tfbsRegion=V$E2F_0 +1.chr4.46236275.46236290;tfbsRegion=V$SRY_01.chr4.46236290.46236297;t +fbsRegion=V$CDP_02.chr4.46236484.46236499;tfbsRegion=V$FOXJ2_02.chr4. +46236490.46236504;tfbsRegion=V$IRF1_01.chr4.46237473.46237486;tfbsReg +ion=V$POU3F2_01.chr4.46240251.46240265;tfbsRegion=V$CART1_01.chr4.462 +40277.46240295;tfbsRegion=V$HFH3_01.chr4.46240446.46240459;tfbsRegion +=V$HNF3B_01.chr4.46240446.46240461;tfbsRegion=V$NKX22_01.chr4.4624244 +2.46242452;tfbsRegion=V$IRF7_01.chr4.46242532.46242550;tfbsRegion=V$I +SRE_01.chr4.46242534.46242549;tfbsRegion=V$IRF1_01.chr4.46242535.4624 +2548;tfbsRegion=V$GATA1_04.chr4.46242699.46242712;tfbsRegion=V$RORA1_ +01.chr4.46243515.46243528;tfbsRegion=V$RSRFC4_01.chr4.46243802.462438 +18;tfbsRegion=V$RSRFC4_01.chr4.46244279.46244295;tfbsRegion=V$CDP_02. +chr4.46244448.46244463;tfbsRegion=V$FOXD3_01.chr4.46244533.46244545;t +fbsRegion=V$CDC5_01.chr4.46250444.46250456;tfbsRegion=V$RORA1_01.chr4 +.46250574.46250587;tfbsRegion=V$TATA_C.chr4.46250704.46250714;tfbsReg +ion=V$PBX1_01.chr4.46250706.46250715;tfbsRegion=V$HOX13_01.chr4.46250 +735.46250765;tfbsRegion=V$MSX1_01.chr4.46250746.46250755;tfbsRegion=V +$HNF1_C.chr4.46250836.46250853;tfbsRegion=V$POU6F1_01.chr4.46250948.4 +6250959;tfbsRegion=V$MEF2_03.chr4.46250962.46250984;tfbsRegion=V$NKX6 +1_01.chr4.46250989.46251002;tfbsRegion=V$CDC5_01.chr4.46251232.462512 +44;tfbsRegion=V$CMYB_01.chr4.46251282.46251300;tfbsRegion=V$OCT1_04.c +hr4.46251340.46251363;tfbsRegion=V$FOXO1_02.chr4.46251351.46251365;tf +bsRegion=V$NKX3A_01.chr4.46251376.46251388;tfbsRegion=V$NKX22_01.chr4 +.46251378.46251388;tfbsRegion=V$NFAT_Q6.chr4.46251591.46251603;tfbsRe +gion=V$CDP_01.chr4.46251597.46251609;tfbsRegion=V$MEF2_01.chr4.462516 +34.46251650;tfbsRegion=V$HFH1_01.chr4.46251639.46251651;tfbsRegion=V$ +MEF2_03.chr4.46251647.46251669;tfbsRegion=V$RSRFC4_01.chr4.46251651.4 +6251667;tfbsRegion=V$FOXJ2_02.chr4.46251656.46251670;tfbsRegion=V$S8_ +01.chr4.46251697.46251713;tfbsRegion=V$OCT1_07.chr4.46251704.46251716 +;tfbsRegion=V$TCF11MAFG_01.chr4.46251714.46251736;tfbsRegion=V$BACH1_ +01.chr4.46251719.46251734;tfbsRegion=V$NFE2_01.chr4.46251720.46251731 +;tfbsRegion=V$AP1_01.chr4.46251720.46251733;tfbsRegion=V$BACH2_01.chr +4.46251721.46251732;tfbsRegion=V$EVI1_01.chr4.46251733.46251749;tfbsR +egion=V$HNF1_01.chr4.46252094.46252109;tfbsRegion=V$TCF11MAFG_01.chr4 +.46252150.46252172 chr1 25583341 25646986 cytoBand=p36.11;name2=RHD;name=NM_0011 +27691;name2=RHD;name=NM_016124;name=NM_001127691;exon=ex1/7,ex2/7,ex3 +/7,ex4/7,ex5/7,ex6/7;name=NM_016124;exon=ex1/10,ex2/10,ex3/10,ex4/10, +ex5/10,ex6/10,ex7/10,ex8/10;HGNC_GeneAnnotation=RHD,Rh blood group, D + antigen,RH,Rh30a, Rh4, RhPI, RhII, DIIIc, CD240D,1p36.11,AB012623,NM +_016124;abParts_IG_T_CelReceptors=False;conrad_Cnv=True,overlap=97.04 +%;mcCarroll_Cnv=True,overlap=84.62%;dgv_Cnv=True,overlap=14.43%;genom +icSuperDups=True,overlap=14.37%,otherChrom=chr1,otherStart=25655516,o +therEnd=25664845;gadAll=OTHER;gadAll=NORMALVARIATION;gadAll=INFECTION +;putativePromoterRegion=RHD,NM_001127691,+,;putativePromoterRegion=RH +D,NM_016124,+,;tfbsRegion=V$E2F_02.chr1.25598933.25598941;tfbsRegion= +V$HMX1_01.chr1.25598970.25598980;tfbsRegion=V$AP4_01.chr1.25598995.25 +599013;tfbsRegion=V$CHX10_01.chr1.25599041.25599055;tfbsRegion=V$LUN1 +_01.chr1.25599071.25599088;tfbsRegion=V$ARP1_01.chr1.25599075.2559909 +1;tfbsRegion=V$SRF_Q6.chr1.25605129.25605143;tfbsRegion=V$MZF1_02.chr +1.25605519.25605532;tfbsRegion=V$NGFIC_01.chr1.25605530.25605542;tfbs +Region=V$MEF2_03.chr1.25606025.25606047;tfbsRegion=V$GR_Q6.chr1.25606 +037.25606056;tfbsRegion=V$HSF1_01.chr1.25607490.25607500;tfbsRegion=V +$HSF2_01.chr1.25607490.25607500;tfbsRegion=V$ISRE_01.chr1.25607548.25 +607563;tfbsRegion=V$IRF2_01.chr1.25610593.25610606;tfbsRegion=V$NKX22 +_01.chr1.25610824.25610834;tfbsRegion=V$AP4_01.chr1.25611125.25611143 +;tfbsRegion=V$PPARG_02.chr1.25611131.25611154;tfbsRegion=V$MYOGNF1_01 +.chr1.25611152.25611181;tfbsRegion=V$EVI1_01.chr1.25615448.25615464;t +fbsRegion=V$GATA6_01.chr1.25616197.25616207;tfbsRegion=V$GATA_C.chr1. +25616199.25616210;tfbsRegion=V$GATA1_01.chr1.25617215.25617225;tfbsRe +gion=V$MYCMAX_01.chr1.25627462.25627476;tfbsRegion=V$MEF2_02.chr1.256 +27475.25627497;tfbsRegion=V$SRF_Q6.chr1.25627477.25627491;tfbsRegion= +V$MRF2_01.chr1.25628024.25628038;tfbsRegion=V$TGIF_01.chr1.25628121.2 +5628132;tfbsRegion=V$MEIS1_01.chr1.25628121.25628133;tfbsRegion=V$HTF +_01.chr1.25628128.25628152;tfbsRegion=V$BACH1_01.chr1.25628144.256281 +59;tfbsRegion=V$AP1_01.chr1.25628145.25628158;tfbsRegion=V$TAXCREB_01 +.chr1.25628146.25628161;tfbsRegion=V$CDPCR3_01.chr1.25628225.25628240 +;tfbsRegion=V$NF1_Q6.chr1.25629831.25629849;tfbsRegion=V$SREBP1_02.ch +r1.25629912.25629923;tfbsRegion=V$EVI1_01.chr1.25631284.25631300;tfbs +Region=V$GATA1_02.chr1.25631285.25631299;tfbsRegion=V$GATA1_01.chr1.2 +5643538.25643548;tfbsRegion=V$TGIF_01.chr1.25643561.25643572;tfbsRegi +on=V$AREB6_01.chr1.25643564.25643577
#!/usr/bin/perl -w open (AN_ANNOT, "<$ARGV[0]")or die "can not open vcf file\n"; open (OUTDGV,">$ARGV[0].mcCorrolcnv.txt") || die "can not open $ARGV[0 +].dgvcnv.txt ($!)\n"; @Annotation_file = <AN_ANNOT>; close AN_ANNOT; foreach $annotation_line(@Annotation_file) { if ( $annotation_line =~ /^chr/ ) { chomp $annotation_line; my( $chr, $start, $end, $description ) = split( /\s+/ , $annotat +ion_line ); if ( $description =~ /dgv_Cnv=True,overlap=\d+\.\d+\%\;/ ) { print OUTDGV "$chr\t$start\t$end\t$&\n"; } } }

Comment on Unable to extract matched string from input
Select or Download Code
Re: Unable to extract matched string from input
by BrowserUk (Pope) on Sep 25, 2013 at 09:43 UTC

    Why are you spliting on whitespace (split( /\s+/ ) when some of your fields contain embedded spaces?

    Also, please put your data lines in code tags.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Unable to extract matched string from input
by drmrgd (Beadle) on Sep 25, 2013 at 10:19 UTC
    Rather than splitting on whitespace since there is whitespace in the descriptions as well, why not just split the whole data chunk with a regex from the start. Assuming that your entries are always formatted with the chr, start, end, and description, you could do something like:
    my ( $chr, $start, $end, $description ) = $annotation_line =~ /(chr\d+ +) (\d+) (\d+) (.*)/; if ( $description =~ /dgv_Cnv=True,overlap=\d+\.\d+%;/ ) { print "$chr\t$start\t$end\t$&\n"; }
Re: Unable to extract matched string from input
by hdb (Parson) on Sep 25, 2013 at 10:27 UTC

    Try to use the third argument to split like this

    my( $chr, $start, $end, $description ) = split( /\s+/ , $annotation_li +ne, 4 );

    This way, $description will contain all of the remaining line and not only up to the next whitespace.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1055639]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (5)
As of 2014-07-23 01:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (130 votes), past polls