Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

Re: help with regex

by JavaFan (Canon)
on May 07, 2012 at 12:20 UTC ( #969250=note: print w/replies, xml ) Need Help??

in reply to help with regex

While you're including sample input, you're not showing what the intended output is. So, I had to guess. Perhaps the following will work for you:
#!/usr/bin/perl use 5.010; use strict; use warnings; while (<DATA>) { my %chunks = /(\S+)="([^"]+)"/g; my $header = delete $chunks{GI} || delete $chunks{protein_id} or n +ext; print ">$header"; print ' ', $_, '="', $chunks{$_}, '"' for keys %chunks; print "\n"; } __DATA__ >1001585.MDS_0001 protein_id="YP_004377784.1" product="chromosomal rep +lication initiation protein" GI="330500915" GeneID="10459818" >1001585.MDS_0002 protein_id="YP_004377785.1" product="DNA polymerase +III subunit beta" GI="330500916" GeneID="10454784" >1001585.MDS_0003 protein_id="YP_004377786.1" product="recombination p +rotein F" GI="330500917" GeneID="10454785" >1001585.MDS_0004 protein_id="YP_004377787.1" product="DNA gyrase subu +nit B" GI="330500918" GeneID="10454786" >1001585.MDS_0005 protein_id="YP_004377788.1" GI="330500919" GeneID="1 +0454787" >1001585.MDS_0006 protein_id="YP_004377789.1" GI="330500920" GeneID="1 +0454788" >1001585.MDS_0007 protein_id="YP_004377790.1" GI="330500921" GeneID="1 +0454789" >1001585.MDS_0008 protein_id="YP_004377791.1" GI="330500922" GeneID="1 +0454790" >1001585.MDS_0009 protein_id="YP_004377792.1" product="ABC transporter + permease" GI="330500923" GeneID="10454791" >1001585.MDS_0010 protein_id="YP_004377793.1" product="ABC transporter + ATP-binding protein" GI="330500924" GeneID="10454792" >245014.CK3_35030 protein_id="CBL42879.1" product="Predicted transcrip +tion factor, homolog of eukaryotic MBF1" >245014.CK3_35040 protein_id="CBL42880.1" product="Bacterial protein o +f unknown function (DUF961)."
Which gives as output:
>330500915 protein_id="YP_004377784.1" product="chromosomal replicatio +n initiation protein" GeneID="10459818" >330500916 protein_id="YP_004377785.1" product="DNA polymerase III sub +unit beta" GeneID="10454784" >330500917 protein_id="YP_004377786.1" product="recombination protein +F" GeneID="10454785" >330500918 protein_id="YP_004377787.1" product="DNA gyrase subunit B" +GeneID="10454786" >330500919 protein_id="YP_004377788.1" GeneID="10454787" >330500920 protein_id="YP_004377789.1" GeneID="10454788" >330500921 protein_id="YP_004377790.1" GeneID="10454789" >330500922 protein_id="YP_004377791.1" GeneID="10454790" >330500923 protein_id="YP_004377792.1" product="ABC transporter permea +se" GeneID="10454791" >330500924 protein_id="YP_004377793.1" product="ABC transporter ATP-bi +nding protein" GeneID="10454792" >CBL42879.1 product="Predicted transcription factor, homolog of eukary +otic MBF1" >CBL42880.1 product="Bacterial protein of unknown function (DUF961)."
It removes the GI or protein_id from the line, and doesn't keep the order of the fields. It also assumes all values are enclosed by double quotes (they all do in the input). It also assumes no GI or protein_id is 0.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://969250]
[Corion]: Yay, vacation time now - one week off work (but a Perl workshop in the middle ;)) )
[talexb]: TStanley I just realized that I was in your area for a convention back in April. We were at the high school for the weekend, the one with the long driveway. Nice spot.
[TStanley]: I am heading to Arlington, Virginia next week, then I get to spend an entire week at home. Looking forward to that as I will have spent 7 weeks on the road by the end of next week
[talexb]: That's a long road trip. I haven't travelled for business in years.

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (7)
As of 2017-06-23 14:21 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (548 votes). Check out past polls.