http://www.perlmonks.org?node_id=969250


in reply to help with regex

While you're including sample input, you're not showing what the intended output is. So, I had to guess. Perhaps the following will work for you:
#!/usr/bin/perl use 5.010; use strict; use warnings; while (<DATA>) { my %chunks = /(\S+)="([^"]+)"/g; my $header = delete $chunks{GI} || delete $chunks{protein_id} or n +ext; print ">$header"; print ' ', $_, '="', $chunks{$_}, '"' for keys %chunks; print "\n"; } __DATA__ >1001585.MDS_0001 protein_id="YP_004377784.1" product="chromosomal rep +lication initiation protein" GI="330500915" GeneID="10459818" >1001585.MDS_0002 protein_id="YP_004377785.1" product="DNA polymerase +III subunit beta" GI="330500916" GeneID="10454784" >1001585.MDS_0003 protein_id="YP_004377786.1" product="recombination p +rotein F" GI="330500917" GeneID="10454785" >1001585.MDS_0004 protein_id="YP_004377787.1" product="DNA gyrase subu +nit B" GI="330500918" GeneID="10454786" >1001585.MDS_0005 protein_id="YP_004377788.1" GI="330500919" GeneID="1 +0454787" >1001585.MDS_0006 protein_id="YP_004377789.1" GI="330500920" GeneID="1 +0454788" >1001585.MDS_0007 protein_id="YP_004377790.1" GI="330500921" GeneID="1 +0454789" >1001585.MDS_0008 protein_id="YP_004377791.1" GI="330500922" GeneID="1 +0454790" >1001585.MDS_0009 protein_id="YP_004377792.1" product="ABC transporter + permease" GI="330500923" GeneID="10454791" >1001585.MDS_0010 protein_id="YP_004377793.1" product="ABC transporter + ATP-binding protein" GI="330500924" GeneID="10454792" >245014.CK3_35030 protein_id="CBL42879.1" product="Predicted transcrip +tion factor, homolog of eukaryotic MBF1" >245014.CK3_35040 protein_id="CBL42880.1" product="Bacterial protein o +f unknown function (DUF961)."
Which gives as output:
>330500915 protein_id="YP_004377784.1" product="chromosomal replicatio +n initiation protein" GeneID="10459818" >330500916 protein_id="YP_004377785.1" product="DNA polymerase III sub +unit beta" GeneID="10454784" >330500917 protein_id="YP_004377786.1" product="recombination protein +F" GeneID="10454785" >330500918 protein_id="YP_004377787.1" product="DNA gyrase subunit B" +GeneID="10454786" >330500919 protein_id="YP_004377788.1" GeneID="10454787" >330500920 protein_id="YP_004377789.1" GeneID="10454788" >330500921 protein_id="YP_004377790.1" GeneID="10454789" >330500922 protein_id="YP_004377791.1" GeneID="10454790" >330500923 protein_id="YP_004377792.1" product="ABC transporter permea +se" GeneID="10454791" >330500924 protein_id="YP_004377793.1" product="ABC transporter ATP-bi +nding protein" GeneID="10454792" >CBL42879.1 product="Predicted transcription factor, homolog of eukary +otic MBF1" >CBL42880.1 product="Bacterial protein of unknown function (DUF961)."
It removes the GI or protein_id from the line, and doesn't keep the order of the fields. It also assumes all values are enclosed by double quotes (they all do in the input). It also assumes no GI or protein_id is 0.