Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

merge lines removing duplicates in a file

by Anonymous Monk
on Oct 01, 2016 at 09:19 UTC ( [id://1173064]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have a file in the following order which has 0 and 1 attached to each column with ':'. I want to combine all into one line. for example if in the file I have NC_009565:0 twice it should be printed once and if I have NC_009565:0 and also NC_009565:1, it should print NC_009565:1
1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:1 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:1 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:1 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0 1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:0 NC_ +017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 + NC_016934:0 NC_018078:0 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 +NC_020089:0 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0
The expected output:
1,4-dihydroxy-2-naphthoate octaprenyltransferase NC_009565:1 NC_ +017524:1 NC_017522:1 NC_018143:0 NC_017026:0 NC_017523:1 + NC_016934:1 NC_018078:1 NC_021193:0 NC_016768:0 NC_021 +251:0 NC_021192:1 NC_012943:1 NC_002755:1 NC_020559:0 +NC_020089:1 NC_022350:0 NC_021194:0 NC_017528:0 NC_021054 +:0 NC_009525:0

Replies are listed 'Best First'.
Re: merge lines removing duplicates in a file
by choroba (Cardinal) on Oct 01, 2016 at 09:39 UTC
    What did you try? How did it fail?

    Also, why is NC_017523 printed with :1 when it always contains :0 in the sample input?

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: merge lines removing duplicates in a file
by AnomalousMonk (Archbishop) on Oct 01, 2016 at 14:34 UTC

    In general:

    c:\@Work\Perl\monks>perl -wMstrict -le "use List::MoreUtils qw(uniq); ;; my @data = qw( NC_009565:0 NC_017524:0 NC_017522:0 NC_018143:0 NC_017026:0 NC_017523:0 NC_016934:1 NC_018078:0 NC_017026:0 NC_017523:0 NC_016934:1 NC_018078:0 NC_999999:0 NC_999999:1 NC_021193:0 NC_016768:0 NC_021251:0 NC_021192:0 NC_012943:0 NC_002755:0 NC_020559:0 NC_020089:0 NC_999999:1 NC_999999:0 ); ;; my @uniq = uniq @data; ;; printf qq{%d in \@data \n}, scalar @data; printf qq{%d in \@uniq \n}, scalar @uniq; print qq{'$_'} for @uniq; " 24 in @data 18 in @uniq 'NC_009565:0' 'NC_017524:0' 'NC_017522:0' 'NC_018143:0' 'NC_017026:0' 'NC_017523:0' 'NC_016934:1' 'NC_018078:0' 'NC_999999:0' 'NC_999999:1' 'NC_021193:0' 'NC_016768:0' 'NC_021251:0' 'NC_021192:0' 'NC_012943:0' 'NC_002755:0' 'NC_020559:0' 'NC_020089:0'
    I leave to you the task of extracting the protein (?) and NC_ info from all records and associating the latter (after uniq-ification) with the former. See List::MoreUtils::uniq().


    Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1173064]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2024-04-19 21:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found