Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

split function problem

by $new_guy (Acolyte)
on Feb 22, 2011 at 08:15 UTC ( #889535=perlquestion: print w/ replies, xml ) Need Help??
$new_guy has asked for the wisdom of the Perl Monks concerning the following question:

I have tried working on this for the past several days. I have a test_data.txt file that I would like to reorganize. The data in the file are arranged in rows, each row representing a distinct cluster/group. I would like to re-arrange the data in the file so that all entries with the same prefix are in one column. The cluster IDs help to identify the data entries in the rows.

The data currently looks like this:

ClusterX a_123(something) b_675(some_other_thing) b_234(something new +) c_897(some different thing) ClusterY b_6998(some_other_thing, thats new) c_877797(something diff +erent inside here) c_111(some other different thing) ClusterZ a_1234(something interesting) a_123467(something - else thats + is - interesting) 3850-1-2_12243(a new one) 3850-1-2_1789(another n +ew one)

The desired format is:

ClusterX a_123(something) b_675(some_other_thing) c_897(some differen +t thing) - ClusterX - b_234(something, new) - + - ClusterY - b_6998(some_other_thing, thats new) c_87779 +7(something: different inside here) ClusterY - - c_111(s +ome other different thing) ClusterZ a_1234(something interesting) - - + - ClusterZ a_123467(something - else thats is - interesting) - + - 3850-1-2_12243(a new one) ClusterZ - - - + 3850-1-2_1789(another new one)

Please note that the thge prefix is eveything before the underscore (ie _). Note: But not the one inside the brackets if there is one.

The script I am using is below, I think the problem is at the split function (line 17). Is this right?

#!usr/bin/perl use strict; use warnings; use IO::String; use List::Util 'max'; my $FILENAME4 = "test_data.txt"; open(DATA, $FILENAME4); #create arrays and hashes to store stuff my (%data, %all, @keys); while (<DATA>) { # avoid \n on last field chomp; #split the data into chunks my @chunks = split(/\s{2,}/, <DATA>); ## make sure you don't split + inside annotated brackets #create keys for the chunks my $key = shift @chunks; #store the keys in an array unless they already exist push @keys, $key unless exists $data{$key}; foreach my $chunk (@chunks) { #return references using hashes $data{$key}{$chunk}++; #add all chunks to the hash '%all' $all{$chunk} = 1; } } #remove new_clusters2.txt if it exists my $remove2 = "new_clusters2.txt"; if (unlink($remove2) == 1) { print "Existing \"new_clusters2.txt\" file was + removed\n"; } #now make a file for the ouput my $outputfile = "new_clusters2.txt"; if (! open(POS, ">>$outputfile") ) { print "Cannot open file \"$outputfile\" to write to!!\n\n" +; exit; } #sort the fields/columns keys and save them as an array #my @fields = sort {$a <=> $b} keys %all; ##<--this sorting didn't wor +k my @fields = sort {lc($a) cmp lc($b)} keys %all; #find the longest entry in an array my @array2 = (); foreach my $e (@fields){ #### my $d = $e; $d =~ m/(\S+)\_/; my $prefix = $1; # print "Prefix: ". $prefix."_"."\n\n"; ## prints the prefices push(@array2, $1); # print "* @array2 \n"; ## prints the prefices_ } #### my $longest = max map {length} @array2; #organise the data foreach my $key (@keys) { while (keys %{$data{$key}}) { print POS $key, " "; foreach my $field (@fields) { if ($data{$key}{$field}){ printf POS "%${longest}s ", $field; delete $data{$key}{$field} unless --$data{$key}{$field +}; } else { printf POS "%${longest}s ", "-"; } } print POS "\n"; } }

My test data (just 5 clusters):

Cluster5 SP_1003(conserved hypothetical protein) SP_1174(conserved + domain protein) SP_1175(conserved domain protein) spr_0907(Pne +umococcal histidine triad protein D precursor) spr_1060(Histidine Mo +tif-Containing protein) spr_1061(Pneumococcal histidine triad protei +n A precursor) SPD_0889(pneumococcal histidine triad protein D p +recursor) SPD_1037(histidine triad protein) SPD_1038(pneumococcal h +istidine triad protein A precursor) SP70585_1043(pneumococcal hi +stidine triad protein B) SP70585_1226(pneumococcal histidine triad p +rotein B) SP70585_1227(pneumococcal histidine triad protein B) +SPJ_0944(pneumococcal histidine triad protein B) SPJ_1093(pneumococc +al histidine triad protein B) SPP_1009(pneumococcal histidine tr +iad protein B) SPP_1217(pneumococcal histidine triad protein B) SPP +_1218(pneumococcal histidine triad protein B) SPT_1049(pneumococ +cal histidine triad protein B) SPT_1198(pneumococcal histidine triad + protein B) SPH_1104(pneumococcal histidine triad protein B) + SPG_0928(pneumococcal histidine triad protein D) SPG_1073(pneumoco +ccal histidine triad protein A/B (phtA/B)) SPCG_0977(hypothetica +l protein) SPCG_1122(hypothetical protein) HMPREF0837_11322(pne +umococcal histidine triad protein A/B (phtA/B)) HMPREF0837_11481(pne +umococcal histidine triad protein B) SPN23F_09290(pneumococcal h +istidine triad protein D (bvh-11-2)) SPN23F_10770(putative streptoco +ccal histidine triad protein PhpA) 3850-1-10_00031(unknown) 385 +0-1-10_01193(unknown) 3850-1-10_01345(unknown) 3850-1-11_01204( +unknown) 3850-1-11_01329(unknown) 3850-1-12_00144(unknown) 385 +0-1-12_01345(unknown) 3850-1-1_00282(unknown) 3850-1-1_01281(un +known) 3850-1-1_01443(unknown) 3850-1-2_00010(unknown) 3850-1- +2_01233(unknown) 3850-1-2_01374(unknown) 3850-1-3_01238(unknown +) 3850-1-3_01239(unknown) 3850-1-3_01382(unknown) 3850-1-4_002 +76(unknown) 3850-1-4_01322(unknown) 3850-1-4_01482(unknown) 38 +50-1-5_00019(unknown) 3850-1-5_00023(unknown) 3850-1-5_01247(unknow +n) 3850-1-6_00040(unknown) 3850-1-6_01259(unknown) 3850-1- +7_00013(unknown) 3850-1-7_01232(unknown) 3850-1-7_01359(unknown) + 3850-1-8_00159(unknown) 3850-1-8_01109(unknown) 3850-1-8_01261(u +nknown) 3850-1-9_00252(unknown) 3850-1-9_01523(unknown) 38 +50-2-10_00214(unknown) 3850-2-10_01304(unknown) 3850-2-10_01461(unk +nown) 3850-2-11_01237(unknown) 3850-2-11_01238(unknown) 3850-2 +-11_01361(unknown) 3850-2-11_01362(unknown) 3850-2-12_00145(unk +nown) 3850-2-12_01279(unknown) 3850-2-12_01280(unknown) 3850-2-12_ +01438(unknown) 3850-2-1_01260(unknown) 3850-2-1_01261(unknown) + 3850-2-1_01369(unknown) 3850-2-2_01307(unknown) 3850-2-2_01443 +(unknown) 3850-2-3_01243(unknown) 3850-2-3_01385(unknown) 3850 +-2-3_01386(unknown) 3850-2-4_01492(unknown) 3850-2-4_01636(unkn +own) 3850-2-5_01509(unknown) 3850-2-5_01510(unknown) 3850- +2-6_00415(unknown) 3850-2-6_01357(unknown) 3850-2-6_01358(unknown) + 3850-2-7_01389(unknown) 3850-2-7_01544(unknown) 3850-2-7_01545 +(unknown) 3850-2-8_00293(unknown) 3850-2-8_01225(unknown) 3850 +-2-8_01226(unknown) 3850-2-8_01353(unknown) 3850-2-9_00078(unkn +own) 3850-2-9_01278(unknown) 3850-2-9_01438(unknown) 3850-2-9_0143 +9(unknown) 3850-3-10_01395(unknown) 3850-3-10_01397(unknown) 3 +850-3-10_01495(unknown) 3850-3-11_00190(unknown) 3850-3-11_0019 +1(unknown) 3850-3-11_01194(unknown) 3850-3-12_00207(unknown) 3 +850-3-12_01390(unknown) 3850-3-12_01391(unknown) 3850-3-1_00383 +(unknown) 3850-3-1_01304(unknown) 3850-3-2_01474(unknown) 3850 +-3-2_01635(unknown) 3850-3-3_00053(unknown) 3850-3-3_01170(unkn +own) 3850-3-3_01315(unknown) 3850-3-4_00436(unknown) 3850-3-4_ +01261(unknown) 3850-3-4_01262(unknown) 3850-3-5_00295(unknown) + 3850-3-5_01224(unknown) 3850-3-5_01365(unknown) 3850-3-5_01366(unk +nown) 3850-3-6_01476(unknown) 3850-3-6_02252(unknown) 3850 +-3-7_01192(unknown) 3850-3-7_01324(unknown) 3850-3-8_00224(unkn +own) 3850-3-8_01346(unknown) 3850-3-9_00049(unknown) 3850-3-9_ +01273(unknown) 3850-3-9_01274(unknown) 3850-5-10_00315(unknown) + 3850-5-10_00420(unknown) 3850-5-10_01240(unknown) 3850-5-11_0 +0058(unknown) 3850-5-11_00096(unknown) 3850-5-12_01197(unknown) + 3850-5-12_01198(unknown) 3850-5-12_01314(unknown) 3850-5-12_01315 +(unknown) 3850-5-1_01339(unknown) 3850-5-1_03653(unknown) +3850-5-2_00097(unknown) 3850-5-2_01168(unknown) 3850-5-2_02100(unkn +own) 3850-5-3_00103(unknown) 3850-5-3_00104(unknown) 3850-5-3_ +01284(unknown) 3850-5-3_01285(unknown) 3850-5-4_01200(unknown) + 3850-5-4_01367(unknown) 3850-5-5_01254(unknown) 3850-5-5_01384 +(unknown) 3850-5-5_01385(unknown) 3850-5-6_01108(unknown) 3850 +-5-6_01244(unknown) 3850-5-7_00525(unknown) 3850-5-8_01313( +unknown) 3850-5-8_01458(unknown) 3850-5-9_00357(unknown) 3850- +5-9_01419(unknown) 3850-5-9_01420(unknown) 3850-6-10_01264(unkn +own) 3850-6-10_01402(unknown) 3850-6-11_01121(unknown) 3850-6- +11_01122(unknown) 3850-6-11_01259(unknown) 3850-6-12_00043(unkn +own) 3850-6-12_01214(unknown) 3850-6-12_01367(unknown) 3850-6- +1_00100(unknown) 3850-6-1_01094(unknown) 3850-6-1_01095(unknown) + 3850-6-2_01432(unknown) 3850-6-2_02107(unknown) 3850-6-3_002 +36(unknown) 3850-6-3_01195(unknown) 3850-6-4_01067(unknown) 38 +50-6-4_01201(unknown) 3850-6-5_00239(unknown) 3850-6-5_01350(un +known) 3850-6-5_02142(unknown) 3850-6-6_01062(unknown) 3850-6- +6_01065(unknown) 3850-6-6_01207(unknown) 3850-6-7_00139(unknown +) 3850-6-7_00140(unknown) 3850-6-7_01133(unknown) 3850-6-7_01263(u +nknown) 3850-6-8_00173(unknown) 3850-6-8_00338(unknown) 3850-6 +-8_01219(unknown) 3850-6-9_01211(unknown) 3850-6-9_01343(unknow +n) 3850-7-10_00113(unknown) 3850-7-10_01327(unknown) 3850-7-10 +_01328(unknown) 3850-7-11_01218(unknown) 3850-7-11_01330(unknow +n) 3850-7-12_01272(unknown) 3850-7-12_01398(unknown) 3850- +7-1_00111(unknown) 3850-7-1_00112(unknown) 3850-7-1_01287(unknown) + 3850-7-1_02022(unknown) 3850-7-2_01123(unknown) 3850-7-2_01233 +(unknown) 3850-7-3_00200(unknown) 3850-7-3_01371(unknown) 3850 +-7-3_02157(unknown) 3850-7-4_00004(unknown) 3850-7-4_01158(unkn +own) 3850-7-4_01290(unknown) 3850-7-5_00155(unknown) 3850-7-5_ +01363(unknown) 3850-7-6_00054(unknown) 3850-7-6_01170(unknown) + 3850-7-7_01195(unknown) 3850-7-7_01196(unknown) 3850-7-7_01346 +(unknown) 3850-7-8_00055(unknown) 3850-7-8_00056(unknown) 3850 +-7-8_01214(unknown) 3850-7-8_01363(unknown) 3850-7-9_01185(unkn +own) 3850-7-9_01327(unknown) 3850-8-10_00010(unknown) 3850-8-1 +0_01203(unknown) 3850-8-11_01391(unknown) 3850-8-11_01392(unkno +wn) 3850-8-12_01230(unknown) 3850-8-12_01233(unknown) 3850-8-1 +2_01354(unknown) 3850-8-12_01355(unknown) 3850-8-1_00024(unknow +n) 3850-8-1_01183(unknown) 3850-8-1_01300(unknown) 3850-8-2_01 +149(unknown) 3850-8-2_01281(unknown) 3850-8-3_00102(unknown) 3 +850-8-3_01220(unknown) 3850-8-3_01221(unknown) 3850-8-4_00210(u +nknown) 3850-8-4_01322(unknown) 3850-8-4_01408(unknown) 3850-8 +-5_01362(unknown) 3850-8-6_01279(unknown) 3850-8-6_01280(unknow +n) 3850-8-6_01416(unknown) 3850-8-7_00036(unknown) 3850-8-7_01 +349(unknown) 3850-8-8_00056(unknown) 3850-8-8_01279(unknown) 3 +850-8-8_01423(unknown) 3850-8-9_00002(unknown) 3850-8-9_01374(u +nknown) 3850-8-9_01500(unknown) Cluster6 SP_0917(pilin gene inverting-related protein) spr_041 +2(Degenerate transposase) spr_0817(Degenerate transposase) spr_0818 +(Degenerate transposase) spr_1886(Degenerate transposase) SPD_1 +901(transposase, putative) SP70585_0526(transposase) SP70585_09 +53(pilin gene inverting-related protein) SP70585_2181(transposase) + SPJ_0440(transposase) SPJ_0856(pilin gene inverting-related prot +ein) SPJ_2096(transposase) SPP_0487(transposase) SPP_0923(cons +erved domain protein) SPP_0925(transposase) SPP_2129(transposase) + SPT_1283(transposase) SPT_1285(transposase) SPT_2085(transposas +e) SPH_2262(transposase) SPG_0419(transposase) SPG_0842(tr +ansposase) SPG_2013(transposase) SPCG_0451(degenerate transposa +se) SPCG_0894(pilin gene inverting-related protein,truncated) SPCG_ +2041(degenerate transposase) HMPREF0837_10072(transposase) HMPR +EF0837_10749(possible transposase) HMPREF0837_11569(transposase) HM +PREF0837_11570(possible pilin gene inverting-related protein) ps +eudoSPN23F_04290(putative transposase (pseudogene)) pseudoSPN23F_084 +00(transposase (pseudogene)) pseudoSPN23F_20990(degenerate transposa +se) 3850-1-10_00199(unknown) 3850-1-10_01100(unknown) 3850 +-1-11_00716(unknown) 3850-1-11_00717(unknown) 3850-1-12_00315(u +nknown) 3850-1-12_00915(unknown) 3850-1-12_01260(unknown) 3850 +-1-1_00296(unknown) 3850-1-1_01194(unknown) 3850-1-2_00163(unkn +own) 3850-1-2_01140(unknown) 3850-1-3_00703(unknown) 3850-1-3_ +01129(unknown) 3850-1-4_00161(unknown) 3850-1-4_00258(unknown) + 3850-1-5_00111(unknown) 3850-1-5_00885(unknown) 3850-1-5_02093 +(unknown) 3850-1-6_00726(unknown) 3850-1-6_00727(unknown) 3850 +-1-6_01168(unknown) 3850-1-6_02316(unknown) 3850-1-7_00235(unkn +own) 3850-1-7_01142(unknown) 3850-1-8_00113(unknown) 3850-1-8_ +00120(unknown) 3850-1-8_00156(unknown) 3850-1-9_00282(unknown) + 3850-1-9_00812(unknown) 3850-1-9_01869(unknown) 3850-1-9_02527(unk +nown) 3850-2-10_00018(unknown) 3850-2-10_00755(unknown) 3850-2 +-10_01212(unknown) 3850-2-10_02313(unknown) 3850-2-11_00268(unk +nown) 3850-2-11_01155(unknown) 3850-2-12_00316(unknown) 3850-2 +-12_00854(unknown) 3850-2-12_01182(unknown) 3850-2-12_02494(unknown +) 3850-2-1_00101(unknown) 3850-2-1_00201(unknown) 3850-2-2 +_00082(unknown) 3850-2-2_01209(unknown) 3850-2-2_02340(unknown) + 3850-2-3_00758(unknown) 3850-2-3_01157(unknown) 3850-2-3_02338(un +known) 3850-2-4_00959(unknown) 3850-2-4_01399(unknown) 3850-2- +4_01400(unknown) 3850-2-4_02354(unknown) 3850-2-5_01279(unknown +) 3850-2-5_02529(unknown) 3850-2-6_00903(unknown) 3850-2-6_012 +86(unknown) 3850-2-6_01798(unknown) 3850-2-6_02397(unknown) 38 +50-2-7_00055(unknown) 3850-2-8_00762(unknown) 3850-2-8_00885(un +known) 3850-2-8_02241(unknown) 3850-2-9_00268(unknown) 3850-2- +9_01187(unknown) 3850-3-10_00947(unknown) 3850-3-10_01304(unkno +wn) 3850-3-10_02319(unknown) 3850-3-11_00684(unknown) 3850-3-1 +1_01099(unknown) 3850-3-11_02124(unknown) 3850-3-12_01294(unkno +wn) 3850-3-1_00060(unknown) 3850-3-1_02574(unknown) 3850-3 +-2_00353(unknown) 3850-3-2_00972(unknown) 3850-3-2_01372(unknown) + 3850-3-3_00614(unknown) 3850-3-3_00615(unknown) 3850-3-3_00616( +unknown) 3850-3-4_00248(unknown) 3850-3-4_02268(unknown) 3 +850-3-5_00976(unknown) 3850-3-5_01135(unknown) 3850-3-5_02101(unkno +wn) 3850-3-6_00821(unknown) 3850-3-6_00822(unknown) 3850-3-6_0 +1258(unknown) 3850-3-7_00620(unknown) 3850-3-7_00621(unknown) +3850-3-7_01092(unknown) 3850-3-8_00694(unknown) 3850-3-8_00695( +unknown) 3850-3-8_01122(unknown) 3850-3-9_00852(unknown) 3850- +3-9_01197(unknown) 3850-3-9_02318(unknown) 3850-5-10_00009(unkn +own) 3850-5-10_00890(unknown) 3850-5-10_00892(unknown) 3850-5-10_0 +1194(unknown) 3850-5-11_00563(unknown) 3850-5-12_00355(unkn +own) 3850-5-12_01133(unknown) 3850-5-1_03001(unknown) 3850-5-1 +_03431(unknown) 3850-5-1_03432(unknown) 3850-5-2_00706(unknown) + 3850-5-2_00707(unknown) 3850-5-2_01083(unknown) 3850-5-3_0081 +0(unknown) 3850-5-3_01191(unknown) 3850-5-3_01192(unknown) 3850-5- +3_01193(unknown) 3850-5-3_02309(unknown) 3850-5-4_00097(unknown +) 3850-5-4_01092(unknown) 3850-5-4_01093(unknown) 3850-5-5_007 +51(unknown) 3850-5-5_01159(unknown) 3850-5-5_01160(unknown) 3850-5 +-5_02160(unknown) 3850-5-6_00601(unknown) 3850-5-6_00602(unknow +n) 3850-5-6_01024(unknown) 3850-5-7_00858(unknown) 3850-5-7_02 +438(unknown) 3850-5-8_00810(unknown) 3850-5-8_01218(unknown) + 3850-5-9_00828(unknown) 3850-5-9_01197(unknown) 3850-5-9_02293(u +nknown) 3850-6-10_00744(unknown) 3850-6-10_01222(unknown) 3850 +-6-10_02379(unknown) 3850-6-11_00022(unknown) 3850-6-11_02122(u +nknown) 3850-6-12_00052(unknown) 3850-6-1_00603(unknown) 3 +850-6-1_01005(unknown) 3850-6-2_01202(unknown) 3850-6-2_02233(u +nknown) 3850-6-3_00760(unknown) 3850-6-3_01124(unknown) 38 +50-6-4_00988(unknown) 3850-6-4_00989(unknown) 3850-6-5_00035(un +known) 3850-6-5_00097(unknown) 3850-6-5_01124(unknown) 3850-6- +6_00536(unknown) 3850-6-6_00973(unknown) 3850-6-6_01974(unknown) + 3850-6-7_00045(unknown) 3850-6-7_00102(unknown) 3850-6-9 +_01120(unknown) 3850-7-10_02104(unknown) 3850-7-11_00031(un +known) 3850-7-11_00210(unknown) 3850-7-11_00699(unknown) 3850-7-11 +_02226(unknown) 3850-7-12_00104(unknown) 3850-7-12_01180(unknow +n) 3850-7-12_02224(unknown) 3850-7-1_00080(unknown) 3850-7-1_0 +0124(unknown) 3850-7-1_01069(unknown) 3850-7-2_02078(unknown) + 3850-7-3_00010(unknown) 3850-7-3_00094(unknown) 3850-7-3_00223( +unknown) 3850-7-3_01165(unknown) 3850-7-4_00155(unknown) 3850- +7-4_01071(unknown) 3850-7-4_01072(unknown) 3850-7-5_00178(unkno +wn) 3850-7-6_00580(unknown) 3850-7-6_00936(unknown) 3850-7-6_0 +2005(unknown) 3850-7-7_00700(unknown) 3850-7-7_01107(unknown) +3850-7-7_01108(unknown) 3850-7-7_02203(unknown) 3850-7-8_00115( +unknown) 3850-7-8_00711(unknown) 3850-7-8_00712(unknown) 3850- +7-9_00026(unknown) 3850-7-9_00055(unknown) 3850-7-9_01092(unknown) + 3850-7-9_02243(unknown) 3850-8-10_00092(unknown) 3850-8-10_007 +49(unknown) 3850-8-10_01097(unknown) 3850-8-11_00782(unknown) +3850-8-11_02607(unknown) 3850-8-12_00179(unknown) 3850-8-12_011 +38(unknown) 3850-8-12_02196(unknown) 3850-8-2_00044(unknown +) 3850-8-2_00610(unknown) 3850-8-2_02175(unknown) 3850-8-3_003 +28(unknown) 3850-8-3_01129(unknown) 3850-8-4_00192(unknown) 38 +50-8-4_00817(unknown) 3850-8-4_01236(unknown) 3850-8-4_02333(unknow +n) 3850-8-5_00022(unknown) 3850-8-5_00233(unknown) 3850-8-5_01 +143(unknown) 3850-8-6_00707(unknown) 3850-8-6_01185(unknown) + 3850-8-7_00046(unknown) 3850-8-7_00190(unknown) 3850-8-7_01111(u +nknown) 3850-8-7_01112(unknown) 3850-8-8_01213(unknown) 38 +50-8-9_00861(unknown) 3850-8-9_01283(unknown) 3850-8-9_02357(unknow +n) Cluster7 spr_1379(ABC transporter, truncation) spr_1380(ABC tr +ansporter, truncation) spr_1381(ABC transporter, truncation) SP +D_1355(conserved hypothetical protein) SPP_1546(ABC tran +sporter) SPG_1451(ABC transporter, ATP-binding protein) + SPG_1452(hypothetical protein) SPG_1453(hypothetical protein) +SPCG_1511(hypothetical protein) SPCG_1512(hypothetical protein) SPC +G_1513(ABC-type multidrug transport system, ATPase and permease compo +nents) HMPREF0837_11760(ABC superfamily ATP binding cassette tra +nsporter, ABC protein) SPN23F_14900(ABC transporter ATP-binding +protein) 3850-1-10_01718(unknown) 3850-1-10_01719(unknown) 385 +0-1-10_01720(unknown) 3850-1-11_01656(unknown) 3850-1-11_01657( +unknown) 3850-1-12_01833(unknown) 3850-1-12_01834(unknown) 385 +0-1-12_01835(unknown) 3850-1-1_01810(unknown) 3850-1-1_01811(un +known) 3850-1-1_01812(unknown) 3850-1-2_01768(unknown) 3850-1- +2_01769(unknown) 3850-1-3_01715(unknown) 3850-1-3_01717(unknown +) 3850-1-4_01809(unknown) 3850-1-4_01810(unknown) 3850-1-4_018 +11(unknown) 3850-1-5_00343(unknown) 3850-1-5_00344(unknown) 38 +50-1-5_00345(unknown) 3850-1-6_01749(unknown) 3850-1-6_01750(un +known) 3850-1-7_01682(unknown) 3850-1-7_01683(unknown) 3850-1- +7_01684(unknown) 3850-1-8_01560(unknown) 3850-1-8_01561(unknown +) 3850-1-9_01882(unknown) 3850-2-10_01781(unknown) 3850-2- +10_01782(unknown) 3850-2-10_01783(unknown) 3850-2-11_01600(unkn +own) 3850-2-12_01789(unknown) 3850-2-12_01790(unknown) 385 +0-2-1_01699(unknown) 3850-2-1_01700(unknown) 3850-2-2_01809(unk +nown) 3850-2-2_01810(unknown) 3850-2-3_01740(unknown) 3850-2-3 +_01741(unknown) 3850-2-3_01742(unknown) 3850-2-4_01896(unknown) + 3850-2-4_01897(unknown) 3850-2-5_01849(unknown) 3850-2-5_0185 +2(unknown) 3850-2-6_01813(unknown) 3850-2-6_01814(unknown) + 3850-2-7_01857(unknown) 3850-2-7_01858(unknown) 3850-2-7_01859(unk +nown) 3850-2-8_01652(unknown) 3850-2-8_01653(unknown) 3850 +-2-9_01772(unknown) 3850-2-9_01773(unknown) 3850-2-9_01774(unknown) + 3850-3-10_00128(unknown) 3850-3-10_00129(unknown) 3850-3- +11_00149(unknown) 3850-3-11_00150(unknown) 3850-3-12_01881(unkn +own) 3850-3-12_01882(unknown) 3850-3-1_01861(unknown) 3850-3-1 +_01862(unknown) 3850-3-2_00363(unknown) 3850-3-2_00364(unknown) + 3850-3-3_01637(unknown) 3850-3-3_01638(unknown) 3850-3-4_ +01712(unknown) 3850-3-4_01713(unknown) 3850-3-5_01574(unknown) + 3850-3-5_01575(unknown) 3850-3-5_01576(unknown) 3850-3-6_01809 +(unknown) 3850-3-7_01655(unknown) 3850-3-8_01718(unknown) +3850-3-8_01719(unknown) 3850-3-9_01773(unknown) 3850-3-9_01774( +unknown) 3850-3-9_01775(unknown) 3850-5-10_01753(unknown) 3850 +-5-10_01754(unknown) 3850-5-11_01542(unknown) 3850-5-11_01543(u +nknown) 3850-5-11_01544(unknown) 3850-5-12_01604(unknown) 3850 +-5-12_01605(unknown) 3850-5-1_03988(unknown) 3850-5-2_01670 +(unknown) 3850-5-2_01671(unknown) 3850-5-2_01672(unknown) 3850 +-5-3_01717(unknown) 3850-5-4_01717(unknown) 3850-5-4_01718(unkn +own) 3850-5-5_01658(unknown) 3850-5-5_01659(unknown) 3850- +5-6_01548(unknown) 3850-5-6_01549(unknown) 3850-5-8_01725(u +nknown) 3850-5-8_01726(unknown) 3850-5-9_01748(unknown) 38 +50-6-10_01742(unknown) 3850-6-11_01577(unknown) 3850-6-12_0 +1672(unknown) 3850-6-12_01673(unknown) 3850-6-12_01674(unknown) + 3850-6-1_01558(unknown) 3850-6-2_01734(unknown) 3850-6-2_0173 +5(unknown) 3850-6-2_01737(unknown) 3850-6-3_01640(unknown) 385 +0-6-3_01641(unknown) 3850-6-4_01491(unknown) 3850-6-4_01492(unk +nown) 3850-6-5_01713(unknown) 3850-6-5_01714(unknown) 3850 +-6-6_01495(unknown) 3850-6-7_01635(unknown) 3850-6-7_01636(unkn +own) 3850-6-8_01737(unknown) 3850-6-8_01738(unknown) 3850- +6-9_00152(unknown) 3850-6-9_00153(unknown) 3850-6-9_00154(unknown) + 3850-7-10_01589(unknown) 3850-7-10_01590(unknown) 3850-7-10_01 +591(unknown) 3850-7-11_01664(unknown) 3850-7-11_01665(unknown) + 3850-7-12_01702(unknown) 3850-7-1_01633(unknown) 3850-7-1_ +01634(unknown) 3850-7-1_01635(unknown) 3850-7-2_01514(unknown) + 3850-7-2_01515(unknown) 3850-7-2_01516(unknown) 3850-7-3_01717 +(unknown) 3850-7-3_01718(unknown) 3850-7-3_01719(unknown) 3850 +-7-4_01596(unknown) 3850-7-4_01597(unknown) 3850-7-4_01598(unknown) + 3850-7-5_01722(unknown) 3850-7-5_01723(unknown) 3850-7-5_0172 +4(unknown) 3850-7-6_01438(unknown) 3850-7-6_01439(unknown) + 3850-7-7_01649(unknown) 3850-7-7_01650(unknown) 3850-7-7_01651(unk +nown) 3850-7-8_01671(unknown) 3850-7-8_01672(unknown) 3850-7-8 +_01673(unknown) 3850-7-9_01711(unknown) 3850-7-9_01712(unknown) + 3850-8-10_01625(unknown) 3850-8-10_01626(unknown) 3850-8- +11_01860(unknown) 3850-8-11_01861(unknown) 3850-8-11_01862(unknown) + 3850-8-12_01664(unknown) 3850-8-12_01665(unknown) 3850-8- +1_00104(unknown) 3850-8-1_00105(unknown) 3850-8-1_00106(unknown) + 3850-8-2_01624(unknown) 3850-8-2_01625(unknown) 3850-8-2_01626(u +nknown) 3850-8-3_01697(unknown) 3850-8-3_01698(unknown) 3850-8 +-3_01699(unknown) 3850-8-4_01740(unknown) 3850-8-4_01741(unknow +n) 3850-8-4_01742(unknown) 3850-8-5_00059(unknown) 3850-8- +6_01734(unknown) 3850-8-6_01735(unknown) 3850-8-7_01748(unknown +) 3850-8-7_01749(unknown) 3850-8-7_01750(unknown) 3850-8-8_001 +71(unknown) 3850-8-9_00075(unknown) 3850-8-9_00076(unknown) 38 +50-8-9_00077(unknown) Cluster8 spr_0324(Transposase, uncharacterized, truncation) sp +r_1295(Transposase, uncharacterized, truncation) spr_1296(Hypothetic +al protein) spr_2016(Transposase, uncharacterized, truncation) +SPD_1269(conserved hypothetical protein) SPD_1270(conserved hypothet +ical protein) SPD_2038(conserved hypothetical protein) SP70585_ +2338(transposase) SPJ_1227(transposase) SPJ_1339(transposase) +SPJ_1340(transposase) SPJ_2237(transposase) SPP_0403(transposas +e) SPP_1461(transposase) SPP_1462(transposase) SPP_2264(transposas +e) SPT_2229(transposase) SPG_0329(IS66-Spn1, transposas +e) SPG_1204(IS66-Spn1, transposase) SPG_2157(IS66-Spn1, transposase +) SPCG_1428(hypothetical protein) SPCG_1429(hypothetical protei +n) SPCG_2178(transposase) HMPREF0837_10225(transposase family p +rotein) pseudoSPN23F_22440(putative transposase family protein) + 3850-1-10_00567(unknown) 3850-1-10_01498(unknown) 3850-1-1 +1_01566(unknown) 3850-1-11_02293(unknown) 3850-1-12_00101(unkno +wn) 3850-1-12_00809(unknown) 3850-1-1_00699(unknown) 3850- +1-2_01520(unknown) 3850-1-2_01521(unknown) 3850-1-3_01631(unkno +wn) 3850-1-3_02363(unknown) 3850-1-4_00717(unknown) 3850-1-4_0 +2455(unknown) 3850-1-5_01625(unknown) 3850-1-5_01626(unknown) + 3850-1-7_00615(unknown) 3850-1-7_02400(unknown) 3850-1- +8_00068(unknown) 3850-1-8_02320(unknown) 3850-1-9_00313(unknown +) 3850-1-9_00713(unknown) 3850-2-10_00663(unknown) 3850-2-10_0 +2471(unknown) 3850-2-11_01488(unknown) 3850-2-11_01507(unknown) + 3850-2-11_01509(unknown) 3850-2-11_02462(unknown) 3850-2-12_0 +0125(unknown) 3850-2-12_00329(unknown) 3850-2-1_00860(unknown) + 3850-2-1_00861(unknown) 3850-2-1_02489(unknown) 3850-2-2_00751 +(unknown) 3850-2-2_01574(unknown) 3850-2-3_00655(unknown) 3850 +-2-3_01652(unknown) 3850-2-4_00174(unknown) 3850-2-4_00175(unkn +own) 3850-2-4_00482(unknown) 3850-2-5_01652(unknown) 3850-2-5_ +02710(unknown) 3850-2-6_00139(unknown) 3850-2-6_01616(unknown) + 3850-2-6_02551(unknown) 3850-2-7_00776(unknown) 3850-2-7_02431 +(unknown) 3850-2-8_00297(unknown) 3850-2-8_00658(unknown) 3850 +-2-8_01464(unknown) 3850-2-8_02380(unknown) 3850-2-9_01488(unkn +own) 3850-2-9_01599(unknown) 3850-3-10_00837(unknown) 3850-3-1 +0_02477(unknown) 3850-3-12_01798(unknown) 3850-3-1_0022 +9(unknown) 3850-3-1_00235(unknown) 3850-3-1_00236(unknown) 3850-3- +1_00864(unknown) 3850-3-1_02710(unknown) 3850-3-2_00866(unknown +) 3850-3-2_01076(unknown) 3850-3-2_01077(unknown) 3850-3-2_01787(u +nknown) 3850-3-2_02432(unknown) 3850-3-3_02341(unknown) 38 +50-3-4_00999(unknown) 3850-3-4_01000(unknown) 3850-3-4_02419(unknow +n) 3850-3-5_01489(unknown) 3850-3-5_01490(unknown) 3850-3-5_02 +253(unknown) 3850-3-6_01150(unknown) 3850-3-7_00528(unknown +) 3850-3-7_01567(unknown) 3850-3-7_02408(unknown) 3850-3-8_000 +25(unknown) 3850-3-9_00291(unknown) 3850-3-9_00748(unknown) + 3850-5-10_00764(unknown) 3850-5-10_01066(unknown) 3850-5-11_0 +1356(unknown) 3850-5-11_01357(unknown) 3850-5-11_02224(unknown) + 3850-5-12_00237(unknown) 3850-5-12_01546(unknown) 3850-5-12_02363 +(unknown) 3850-5-1_03905(unknown) 3850-5-2_00600(unknown) +3850-5-2_02391(unknown) 3850-5-3_01638(unknown) 3850-5-3_02411( +unknown) 3850-5-4_01634(unknown) 3850-5-4_02388(unknown) 3 +850-5-5_01573(unknown) 3850-5-6_00495(unknown) 3850-5-6_02263(u +nknown) 3850-5-8_00350(unknown) 3850-5-8_01500(unknown) 38 +50-5-8_01643(unknown) 3850-5-9_00737(unknown) 3850-6-11 +_00086(unknown) 3850-6-12_00616(unknown) 3850-6-12_02349(unknow +n) 3850-6-1_02274(unknown) 3850-6-2_00639(unknown) 3850-6- +2_02382(unknown) 3850-6-3_00664(unknown) 3850-6-3_00887(unknown +) 3850-6-3_01549(unknown) 3850-6-4_00464(unknown) 3850-6-4_022 +12(unknown) 3850-6-5_00592(unknown) 3850-6-6_00863(unknown) + 3850-6-6_02131(unknown) 3850-6-7_01559(unknown) 3850-6-7_0235 +4(unknown) 3850-6-8_01503(unknown) 3850-6-9_01281(unknown) + 3850-6-9_01479(unknown) 3850-7-10_00569(unknown) 3850-7-10_022 +45(unknown) 3850-7-10_02246(unknown) 3850-7-11_01581(unknown) +3850-7-11_01582(unknown) 3850-7-11_02329(unknown) 3850-7-11_02330(u +nknown) 3850-7-12_00636(unknown) 3850-7-12_02334(unknown) +3850-7-1_00547(unknown) 3850-7-1_02291(unknown) 3850-7-2_00037( +unknown) 3850-7-2_02238(unknown) 3850-7-3_00661(unknown) 3850- +7-3_01637(unknown) 3850-7-3_02443(unknown) 3850-7-4_00563(unkno +wn) 3850-7-5_00581(unknown) 3850-7-6_00490(unknown) 3850-7 +-6_02155(unknown) 3850-7-7_00601(unknown) 3850-7-7_02324(unknow +n) 3850-7-8_00606(unknown) 3850-7-8_02333(unknown) 3850-7- +9_01627(unknown) 3850-7-9_01629(unknown) 3850-8-10_00191(unknow +n) 3850-8-10_02344(unknown) 3850-8-11_00680(unknown) 3850- +8-12_00582(unknown) 3850-8-12_02353(unknown) 3850-8-1_00031(unk +nown) 3850-8-1_01512(unknown) 3850-8-2_01534(unknown) 3850-8-2 +_01535(unknown) 3850-8-2_02279(unknown) 3850-8-3_00697(unknown) + 3850-8-3_02340(unknown) 3850-8-4_01480(unknown) 3850-8-4_0249 +5(unknown) 3850-8-5_02322(unknown) 3850-8-6_00190(unknown) + 3850-8-6_00191(unknown) 3850-8-6_00223(unknown) 3850-8-6_01655(unk +nown) 3850-8-7_00622(unknown) 3850-8-8_02285(unknown) +3850-8-9_00763(unknown) 3850-8-9_01767(unknown) 3850-8-9_01769(unkn +own) 3850-8-9_02517(unknown) Cluster9 SP_0733(hypothetical protein) SP_0810(hypothetical protei +n) SP_1302(conserved hypothetical protein) SP_1487(hypothetical pro +tein) spr_0645(Hypothetical protein) spr_0717(Transposase) spr +_1180(Degenerate transposase) spr_1342(Degenerate transposase) +SPD_0639(conserved hypothetical protein) SPD_0711(conserved hypothet +ical protein) SPD_1157(conserved hypothetical protein) SPD_1316(con +served hypothetical protein) SP70585_0780(transposase) SP70585_ +1368(transposase) SP70585_1525(transposase) SPJ_0673(transposas +e) SPJ_1218(transposase) SPJ_1383(transposase) SPJ_1384(transposas +e) SPP_0745(transposase) SPP_0819(transposase) SPP_1343(transp +osase) SPP_1505(transposase) SPT_0749(transposase) SPT_0792(tr +ansposase family protein) SPT_0924(hypothetical protein) SPH_08 +20(transposase) SPH_0910(transposase) SPH_1445(transposase) SPH_14 +50(transposase) SPG_0666(IS630-SpnII, transposase) SPG_1196(IS6 +30-SpnII, transposase) SPG_1411(IS630-SpnII, transposase) SPCG_ +0682(hypothetical protein) SPCG_1269(hypothetical protein) HMPR +EF0837_11017(transposase) HMPREF0837_11060(possible transposase) HM +PREF0837_11684(transposase) pseudoSPN23F_06580(putative transpos +ase (pseudogene)) pseudoSPN23F_11950(putative transposase (pseudogen +e)) pseudoSPN23F_14470(putative transposase (pseudogene)) 3850- +1-10_00217(unknown) 3850-1-10_00914(unknown) 3850-1-10_01673(unknow +n) 3850-1-11_00022(unknown) 3850-1-12_00241(unknown) 3850- +1-12_00324(unknown) 3850-1-12_01117(unknown) 3850-1-12_01118(unknow +n) 3850-1-1_01013(unknown) 3850-1-1_01769(unknown) 3850-1- +2_00959(unknown) 3850-1-3_01528(unknown) 3850-1-3_01675(unknown +) 3850-1-4_01063(unknown) 3850-1-5_00971(unknown) 3850-1-5 +_01532(unknown) 3850-1-5_01671(unknown) 3850-1-6_00976(unknown) + 3850-1-9_01078(unknown) 3850-2-10_01026(unknown) +3850-2-10_01604(unknown) 3850-2-11_00165(unknown) 3850-2-11_009 +81(unknown) 3850-2-11_00982(unknown) 3850-2-12_00039(unknown) +3850-2-12_01064(unknown) 3850-2-12_01744(unknown) 3850-2-1_0002 +0(unknown) 3850-2-1_01658(unknown) 3850-2-1_01659(unknown) 385 +0-2-2_01007(unknown) 3850-2-2_01566(unknown) 3850-2-3_01023(unk +nown) 3850-2-3_01697(unknown) 3850-2-4_00604(unknown) 3850-2-4 +_01204(unknown) 3850-2-4_01658(unknown) 3850-2-5_01649(unknown) + 3850-2-6_01768(unknown) 3850-2-6_01769(unknown) 3850-2-7_ +01107(unknown) 3850-2-8_00145(unknown) 3850-2-8_00952(unknown) + 3850-2-9_00993(unknown) 3850-2-9_01595(unknown) 3850-2-9_01731 +(unknown) 3850-3-11_00893(unknown) 3850-3-11_00907(unknown) + 3850-3-11_01622(unknown) 3850-3-1_00524(unknown) 3850-3-1 +_01818(unknown) 3850-3-2_00029(unknown) 3850-3-2_01175(unknown) + 3850-3-2_01176(unknown) 3850-3-2_01784(unknown) 3850-3-3_0087 +5(unknown) 3850-3-4_00178(unknown) 3850-3-4_01046(unknown) + 3850-3-5_00081(unknown) 3850-3-5_01535(unknown) 3850-3-6_01071 +(unknown) 3850-3-9_01066(unknown) 3850-3-9_01728(unknow +n) 3850-3-9_01729(unknown) 3850-5-10_01070(unknown) 3850-5-10_ +01543(unknown) 3850-5-10_01710(unknown) 3850-5-11_01352(unknown +) 3850-5-12_00113(unknown) 3850-5-12_01454(unknown) 3850-5 +-1_03245(unknown) 3850-5-2_00016(unknown) 3850-5-2_00128(unknow +n) 3850-5-2_00932(unknown) 3850-5-2_01443(unknown) 3850-5- +4_00906(unknown) 3850-5-4_01534(unknown) 3850-5-5_01011(unknown +) 3850-5-6_00870(unknown) 3850-5-7_00592(unknown) 3850 +-5-8_01027(unknown) 3850-5-8_01028(unknown) 3850-5-8_01497(unknown) + 3850-5-9_01068(unknown) 3850-6-11_00147(unknown) 3850 +-6-11_00896(unknown) 3850-6-11_01401(unknown) 3850-6-12_00946(u +nknown) 3850-6-2_01007(unknown) 3850-6-2_01576(unknown) + 3850-6-3_00992(unknown) 3850-6-3_01466(unknown) 3850-6-4_0079 +6(unknown) 3850-6-5_00942(unknown) 3850-6-6_00049(unknown) + 3850-6-7_00852(unknown) 3850-6-8_00975(unknown) 3850-6-8_0 +0976(unknown) 3850-6-9_01628(unknown) 3850-7-10_00907(unkno +wn) 3850-7-10_00908(unknown) 3850-7-11_00005(unknown) 3850-7-1 +1_00959(unknown) 3850-7-11_01458(unknown) 3850-7-11_01628(unknown) + 3850-7-12_00993(unknown) 3850-7-12_01660(unknown) 3850-7-1 +_00100(unknown) 3850-7-1_00890(unknown) 3850-7-1_01425(unknown) + 3850-7-2_00830(unknown) 3850-7-2_00831(unknown) 3850-7-3_0100 +4(unknown) 3850-7-3_01513(unknown) 3850-7-4_00896(unknown) 385 +0-7-4_01554(unknown) 3850-7-5_00930(unknown) 3850-7-6_00783 +(unknown) 3850-7-7_01619(unknown) 3850-7-8_00061(unknown) +3850-7-8_00979(unknown) 3850-7-9_00908(unknown) 3850-8-10_0 +0258(unknown) 3850-8-10_01481(unknown) 3850-8-11_01816(unknown) + 3850-8-11_01817(unknown) 3850-8-12_00939(unknown) 3850-8-12_0 +0940(unknown) 3850-8-12_01486(unknown) 3850-8-12_01623(unknown) + 3850-8-1_00954(unknown) 3850-8-1_00955(unknown) 3850-8-2_0086 +1(unknown) 3850-8-2_01414(unknown) 3850-8-2_01580(unknown) 385 +0-8-3_00013(unknown) 3850-8-3_01022(unknown) 3850-8-3_01484(unknown +) 3850-8-3_01660(unknown) 3850-8-4_00247(unknown) 3850 +-8-6_00129(unknown) 3850-8-6_00993(unknown) 3850-8-6_00994(unknown) + 3850-8-7_00941(unknown) 3850-8-8_00270(unknown) 3850-8-8_ +01498(unknown) 3850-8-9_01084(unknown) Cluster10 SP_0042(competence factor transporting ATP-binding/permea +se protein ComA) spr_0043(Transport ATP-binding protein ComA) s +pr_0468(Conserved hypothetical protein, truncation) SPD_0049(com +petence factor transporting ATP-binding/permease protein ComA) S +P70585_0109(transport/processing ATP-binding protein ComA) SPJ_0 +073(transport/processing ATP-binding protein ComA) SPP_0107(tran +sport/processing ATP-binding protein ComA) SPP_0552(transport/proces +sing ATP-binding protein ComA) SPT_0080(transport/processing ATP +-binding protein ComA) SPH_0148(transport/processing ATP-binding + protein ComA) SPH_0636(transport/processing ATP-binding protein Com +A) SPG_0048(transport/processing ATP-binding protein ComA) SPG_ +0480(BlpC ABC transporter, ATP-binding protein (blpA)) SPCG_0044 +(competence factor transporting ATP-binding/permease protein ComA) S +PCG_0501(competence factor transporting ATP-binding/permease protein +ComA) HMPREF0837_10331(bacteriocin-associated ABC superfamily AT +P binding cassette transporter) SPN23F_00590(bacteriocin transpo +rt/processing ATP-binding protein) SPN23F_04810(putative bacteriocin + transport/processing ATP-binding protein BlpA) 3850-1-10_00272( +unknown) 3850-1-10_00721(unknown) 3850-1-11_00336(unknown) 385 +0-1-11_00769(unknown) 3850-1-12_00494(unknown) 3850-1-12_00978( +unknown) 3850-1-1_00388(unknown) 3850-1-1_00409(unknown) 3 +850-1-2_00344(unknown) 3850-1-2_00769(unknown) 3850-1-3_00314(u +nknown) 3850-1-3_00762(unknown) 3850-1-4_00415(unknown) 3850-1 +-4_00869(unknown) 3850-1-5_00272(unknown) 3850-1-5_00578(unknow +n) 3850-1-6_00350(unknown) 3850-1-6_00773(unknown) 3850-1- +7_00312(unknown) 3850-1-7_00764(unknown) 3850-1-8_00256(unknown +) 3850-1-8_00700(unknown) 3850-1-9_00401(unknown) 3850-1-9_008 +76(unknown) 3850-2-10_00345(unknown) 3850-2-10_00827(unknown) + 3850-2-11_00355(unknown) 3850-2-11_00792(unknown) 3850-2-12 +_00440(unknown) 3850-2-1_00273(unknown) 3850-2-1_00737(unknown) + 3850-2-2_00436(unknown) 3850-2-2_00921(unknown) 3850-2-3_ +00363(unknown) 3850-2-3_00819(unknown) 3850-2-4_00558(unknown) + 3850-2-4_01023(unknown) 3850-2-5_00449(unknown) 3850-2-5_00928 +(unknown) 3850-2-6_00524(unknown) 3850-2-6_00965(unknown) +3850-2-7_00534(unknown) 3850-2-7_00935(unknown) 3850-2-8_00372( +unknown) 3850-2-8_00831(unknown) 3850-2-9_00327(unknown) 3850- +2-9_00787(unknown) 3850-3-10_00581(unknown) 3850-3-11_00322 +(unknown) 3850-3-11_00746(unknown) 3850-3-12_00533(unknown) 38 +50-3-12_00998(unknown) 3850-3-1_00557(unknown) 3850-3-1_00932(u +nknown) 3850-3-2_00129(unknown) 3850-3-2_00535(unknown) 38 +50-3-3_00244(unknown) 3850-3-3_00675(unknown) 3850-3-4_00485(un +known) 3850-3-4_00973(unknown) 3850-3-5_00393(unknown) 3850-3- +5_00574(unknown) 3850-3-6_00394(unknown) 3850-3-6_00884(unknown +) 3850-3-7_00247(unknown) 3850-3-7_00684(unknown) 3850-3-8 +_00136(unknown) 3850-3-8_00321(unknown) 3850-3-9_00489(unknown) + 3850-3-9_00918(unknown) 3850-5-10_00158(unknown) 3850-5-10_00 +956(unknown) 3850-5-11_00180(unknown) 3850-5-11_00633(unknown) + 3850-5-12_00418(unknown) 3850-5-12_00866(unknown) 3850-5-1 +_02619(unknown) 3850-5-1_03062(unknown) 3850-5-2_00294(unknown) + 3850-5-2_00766(unknown) 3850-5-3_00423(unknown) 3850-5-4_ +00262(unknown) 3850-5-4_00735(unknown) 3850-5-5_00414(unknown) + 3850-5-5_00812(unknown) 3850-5-6_00203(unknown) 3850-5-6_00661 +(unknown) 3850-5-7_00560(unknown) 3850-5-7_01369(unknown) +3850-5-8_00427(unknown) 3850-5-8_00876(unknown) 3850-5-9_00425( +unknown) 3850-5-9_01274(unknown) 3850-6-10_00404(unknown) 3850 +-6-10_00845(unknown) 3850-6-11_00699(unknown) 3850-6-12_003 +28(unknown) 3850-6-12_00756(unknown) 3850-6-1_00202(unknown) 3 +850-6-1_00663(unknown) 3850-6-2_00347(unknown) 3850-6-2_00820(u +nknown) 3850-6-3_00197(unknown) 3850-6-3_00202(unknown) 38 +50-6-4_00081(unknown) 3850-6-4_00618(unknown) 3850-6-5_00298(un +known) 3850-6-5_00740(unknown) 3850-6-6_00172(unknown) 3850-6- +6_00595(unknown) 3850-6-7_00228(unknown) 3850-6-7_00668(unknown +) 3850-6-8_00397(unknown) 3850-6-8_00834(unknown) 3850-6-9 +_00321(unknown) 3850-6-9_00760(unknown) 3850-7-10_00284(unknown +) 3850-7-10_00716(unknown) 3850-7-11_00317(unknown) 3850-7-11_ +00760(unknown) 3850-7-12_00313(unknown) 3850-7-1_00264(unkn +own) 3850-7-1_00693(unknown) 3850-7-2_00269(unknown) 3850-7-2_ +00688(unknown) 3850-7-3_00359(unknown) 3850-7-3_00812(unknown) + 3850-7-4_00274(unknown) 3850-7-4_00708(unknown) 3850-7-5_0 +0283(unknown) 3850-7-5_00734(unknown) 3850-7-6_00211(unknown) +3850-7-6_00646(unknown) 3850-7-7_00305(unknown) 3850-7-7_00760( +unknown) 3850-7-8_00446(unknown) 3850-7-8_00773(unknown) 3 +850-7-9_00251(unknown) 3850-7-9_00697(unknown) 3850-8-10_00351( +unknown) 3850-8-10_00807(unknown) 3850-8-11_00346(unknown) 385 +0-8-11_00847(unknown) 3850-8-12_00310(unknown) 3850-8-12_00732( +unknown) 3850-8-1_00391(unknown) 3850-8-1_01185(unknown) 3 +850-8-2_00332(unknown) 3850-8-2_00668(unknown) 3850-8-3_00407(u +nknown) 3850-8-3_00853(unknown) 3850-8-4_00435(unknown) 38 +50-8-5_00300(unknown) 3850-8-5_00742(unknown) 3850-8-6_00303(un +known) 3850-8-6_00768(unknown) 3850-8-7_00339(unknown) 3850-8- +7_00744(unknown) 3850-8-8_00400(unknown) 3850-8-8_00837(unknown +) 3850-8-9_00458(unknown) 3850-8-9_00923(unknown)

The actual data has about 8000 clusters ie 0-8000

Thanks

$new_guy

Comment on split function problem
Select or Download Code
Re: split function problem
by hackman (Acolyte) on Feb 22, 2011 at 09:47 UTC
    I'm sorry that my answer is not a Perl based but if the only thing you want to do is sort the lines based on the first column here is a simpler solution for you:
    cat data-file.txt |sed 's/^Cluster//'|sort -k1 -n|sed 's/^\([0-9]\+\)/ +Cluster\1/' > sorted-data.txt
    Here is the output with pasted a few times your test data:
    $ cat dd |sed 's/^Cluster//'|sort -k1 -n|sed 's/^\([0-9]\+\)/Cluster\1 +/'| cut -c 1-20 Cluster5 SP_1003(con Cluster5 SP_1003(con Cluster5 SP_1003(con Cluster6 SP_0917(pil Cluster6 SP_0917(pil Cluster6 SP_0917(pil Cluster7 spr_1379(A Cluster7 spr_1379(A Cluster7 spr_1379(A Cluster8 spr_0324(T Cluster8 spr_0324(T Cluster8 spr_0324(T Cluster9 SP_0733(hyp Cluster9 SP_0733(hyp Cluster9 SP_0733(hyp Cluster10 SP_0042(co Cluster10 SP_0042(co Cluster10 SP_0042(co
    I hope I have helped.
    One Planet, One Internet...
    We Are All Connected...
Re: split function problem
by moritz (Cardinal) on Feb 22, 2011 at 10:08 UTC

    Please read How (Not) To Ask A Question.

    You write

    The script I am using is below, I think the problem is at the split function (line 17). Is this right?

    Yet you don't tell us what problem the script as a whole has.

    When you have a suspicion where the proble might be, just add some debugging output. For example Data::Dumper can help you to analyze data structures.

    use Data::Dumper; print Dumper \@chunks;

    For writing the script you surely must have a mental model of what should be inside @chunks. Looking at the debugging output will tell you if the actual data in that variable matches your mental model. That way you can answer the question yourself if the line with the split is your problem.

    It seems you have been working on this problem for three weeks now, and haven't made much progress. Did you just wait for to solve your problem? Or are you truly in a dead end?

    If the latter is true, you should take some programming courses or maybe read one or more good books on programming.

    We'll happily help you to answer programming questions, but only if you also put effort into answering these questions yourself.

      Thanks, am new to perl been trying to use it daily for 7 months now! I did know about data::Dumper
        Sorry, I meant, I didn't know about Data::Dumper, I will also look into what you have recommended.
Re: split function problem
by GrandFather (Cardinal) on Feb 22, 2011 at 10:24 UTC

    I guess you are looking for something like this:

    #!usr/bin/perl use strict; use warnings; use 5.010; my %data; # Parse data while (<DATA>) { chomp; my ($key, @chunks) = /(\w+ (?:\([^)]*\))?) (?: \s+ |$)/gx; next if !@chunks; foreach my $chunk (@chunks) { my ($prefix, $num, $tail) = $chunk =~ /^([a-z0-9]+)_(\d+)(.*)/ +i; $data{$key}{$prefix}{$num} = $tail; } } # Create lists of values for my $key (keys %data) { for my $prefix (keys %{$data{$key}}) { my @items = map {"${prefix}_$_$data{$key}{$prefix}{$_}"} sort {$a <=> $b} keys %{$data{$key}{$prefix}}; $data{$key}{$prefix} = \@items; } } # Generate output for my $key (sort keys %data) { my @prefixes = sort keys %{$data{$key}}; my @lines; my @colMax; while (1) { my @items = ($key, map {shift @{$data{$key}{$_}}} @prefixes); last if 1 >= grep {defined} @items; $_ //= ' -' for @items; push @lines, \@items; } for my $line (@lines) { for my $colIndex (0 .. $#$line) { my $itemWidth = length $line->[$colIndex]; $colMax[$colIndex] = $itemWidth if !$colMax[$colIndex] || $itemWidth > $colMax[$colInd +ex]; } } for my $line (@lines) { $line->[$_] = sprintf '%-*s', $colMax[$_], $line->[$_] for 0 .. $#$line; } print "@$_\n" for @lines; } __DATA__ ClusterX a_123(something) b_675(some_other_thing) b_234(something new +) c_897(some different thing) ClusterY b_6998(some_other_thing, thats new) c_877797(something diff +erent inside here) c_111(some other different thing) ClusterZ a_1234(something interesting) a_123467(something - else thats + is - interesting) 3850-1-2_12243(a new one) 3850-1-2_1789(another n +ew one)

    Prints:

    ClusterX a_123(something) b_234(something new) c_897(some different + thing) ClusterX - b_675(some_other_thing) - + ClusterY b_6998(some_other_thing, thats new) c_111(some other differen +t thing) ClusterY - c_877797(something differ +ent inside here) ClusterZ 2_1789(another new one) a_1234(something interesting) + ClusterZ 2_12243(a new one) a_123467(something - else thats is - +interesting)

    although the final "empty" ClusterZ line is not generated.

    True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://889535]
Approved by philipbailey
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (10)
As of 2014-07-13 19:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (251 votes), past polls