http://www.perlmonks.org?node_id=885470

$new_guy has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

I have a question about re-organising entries. Currently ,I have files with entries that look like this:

1 SP_85(IS33, qqq), SP_155(IS33eee) spr_111(ISyyy33, qqq), spr_1 +71(IS33eee) 2 SP_83(S3 , jgjg), SP_32(IS33, jhdjdjd) spr_113(Stty3 , jgjg), +spr_1881(IS33, jhdjdjd) 3 SP_78(3jmdsjkdej), SP_66(IShbdhdhd33) spr_115(3jmhhggggdsjkdej +), spr_1551(IShbdhdjjjhd33), spr_88881(Iyt33ff), spr_145411(Iddd3ff) 4 SP_77(3jmdsjkdej), SP_1485(Idhd33ff) spr_116(3jmdhhhhhsjkdej), + spr_17781(Idhdhhtytyt33ff)

How do I re-organise them so that all entries with a similar name will be in the same column

Here is my desired output:

1 SP_85(IS33, qqq) spr_111(ISyyy33, qqq) 1 SP_155(IS33eee) spr_171(IS33eee) 2 SP_83(S3 , jgjg) spr_113(Stty3 , jgjg) 2 SP_32(IS33, jhdjdjd) spr_1881(IS33, jhdjdjd) 3 SP_66(IShbdhdhd33) spr_115(3jmhhggggdsjkdej) 3 SP_66(IShbdhdhd33) spr_1551(IShbdhdjjjhd33) 3 - spr_88881(Iyt33ff) 3 - spr_145411(Iddd3ff) 4 SP_77(3jmdsjkdej) spr_116(3jmdhhhhhsjkdej) 4 SP_1485(Idhd33ff) spr_17781(Idhdhhtytyt33ff)

Please bear in mind that I have hundreds of columns and thousands of rows

The code i have tried implementing so far:

my $columnFile = q{re-organized.txt}; open my $columnFH, q{<}, $columnFile or die qq{open: < $columnFile: $!\n}; my @results; while ( <$columnFH> ) { my @cols = split; foreach my $idx ( 0 .. $#tests ) { foreach my $subidx ( 0 .. $#{ $tests[ $idx ] } ) { my @posns = split m{,}, $tests[ $idx ]->[ $subidx ]; $results[ $idx ]->[ $subidx ] ++ if scalar @posns == grep { q{z} eq $cols[ $_ ] } @posns; } } } close $columnFH or die qq{close: < $columnFile: $!\n};

Thanks,

$new_guy

Replies are listed 'Best First'.
Re: Re-organising entries
by moritz (Cardinal) on Feb 01, 2011 at 12:09 UTC

    Read the file one line at a time. For each line, split it into separate records, rearrange the records, and print them out.

    Which part exactly are you having problems with? What have you tried so far?

      Hi I have updated my post and I show what i have tried so far. However, it prints each entry into its own column and I end up with only one very long row!
Re: Re-organising entries
by Limbic~Region (Chancellor) on Feb 01, 2011 at 17:02 UTC
    $new_guy,
    This appears to do what you want but your requirements are not very clear so I had to guess.
    #!/usr/bin/perl use strict; use warnings; my $file = $ARGV[0] or die "Usage: $0 <input>"; open(my $fh, '<', $file) or die "Unable to open '$file' for reading: $ +!"; while (<$fh>) { chomp; my ($num, @data) = split m|,?\s+(?=sp)|i; die "Need rules for handling odd elements" if @data % 2; for (my $idx = 0; $idx < @data - 3; $idx += 4) { print join("\t", $num, @data[$idx, $idx + 2]), "\n"; print join("\t", $num, @data[$idx + 1, $idx + 3]), "\n"; } if (@data % 4) { print join("\t", $num, "-", $data[-2]), "\n"; print join("\t", $num, "-", $data[-1]), "\n"; } }

    Cheers - L~R

Re: Re-organising entries
by moritz (Cardinal) on Feb 01, 2011 at 17:04 UTC
    Since some of your records contain spaces, a simple call to split with no regex doesn'T do any good. Also you need to special-case the number in the first column, since you want to repeat it.

    Here's my approach (reading from DATA instead of a file handle for convenience):

    use strict; use warnings; while (<DATA>) { chomp; my ($id, @records) = split /\t|(?<=\)),\s+/, $_; my (@left, @right); for my $r (@records) { if ($r =~ /^SP_/) { push @left, $r; } else { push @right, $r; } } while (@left || @right) { print $id, "\t", (shift(@left) || ' - '), ', ', (shift(@right) || ' - '), "\n"; } } __DATA__ 1 SP_85(IS33, qqq), SP_155(IS33eee) spr_111(ISyyy33, qqq), spr_1 +71(IS33eee) 2 SP_83(S3 , jgjg), SP_32(IS33, jhdjdjd) spr_113(Stty3 , jgjg), +spr_1881(IS33, jhdjdjd) 3 SP_78(3jmdsjkdej), SP_66(IShbdhdhd33) spr_115(3jmhhggggdsjkdej +), spr_1551(IShbdhdjjjhd33), spr_88881(Iyt33ff), spr_145411(Iddd3ff) 4 SP_77(3jmdsjkdej), SP_1485(Idhd33ff) spr_116(3jmdhhhhhsjkdej), + spr_17781(Idhdhhtytyt33ff)

      Hi Moritz and Limbic_Region,

      Am afraid your scripts don't work very well. They only generate 2 coulumns and the contents/entries in the columns don't have the same prefix. If you look at my scratchpad I have put the first seven rows (of which there are about 8000 rows in my .txt file). Note that each row starts with Cluster(\d+) i.e. the word "Cluster" followed by a number eg 1,2,3 etc.

      The code I have so far come up with is:

      #!usr/bin/perl -w use warnings; use strict; use List::Util 'max'; # Read in the file my $FILENAME3 = "clusters3.txt"; open(DATA, $FILENAME3); #create arrays and hashes to store stuff my (%data, %all, @keys); while (<DATA>) { # avoid \n on last field chomp; #split the data into chunks my @chunks = split; #create keys for the chunks my $key = shift @chunks; #store the keys in an array unless they already exist push @keys, $key unless exists $data{$key}; foreach my $chunk (@chunks) { #return references using hashes $data{$key}{$chunk}++; #add all chunks to the hash '%all' $all{$chunk} = 1; } #now make a file for the ouput my $outputfile = "new_cluster.txt"; if (! open(POS, ">>$outputfile") ) { print "Cannot open file \"$outputfile\" to write to!!\n\n" +; exit; } #sort the fields/columns keys and save them as an array #my @fields = sort {$a <=> $b} keys %all; my @fields = sort {lc($a) cmp lc($b)} keys %all; ##<--this sorting did +n't work #find the longest entry in an array #my $longest = max map {length} @fields; my $longest = max map {scalar grep $_=~ /\(\d+\)\_\(\d+\)\_\(\d+\)\_/, + @fields} @fields; #the line I think has a problem! #organise the data foreach my $key (@keys) { while (keys %{$data{$key}}) { print POS $key, " "; foreach my $field (@fields) { if ($data{$key}{$field}){ printf POS "%${longest}s ", $field; delete $data{$key}{$field} unless --$data{$key}{$field +}; } else { printf POS "%${longest}s ", "-"; } } print POS"\n"; }}}

      In the code cluster3.txt is my .txt file But it spits out rubbish

      Is it possible to have for each entry in each row arranged tidyly in columns

      Generally the prefixes are separated by an underscore for this beginning with letter, except 'spr, HMPREF, and pseudoSPN23F(which is also exactly similar or should be in the same column as SPN23F)'

      For this beginning with digits/numbers. The prefix is from the beginning to the last underscore e.g. 3850_1_7_ and 3850_1_8_ .

      Thanks

      $new_guy