Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

File Manipulation - Need Advise!

by nashkab (Novice)
on Jan 03, 2008 at 17:38 UTC ( #660265=perlquestion: print w/ replies, xml ) Need Help??
nashkab has asked for the wisdom of the Perl Monks concerning the following question:

I Have a file which looks like this:-
COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxxx1' IC--- 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC---

IF TWO RECORDS EXIST FOR THE SAME INSTANCE OF COMPUTER... I WANT TO KEEP THE FINAL RECORD AND REMOVE THE FIRST RECORD. FOR EXAMPLE, COMPUTER 30F-WKS HAS TWO RECORDS.I WANT TO REMOVE THE FIRST RECORD AND KEEP THE SECORD.

I have used the following code:-

open(FILE2,">file1.txt")|| warn "Could not open\n"; open(FILE3,"file2.txt")|| warn "Could not open\n"; my $Previous = ""; my @data = <FILE3>; $index=0; foreach $_data (@data) { $index++; chomp ($_data); @Current = split(/\t/, $_data); @Previous = split(/\t/, $Previous); if (@Current[0] ne @Previous[0]) { if ($index == 1) { # do nothing. } else { print FILE2 $Previous; } } else {} $Previous = $_data; } close(FILE2); close(FILE3);

So the output file will look like this:-

COMPUTER DISTRIBUTION_ID STATUS</br> 30F-WKS `1781183799.xxx11' IC---</br> ADM34A3F9 `1781183799.41455' IC---</br>

Comment on File Manipulation - Need Advise!
Select or Download Code
Re: File Manipulation - Need Advise!
by Old_Gray_Bear (Bishop) on Jan 03, 2008 at 17:46 UTC
    Whenever you want the unique members of a data-set, think about using a hash, keyed from the field you want to be unique. Once you have cycled through your input, print the keys from the hash and you're done.

    ----
    I Go Back to Sleep, Now.

    OGB

      Workout of Old Gray Bear's idea:
      my %data; my $header = <>; # first line while(<>) { my($key) = split /\t/; $data{$key} = $_; } # output: print $header; foreach my $key (sort keys %data) { print $data{$key}; }
      To use it as is, call the script with "file2.txt" as parameter on the command line, and redirect the script's STDOUT to "file1.txt".
      perl thescript.pl file2.txt >file1.txt
        file1.txt output is the following:- COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxx11' IC--- 30F-WKS `1781183799.xxxx1' IC--- ADM34A3F9 `1781183799.41455' IC---
        I want COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC---
      > Whenever you want the unique members of a data-set, think about using a hash
      When you want the pairwise unique members of a serial set, think about a state variable.

      If you need unique across an entire set, no question that hashes are most useful. Problem, though, is that you have to then store all the keys.

      It is not uncommon to want to dedup when there are successive runs (think unix's 'uniq'). That's when this second class comes into play. Set a state variable, and read one line at a time. You may have to keep around the previous line or two to compute your state. You may have to do some work at the end to clean up stored lines.

      my $thisKey; my $lastLine = <>; my $lastKey = ''; # first line is header, so always print while (<>) { if (/(.*?)\t.*/) { $thisKey = $1 } else { warn "bad data: $_ had no tab\n"; } if ($thisKey ne $lastKey) { print $lastLine; } $lastLine = $_; $lastKey = $thisKey; } print $lastLine;
      This is a big win when you have millions and millions of entries to sift through.
Re: File Manipulation - Need Advise!
by jrsimmon (Hermit) on Jan 03, 2008 at 18:13 UTC
    You need a hash. Something like this should work:
    use strict; use warnings; open(SOURCE,"test.txt")|| warn "Could not open\n"; my @data = <SOURCE>;#just fyi -- slurping is dangerous on very large f +iles close(SOURCE); my %filtered_data; foreach my $line_of_data (@data){ my @split_data_values = split(/\s/,$line_of_data); my $computer_name = shift(@split_data_values); $filtered_data{$computer_name} = "@split_data_values"; } open(RESULT,">test.out")|| warn "Could not open\n"; foreach my $computer (keys(%filtered_data)){ print RESULT "$computer $filtered_data{$computer}\n"; } close(RESULT);
Re: File Manipulation - Need Advise!
by Codon (Friar) on Jan 03, 2008 at 18:57 UTC
    You didn't mention this, but if order matters in some way, you would might want two data structures, one to unique the output (a hash) and one to maintain ordering (an array). I don't know if you have seen this with the previous examples, but the header line could get mixed into the file somewhere (randomly, thanks to the hashing algorithm) unless handled separately. Alternatively, I provide this quick example:
    #!/usr/bin/perl use strict; use warnings; my @order; my %data; while (<DATA>) { my ($key,$value) = split /\t/, $_, 2; push @order, $key; $data{$key} = $value; } for my $key (@order) { printf("%s\t%s", $key, delete $data{$key}) if ($data{$key}); } __DATA__ COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxxx1' IC--- 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC---
    Ivan Heffner
    Sr. Software Engineer
    WhitePages.com, Inc.
Re: File Manipulation - Need Advise!
by blue_cowdawg (Prior) on Jan 03, 2008 at 18:58 UTC

    Dear Monk,
    Let me at the risk of being repetitive since you've already gotten advise on this subject try and make things more clear to you.

    Consider the following code:

    #!/usr/bin/perl -w use strict; my %storage=(); # untested, but should in theory work... my $junk=<DATA>; #get rid of header while(my $line=<DATA>){ chomp($line); # Get rid of newline my ($host,$dist_id,$status)=split(/[\s\n\t]+/,$line); # split on + any whitespace my $storage{$host) = { host=> $host, dist_id => $dist_id, status=> $status }; # Put this into a hash keyed on the two fields we +want to key on } # We never removed the new line character from $junk so... print $junk; # We reclaim this from the trash can foreach my $key(sort keys %storage){ # # Print the remaining record matching the keys printf "%s\t%s\t%s\n",$storage{$key}->{host},$storage{$key}->{dis +t_id},$storage{$key}->{status}; } exit(0); __END__ COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxxx1' IC--- 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC---

    The way this works is you are going to overwrite subsequent records that you read in for the same host in the hash %storage and as a result get the last record in your data set output. Since you said in CB you don't know if your fields are space or tab separated I covered both bases by using the regex /[\s\t\n]+/ in the split callout.

    Hope this helps


    Peter L. Berghold -- Unix Professional
    Peter -at- Berghold -dot- Net; AOL IM redcowdawg Yahoo IM: blue_cowdawg
Re: File Manipulation - Need Advise!
by ysth (Canon) on Jan 03, 2008 at 19:13 UTC
    You do not need a hash. In fact, your original code is very close to doing what you want. Just a few tweaks:
    #!/usr/bin/perl use strict; use warnings; open(FILE2,">file1.txt")|| warn "Could not open\n"; open(FILE3,"file2.txt")|| warn "Could not open\n"; my $Previous = ""; my @data = <FILE3>; my $index=0; foreach my $_data (@data) { $index++; chomp ($_data); my @Current = split(/\s+/, $_data); if ($index == 1) { # do nothing. } else { my @Previous = split(/\s+/, $Previous); if ($Current[0] ne $Previous[0]) { print FILE2 $Previous, "\n"; } } $Previous = $_data; } if ($Previous) { print FILE2 $Previous, "\n"; } close(FILE2); close(FILE3);
    I made the following changes:
    • Added strict and warnings. Corrected @foo[0] to $foo[0] (see the perldiag entry for the warning it gave before), and declared variables.
    • Moved the $index check so $Previous isn't used unless it's been set.
    • Changed to split on whitespaces, not tabs (since the data you provided didn't have tabs).
    • Add newlines to what's written out.
    • Add block after the loop to print the final line that had been saved.
      This solution (as the one from the original post) may not work properly unless all hostname entries were previously sorted. However, by using a hash you can deal with an unsorted list of hostnames.
Re: File Manipulation - Need Advise!
by dwm042 (Priest) on Jan 03, 2008 at 22:41 UTC
    This is an easy problem to solve (and you can sort multiple ways with the same code) if you will use a hash (I'll note the link shows yet another way to do this kind of operation).

    #!/usr/bin/perl use warnings; use strict; use Getopt::Long; use Pod::Usage; =head1 NAME unique.pl -- examines data and keeps the unique ones. =head1 SYNOPSIS unique.pl [options] Options: --help Brief help message --man Full documentation --first Keep the first one found rather than the last. =head1 DESCRIPTION unique.pl -- examines data and keeps the unique ones. Program can keep the first or the last one found. =cut my $help = 0; my $man = 0; my $first = 0; GetOptions( 'help|?' => \$help, man => \$man, first => \$first, ) or pod2usage(2); pod2usage( -exitval => 0, -verbose => 1 ) if $help; pod2usage( -exitval => 0, -verbose => 2, -noperldoc => 1 ) if $man; my %hash = (); while(<DATA>) { chomp; my ( $comp, $id, $status ) = split ( /\s+/, $_, 3 ); next if ( $comp =~ m/COMPUTER/ ); if ( $first ) { next if ( defined( $hash{$comp} ) ); } $hash{$comp} = [ $id, $status ]; } for ( sort keys %hash ) { printf "%s %s %s\n", $_, $hash{$_}->[0], $hash{$_}->[1]; } __DATA__ COMPUTER DISTRIBUTION_ID STATUS 30F-WKS `1781183799.xxxx1' IC--- 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC---
    and the results are:

    C:\Code>perl unique.pl --help Usage: unique.pl [options] Options: --help Brief help message --man Full documentation --first Keep the first one found rather than the l +ast. C:\Code>perl unique.pl 30F-WKS `1781183799.xxx11' IC--- ADM34A3F9 `1781183799.41455' IC--- C:\Code>perl unique.pl --first 30F-WKS `1781183799.xxxx1' IC--- ADM34A3F9 `1781183799.41455' IC---

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://660265]
Approved by McDarren
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (19)
As of 2014-07-11 19:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (235 votes), past polls