Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Perl - Remove duplicate based on substring and check on delimiters

by bopibopi (Initiate)
on May 18, 2016 at 20:40 UTC ( #1163377=perlquestion: print w/replies, xml ) Need Help??
bopibopi has asked for the wisdom of the Perl Monks concerning the following question:

Hello, i have the following input file :

1212123x534534534534xx4545454x232322xx 0901001x876879878787xx0909918x212245xx 1212123x534534534534xx4545454x232323xx 1212133x534534534534xx4549454x232322xx 4352342xx23232xxx345545x45454x23232xxx

Delimited with x, and the column of interest is 0-7, see below. I m trying to write a script that reads each line, checks the amount of x's and compares it against a number, if the amount if != the set number then i want the output into fh1 (output.control). Then it ll check a specific substring on each line, and print only the first encountered. (Remove duplicates but maintain order)

The code i have so far is

#!/usr/bin/perl use strict; # use warnings qw/ all FATAL /; my %seen; my $delimiter = 'x'; my $delim_amnt_per_line = 5; open(my $fh1, ">>", "outputcontrol.txt"); open(my $fh2, ">>", "outputoutput.txt"); while ( <> ) { my $count = ($_ =~ y/x//); print "$count \n"; # print $_; if ( $count != $delim_amnt_per_line ) { print fh1 $_; } my ($prefix) = substr $_, 0, 7; next if $seen{$prefix}++; print $fh2; }

My problem is that it doesnt print anything on either filename, whereas if it was just print and i had redirected the script from the command line, it would output as normal. Can someone help me?

EDIT : I think i ve located the problem. Neither of the produced files had write permission. They were just set on read, is there a way to change this from inside the code?

Replies are listed 'Best First'.
Re: Perl - Remove duplicate based on substring and check on delimiters
by haukex (Abbot) on May 18, 2016 at 21:18 UTC

    Hi bopibopi,

    Your code has a couple of issues: You don't check your open calls for errors (open(...) or die $!;), you seem to have a typo on the line print fh1 $_; (should be $fh1), and there's a closing brace missing (a copy/paste mistake I assume) (apparently fixed by ninja edit... It is uncool to update a node in a way that renders replies confusing or meaningless). Also, print $fh2; prints the filehandle to standard output, if you want to print the current line to $fh2 you have to be explicit: print $fh2 $_;

    They were just set on read, is there a way to change this from inside the code?

    I'd recommend you don't, because write-protection is supposed to be exactly that! Someone someday (including you) might set a file to read-only for a good reason and the script would clobber it anyways. I recommend you output a descriptive error message instead, e.g. die "I can't write to $filename\n" unless -w $filename; (see -X). But if you must ("just enough rope" and all that), there's chmod.

    Hope this helps,
    -- Hauke D

    P.S. Just saw stevieb and linuxer were a little faster than me on several points :-)

Re: Perl - Remove duplicate based on substring and check on delimiters
by stevieb (Abbot) on May 18, 2016 at 21:07 UTC

    Always, *always* check to ensure your file actually opened properly:

    open my $fh1, ">>", "outputcontrol.txt" or die $!; open my $fh2, ">>", "outputoutput.txt" or die $!;

    I don't know if that's the issue, but it's definitely the first thing to try.

Re: Perl - Remove duplicate based on substring and check on delimiters
by linuxer (Curate) on May 18, 2016 at 21:08 UTC

    You can use chmod to change file's permissions if you have sufficient permissions to chmod the file.

    You should check open's success, so you know directly if open was successful or not and you can behave accordingly.

    open(my $handle, '>>', $filename) or die "open($filename,w+) failed: $ +!";
    edit: fixed typo
Re: Perl - Remove duplicate based on substring and check on delimiters
by Marshall (Abbot) on May 18, 2016 at 23:39 UTC
    Another way without using substr (which is actually seldom used in Perl) is to use split, like a simple CSV file would be parsed, except with 'x' instead of ','.

    #!usr/bin/perl use warnings; use strict; use Data::Dumper; while (my $line =<DATA>) { chomp $line; print "line = $line\n"; my $tokens =(my $first, my @rest)= split 'x',$line,-1; print "num tokens is: $tokens\n"; print Dumper $first, \@rest; print "\n"; } =prints line = 1212123x534534534534xx4545454x232322xx num tokens is: 7 $VAR1 = '1212123'; $VAR2 = [ '534534534534', '', '4545454', '232322', '', '' ]; line = 0901001x876879878787xx0909918x212245xx num tokens is: 7 $VAR1 = '0901001'; $VAR2 = [ '876879878787', '', '0909918', '212245', '', '' ]; line = 1212123x534534534534xx4545454x232323xx num tokens is: 7 $VAR1 = '1212123'; $VAR2 = [ '534534534534', '', '4545454', '232323', '', '' ]; line = 1212133x534534534534xx4549454x232322xx num tokens is: 7 $VAR1 = '1212133'; $VAR2 = [ '534534534534', '', '4549454', '232322', '', '' ]; line = 4352342xx23232xxx345545x45454x23232xxx num tokens is: 11 $VAR1 = '4352342'; $VAR2 = [ '', '23232', '', '', '345545', '45454', '23232', '', '', '' ]; =cut __DATA__ 1212123x534534534534xx4545454x232322xx 0901001x876879878787xx0909918x212245xx 1212123x534534534534xx4545454x232323xx 1212133x534534534534xx4549454x232322xx 4352342xx23232xxx345545x45454x23232xxx

      That gives an off-by-one  $tokens value (it's actually counting the stuff "around" the tokens (update: and it requires creation of an otherwise unused array to hold most of that stuff)), but that's easy to fix:

      c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $t = 'x'; ;; for my $line (qw( 1212123x534534534534xx4545454x232322xx 0901001x876879878787xx0909918x212245xx 1212123x534534534534xx4545454x232323xx 1212133x534534534534xx4549454x232322xx 4352342xx23232xxx345545x45454x23232xxx )) { my $tokens = my ($first, @rest) = split $t, $line, -1; $tokens -= 1; print qq{'$line': num '$t' tokens is: $tokens}; dd ($first, \@rest); } " '1212123x534534534534xx4545454x232322xx': num 'x' tokens is: 6 (1212123, [534534534534, "", 4545454, 232322, "", ""]) '0901001x876879878787xx0909918x212245xx': num 'x' tokens is: 6 ("0901001", [876879878787, "", "0909918", 212245, "", ""]) '1212123x534534534534xx4545454x232323xx': num 'x' tokens is: 6 (1212123, [534534534534, "", 4545454, 232323, "", ""]) '1212133x534534534534xx4549454x232322xx': num 'x' tokens is: 6 (1212133, [534534534534, "", 4549454, 232322, "", ""]) '4352342xx23232xxx345545x45454x23232xxx': num 'x' tokens is: 10 ( 4352342, ["", 23232, "", "", 345545, 45454, 23232, "", "", ""], )
      (But I don't really see anything wrong with using good old  tr/// for counting and poor old substr for fixed-field extraction.)

      Update: This gets rid of  @rest and the  $tokens -= 1; statement for all you one-liner addicts out there:
          my $tokens = (my ($first) = split $t, $line, -1) - 1;


      Give a man a fish:  <%-{-{-{-<

        I think we are splitting hairs here. I count $first as the first token, you don't. Or you figure that the final empty token shouldn't be counted? Either way not a significant problem in my mind.

        Yes, tr is the fastest and best way to do a simple count of the x's. And yes, substr is the fastest way to get a fixed length thing at the beginning. The reason that I demo'd split was to show: a)how to get a non-fixed length thing at the beginning, b)how to access some of these other length "between the x's" fields. I'm sure that they have some meaning.

        Update: I almost never use the -1 limit on split. I saw an opportunity to play with this and remind myself of how it worked. Once I had done that, I impulsively posted my "play". Wasn't meant to be "earth shattering" stuff, just an example of a not so common usage that is often forgotten.

      without using substr (which is actually seldom used in Perl)

      Surely, you jest?!?!

      Cheers,

      JohnGG

        Sorry for the controversy - not my intent. I should've said something different or omitted that entirely.

        I use Perl often to process all kinds of text reports. By far and away, the most common tools that I use are: a)split and b)match global combined with c) regex. In my typical application, speed doesn't matter, but flexibility does. It is very seldom that I encounter a fixed column report where substr would be appropriate.

        That doesn't mean that I don't use substr, just that in my personal experience, with the types of text reports that I process, it doesn't come up. Mileage Varies! Processing a binary header, say like that found in a .WAV file is a whole different critter, substr is definately the right tool for that job. I am talking about text reports.

        Just yesterday, a file that I've been processing since 2011 changed its format. Oops. The same info is there, but it got moved around. The 2016 format is different and I have no control over that change. But this change was easy for me to adapt to and was something like this: (split ' ',$line)[1,7,3] to (split ' ',$line)[1,4,-2]. If I had used substr(), then this would have been a bigger deal. Changing something that has been working for 5 years comes up all the time. Such is the nature of using ad hoc methods to parse reports that you have no control over.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1163377]
Approved by sweetblood
help
Chatterbox?
NodeReaper practices with his stiletto

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2017-12-12 23:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (341 votes). Check out past polls.

    Notices?