zarath has asked for the wisdom of the Perl Monks concerning the following question:

Hello everybody!

Not yet very experienced with Perl, but getting there at a slow and steady pace.

Working on a new code that needs to do the following:

Read $input (.csv), extract the field that comes after the first comma and print each found value exactly once (no more, no less).

It is that last part that I'm having trouble with, I have tried different things, but I will post 2 pretty different versions of what I have tried and what I thought would work, but alas.

An extract of $input:

AgreementId;SalesId;PriceComponentId;ProductTermsId;FromDate;ToDate;Tr +ansType;FixedPrepaym;NoteType C000004923;VK11070778;Delta;;16/08/2017;15/09/2017;Prepayment;Yes;Addi +tional note C000004923;VK11070778;Rounding;;16/08/2017;15/09/2017;Prepayment;Yes;A +dditional note C000004924;VK11070778;Delta Gas;;16/08/2017;15/09/2017;Prepayment;Yes; +Additional note C000858948;VK11070783;Delta;;3/01/2017;2/02/2017;Prepayment;Yes;Additi +onal note C001028127;VK11070844;Delta;;1/07/2017;31/07/2017;Prepayment;Yes;Addit +ional note C000863388;VK11070869;Delta;;14/03/2016;13/04/2016;Prepayment;Yes;Addi +tional note C000863388;VK11070869;Rounding;;14/03/2016;13/04/2016;Prepayment;Yes;A +dditional note C000863389;VK11070869;Delta Gas;;14/03/2016;13/04/2016;Prepayment;Yes; +Additional note C001041275;VK11070873;Delta;;14/04/2017;13/05/2017;Prepayment;Yes;Addi +tional note C000457921;VK11070913;Delta;;11/12/2014;10/01/2015;Prepayment;Yes;Addi +tional note C000457922;VK11070913;Delta Gas;;11/12/2014;10/01/2015;Prepayment;Yes; +Additional note C000354278;VK11070920;Delta;;21/09/2015;20/10/2015;Prepayment;Yes;Addi +tional note C000354278;VK11070920;Rounding;;21/09/2015;20/10/2015;Prepayment;Yes;A +dditional note C001139698;VK11070923;Delta;;12/08/2017;11/09/2017;Prepayment;Yes;Addi +tional note C001139698;VK11070923;Rounding;;12/08/2017;11/09/2017;Prepayment;Yes;A +dditional note C001072986;VK11070933;Delta;;14/03/2017;15/05/2017;Prepayment;Yes;Addi +tional note C001072986;VK11070933;Rounding;;14/03/2017;15/05/2017;Prepayment;Yes;A +dditional note C000833421;VK11074400;Delta;;1/05/2017;31/05/2017;Prepayment;Yes;Addit +ional note C000833422;VK11074400;Delta Gas;;1/05/2017;31/05/2017;Prepayment;Yes;A +dditional note C000833422;VK11074400;Rounding;;1/05/2017;31/05/2017;Prepayment;Yes;Ad +ditional note C000147059;VK11074404;Delta;;20/06/2017;19/07/2017;Prepayment;Yes;Addi +tional note C000147062;VK11074404;Delta Gas;;20/06/2017;19/07/2017;Prepayment;Yes; +Additional note C001109215;VK11074415;Delta;;24/08/2017;23/09/2017;Prepayment;Yes;Addi +tional note C000313157;VK11074418;Delta;;15/11/2016;14/12/2016;Prepayment;Yes;Addi +tional note C000313157;VK11074418;Rounding;;15/11/2016;14/12/2016;Prepayment;Yes;A +dditional note C000313158;VK11074418;Delta Gas;;11/11/2016;10/12/2016;Prepayment;Yes; +Additional note C001099002;VK11074430;Delta;;1/08/2017;31/08/2017;Prepayment;Yes;Addit +ional note C001117234;VK11074441;Delta Gas;;15/06/2017;14/07/2017;Prepayment;Yes; +Additional note C001009800;VK11074443;Delta;;16/11/2016;15/12/2016;Prepayment;Yes;Addi +tional note C000679686;VK11074451;Delta;;20/06/2016;19/07/2016;Prepayment;Yes;Addi +tional note C000679687;VK11074451;Delta Gas;;20/06/2016;19/07/2016;Prepayment;Yes; +Additional note C001242987;VK11074454;Delta Gas;;15/06/2017;14/07/2017;Prepayment;Yes; +Additional note C001080282;VK11074470;Delta;;2/03/2017;1/04/2017;Prepayment;Yes;Additi +onal note C001080283;VK11074470;Delta Gas;;2/03/2017;1/04/2017;Prepayment;Yes;Ad +ditional note C001192414;VK11074473;Delta;;14/07/2017;13/08/2017;Prepayment;Yes;Addi +tional note C001192414;VK11074473;Rounding;;14/07/2017;13/08/2017;Prepayment;Yes;A +dditional note C001192415;VK11074473;Delta Gas;;14/07/2017;13/08/2017;Prepayment;Yes; +Additional note C001192415;VK11074473;Rounding;;14/07/2017;13/08/2017;Prepayment;Yes;A +dditional note C000268914;VK11074478;Delta;;9/10/2016;8/11/2016;Prepayment;Yes;Additi +onal note C000268914;VK11074478;Rounding;;9/10/2016;8/11/2016;Prepayment;Yes;Add +itional note

So I need to write every VKxxxxxxxx once to $output.

Most lines (especially the ones that might be the problem) in my codes have a comment that explains my thinking there.

First version of my code:

use strict; use warnings; use autodie; my $input = 'D:/Some/Specific/Path/To/Input.CSV'; my $output = 'D:/Some/Specific/Path/To/Output.CSV'; open IN,$input; binmode(IN); open OUT,'>'.$output; my $count = 0; while (my $line = <IN>) { chomp $line; # good practice? next if $. < 2; # do not need first line my @fields = split ";" , $line; # define input-lines my @vk = $fields[1]; # extract all VK-codes my %seen; # declare hash to 'memorise' which VK has already been p +ushed my @uniq; # declare array to store unique values for my $vk (@vk) { # need to 'check' all VK-codes in @vk push (@uniq, $vk) unless $seen{$vk}; # only push VK-codes that + are not yet pushed for my $uvk (@uniq) { # uvk = unique VK if ($count != 10) { # we want 10 VK-codes per line -- work +s as intended print OUT $uvk.';'; ++ $count; } else { print OUT "\n"; $count = 0; } } } } close IN; close OUT; exit 0;

After seeing it writes double values (a VK that appears 3 times in $input gets printed 3 times in $output) I stared at this code for a while and realised it might be because I'm printing to OUT inside the while-block and each $line is unique, so I tweaked the code and ended up with version 2:

use strict; use warnings; use autodie; my $input = 'D:/Some/Specific/Path/To/Input.CSV'; my $output = 'D:/Some/Specific/Path/To/Output.CSV'; open IN,$input; binmode(IN); open OUT,'>'.$output; my $count = 0; my @vk; # declare array to store all unique VK-codes my %seen; # declare hash to 'memorise' which VK has already been pushe +d while (my $line = <IN>) { chomp $line; # good practice? next if $. < 2; # do not need first line my @fields = split ";" , $line; # define input-lines push (@vk,$fields[1]) unless $seen{$fields[1]}; # extract all and +only unique VK-codes } close IN; # do not need this anymore for my $vk (@vk) { # we need to print each one if ($count != 10) { # we want 10 VK-codes per line -- works as int +ended print OUT $vk.';'; ++ $count; } else { print OUT "\n"; $count = 0; } } close OUT; exit 0;

Both versions of the code do the exact same thing, print double values, not what I need.

The output from both codes (from posted input).

VK11070778;VK11070778;VK11070778;VK11070783;VK11070844;VK11070869;VK11 +070869;VK11070869;VK11070873;VK11070913; VK11070920;VK11070920;VK11070923;VK11070923;VK11070933;VK11070933;VK11 +074400;VK11074400;VK11074400;VK11074404; VK11074415;VK11074418;VK11074418;VK11074418;VK11074430;VK11074441;VK11 +074443;VK11074451;VK11074451;VK11074454; VK11074470;VK11074473;VK11074473;VK11074473;VK11074473;VK11074478;VK11 +074478;

So printing inside the while-block is not the (only) mistake here. But I can't seem to find my other mistake(s).

Any tips on how to make one of both versions work will of course be appreciated.

Thank you in advance!

Replies are listed 'Best First'.
Re: Removing doubles and printing only unique values
by hippo (Canon) on Oct 31, 2017 at 11:13 UTC
Re: Removing doubles and printing only unique values
by johngg (Canon) on Oct 31, 2017 at 11:26 UTC

    Using split, map and grep inside a do block.

    johngg@shiraz:~/perl/Monks > perl -Mstrict -Mwarnings -E ' open my $inFH, q{<}, \ <<__EOD__ or die $!; C000004923;VK11070778;Delta;;16/08/2017;15/09/2017;Prepayment;Yes;Addi +tional note C000004923;VK11070778;Rounding;;16/08/2017;15/09/2017;Prepayment;Yes;A +dditional note C000004924;VK11070778;Delta Gas;;16/08/2017;15/09/2017;Prepayment;Yes; +Additional note C000858948;VK11070783;Delta;;3/01/2017;2/02/2017;Prepayment;Yes;Additi +onal note C001028127;VK11070844;Delta;;1/07/2017;31/07/2017;Prepayment;Yes;Addit +ional note C000863388;VK11070869;Delta;;14/03/2016;13/04/2016;Prepayment;Yes;Addi +tional note C000863388;VK11070869;Rounding;;14/03/2016;13/04/2016;Prepayment;Yes;A +dditional note C000863389;VK11070869;Delta Gas;;14/03/2016;13/04/2016;Prepayment;Yes; +Additional note C001041275;VK11070873;Delta;;14/04/2017;13/05/2017;Prepayment;Yes;Addi +tional note C000457921;VK11070913;Delta;;11/12/2014;10/01/2015;Prepayment;Yes;Addi +tional note C000457922;VK11070913;Delta Gas;;11/12/2014;10/01/2015;Prepayment;Yes; +Additional note C000354278;VK11070920;Delta;;21/09/2015;20/10/2015;Prepayment;Yes;Addi +tional note C000354278;VK11070920;Rounding;;21/09/2015;20/10/2015;Prepayment;Yes;A +dditional note C001139698;VK11070923;Delta;;12/08/2017;11/09/2017;Prepayment;Yes;Addi +tional note C001139698;VK11070923;Rounding;;12/08/2017;11/09/2017;Prepayment;Yes;A +dditional note C001072986;VK11070933;Delta;;14/03/2017;15/05/2017;Prepayment;Yes;Addi +tional note C001072986;VK11070933;Rounding;;14/03/2017;15/05/2017;Prepayment;Yes;A +dditional note C000833421;VK11074400;Delta;;1/05/2017;31/05/2017;Prepayment;Yes;Addit +ional note C000833422;VK11074400;Delta Gas;;1/05/2017;31/05/2017;Prepayment;Yes;A +dditional note C000833422;VK11074400;Rounding;;1/05/2017;31/05/2017;Prepayment;Yes;Ad +ditional note C000147059;VK11074404;Delta;;20/06/2017;19/07/2017;Prepayment;Yes;Addi +tional note C000147062;VK11074404;Delta Gas;;20/06/2017;19/07/2017;Prepayment;Yes; +Additional note C001109215;VK11074415;Delta;;24/08/2017;23/09/2017;Prepayment;Yes;Addi +tional note C000313157;VK11074418;Delta;;15/11/2016;14/12/2016;Prepayment;Yes;Addi +tional note C000313157;VK11074418;Rounding;;15/11/2016;14/12/2016;Prepayment;Yes;A +dditional note C000313158;VK11074418;Delta Gas;;11/11/2016;10/12/2016;Prepayment;Yes; +Additional note C001099002;VK11074430;Delta;;1/08/2017;31/08/2017;Prepayment;Yes;Addit +ional note C001117234;VK11074441;Delta Gas;;15/06/2017;14/07/2017;Prepayment;Yes; +Additional note C001009800;VK11074443;Delta;;16/11/2016;15/12/2016;Prepayment;Yes;Addi +tional note C000679686;VK11074451;Delta;;20/06/2016;19/07/2016;Prepayment;Yes;Addi +tional note C000679687;VK11074451;Delta Gas;;20/06/2016;19/07/2016;Prepayment;Yes; +Additional note C001242987;VK11074454;Delta Gas;;15/06/2017;14/07/2017;Prepayment;Yes; +Additional note C001080282;VK11074470;Delta;;2/03/2017;1/04/2017;Prepayment;Yes;Additi +onal note C001080283;VK11074470;Delta Gas;;2/03/2017;1/04/2017;Prepayment;Yes;Ad +ditional note C001192414;VK11074473;Delta;;14/07/2017;13/08/2017;Prepayment;Yes;Addi +tional note C001192414;VK11074473;Rounding;;14/07/2017;13/08/2017;Prepayment;Yes;A +dditional note C001192415;VK11074473;Delta Gas;;14/07/2017;13/08/2017;Prepayment;Yes; +Additional note C001192415;VK11074473;Rounding;;14/07/2017;13/08/2017;Prepayment;Yes;A +dditional note C000268914;VK11074478;Delta;;9/10/2016;8/11/2016;Prepayment;Yes;Additi +onal note C000268914;VK11074478;Rounding;;9/10/2016;8/11/2016;Prepayment;Yes;Add +itional note __EOD__ say for do { my %seen; grep { not $seen{ $_ } ++ } map { ( split m{;} )[ 1 ] } <$inFH>; };' VK11070778 VK11070783 VK11070844 VK11070869 VK11070873 VK11070913 VK11070920 VK11070923 VK11070933 VK11074400 VK11074404 VK11074415 VK11074418 VK11074430 VK11074441 VK11074443 VK11074451 VK11074454 VK11074470 VK11074473 VK11074478

    I hope this is useful.

    Cheers,

    JohnGG

      Sub-optimal.
      say for do { # <-- mixed paradigms -------- my %seen; # <-- reinventing the wheel | grep { not $seen{ $_ } ++ } # <--------------------------- + map { ( split m{;} )[ 1 ] } # too much logic, but that's nitpickin +g <$inFH>; };'
      Better? I think so:
      use List::Util qw(uniqstr); say join "\n", uniqstr map { $_->[0] } map { [ split m{;} ] } <$inFH>;
      You could of course roll your own implementation of uniq if you want to.


      holli

      You can lead your users to water, but alas, you cannot drown them.

        Coding styles are somewhat subjective so the question of better or worse is hard to answer. I will set out why I have coded the solution that way and, incidentally, point out that your code should be operating on the second field and not the first.

        • say for do { ... I use a do block so that the %seen hash is not left hanging around but goes out of scope at the end of the block. Not necessary here as it is a one-liner but a good habit to get into I think. I'm not sure what you mean by "mixed paradigms" as I'm a bit old school, paradigms hadn't been invented when I started programming. To me it is just a piece of code that DWIMs.

        • The use of my %seen; grep { not $seen{ $_ } ++ } ... is so simple that it hardly seems worth loading a module, especially as in this case the module uses pretty much the same wheel. I'm all for modules when a task is more complex but not when there is nothing to gain.

        • I find it a bit puzzling that you consider map  { ( split m{;} )[ 1 ] } to be too much logic yet suggest as an alternative the use of two maps, the first to split and pass on an anonymous array, the second to pull out an element of that array (which actually should be element [1] not [0]). To me that appears to add more complication.

        Thank you for making those suggestions (++), it was interesting to look again at my post in the light of your comments and question whether it was optimal or not. On balance I don't think I would change anything as I can justify to myself the reasons for coding it that way. I would be interested to know if others feel that my reasoning is flawed.

        Cheers,

        JohnGG

Re: Removing doubles and printing only unique values
by haukex (Canon) on Oct 31, 2017 at 11:22 UTC

    I strongly recommend you use Text::CSV for CSV input/output, your script will end up much more robust and probably a bit shorter (also install Text::CSV_XS for speed). I gave an example of CSV input and output here. Update: Copied over code example (with minor adaptations):

    use warnings; use strict; use Text::CSV; my $csv = Text::CSV->new({ binary=>1, auto_diag=>2, eol=>$/, sep_char => ";" }); open my $ifh, '<', 'Input.CSV' or die $!; open my $ofh, '>', 'Output.CSV' or die $!; while ( my $row = $csv->getline($ifh) ) { # your logic to modify data here $csv->print($ofh, $row); } $csv->eof or $csv->error_diag; close $ifh; close $ofh;

      Thank you for the suggestion.

      In this case I won't be using that, just because I now have a working script and I can completely trust the inputfile will be clean. Plus, this script needs to run on an external server, Text::CSV is not installed and I am not allowed to install whatever I want there, so have to make it work with the tools at hand. Oftenly there is no speedboat here, so we have to use the rowboat.

      But I will definitely install it on my laptop, it indeed looks like it could make for a much cleaner script than what I oftenly write :-)

Re: Removing doubles and printing only unique values
by Eily (Prior) on Oct 31, 2017 at 10:48 UTC

    Hi zarath. $seen{$fields[1]} is always going to be false unless you set it to true. One possible way to do what you want is:

    (undef, my $vk) = split ";" , $line; # define input-lines push (@vk,$vk) unless $seen{$vk}++;
    $seen{$vk}++ will first return the current value (ie 0 the first time), and then increment it by one, so the value will be true from that point. By the way, I also changed your split, you can affect the output of split directly to a list, with undef in the place of elements you want to ignore. And split will be optimized to stop splitting after the list is full (eg: here you only have two elements on the left, so it will only split twice).

      Hi,

      Thank you sooo much! This does exactly what I need it to do.

      At first I did not understand at all how it would 'know' which field it needs to extract, but I guess it is because my $vk is the second field in (undef, my $vk)? So if I would need the third field, this part of the code would be (undef, undef, my $vk)? Is that right?

        Yes! And you can of course fetch several values with: my (undef, $second, $third) = split .... my (undef, $var, $other) is the same as (undef, my $var, my $other)

        Note that haukex is right about Text::CSV though, as long as your input stays simple (ie: no field contains the ; character, or a newline) split will do the work, but if there's any chance that your data can be more complex, it's just either to use the module rather than try to handle all the special cases by hand.

Re: Removing doubles and printing only unique values
by zarath (Beadle) on Oct 31, 2017 at 13:23 UTC

    Thanks for the replies everyone!

    This is the code I eventually ended up with:

    use strict; use warnings; use autodie; my $input = 'D:/Some/Specific/Path/To/Input.CSV'; my $output = 'D:/Some/Specific/Path/To/Output.CSV'; open IN,$input; binmode(IN); open OUT,'>'.$output; my $count = 0; my @vk; my %seen; while (my $line = <IN>) { chomp $line; next if $. < 2; (undef, my $vk) = split ";" , $line; push (@vk,$vk) unless $seen{$vk}++; } close IN; for my $vk (@vk) { if ($count != 10) { print OUT $vk.';'; ++ $count; } else { print OUT "\n"; $count = 0; } } close OUT; exit 0;

    Minimal amount of changes to my last version of the script and keeps it looking clean.

    Have looked up quite a bit about filtering out double values but don't think I have come across that $seen{$vk}++ technique. Will definitely be able to use that more often, seems pretty easy to use and powerful when needing to do something like this.

      I am curious why chose this approach over the functional one by johngg above?

      As you can see, it it is significantly shorter, clearer and arguably easier to understand1, once you wrapped your head around the concept.

      1If you don't believe me, imagine explaining the iterative approach and the functional approach to a 10 year old.


      holli

      You can lead your users to water, but alas, you cannot drown them.

        perl -F';' -ane '$seen{$F[1]} //= print'

      my $count = 0; ... for my $vk (@vk) { if ($count != 10) { print OUT $vk.';'; ++ $count; } else { print OUT "\n"; $count = 0; } }

      I think there's a problem with the quoted code:

      c:\@Work\Perl\monks>perl -wMstrict -e "my $count = 0; my @vk = qw(zero one two THREE four five six SEVEN eight nine ten); ;; for my $vk (@vk) { if ($count != 3) { print $vk.';'; ++ $count; } else { print qq{\n}; $count = 0; } } " zero;one;two; four;five;six; eight;nine;ten;
      Where did  THREE and  SEVEN go? Another vote for a functional approach. (But the quoted code is easily fixed.)


      Give a man a fish:  <%-{-{-{-<

      ... that $seen{$vk}++ technique.

      This is the technique on which List::Util::uniq() and related functions are based. (uniq() originally appeared in List::MoreUtils, but was duplicated into the core List::Util at some point.)


      Give a man a fish:  <%-{-{-{-<

Re: Removing doubles and printing only unique values
by BillKSmith (Parson) on Oct 31, 2017 at 17:33 UTC
    Perl Command Switches (Refer:perlrun) make this task trivial.
    C:\Users\Bill\forums\monks>type zarath.pl #!perl -naF; our %seen; print $F[1], "\n" if !$seen{$F[1]}++ C:\Users\Bill\forums\monks>perl zarath.pl 1202409.txt SalesId VK11070778 VK11070783 VK11070844 VK11070869 VK11070873 VK11070913 VK11070920 VK11070923 VK11070933 VK11074400 VK11074404 VK11074415 VK11074418 VK11074430 VK11074441 VK11074443 VK11074451 VK11074454 VK11074470 VK11074473 VK11074478
    Bill