Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

How could I simplify this redundant-column-removing code?

by rubystallion (Novice)
on Jun 17, 2015 at 14:54 UTC ( #1130826=perlquestion: print w/replies, xml ) Need Help??

rubystallion has asked for the wisdom of the Perl Monks concerning the following question:

I have a few long query strings and to see which parts of the query string matter I wanted to delete those parts which are constant for all query strings. This is equal to removing redundant columns from a '&'-separated table.

I wrote the following code, which seems to work, but I'm not that happy, because in my head it's very simple, like something which should be easily accomplished with a few chained unix commands or a few lines of perl or awk, but what I ended up is relatively long and I only got it to work on the 4th try.

Is there any way to make the code significantly simpler or make it easier for me to write something like this bug-free the first time?

#!/usr/bin/env perl -w use v5.020; my @F1; my @recs; while (my $line = <DATA>) { chomp $line; my @F = split '&', $line; if ($. == 1) { @F1 = @F; } die "NF mismatch" if +@F1 != +@F; push @recs, \@F; for (my $i = 0; $i < @F; $i++) { next unless defined $F1[$i]; if ($F1[$i] ne $F[$i]) { $F1[$i] = undef; } } } for my $rec (@recs) { my $i = 0; for my $field (@$rec) { unless (defined $F1[$i++]) { print "$field\t"; } } say ""; } __DATA__ a=1&b=1&c=1&d=2&e=&f=3 a=1&b=2&c=3&d=2&e=&f=4 a=1&b=2&c=5&d=1&e=&f=5

Replies are listed 'Best First'.
Re: How could I simplify this redundant-column-removing code?
by kennethk (Abbot) on Jun 17, 2015 at 16:09 UTC
    A couple suggestions:
    1. If you are just incrementing by 1, use Foreach Loops instead of C-style loops -- fewer moving parts:
      for my $i (0 .. @F-1) {
    2. Avoid punctuation variables if you can, unless this is your toy and you are super comfy with them. Rather than testing $., just test if @F1 is initialized:
      @F1 = @F if !@F1;
      Note that you shouldn't use logical compound assignment operators because they have scalar context.
    3. Contrasting the above, you should be using $_ in this case because $line is so highly localized.
      while (<DATA>) { chomp; my @F = split '&';
    4. +@F1 is a no-op. Numification requires two arguments, so you'd need to write 0+@F1, but you don't even need to do that because logical operators like != also apply scalar context to their arguments.
    5. Your algorithm gets simpler and allows using a hash if you track which terms to delete rather than which ones to keep. I'm assuming that you don't have repeated keys.
    So I might write that as:
    #!/usr/bin/env perl -w use v5.014; my %seen; my $count; my @recs; while (<DATA>) { chomp; my @F = split '&'; $count //= @F; die "NF mismatch" if @F != $count; $seen{$_}++ for @F; push @recs, \@F; } for my $rec (@recs) { say join "\t", grep $seen{$_} != @recs, # Doesn't show up in every record @$rec ; } __DATA__ a=1&b=1&c=1&d=2&e=&f=3 a=1&b=2&c=3&d=2&e=&f=4 a=1&b=2&c=5&d=1&e=&f=5

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: How could I simplify this redundant-column-removing code?
by Athanasius (Bishop) on Jun 17, 2015 at 15:47 UTC

    Hello rubystallion,

    Your approach looks about right to me. The only thing that concerns me is that in the second (non-nested) for loop you have to test whether each member of @F1 is defined for every element of every record. The following reduces the number of tests by deploying an array slice, which is calculated only once:

    use 5.020; # includes strictures use warnings; my (@F1, @recs); while (my $line = <DATA>) { chomp $line; my @F = split '&', $line; @F1 = @F if $. == 1; die "NF mismatch" unless @F1 == @F; push @recs, \@F; for my $i (0 .. $#F) { next unless defined $F1[$i]; $F1[$i] = undef unless $F1[$i] eq $F[$i]; } } my @keep; defined $F1[$_] || push @keep, $_ for (0 .. $#F1); print join("\t", @$_[ @keep ]), "\n" for @recs; __DATA__ a=1&b=1&c=1&d=2&e=&f=3 a=1&b=2&c=3&d=2&e=&f=4 a=1&b=2&c=5&d=1&e=&f=5


    1:46 >perl b=1 c=1 d=2 f=3 b=2 c=3 d=2 f=4 b=2 c=5 d=1 f=5 1:46 >
    Is there any way to ... make it easier for me to write something like this bug-free the first time?

    If only! ;-)

    Anyway, hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: How could I simplify this redundant-column-removing code?
by BrowserUk (Pope) on Jun 17, 2015 at 19:28 UTC

    My version:

    #! perl -sw use strict; my $pos = tell DATA; my %tally; ++$tally{ $_ } for map split( '&' ), <DATA>; my $lines = $.; seek DATA, $pos, 0; print join '&', grep{ $tally{ $_ } != $lines } split '&' while <DATA> __DATA__ a=1&b=1&c=1&d=2&e=&f=3 a=1&b=2&c=3&d=2&e=&f=4 a=1&b=2&c=5&d=1&e=&f=5


    C:\test>junk b=1&c=1&d=2&f=3 b=2&c=3&d=2&f=4 b=2&c=5&d=1&f=5

    For a real file, you wouldn't need the tell, just rewind the file with seek $fh, 0, 0; for the second pass.

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: How could I simplify this redundant-column-removing code?
by kcott (Bishop) on Jun 17, 2015 at 18:55 UTC

    G'day rubystallion,

    Welcome to the Monastery.

    The approach I took was to read the spec, grab the DATA and write the code. I didn't spend a lot of time looking at your code initially; although, I have commented on it further down in my post. Here's what I came up with:

    #!/usr/bin/env perl -l use strict; use warnings; no warnings 'uninitialized'; my @key_value_pairs; # Capture key-value pairs from original query strings while (<DATA>) { chomp; push @key_value_pairs, { map { (split /=/)[0,1] } split /&/ }; } # Remove common key-value pairs KEY: for my $key (keys %{$key_value_pairs[0]}) { for my $i (1 .. $#key_value_pairs) { next KEY unless $key_value_pairs[0]{$key} eq $key_value_pairs[ +$i]{$key}; } delete $key_value_pairs[$_]{$key} for 0 .. $#key_value_pairs; } # Recreate query strings without common key-value pairs for my $kvp (@key_value_pairs) { print join '&', map { join '=', $_, $kvp->{$_} } sort keys %$kvp; } __DATA__ a=1&b=1&c=1&d=2&e=&f=3 a=1&b=2&c=3&d=2&e=&f=4 a=1&b=2&c=5&d=1&e=&f=5


    b=1&c=1&d=2&f=3 b=2&c=3&d=2&f=4 b=2&c=5&d=1&f=5

    From the comments embedded in the code, you can see three distinct steps: capture all the initial data; remove the common data; recreate the query strings with what's left.

    As you indicated (i.e. "in my head it's very simple") this was fairly straightforward:

    1. split on '&' and then on '='
    2. only delete if all equality tests are TRUE
    3. join with '=' and then with '&'
    "Is there any way to make the code significantly simpler or make it easier for me to write something like this bug-free the first time?"

    That's a little difficult to answer without knowing what you did on your first three attempts.

    • Perhaps a lack of an initial design?
    • Perhaps you had problems with poorly named variables? I certainly did! As soon as I saw your first runtime statement (my @F1;), I realised I was going to have to read more code to find out what F1 represented.
    • Did you somehow get caught up with "use v5.020;"? Although I tested my code under v5.22.0, I suspect it'll run on any Perl5 released this century.

    A couple of notes on command switches:

    • By using the -l command switch, my print works like say (no need to add "\n").
    • The -w command switch is a poor choice. The warnings pragma is better; the documentation explains why.

    And, of course, if anything else in my code needs further explanation, just ask.

    -- Ken

      Hi Ken, Thanks for the suggestions. You're right, poorly named variables (or poor commenting) might have been the reason for one bug I had. Also good to know the advantages of the warnings pragma. I still needed a few minutes to get my head around your solution, but if I get more practise with nested commands and nested data structures this will hopefully become fairly straightforward for me, too.
Re: How could I simplify this redundant-column-removing code?
by pme (Prior) on Jun 17, 2015 at 16:03 UTC
    Hi rubystallion,

    This solution define F1 as hash variable. Unfortunately not much simpler if simpler at all.

    #!/usr/bin/perl -w use strict; my %F1; my @recs; while (my $line = <DATA>) { chomp $line; my @F = split '&', $line; if ($. == 1) { $F1{$_} = 0 for (@F); } die "NF mismatch" if keys %F1 != @F; push @recs, \@F; foreach (@F) { $F1{$_}++ if exists $F1{$_}; } } for my $rec (@recs) { for my $field (@$rec) { print "$field\t" unless exists $F1{$field} and $F1{$field} == + $.; } print "\n"; } __DATA__ a=1&b=1&c=1&d=2&e=&f=3 a=1&b=2&c=3&d=2&e=&f=4 a=1&b=2&c=5&d=1&e=&f=5
    b=1 c=1 d=2 f=3 b=2 c=3 d=2 f=4 b=2 c=5 d=1 f=5
Re: How could I simplify this redundant-column-removing code?
by aaron_baugher (Curate) on Jun 17, 2015 at 18:07 UTC

    This is kind of an interesting problem. If I were being handed this task, my first two questions would be:

    1. Are the fields always in the same order? (in query strings, that's usually not guaranteed)
    2. Does each key appear in every line?

    For a real-world task, I'd assume both answers are 'no,' so I'd have to be prepared to handle missing keys (including keys not appearing in the first line), and keys out of order. Given all that, I'd store the keys and values of each line in an array of hashes (an array to maintain the order of the lines). I'd also have a hash for keeping track of all the keys, and another hash for tracking which keys have the same value throughout so they can be dropped from the output. (There may be a clever way to make one hash do both those things, but it didn't occur to me.) This is what I ended up with. It drops the key/value pairs that are always the same (a & e), regardless of order or whether a key is sometimes missing.

    One note: I tell the inner split to always produce 2 fields, because otherwise it'll produce undef where there's no value, which messes up the nifty map-to-hash.

    #!/usr/bin/env perl use 5.010; use strict; use warnings; use Data::Printer; my @l; # array of hashes, to hold the keys and values for each line my %a; # hash to keep track of all keys my %c; # hash to track whether a key changes value # if a key is still in the hash when the loop finishes, # that means it had the same value throughout and should # be ignored while(<DATA>){ chomp; my %h = map { split /=/,$_,2 } split '&'; # split string into keys +/values push @l, \%h; if( $. == 1 ){ # on the first line, load into %c %c = %h; } else { # on other lines, check values for my $k (keys %h){ delete $c{$k} unless exists $c{$k} and $c{$k} eq $h{$k}; # + remove if different $a{$k}=1; # keep track of key existing } } } # remove consistent keys from output delete $a{$_} for keys %c; for my $l (@l){ for my $k (sort keys %a){ printf "%-8s", $l->{$k} ? "$k=$l->{$k}" : ' '; } say ''; } # I've mixed up the data a bit to reflect my open requirements. # a & e should still be dropped from the output. __DATA__ a=1&b=1&f=3&c=1&d=2&e= a=1&b=2&c=3&e=&f=4 b=2&a=1&c=5&d=1&e=&f=5

    Aaron B.
    Available for small or large Perl jobs and *nix system administration; see my home node.

Re: How could I simplify this redundant-column-removing code?
by rubystallion (Novice) on Jun 18, 2015 at 03:47 UTC

    Wow, of all my "first posts" since I started using the internet in 2000, this must be the one with the largest number of helpful replies in a single day. Glad to have found such a lively community!

    All replies had some useful suggestions. Just for reference: In bash I would have written something like this, which of course has horrible performance, but is very close to pseudocode and therefore hard to get wrong. Some of the solutions posted above are just as readable, but faster and more extensible. Thanks a lot everyone, I couldn't figure that out myself.

    for i in {1..7};do if (( $(cut -d '&' -f $i query.txt|uniq|wc -l) != 1 ));then f="$f,$i";fi;done;cut -d '&' -f "${f#,}" query.txt

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1130826]
Approved by planetscape
Front-paged by kcott
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2020-07-02 09:43 GMT
Find Nodes?
    Voting Booth?

    No recent polls found