Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

How can I make a string unique (quicker than my approach at least)

by Anonymous Monk
on Apr 01, 2024 at 21:15 UTC ( [id://11158635]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have several thousands of lines that have a format like the following:
ID1<TAB>name1-name2-name3-name4....-nameX ID2<TAB>name1-name2-name3-name4....-nameX ID3<TAB>name1-name2-name3-name4....-nameX ...

What I want to achieve is to keep each of the IDs as they appear and ONLY the unique names that appear on the second column. For instance, imagine:
ID1<TAB>nick-john-helena ID2<TAB>george-andreas-lisa-anna-matthew-andreas-lisa ID3<TAB>olivia-niels-peter-lars-niels-lars-olivia-olivia ...

my output should be:
ID1<TAB>nick-john-helena ID2<TAB>george-andreas-lisa-anna-matthew ID3<TAB>olivia-niels-peter-lars

What I am looking is to see if someone has a quick solution to this. My approach would be to read each line, store the ID and the line of names, split the names using - as delimiter, put all names into an temp array on the fly and then make this array unique and print the unique elements. Any more clever solution perhaps?

Replies are listed 'Best First'.
Re: How can I make a string unique (quicker than my approach at least)
by syphilis (Archbishop) on Apr 02, 2024 at 00:20 UTC
    ... and then make this array unique and print the unique elements

    For this part of the operation I would think that List::Util::uniqstr() is what you want.
    use warnings; use strict; use List::Util qw(uniqstr); my $str = 'olivia-niels-peter-lars-niels-lars-olivia-olivia'; my @s = split /\-/, $str; print "@s\n"; my @s_new = uniqstr(@s); print "@s_new\n"; __END__ Outputs: olivia niels peter lars niels lars olivia olivia olivia niels peter lars
    Cheers,
    Rob
Re: How can I make a string unique (quicker than my approach at least)
by johngg (Canon) on Apr 02, 2024 at 10:29 UTC

    This solution preserves order, if that is an issue.

    johngg@aleatico:~/perl/Monks$ perl -Mstrict -Mwarnings -E 'say q{}; open my $inFH, q{<}, \ <<__EOD__ or die $!; ID1<TAB>nick-john-helena ID2<TAB>george-andreas-lisa-anna-matthew-andreas-lisa ID3<TAB>olivia-niels-peter-lars-niels-lars-olivia-olivia __EOD__ while ( <$inFH> ) { chomp; my( $pre, $post ) = split m{(?<=>)}; say $pre, join q{-}, do { my %seen; grep { ! $seen{ $_ } ++ } split m{\s?-\s?}, $pos +t } }' ID1<TAB>nick-john-helena ID2<TAB>george-andreas-lisa-anna-matthew ID3<TAB>olivia-niels-peter-lars

    I hope this is of interest.

    Cheers,

    JohnGG

      my( $pre, $post ) = split m{(?<=>)};

      Interesting - you have assumed that the <TAB> in the OP's data is literally those 5 characters whereas in my reading they were using this to indicate a single tab character. Doesn't matter really, but it would make the split regex simpler if it were a single tab.


      🦛

Re: How can I make a string unique (quicker than my approach at least)
by LanX (Saint) on Apr 02, 2024 at 01:14 UTC
    > put all names into an temp array on the fly

    I'd use a hash-slice (if order doesn't matter)

    Debugger demo:

    DB<3> p $a olivia-niels-peter-lars-niels-lars-olivia-olivia DB<4> @hash{ split '-',$a } = () DB<5> x keys %hash 0 'lars' 1 'niels' 2 'peter' 3 'olivia' DB<6> p join '-', keys %hash lars-niels-peter-olivia DB<7>

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

Re: How can I make a string unique (quicker than my approach at least)
by Marshall (Canon) on Apr 02, 2024 at 15:38 UTC
    use strict; use warnings; use List::Util qw(uniq); while (<DATA>) { my ($id, @names) = split (/\n|\t|-/,$_); print "$id\t",join("-",uniq(@names)),"\n"; #update:parens needed fo +r join args } =Prints ID1 nick-john-helena ID2 george-andreas-lisa-anna-matthew ID3 olivia-niels-peter-lars =cut __DATA__ ID1 nick-john-helena ID2 george-andreas-lisa-anna-matthew-andreas-lisa ID3 olivia-niels-peter-lars-niels-lars-olivia-olivia
Re: How can I make a string unique (quicker than my approach at least)
by stevieb (Canon) on Apr 01, 2024 at 22:59 UTC

    Quick and dirty. If the order of the names matter, this won't work. If there are duplicate ID tags, this won't work.

    use warnings; use strict; my %seen; while (my $line = <DATA>) { chomp $line; my ($id, $data) = split /\s+/, $line; next if ! $id || ! $data; $seen{$id}{$_}++ for split /-/, $data; } for my $id (sort keys %seen) { printf( "%s\t%s\n", $id, join '-', keys %{ $seen{$id} } ); } __DATA__ ID1 nick-john-helena ID2 george-andreas-lisa-anna-matthew-andreas-lisa ID3 olivia-niels-peter-lars-niels-lars-olivia-olivia

    Output:

    ID1 helena-nick-john ID2 george-lisa-anna-matthew-andreas ID3 niels-peter-lars-olivia
Re: How can I make a string unique (quicker than my approach at least)
by jdporter (Paladin) on Apr 02, 2024 at 15:43 UTC
    perl -MList::Util=uniq -ple "s((\S+)$){ join '-', sort uniq split /-/, + $1 }e" < 11158635.dat
Re: How can I make a string unique (quicker than my approach at least) (sort)
by LanX (Saint) on Apr 03, 2024 at 14:29 UTC
    While most answers have concentrated on the ambiguity of preserving order or not - which you never clarified - there is another aspect to consider about uniqueness.

    Are nick-john-helena and john-helena-nick really considered two different solutions?

    If not, you'll need to sort the result set.

    Goes without saying that solutions preserving order are overkill in that case.

    FWIW: afaik, my hash solution should always return the keys in the same randomized order , hence sorting wouldn't be strictly necessary. At least within the limits of the same process.

    But personally I'd rather play safe and sort.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    see Wikisyntax for the Monastery

Re: How can I make a string unique (quicker than my approach at least)
by harangzsolt33 (Deacon) on Apr 03, 2024 at 03:30 UTC

      Hello harangzsolt33,

      Maybe somebody ... can explain how this works

      Since there is no explicit return, the sub returns the value of its final statement, namely grep !$seen{$_}++, @_;. @_ contains the arguments passed into the sub, and grep filters out those elements that do not make the expression !$seen{$_}++ true. So let’s look at that expression in detail.

      %seen is a hash, initially empty. When reference is made to an element that does not yet exist, that element is autovivified. So if $_ is 'x' and the hash has no 'x' key, a hash element is created with key 'x' and value undef.

      Now the clever part: postfix ++ increments an item’s value, but the increment is delayed until after the current expression has been evaluated. Further, incrementing undef produces the value 1, because undef is taken to be zero. So if the current value of $_ is not already in the hash %seen, the expression !$seen{$_}++ autovivifies a hash value with key $_ and value undef and applies the logical negation operator ! to the value. Since undef is false by definition, its negation is true and the value of $_ passes through the grep filter into the eventual output of the subroutine.

      But the next time $_ has that value, the hash item $seen{$_} exists and has a value of 1 (from the previous application of postfix ++). And since !1 is false, grep filters this item out. In this way, only the first occurrence of any item passes through the filter. So all repeated items are removed from the original list.

      Hope that helps,

      Athanasius <°(((><contra mundum סתם עוד האקר של פרל,

      I use that snippet sometimes since some older version of List::Util do not include a uniq function.

      The code means:

      sub uniq { my %h; # Keep track of things seen. grep { # 4: Return items seen only once. not $h{$_}++ # 2: Item is not (yet) be seen. # 3: ++ would then say item was seen. } @_; # 1: For each input... }

        Input:

        cat in ID1 nick-john-helena ID2 george-andreas-lisa-anna-matthew-andreas-lisa ID3 olivia-niels-peter-lars-niels-lars-olivia-olivia

        Code:

        perl -MList::Util=uniq -ple ' s/ ^ # Beginning of line (BOL). \w+ # Any "words". \s+ # Any whitespace (like tabs). \K # "Keep" whats to the left. (\S+) # Capture and replace next non whitespace (words). / join "-", # 4: n1-n2-n3 uniq # 3: [ "n1", "n2", "n3" ] split "-", # 2: [ "n1", "n2", "n1", "n3" ] $1 # 1: n1-n2-n1-n3 /xe # Freespace regex and eval replacement. ' in

        Output:

        ID1 nick-john-helena ID2 george-andreas-lisa-anna-matthew ID3 olivia-niels-peter-lars
      Maybe somebody who is more knowledgeable can explain how this works

      I think the FAQ explains it quite well. I would always look there first in preference to StackOverflow anyway.


      🦛

        Agreed.

        Curiously, in this case the top answer at StackOverflow points to the identical faq link you pointed to ... while the second top answer cites perldoc -q duplicate (which emits identical content) and then further goes to the bother of embedding verbatim brian_d_foy's excellent FAQ entry in the SO response! ... so you'd think harangzsolt33 must have seen it (or requires a new pair of glasses) ... maybe he can comment further to clear up this mystery. :)

        👁️🍾👍🦟

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11158635]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2025-12-09 13:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What's your view on AI coding assistants?





    Results (89 votes). Check out past polls.

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.