Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Deleting intermediate whitespaces, but leaving one behind each word

by Feneden (Novice)
on Dec 05, 2017 at 06:25 UTC ( #1204926=perlquestion: print w/replies, xml ) Need Help??
Feneden has asked for the wisdom of the Perl Monks concerning the following question:

Hello, as you can already see, i am searching for a solution to find a way to get this:

Intel(R) Xeon(R) CPU X5660 2.80GHz into this: Intel(R) Xeon(R) CPU X5660 @ 2.80GHz

I thought of methods like this:

@CPU_SPLIT = split(/\ /,$CPU); ~ s\ //g; join (' ',$CPU_SPLIT[0],$CPU_SPLIT[1],$CPU_SPLIT[2],$CPU_SPLIT[3],$CPU +_SPLIT[4],$CPU_SPLIT[5],$CPU_SPLIT[6]);

But i think there is a better way... - that works ^^ - In Bash is a command like xargs . Regards Jan and thank you for you help :)

Replies are listed 'Best First'.
Re: Deleting intermediate whitespaces, but leaving one behind each word
by Athanasius (Chancellor) on Dec 05, 2017 at 06:35 UTC

    Hello Feneden, and welcome to the Monastery!

    No need for the join, just replace each occurrence of one or more whitespace characters with a single character:

    16:30 >perl -wE "my $s = 'Intel(R) Xeon(R) CPU X5660 2.80GHz + '; $s =~ s/(\s)+/$1/g; say qq[>$s<];" >Intel(R) Xeon(R) CPU X5660 2.80GHz < 16:33 >

    Update: Looking again at the thread title, it appears you may also want to remove trailing whitespace at the end of the string:

    use strict; use warnings; my $s = 'Intel(R) Xeon(R) CPU X5660 2.80GHz '; $s =~ s{ (\s)+ }{$1}gx; $s =~ s{ \s+ $ }{}x; print "\n>$s<\n";

    Output:

    17:26 >perl 1843_SoPW.pl >Intel(R) Xeon(R) CPU X5660 2.80GHz< 17:26 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Hello Athanasius, iŽll try this later in my script. For now it seems like a great way. Also for me, as a Perl beginner, i was not too far away from the solution :) This is my first month in scripting, before i only had 2 month of experience in Java. All in all great "costumer" feeling here - Great experience using this forum. I appreciate your help :) See you soon :D

        Perlmonks is the best. So many helpful people. Welcome!

        $PM = "Perl Monk's";
        $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest";
        $nysus = $PM . ' ' . $MCF;
        Click here if you love Perl Monks

Re: Deleting intermediate whitespaces, but leaving one behind each word
by Laurent_R (Canon) on Dec 05, 2017 at 09:07 UTC
    Hi Feneden,

    the solutions using a simple regex suggested by Athanasius and haukex are great and definitely better than any solution using split and join (and I would have suggested exactly the same solution as haukex if he had not done it before).

    However, I feel it might be useful for you to see how your original solution could be improved:

    use strict; use warnings; my @CPU_SPLIT = split /\s+/, $CPU; # split the string on one or mo +re spaces # ~ s\ //g; # this line does not make any s +ense to me, commented out. my $result = join ' ', @CPU_SPLIT; # join can operate directly on +an array, no need to list the indices

      definitely better than any solution using split and join
      Well, the best solutions are the ones you understand the most. I doubt there's a performance issue here anyway. So ++ for giving a solution based on the original one.

      NB: split /\s+/, $CPU can be written split " ", $CPU since the latter will be translated to the former, as explained in the doc. Then if you don't need the temp variable:
      my $result = join " ", split " ", $CPU;
      (Okay, maybe in that case using /\s+/ makes it look a little less silly :P)

        Well, the best solutions are the ones you understand the most. I doubt there's a performance issue here anyway.
        Yes, Eily++, I agree with both sentences. I meant "better" only in the sense that I find that the solutions using the s/// substitution operator are just simpler.

        And, yes, I would also avoid the intermediate temp variable by pipe-lining the join and the split as you've shown, but, here, I wanted to stay close to the OP's solution.

        As for the more awkish version of split using the ' ' string for splitting on multiple spaces, I know it exists and I agree it looks somewhat simpler, but I tend to prefer a regex such as /\s+/ because I find it states more explicitly what it is doing; as an example, I wouldn't know for sure (off the top of my head, without looking up in the documentation or testing, that is) whether it would also split on tabs or new line characters.

Re: Deleting intermediate whitespaces, but leaving one behind each word
by haukex (Abbot) on Dec 05, 2017 at 06:35 UTC
    my $CPU = 'Intel(R) Xeon(R) CPU X5660 2.80GHz'; $CPU =~ s/\s+/ /g; print "<$CPU>\n"; # prints <Intel(R) Xeon(R) CPU X5660 2.80GHz>

    Have a look at perlrequick and perlretut.

      Hello, i also appreciate your help and will try this solution too. Thanks for the fast answer :)
Re: Deleting intermediate whitespaces, but leaving one behind each word
by Monk::Thomas (Friar) on Dec 05, 2017 at 11:30 UTC

    RegExp Search/Replace is a better match for this kind of job. But I also would like to show how to do this via split/join. This might come in handy if you want to skip certain fields or need an array representation later on.

    The split pattern

    I see you're using the pattern 'a single space' / /, but since you want to fold whitespace anyway a better solution is to use 'multiple spaces' / +/ or 'any whitespace' /\s+/, the effect is this:

    my $CPU = 'Intel(R) Xeon(R) CPU X5660 2.80GHz '; # field: 0 1 2 3..11 12 13 14 # using / / # field: 0 1 2 3 4 # using / +/

    Usage of join

    There are multiple ways to feed the generated array into join. In order to get the expected result you can either feed each field separately, use an array slice or just feed the whole array:

    my @CPU_SPLIT = split / +/, 'Intel(R) Xeon(R) CPU X5660 2.80 +GHz '; # 1 - feed each field explicitly print join ' ', $CPU_SPLIT[0], $CPU_SPLIT[1], $CPU_SPLIT[2], $CPU_SPLI +T[3], $CPU_SPLIT[4]; print "\n"; # 2a - use an array slice - explicitly print join ' ', @CPU_SPLIT[0,1,2,3,4]; print ">n"; # 2b - use an array slice - via a range print join ' ', @CPU_SPLIT[0..4]; print "\n"; # 3 - feed the whole array print join ' ', @CPU_SPLIT; print "\n";

      Note that  split / +/, $string doesn't handle leading whitespace gracefully:

      c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $CPU = ' Intel(R) Xeon(R) CPU X5660 2.80GHz '; dd $CPU; ;; my @CPU_SPLIT = split / +/, $CPU; dd \@CPU_SPLIT; ;; my $t = join ' ', @CPU_SPLIT; dd $t; " " Intel(R) Xeon(R) CPU X5660 2.80GHz " ["", "Intel(R)", "Xeon(R)", "CPU", "X5660", "2.80GHz"] " Intel(R) Xeon(R) CPU X5660 2.80GHz"
      Here, the special-case  ' ' split pattern is better (if you're going to split/join):
      c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $CPU = ' Intel(R) Xeon(R) CPU X5660 2.80GHz '; dd $CPU; ;; my @CPU_SPLIT = split ' ', $CPU; dd \@CPU_SPLIT; ;; my $t = join ' ', @CPU_SPLIT; dd $t; " " Intel(R) Xeon(R) CPU X5660 2.80GHz " ["Intel(R)", "Xeon(R)", "CPU", "X5660", "2.80GHz"] "Intel(R) Xeon(R) CPU X5660 2.80GHz"
      (Dealing with general whitespace was exactly why  ' ' was special-cased.)


      Give a man a fish:  <%-{-{-{-<

Re: Deleting intermediate whitespaces, but leaving one behind each word
by johngg (Abbot) on Dec 05, 2017 at 13:45 UTC

    You have received some excellent suggestions already but, for the sake of completeness, here's a solution that also removes leading spaces if necessary. You can pre-compile a regular expression using qr{ ... } ( see Quote and Quote like Operators ) for use in a later match or substitution and you can use extended syntax to comment the expression. I prefer if possible to match only what we want to remove and replace it with nothing rather than matching what we want to remove and capturing what we want to keep, using the capture ( $1 etc. ) in the replacement part. We want to match spaces preceded by the beginning of the string OR spaces preceded by a single space OR spaces followed by the end of the string. To do this we can use zero-width look-behind and look-ahead assertions ( see "Lookaround Assertions" in Extended Patterns ) to match multiple spaces just where we want. This code:-

    use strict; use warnings; use feature qw{ say }; my $str = q{ Intel(R) Xeon(R) CPU X5660 2.80GHz }; my $rxSpaces = qr{(?x) # Use regex extended syntax to allow comments (?: # Open non-capturing group for alternation (?<= \A ) \s+ # Spaces preceded by beginning of string | # or (?<= \s ) \s+ # Spaces preceded by a single space | # or \s+ (?= \z ) # Spaces followed by end of string ) # Close group }; # Replace matching spaces by nothing globally. # $str =~ s{$rxSpaces}{}g; say qq{-->$str<--};

    produces this output:-

    -->Intel(R) Xeon(R) CPU X5660 2.80GHz<--

    I hope this is of interest.

    Cheers,

    JohnGG

      Please forgive the nit-picky nature of this reply, but your post raised a number of interesting points.

      my $rxSpaces = qr{(?x) # Use regex extended syntax to allow comments (?: # Open non-capturing group for alternation (?<= \A ) \s+ # Spaces preceded by beginning of string | # or (?<= \s ) \s+ # Spaces preceded by a single space | # or \s+ (?= \z ) # Spaces followed by end of string ) # Close group };

      Many of the details of this regex no doubt have an expository purpose. However, more or less in descending order of importance:

      • In the  (?<= \A ) \s+ and  \s+ (?= \z ) sub-patterns, the zero-width look-around assertions are overkill because  \A and  \z are already zero-width assertions, so the simpler  \A \s+ and  \s+ \z (respectively) are exactly equivalent and IMHO preferable;
      • The  (?: ... ) non-capturing group surrounding the alternation is redundant because the whole  qr// is effectively wrapped in a non-capturing group;
      • Lastly, the  (?x) at the start of the regex is IMHO to be avoided in favor of a standard  /xms tail for this (and every!) regex. (This is my personal regex best practice.)
      Then what you have is a regex like
          qr{ (?<= \s) \s+ | \A \s+ | \s+ \z }xms
      which IMHO is very easy to understand.

      The use of Perl's ordered regex alternation raises the question the proper order of the sub-patterns. My experience has been that only testing can answer this question reliably:

      c:\@Work\Perl\monks>perl -wMstrict -le "use Test::More 'no_plan'; use Test::NoWarnings; ;; note 'perl version: ', $]; ;; use constant S => ' Intel(R) Xeon(R) CPU X5660 2.80GHz '; use constant T => 'Intel(R) Xeon(R) CPU X5660 2.80GHz'; ;; for my $rxSpaces ( qr{ (?<= \s) \s+ | \A \s+ | \s+ \z }xms, qr{ \A \s+ | (?<= \s) \s+ | \s+ \z }xms, qr{ \A \s+ | \s+ \z | (?<= \s) \s+ }xms, ) { (my $t = S) =~ s{$rxSpaces}{}g; ok $t eq T, qq{$rxSpaces -> \n >$t<}; } ;; note qq{still with spaces? >${ \S }<}; done_testing; " # perl version: 5.008009 ok 1 - (?msx-i: (?<= \s) \s+ | \A \s+ | \s+ \z ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< ok 2 - (?msx-i: \A \s+ | (?<= \s) \s+ | \s+ \z ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< ok 3 - (?msx-i: \A \s+ | \s+ \z | (?<= \s) \s+ ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< # still with spaces? > Intel(R) Xeon(R) CPU X5660 2.80GHz +< 1..3 ok 4 - no warnings 1..4
      Ok, no ordering dependency is seen.

      Now you think, "Gee, with Perl 5.10 there's that neat  \K variable-width look-behind emulation operator I can use to simplify the regex even more!" Unfortunately, after testing (and you always test this stuff, right?) you find a problem:

      c:\@Work\Perl\monks>perl -wMstrict -le "use Test::More 'no_plan'; use Test::NoWarnings; ;; note 'perl version: ', $]; ;; use constant S => ' Intel(R) Xeon(R) CPU X5660 2.80GHz '; use constant T => 'Intel(R) Xeon(R) CPU X5660 2.80GHz'; ;; for my $rxSpaces ( qr{ (?<= \s) \s+ | \A \s+ | \s+ \z }xms, qr{ \A \s+ | (?<= \s) \s+ | \s+ \z }xms, qr{ \A \s+ | \s+ \z | (?<= \s) \s+ }xms, qr{ \s \K \s+ | \A \s+ | \s+ \z }xms, qr{ \A \s+ | \s \K \s+ | \s+ \z }xms, qr{ \A \s+ | \s+ \z | \s \K \s+ }xms, ) { (my $t = S) =~ s{$rxSpaces}{}g; ok $t eq T, qq{$rxSpaces -> \n >$t<}; } ;; note qq{still with spaces? >${ \S }<}; done_testing; " # perl version: 5.010001 ok 1 - (?msx-i: (?<= \s) \s+ | \A \s+ | \s+ \z ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< ok 2 - (?msx-i: \A \s+ | (?<= \s) \s+ | \s+ \z ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< ok 3 - (?msx-i: \A \s+ | \s+ \z | (?<= \s) \s+ ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< not ok 4 - (?msx-i: \s \K \s+ | \A \s+ | \s+ \z ) -> # > Intel(R) Xeon(R) CPU X5660 2.80GHz < # Failed test '(?msx-i: \s \K \s+ | \A \s+ | \s+ \z ) + -> # > Intel(R) Xeon(R) CPU X5660 2.80GHz <' # at -e line 1. not ok 5 - (?msx-i: \A \s+ | \s \K \s+ | \s+ \z ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz < # Failed test '(?msx-i: \A \s+ | \s \K \s+ | \s+ \z ) + -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz <' # at -e line 1. ok 6 - (?msx-i: \A \s+ | \s+ \z | \s \K \s+ ) -> # >Intel(R) Xeon(R) CPU X5660 2.80GHz< # still with spaces? > Intel(R) Xeon(R) CPU X5660 2.80GHz +< 1..6 ok 7 - no warnings 1..7 # Looks like you failed 2 tests of 7.
      Hmmm... The  (?<= \s) \s+ sub-pattern continues to work just fine everywhere, but the seemingly equivalent  \s \K \s+ sub-pattern only works in the last position in the ordered alternation. Why? (Food for thought, this.)

      A lot of these points echo those made by Laurent_R here: regexes are really neat and I love them, but they're not always the ideal tool for the job.


      Give a man a fish:  <%-{-{-{-<

      Yes, johngg,

      TIMTOWTDI, and, yes, I think this is definitely of interest++. Having said that, I feel that using zero-width look-around assertions for such a simple case might be a little bit of an overkill. Well, at least for a beginner who obviously doesn't know very much about regexes at this point.

      Personally, I'm using look-around assertions only from time to time (sometimes, it is really the best solution), but not often enough to always remember the exact syntax by heart, so that when I feel this is the right solution, I usually have to look it up in Johan Vromans's Perl Pocket Reference (or on the net or somewhere else). For such a simple case, I would rather do most of the job with a simple s/\s+/ /g regex, and add one or two simple regexes to handle leading and trailing spaces if needed. My colleagues having to maintain my code will probably thank me for that and I will be even more delighted when the person having to maintain this code a year from now will be... me.

      BTW, Perl 6's regexes have a much cleaner syntax for look-around assertions, so that I would not have the same second thoughts in P6. But that's getting slightly OT, sorry for that.

Re: Deleting intermediate whitespaces, but leaving one behind each word
by Anonymous Monk on Dec 05, 2017 at 13:28 UTC

    Squeezing or squashing repeated characters is what it's called. There's the tr unix utility, and perl has a tr operator, too.

    $ perl -wpe 'tr/ //s;'

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1204926]
Approved by Athanasius
Front-paged by davies
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2017-12-17 09:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (462 votes). Check out past polls.

    Notices?