Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Case insensitive string comparison

by DAN0207 (Acolyte)
on Jun 26, 2020 at 06:32 UTC ( #11118553=perlquestion: print w/replies, xml ) Need Help??

DAN0207 has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file with the content as follows

SMS,SMS1,20190811,084500,servname,servid,servname1,s1,400,300,300,300, +300,300 SMS,SMSh,20190811,084500,servname,servid,servname1,s1,700,300,300,300, +300,300 SMS,SMSH,20190811,084500,servname,servid,servname1,s1,600,300,300,300, +300,300 SMS,SMSi,20190811,084500,servname,servid,servname1,s1,800,300,300,300, +300,300 SMS,SMSI,20190811,084500,servname,servid,servname1,s1,500,300,300,300, +300,300

I have written the following line of code for case insensitive string comparison

$$blk_ref = 'SMSblk' if $$blk_ref =~ /SMSi/i || $$blk_ref =~ /SMSI/i | +| $$blk_ref =~ /SMSh/i || $$blk_ref =~ /SMSH/i || $$blk_ref =~ /SMS1/ +;

But in the output file, i get the values only for SMSi,SMSh and SMS1.'The values of SMSI and SMSH are not taken.Kindly help to correct my code so that all the values are present in the output file

Replies are listed 'Best First'.
Re: Case insensitive string comparison (updated x2)
by AnomalousMonk (Bishop) on Jun 26, 2020 at 07:42 UTC

    Your code seems ok to me:

    c:\@Work\Perl\monks>perl -wMstrict -le "my @strings = ( 'SMS,SMS1,20190811', 'SMS,SMSh,20190811', 'SMS,SMSH,20190811', 'SMS,SMSx,20190811', 'SMS,SMSi,20190811', 'SMS,SMSX,20190811', 'SMS,SMSI,20190811', ); ;; for my $s (@strings) { my $ref_s = \$s; print qq{'$$ref_s' matches} if $$ref_s =~ /SMSi/i || $$ref_s =~ /SMSI/i || $$ref_s =~ /SMSh/i || $$ref_s =~ /SMSH/i || $$ref_s =~ /SMS1/ ; } " 'SMS,SMS1,20190811' matches 'SMS,SMSh,20190811' matches 'SMS,SMSH,20190811' matches 'SMS,SMSi,20190811' matches 'SMS,SMSI,20190811' matches
    What am I doing differently from what you're doing?

    Update 1: Here's a variation showing assignment via reference (and aliasing). Again, I think it works the way I think you think it should work. (Maybe take a look at Short, Self-Contained, Correct Example.)

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @strings = ( 'SMS,SMS1,20190811', 'SMS,SMSh,20190811', 'SMS,SMSH,20190811', 'SMS,SMSx,20190811', 'SMS,SMSi,20190811', 'SMS,SMSX,20190811', 'SMS,SMSI,20190811', ); ;; for my $s (@strings) { my $ref_s = \$s; $$ref_s = 'SMSblk' if $$ref_s =~ /SMSi/i || $$ref_s =~ /SMSI/i || $$ref_s =~ /SMSh/i || $$ref_s =~ /SMSH/i || $$ref_s =~ /SMS1/ ; } ;; dd \@strings; " [ "SMSblk", "SMSblk", "SMSblk", "SMS,SMSx,20190811", "SMSblk", "SMS,SMSX,20190811", "SMSblk", ]

    Update 2: BTW: I'd tend to write something like this a bit differently:

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @strings = ( 'SMS,SMS1,20190811', 'SMS,SMSh,20190811', 'SMS,SMSH,20190811', 'SMS,SMSx,20190811', 'SMS,SMSi,20190811', 'SMS,SMSX,20190811', 'SMS,SMSI,20190811', ); ;; for my $s (@strings) { my $ref_s = \$s; $$ref_s .= ' is SMSblk' if $$ref_s =~ /SMS[iIhH1]/; } ;; dd \@strings; " [ "SMS,SMS1,20190811 is SMSblk", "SMS,SMSh,20190811 is SMSblk", "SMS,SMSH,20190811 is SMSblk", "SMS,SMSx,20190811", "SMS,SMSi,20190811 is SMSblk", "SMS,SMSX,20190811", "SMS,SMSI,20190811 is SMSblk", ]
    And maybe also throw in some kind of boundary assertion like  \b so the final regex might look like
        / \b SMS[iIhH1] \b /x
    to prevent a string like  'SMS,xSMSHx,20190811' from matching.


    Give a man a fish:  <%-{-{-{-<

      I was also confused about what this $$blk_ref was about. And I just punted that issue in my direct post. It would be helpful if the OP showed more of his application. Dereferencing a ref to a single scalar is a relatively rare thing in Perl. That is because Perl array iterator operations are very good at hiding this nastiness.

      For fun, I used your example data and coded the loop a couple of different ways. Neither of which use an explicit dereferencing operation.

      use strict; use warnings; use Data::Dump qw(dd); my @strings = ( 'SMS,SMS1,20190811', 'SMS,SMSh,20190811', 'SMS,SMSH,20190811', 'SMS,SMSx,20190811', 'SMS,SMSi,20190811', 'SMS,SMSX,20190811', 'SMS,SMSI,20190811', ); # map{} is a logical thought for an array transformation # # Could assign back to @strings or can make # a new array, @strings2 # Could use an "if" and concatenate a message if true # and return $_ in any event. Ternary operator here # gives a place to put a single token that is # not "SMSblk" my @strings2 = map{/SMS[1HI]/i ? "$_ is SMSblk":$_}@strings; dd \@strings2; =prints: [ "SMS,SMS1,20190811 is SMSblk", "SMS,SMSh,20190811 is SMSblk", "SMS,SMSH,20190811 is SMSblk", "SMS,SMSx,20190811", "SMS,SMSi,20190811 is SMSblk", "SMS,SMSX,20190811", "SMS,SMSI,20190811 is SMSblk", ] =cut # Or with a for loop instead of map{} # to modify original array: # Foreach creates an alias and modifying that # alias modifies the original array # No tricky dereferencing is needed. foreach (@strings) { $_ .= " is SMSblk" if /SMS[1HI]/i; } dd \@strings; =prints: [ "SMS,SMS1,20190811 is SMSblk", "SMS,SMSh,20190811 is SMSblk", "SMS,SMSH,20190811 is SMSblk", "SMS,SMSx,20190811", "SMS,SMSi,20190811 is SMSblk", "SMS,SMSX,20190811", "SMS,SMSI,20190811 is SMSblk", ] =cut
        ... a ref to a single scalar ...

        My guess about a possible rationale for this is that the strings being handled are in reality very long and DAN0207 wants to avoid making a bunch of copies of these long strings (e.g., to pass to subroutines). In a case like this, I'd agree that taking a reference to a scalar (string) and passing it around could be quite advantageous in the right circumstances. But this is all just guesswork.


        Give a man a fish:  <%-{-{-{-<

Re: Case insensitive string comparison
by Marshall (Abbot) on Jun 27, 2020 at 16:45 UTC
    I have a few comments for you. I will leave the dereferencing code out of my response because my main points have to do with the regex matching part and also I don't really understand what you are doing with your deref of a reference to a scalar.

    First I would not match against the whole comma separated line, I would narrow the focus to the field that you are interested in. Below I use a split to get field[1]. Another poster suggested using a boundary condition in the regex for the same intended purpose (making sure you are matching against what you think that you are). We don't know what those other names or id's in the line look like, perhaps one server is "sms1Master" or whatever.

    Instead of multiple "or" terms, I would use a character set in this case. This makes it easier for me to see what is going to match or not match. Of course mileage varies.

    use strict; use warnings; while (<DATA>) { my $SMSfield = (split(',',$_))[1]; if ($SMSfield =~ /SMS[1HI]/i) { print "Match $SMSfield\n"; } else { print "No Match $SMSfield\n"; } } =prints Match SMS1 Match SMSh Match SMSH Match SMSi Match SMSI Match SmsI **Note this match** I think in your case, this is fine. No Match SMSx =cut __DATA__ SMS,SMS1,20190811,084500,servname,servid,servname1,s1,400,300,300,300, +300,300 SMS,SMSh,20190811,084500,servname,servid,servname1,s1,700,300,300,300, +300,300 SMS,SMSH,20190811,084500,servname,servid,servname1,s1,600,300,300,300, +300,300 SMS,SMSi,20190811,084500,servname,servid,servname1,s1,800,300,300,300, +300,300 SMS,SMSI,20190811,084500,servname,servid,servname1,s1,500,300,300,300, +300,300 SMS,SmsI,20190811,084500,servname,servid,servname1,s1,500,300,300,300, +300,300 SMS,SMSx,20190811,084500,servname,servid,servname1,s1,500,300,300,300, +300,300

      I agree with matching against a particular field rather than against the entire string, and with using a character class rather than several regexes | matches in tandem.

      I have some comments regarding implementation details. I'm forced to admit, however, that because I don't really know DAN0207's requirements, these comments may be meaningless. That said, I forge ahead.

      Firstly, the  /SMS[1HI]/i match against the extracted $SMSfield field allows a field like 'xSMSIx' to be accepted. This match could benefit from anchor assertions:  / \A SMS [1HI] \z /xi rejects this field.

      Secondly, I find the use of the global  /i flag problematic. In the OPed code statement
          $$blk_ref = 'SMSblk' if $$blk_ref =~ /SMSi/i || ... || $$blk_ref =~ /SMS1/;
      the  /i modifier is only present in matches with an  i I h H suffix, not with the numeric suffix. This suggests (and again, I'm only guessing) that the 'SMS' subfield of the field in question should not be matched case-insensitively. If that's so, a match of
          / \A SMS [1hHiI] \z /x
      (which I personally prefer) or
          / \A SMS (?i) [1HI] \z /x
      will reject the 'SmsI' field and all like it.


      Give a man a fish:  <%-{-{-{-<

        Good points.
        I didn't read too much into the OP's use of the /i modifier because when I saw: $$blk_ref =~ /SMSh/i || $$blk_ref =~ /SMSH/i that lead me to believe that perhaps the OP doesn't really understand what /i does. So I gave an example where you have to rely upon the /i operation working. Having said that, in my own code I probably would have used your character set [1hHiI] which explicitly enumerates the possibilities because this is just H and I. If there were say 10 options, all with lower and uppercase versions, I'd do it more like I showed in my example in an attempt to avoid missing one possibility.

        I actually did consider the use of anchors. I thought that narrowing the focus to the field of interest would be "good enough". We don't know where this csv data comes from. I suppose that this could potentially come from some spreadsheet or other program which might add "" marks even where not required (but allowed). In that case, something like /^SMS/ would fail.

        I think it is highly likely that this data comes from another program rather than from user input. In cases like that, I often write regex'es that allow more matches than a very rigid interpretation because the computer won't "fumble finger" in an extraneous character. All of these types of decisions come down to the exact application which we just don't know.

        Overall I think this is a good thread. Although I do wish that the OP had provided more code to put his problem into a wider context. The Monks demonstrated some new points for the OP to consider along with adequate explanations. I hope that the OP reads all this stuff and decides what is right for his application.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11118553]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (8)
As of 2020-07-06 09:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?