Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Match all Non-0 and Letters

by arblargan (Acolyte)
on Jun 24, 2017 at 07:05 UTC ( [id://1193434]=perlquestion: print w/replies, xml ) Need Help??

arblargan has asked for the wisdom of the Perl Monks concerning the following question:

I'm relatively new to Perl and am having a terribly difficult time figuring this one out. I'm expecting a string in the following format:

 00000001

Essentially, a normal word will be 7 0's followed by a number between 0-9 (8-digits total). However, occasionally there is corruption in the file being processed, causing the format to have something like the following:

 FFFFFFFF or  6C163512

I want to skip these lines of corruption and loop until the corruption has been passed. This is where the tricky part comes in (at least for me). I have tried every combination of matching I can think of, but can't seem to get this one squared away. Below are the lines of code I have tried:

$Disc = get_word(); $D1 = substr($Disc,0,7); $D2 = substr($Disc,7,1); if ($D1 !~ /0+/ and $D2 !~ /([0-9]+)/) ##Catches FFFFFFFF just fine, b +ut not 6C163512 #### $D1 = 6C16351 and $D2 = 2 ###Get words until corruption is cleared. Works great with FFFFFFF +F, but will not catch 6C163512 if ($D1 !~ /0000000/ and $D2 !~ /\D/) ## Same as above ###Get words until corruption is cleared. Works great with FFFFFFF +F, but will not catch 6C163512 if ($Disc =~ /[1-9a-ZA-Z]{7}\D/ ## Same as above ###Get words until corruption is cleared. Works great with FFFFFFF +F, but will not catch 6C163512

I've been working on this forever and can't seem to figure out how to dynamically catch this corruption in the event that all F's have migrated from the string word. I created the $D1 and $D2 variables to try and see why the regex patterns weren't matching, but I still can't figure it out.

Lastly, it should be noted that occasionally, the line of corruption will show as 01020102. The corruption value will be dynamic. This is why I simply can't use /\D+/ for the majority of the string as the first 7 digits must be 0 for a valid word.

Replies are listed 'Best First'.
Re: Match all Non-0 and Letters
by CountZero (Bishop) on Jun 24, 2017 at 08:35 UTC
    Regexes are a cool and important part of your Perl-toolchest. But as with any tool, one must use it wisely.

    In this case, you want to distinguish between "good" and "bad" words. Sometimes it is easy to define what is "good" and sometimes it is more easy to define what is "bad".

    In this particular case, the definition of a good word is easy: 7 zeroes followed by a digit. It then follows logically that all words that to not comply with this simple format must be "bad". Hence we extract all "good" words and simply drop all others and we don't care in which way they may be bad.

    The only regex you need is therefore qr/0{7}\d/ and depending on how the words are presented to you, you may wish to "anchor" the regex in the front or the back to avoid some false positives.

    By concentrating upon the "bad" words you made it yourself unnecessary difficult.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics

      All, thank you very much for the help. My apologies with the confusing post as I typed this out before bed last night in desperation. The word extraction happens farther up in the subroutine than I've shown, but by the time it gets to this point, it will always be 8 continuous digits (or letters if there's corruption) not separated by whitespace.

      I realize that using the $D1 and $D2 variables makes the regex much more difficult than it needed to be, but I created those to try and figure out where the regex was failing at. When I tried my initial regex it looked something like this

      if ($Disc =~ /[1-9a-zA-Z]{7}\D/)

      However, this still did not perform the functions that I was wanting. I did try something similar to if ($Disc !~ /0{7}\d/) but I think I may have used a D by mistake. I just tried if ($Disc !~ /(0{7})(\d$)/) and the regex worked great!

      Thank you all for the quick replies and showing the correct syntax for what I'm trying to do. As I mentioned before, I'm relatively new to Perl, so I still have quite a ways to go, especially with the regex syntax.

        The word ... will always be 8 continuous digits (or letters if there's corruption) not separated by whitespace.
        ...
        I just tried if ($Disc !~ /(0{7})(\d$)/) and the regex worked great!

        Note that if  $Disc can ever possibly be longer than eight characters (update: with extra characters at the beginning), that regex will fail:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $Disc = 'foo00000008'; ;; if ($Disc !~ /(0{7})(\d$)/) { print qq{'$Disc' is bad}; } else { print qq{'$Disc' is OK!}; } " 'foo00000008' is OK!
        If the string can only possibly be exactly eight characters, the  $ end-of-string anchor is redundant. OTOH, I would tend to play it safe and include both start-of-string  ^ and end-of-string anchors: it can't hurt, and may save you someday when one of your upstream assumptions fails you.

        The other thing I notice about the  /(0{7})(\d$)/ regex is that  (0{7}) captures a substring that can't possibly be anything other than '0000000', so why bother? (I assume you have some reason for capturing the trailing digit.)

        So what I might end up with would be something like  m{ \A 0{7} (\d) \z }xms (in a testing matrix):

        c:\@Work\Perl\monks>perl -wMstrict -le "for my $Disc (qw( 00000000 00000001 00000002 00000003 00000004 00000005 00000006 00000007 00000008 00000009 0 00 000 0000 00000 000000 0000000 000000000 FFFFFFFF ffffffff 6C163512 x00000000 00000000x x00000000x x0000000 0000000x x0000000x x000000000 000000000x x000000000x ), '') { ;; my $proper_word = my ($righmost_digit) = $Disc =~ m{ \A 0{7} (\d) \z }xms; ;; if ($proper_word) { print qq{'$Disc' ok, rightmost digit '$righmost_digit'}; } else { print qq{'$Disc' is bad}; } } " '00000000' ok, rightmost digit '0' '00000001' ok, rightmost digit '1' '00000002' ok, rightmost digit '2' '00000003' ok, rightmost digit '3' '00000004' ok, rightmost digit '4' '00000005' ok, rightmost digit '5' '00000006' ok, rightmost digit '6' '00000007' ok, rightmost digit '7' '00000008' ok, rightmost digit '8' '00000009' ok, rightmost digit '9' '0' is bad '00' is bad '000' is bad '0000' is bad '00000' is bad '000000' is bad '0000000' is bad '000000000' is bad 'FFFFFFFF' is bad 'ffffffff' is bad '6C163512' is bad 'x00000000' is bad '00000000x' is bad 'x00000000x' is bad 'x0000000' is bad '0000000x' is bad 'x0000000x' is bad 'x000000000' is bad '000000000x' is bad 'x000000000x' is bad '' is bad
        (See also Test::More for more thorough testing possibilities.)


        Give a man a fish:  <%-{-{-{-<

Re: Match all Non-0 and Letters
by Athanasius (Archbishop) on Jun 24, 2017 at 07:23 UTC

    Hello arblargan, and welcome to the Monastery!

    Assuming your “words” are separated by whitespace within each line, the following should do what you want:

    use strict; use warnings; OUTER: while (my $line = <DATA>) { my @words = split /\s+/, $line; for (@words) { next OUTER unless /^0{7}\d$/; } print $line; } __DATA__ 00000000 00000001 00000009 00000006 FFFFFFFF 00000007 6C163512 00000000 00000008 00000003 00000004 01020102

    Output:

    17:21 >perl 1786_SoPW.pl 00000000 00000001 00000009 17:22 >

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Re: Match all Non-0 and Letters
by Laurent_R (Canon) on Jun 24, 2017 at 09:11 UTC
    Hi arblargan,

    as other monks have already mentioned, all you really need is a single regex such as /0{7}\d/ (or perhaps /^0{7}\d$/ if the word you get is just the number).

    You could, however, split your word into two parts as you did, but you made a logical error: you should have a "or", not an "and" in your condition for detecting a corrupt word, because you want to detect if the first part is not made of 0 OR if the second part is not a digit. So, you might fix your code as follows:

    my $Disc = get_word(); my $D1 = substr($Disc,0,7); my $D2 = substr($Disc,7,1); print "Word $Disc is corrupt!\n" if $D1 !~ /0+/ or $D2 !~ /[0-9]+/;
    But, again, this was just to explain the error in your code, the solution with /0{7}\d/ is much simpler and better.

    Update: this was intended to show the logical error ("and" instead of "or"). As pointed out by AnomalousMonk just below, the regexes are also wrong in terms of the intended purpose described in the original post.

      ... this was just to explain the error in your code ... /0{7}\d/ is much simpler and better.

      I understand that the intended purpose of the code example is very limited, but I think it's very important to point out that the
          $D1 !~ /0+/
      test ("if there is not at least one '0' in the first 7 digits") is also a fundamental error.


      Give a man a fish:  <%-{-{-{-<

        Yes, AnomalousMonk, you're right. I wanted to point out the logical error ("and" instead of "or" in the conditional), but you're absolutely right that the regexes should also be fixed.

        Perhaps something like:

        print "Word $Disc is corrupt!\n" if $D1 !~ /^0{7}$/ or $D2 !~ /[0-9]/;
        And the first part of the conditional could actually be replaced by a string inequality operator rather than a regex:
        print "Word $Disc is corrupt!\n" if ($D1 ne '0' x 7) or $D2 !~ /[0-9]/ +;
        Update: s/instead or "or"/instead of "or"/;. Thanks to Discipulus for pointing out the typo.
Re: Match all Non-0 and Letters
by AnomalousMonk (Archbishop) on Jun 24, 2017 at 08:16 UTC

    It's not clear to me just what you want. If you want to extract from a line all "normal" words skipping other words, try something like this:

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my $normal = qr{ 0{7} [0-9] }xms; ;; my $line = '00000000 FFFFFFFF 00000001 6C163512 00000002 ' . 'ffffffff 00000003 0000009 00000004 000000009 ' . '0 00 000 0000 00000 000000 0000000 000000000 ' . '00000005' ; print qq{line: '$line'}; ;; my @all_ok = $line =~ m{ \b $normal \b }xmsg; dd \@all_ok; " line: '00000000 FFFFFFFF 00000001 6C163512 00000002 ffffffff 00000003 +0000009 00000004 000000009 0 00 000 0000 00000 000000 0000000 0000000 +00 00000005' [ "00000000", "00000001", "00000002", "00000003", "00000004", "00000005", ]


    Give a man a fish:  <%-{-{-{-<

Re: Match all Non-0 and Letters
by anonymized user 468275 (Curate) on Jun 25, 2017 at 07:15 UTC
    Sorry for being the wicked witch arriving late at what looks like a decimal-only gabberfest, but I could not help smiling at the term 'corruption' - apparently just because the data is hexadecimal rather than decimal. Or to put it another way, are you sure you should be filtering out the hex rather than taking it at face value?

    What about just converting it to decimal instead, e.g. see https://perldoc.perl.org/functions/hex.html

    Update: If you want to limit the data to a range of values, you should STILL convert from hex to decimal first and then apply the test. In other words just forget the idea that fffffff is corrupt because e.g. 0000000A is only 10 in decimal - quite a low value and you might want to include the value 10!

    One world, one people

      'corruption' - apparently just because the data is hexadecimal rather than decimal

      I think the OP was quite specific in the definition of the input format - "a normal word will be 7 0's followed by a number between 0-9 (8-digits total)". To put some perspective on this from an ECE point of view, I find this kind of corruption is completely "normal", for example, in a RS-232 or wireless serial data stream corrupted by noise. Simply skipping the obviously corrupted values until a good value is seen is a valid approach to regaining synchronization with the stream. Of course there are ways to add error detection and/or correction encodings on the stream on the transmitting end so the corruption is less likely in the first place, but a large number of "modern" devices I've worked with still don't do this.

      A reply falls below the community's threshold of quality. You may see it by logging in.
    A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1193434]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-12-09 06:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which IDE have you been most impressed by?













    Results (53 votes). Check out past polls.