arblargan has asked for the wisdom of the Perl Monks concerning the following question:
I'm relatively new to Perl and am having a terribly difficult time figuring this one out. I'm expecting a string in the following format:
00000001
Essentially, a normal word will be 7 0's followed by a number between 0-9 (8-digits total). However, occasionally there is corruption in the file being processed, causing the format to have something like the following:
FFFFFFFF or 6C163512
I want to skip these lines of corruption and loop until the corruption has been passed. This is where the tricky part comes in (at least for me). I have tried every combination of matching I can think of, but can't seem to get this one squared away. Below are the lines of code I have tried:
$Disc = get_word();
$D1 = substr($Disc,0,7);
$D2 = substr($Disc,7,1);
if ($D1 !~ /0+/ and $D2 !~ /([0-9]+)/) ##Catches FFFFFFFF just fine, b
+ut not 6C163512 #### $D1 = 6C16351 and $D2 = 2
###Get words until corruption is cleared. Works great with FFFFFFF
+F, but will not catch 6C163512
if ($D1 !~ /0000000/ and $D2 !~ /\D/) ## Same as above
###Get words until corruption is cleared. Works great with FFFFFFF
+F, but will not catch 6C163512
if ($Disc =~ /[1-9a-ZA-Z]{7}\D/ ## Same as above
###Get words until corruption is cleared. Works great with FFFFFFF
+F, but will not catch 6C163512
I've been working on this forever and can't seem to figure out how to dynamically catch this corruption in the event that all F's have migrated from the string word. I created the $D1 and $D2 variables to try and see why the regex patterns weren't matching, but I still can't figure it out.
Lastly, it should be noted that occasionally, the line of corruption will show as 01020102. The corruption value will be dynamic. This is why I simply can't use /\D+/ for the majority of the string as the first 7 digits must be 0 for a valid word.
Re: Match all Non-0 and Letters
by CountZero (Bishop) on Jun 24, 2017 at 08:35 UTC
|
Regexes are a cool and important part of your Perl-toolchest. But as with any tool, one must use it wisely.In this case, you want to distinguish between "good" and "bad" words. Sometimes it is easy to define what is "good" and sometimes it is more easy to define what is "bad". In this particular case, the definition of a good word is easy: 7 zeroes followed by a digit. It then follows logically that all words that to not comply with this simple format must be "bad". Hence we extract all "good" words and simply drop all others and we don't care in which way they may be bad. The only regex you need is therefore qr/0{7}\d/ and depending on how the words are presented to you, you may wish to "anchor" the regex in the front or the back to avoid some false positives. By concentrating upon the "bad" words you made it yourself unnecessary difficult.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics
| [reply] [d/l] |
|
All, thank you very much for the help. My apologies with the confusing post as I typed this out before bed last night in desperation. The word extraction happens farther up in the subroutine than I've shown, but by the time it gets to this point, it will always be 8 continuous digits (or letters if there's corruption) not separated by whitespace.
I realize that using the $D1 and $D2 variables makes the regex much more difficult than it needed to be, but I created those to try and figure out where the regex was failing at. When I tried my initial regex it looked something like this
if ($Disc =~ /[1-9a-zA-Z]{7}\D/)
However, this still did not perform the functions that I was wanting. I did try something similar to if ($Disc !~ /0{7}\d/) but I think I may have used a D by mistake. I just tried if ($Disc !~ /(0{7})(\d$)/) and the regex worked great!
Thank you all for the quick replies and showing the correct syntax for what I'm trying to do. As I mentioned before, I'm relatively new to Perl, so I still have quite a ways to go, especially with the regex syntax.
| [reply] [d/l] [select] |
|
c:\@Work\Perl\monks>perl -wMstrict -le
"my $Disc = 'foo00000008';
;;
if ($Disc !~ /(0{7})(\d$)/) {
print qq{'$Disc' is bad};
}
else {
print qq{'$Disc' is OK!};
}
"
'foo00000008' is OK!
If the string can only possibly be exactly eight characters, the $ end-of-string anchor is redundant. OTOH, I would tend to play it safe and include both start-of-string ^ and end-of-string anchors: it can't hurt, and may save you someday when one of your upstream assumptions fails you.
The other thing I notice about the /(0{7})(\d$)/ regex is that (0{7}) captures a substring that can't possibly be anything other than '0000000', so why bother? (I assume you have some reason for capturing the trailing digit.)
So what I might end up with would be something like m{ \A 0{7} (\d) \z }xms (in a testing matrix):
c:\@Work\Perl\monks>perl -wMstrict -le
"for my $Disc (qw(
00000000 00000001 00000002 00000003 00000004
00000005 00000006 00000007 00000008 00000009
0 00 000 0000 00000 000000 0000000 000000000
FFFFFFFF ffffffff 6C163512
x00000000 00000000x x00000000x
x0000000 0000000x x0000000x
x000000000 000000000x x000000000x
), '') {
;;
my $proper_word =
my ($righmost_digit) = $Disc =~ m{ \A 0{7} (\d) \z }xms;
;;
if ($proper_word) {
print qq{'$Disc' ok, rightmost digit '$righmost_digit'};
}
else {
print qq{'$Disc' is bad};
}
}
"
'00000000' ok, rightmost digit '0'
'00000001' ok, rightmost digit '1'
'00000002' ok, rightmost digit '2'
'00000003' ok, rightmost digit '3'
'00000004' ok, rightmost digit '4'
'00000005' ok, rightmost digit '5'
'00000006' ok, rightmost digit '6'
'00000007' ok, rightmost digit '7'
'00000008' ok, rightmost digit '8'
'00000009' ok, rightmost digit '9'
'0' is bad
'00' is bad
'000' is bad
'0000' is bad
'00000' is bad
'000000' is bad
'0000000' is bad
'000000000' is bad
'FFFFFFFF' is bad
'ffffffff' is bad
'6C163512' is bad
'x00000000' is bad
'00000000x' is bad
'x00000000x' is bad
'x0000000' is bad
'0000000x' is bad
'x0000000x' is bad
'x000000000' is bad
'000000000x' is bad
'x000000000x' is bad
'' is bad
(See also Test::More for more thorough testing possibilities.)
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Match all Non-0 and Letters
by Athanasius (Archbishop) on Jun 24, 2017 at 07:23 UTC
|
Hello arblargan, and welcome to the Monastery!
Assuming your “words” are separated by whitespace within each line, the following should do what you want:
use strict;
use warnings;
OUTER: while (my $line = <DATA>)
{
my @words = split /\s+/, $line;
for (@words)
{
next OUTER unless /^0{7}\d$/;
}
print $line;
}
__DATA__
00000000 00000001 00000009
00000006 FFFFFFFF 00000007
6C163512 00000000 00000008
00000003 00000004 01020102
Output:
17:21 >perl 1786_SoPW.pl
00000000 00000001 00000009
17:22 >
Hope that helps,
| [reply] [d/l] [select] |
Re: Match all Non-0 and Letters
by Laurent_R (Canon) on Jun 24, 2017 at 09:11 UTC
|
Hi arblargan,
as other monks have already mentioned, all you really need is a single regex such as /0{7}\d/ (or perhaps /^0{7}\d$/ if the word you get is just the number).
You could, however, split your word into two parts as you did, but you made a logical error: you should have a "or", not an "and" in your condition for detecting a corrupt word, because you want to detect if the first part is not made of 0 OR if the second part is not a digit. So, you might fix your code as follows:
my $Disc = get_word();
my $D1 = substr($Disc,0,7);
my $D2 = substr($Disc,7,1);
print "Word $Disc is corrupt!\n" if $D1 !~ /0+/ or $D2 !~ /[0-9]+/;
But, again, this was just to explain the error in your code, the solution with /0{7}\d/ is much simpler and better.
Update: this was intended to show the logical error ("and" instead of "or"). As pointed out by AnomalousMonk just below, the regexes are also wrong in terms of the intended purpose described in the original post.
| [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
|
Yes, AnomalousMonk, you're right. I wanted to point out the logical error ("and" instead of "or" in the conditional), but you're absolutely right that the regexes should also be fixed.
Perhaps something like:
print "Word $Disc is corrupt!\n" if $D1 !~ /^0{7}$/ or $D2 !~ /[0-9]/;
And the first part of the conditional could actually be replaced by a string inequality operator rather than a regex:
print "Word $Disc is corrupt!\n" if ($D1 ne '0' x 7) or $D2 !~ /[0-9]/
+;
Update: s/instead or "or"/instead of "or"/;. Thanks to Discipulus for pointing out the typo. | [reply] [d/l] [select] |
|
Re: Match all Non-0 and Letters
by AnomalousMonk (Archbishop) on Jun 24, 2017 at 08:16 UTC
|
It's not clear to me just what you want. If you want to extract from a line all "normal" words skipping other words, try something like this:
c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le
"my $normal = qr{ 0{7} [0-9] }xms;
;;
my $line = '00000000 FFFFFFFF 00000001 6C163512 00000002 '
. 'ffffffff 00000003 0000009 00000004 000000009 '
. '0 00 000 0000 00000 000000 0000000 000000000 '
. '00000005'
;
print qq{line: '$line'};
;;
my @all_ok = $line =~ m{ \b $normal \b }xmsg;
dd \@all_ok;
"
line: '00000000 FFFFFFFF 00000001 6C163512 00000002 ffffffff 00000003
+0000009 00000004 000000009 0 00 000 0000 00000 000000 0000000 0000000
+00 00000005'
[
"00000000",
"00000001",
"00000002",
"00000003",
"00000004",
"00000005",
]
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Match all Non-0 and Letters
by anonymized user 468275 (Curate) on Jun 25, 2017 at 07:15 UTC
|
Sorry for being the wicked witch arriving late at what looks like a decimal-only gabberfest, but I could not help smiling at the term 'corruption' - apparently just because the data is hexadecimal rather than decimal. Or to put it another way, are you sure you should be filtering out the hex rather than taking it at face value?
What about just converting it to decimal instead, e.g. see https://perldoc.perl.org/functions/hex.html
Update: If you want to limit the data to a range of values, you should STILL convert from hex to decimal first and then apply the test. In other words just forget the idea that fffffff is corrupt because e.g. 0000000A is only 10 in decimal - quite a low value and you might want to include the value 10!
| [reply] |
|
'corruption' - apparently just because the data is hexadecimal rather than decimal
I think the OP was quite specific in the definition of the input format - "a normal word will be 7 0's followed by a number between 0-9 (8-digits total)". To put some perspective on this from an ECE point of view, I find this kind of corruption is completely "normal", for example, in a RS-232 or wireless serial data stream corrupted by noise. Simply skipping the obviously corrupted values until a good value is seen is a valid approach to regaining synchronization with the stream. Of course there are ways to add error detection and/or correction encodings on the stream on the transmitting end so the corruption is less likely in the first place, but a large number of "modern" devices I've worked with still don't do this.
| [reply] |
A reply falls below the community's threshold of quality. You may see it by logging in.
| A reply falls below the community's threshold of quality. You may see it by logging in. | A reply falls below the community's threshold of quality. You may see it by logging in. |
|
|