Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^3: Match all Non-0 and Letters

by sundialsvc4 (Abbot)
on Jun 26, 2017 at 22:02 UTC ( #1193633=note: print w/replies, xml ) Need Help??


in reply to Re^2: Match all Non-0 and Letters
in thread Match all Non-0 and Letters

This is certainly a reasonable guess, haukex, and it certainly could be correct.   I hope that the OP will elect to elaborate further to remove all doubt.

Certainly, the file appears to consist of hexadecimal data, and I would treat it as such in conversion if only to quickly and reliably detect any problems in the input file itself.   Then, the “valid” values, now converted to a stream 32-bit integer quantities, would be those that fall within a particular numeric range.   Any anomalies, likewise, could now be tackled in the context of that now-successfully-decoded integer (not text ...) data stream.

In general, does not make good sense to me to attack the file with regular expressions that consider only characters, when a stronger definition of the file’s expected format is that it consists of hexadecimal-encoded integers ... representative of an original data stream which also consisted of integers.   “The strings,” one might safely say here, “are merely the encoding” of the actual data-of-interest, and therefore should not be the first object of the attack.

Replies are listed 'Best First'.
Re^4: Match all Non-0 and Letters
by haukex (Monsignor) on Jun 27, 2017 at 08:19 UTC
    This is certainly a reasonable guess

    Yes, it's just a guess, but I felt I wanted to provide a possible alternative to the guess that the OP doesn't know their specifications and doesn't know what hex is.

    anomalies, likewise, could now be tackled in the context of that now-successfully-decoded integer (not text ...) data stream. In general, does not make good sense to me to attack the file with regular expressions

    Well, in my hypothetical situation of a serial data stream corrupted by noise, unfortunately decoding into integers first and then inspecting those integers for bad values won't work. The reason is that the corruption on such streams can include bytes inserted or dropped, meaning that it's entirely possible that none of the incoming data is aligned on 32-bit boundaries. In such a case, one needs a state machine to reacquire synchronization with the data stream, so actually in this case Perl's regular expressions are a decent tool for that job. Note how none of the valid values in the following stream are aligned on 4-byte boundaries:

    my $datastr = "BEEF00000001AB0000000200000700000003F00D"; print "$_\n" for $datastr=~/0{7}[0-9]/g; __END__ 00000001 00000002 00000003
      I asked the question 'are you sure it's corruption', (yet to be answered by OP) because Occam's razor makes the proposition that the data is always hex more likely than the OP notion that the correct 'uncorrupted' format should be 8 decimal digits with leading zeros - the latter proposition would require a bizarre explanation (weirdly written COBOL program?) to say the least without even getting into how a hypothetical hex gremlin performed the alleged corruption on top of that.

      (Occam's razor: that the simplest of competing theories be preferred to the more complex or that explanations of unknown phenomena be sought first in terms of known quantities.)

      One world, one people

        the latter proposition would require a bizarre explanation to say the least

        What you call "bizarre" is in my experience completely normal. I myself would not design a data format in this way, but have worked with plenty of binary data formats that do make somewhat strange choices like for example storing a value from 0 to 9 in a 32-bit field. Just a month ago I finished implementing a driver for a proprietary network protocol that, among other things, has a "flag" field in which only the lowest 3 bits are used, which is 32 bits wide. As for how the corruption might have gotten there I already explained a possibility, which again, in the ECE world is, despite being avoidable, unfortunately still completely normal.

        So as I said, given that the OP seemed to be clear on the expected format, I just wanted to provide a different perspective for the explanation.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1193633]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (5)
As of 2017-11-23 00:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    In order to be able to say "I know Perl", you must have:













    Results (327 votes). Check out past polls.

    Notices?