Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Regex trouble w/ embedded 0s?

by roboticus (Chancellor)
on Jul 10, 2014 at 22:19 UTC ( [id://1093143]=perlquestion: print w/replies, xml ) Need Help??

roboticus has asked for the wisdom of the Perl Monks concerning the following question:

Hello, all:

I've been fighting this for a little while, and I'm stumped. I'm trying to pull a file apart, and for some reason, my regex works normally most of the time, but sometimes fails and I can't figure out why.

Here's an example:

$ cat bug.pl #!/usr/bin/perl use strict; use warnings; my $str = join("", map { chr } 0x13, 0x00, 0x00, 0x00, 0xf8, 0x90, 0xbc, 0xac, 0x3a, 0x26, 0x1c, 0x27, 0xb3, 0x22, 0x22, 0xb3, 0xf6, 0x60, 0x23, 0x2d, 0x77, 0xbf, 0xdb, 0xda, 0xd1, 0xad, 0x0a, 0x98, 0x1a, 0x38, 0xae, 0x76, 0xee, 0x77, 0x66, 0x35, 0x66, 0x00, 0x65, 0x00, 0x74, 0x00, 0x69, 0x00, 0x64, 0x00, 0x5f, 0x00, 0x69, 0x00, 0x73, 0x00, 0x6c, 0x00, 0x61, 0x00, 0x6e, 0x00, 0x64, 0x00, 0x5f, 0x00, 0x32, 0x00, 0x2e, 0x00, 0x74, 0x00, 0x67, 0x00, 0x72, 0x00, 0x00, 0x00, 0xff, 0xfe, 0x76, 0x00, 0x65, 0x00, 0x72, 0x00, 0x73, 0x00, 0x69, 0x00, 0x6f, 0x00, 0x6e, 0x00, 0x20, 0x00, 0x31, 0x00, 0x37, 0x00, 0x0a, 0x00, 0x53, 0x00, 0x69, 0x00, 0x7a, 0x00, 0x65, 0x00, 0x3a, 0x00, 0x20, 0x00, 0x34, 0x00 ); for my $c (split //, $str) { printf "%02x ", ord($c); } print "\n"; if ($str=~/^(.{36})(.*?)\0\0\0/) { print "Found it!\n"; } else { print "?where is it?\n"; } $ perl bug.pl 13 00 00 00 f8 90 bc ac 3a 26 1c 27 b3 22 22 b3 f6 60 23 2d 77 bf db da d1 ad 0a 98 1a 38 ae 76 ee 77 66 35 66 00 65 00 74 00 69 00 64 00 5f 00 69 00 73 00 6c 00 61 00 6e 00 64 00 5f 00 32 00 2e 00 74 00 67 00 72 00 00 00 ff fe 76 00 65 00 72 00 73 00 69 00 6f 00 6e 00 20 00 31 00 37 00 0a 00 53 00 69 00 7a 00 65 00 3a 00 20 00 34 00 ?where is it?

I expected to see "Found it!", as there's clearly a string of three zeroes on the eighth line. (Note: I manually inserted the line breaks in the output.)

I'm probably doing something crazy, but I can't see it.

I don't expect that it matters, but I'm running 5.14.4 (from cygwin).

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re: Regex trouble w/ embedded 0s?
by Eily (Monsignor) on Jul 10, 2014 at 22:27 UTC

    The dot doesn't match the lineflush (0x0A) character unless you use the /s modifier. Just add it and your code will work :)

      Eily:

      Thanks! I know it had to be something dumb, but I just *couldn't* find it.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

Re: Regex trouble w/ embedded 0s?
by Laurent_R (Canon) on Jul 11, 2014 at 07:04 UTC
    OK, Eily has provided the solution, but I don't understand why you use such a strange-looking reegex:
    /^(.{36})(.*?)\0\0\0/
    Why do you have 36 times any char, followed by the smallest number of any characters still making the rest of the match possible?

      Laurent_R:

      I'm dismantling a large (5GB) binary file archive, and the first 36 bytes of each file entry is stuff I haven't determined the purpose of. Then comes the filename (variable length) and the data. The filename appears to be unicodey terminated by a 0, so it looks like: (letter, 0, letter, 0, ..., letter, 0, 0, 0). Since the filename is variable length, it felt like a regex would be the simplest to use to dismantle it.

      Normally when exploring things like this, I take things apart, and as I find the patterns, I improve the parsing. This file freely seems to mix binary, unicode and normal ASCII, I'm still thinking about how to dismantle it best. I also don't know much about the internal structure of the file yet, other than from a very gross overview. I could look it up on the 'net, but I like figuring stuff out as much as I can first before looking at the answer in the back of the book.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        OK, roboticus, thank you for answering, I now understand your context.

      If you wanted to grab the first 36 chars from the start of a string, then grab the first subsequent group that was terminated by (and did not contain) a  \0\0\0 sequence, what regex would you use?

        I dunno, prolly
        /^.{36,}?\0\0\0/
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        If you wanted to grab the first 36 chars from the start of a string, then grab the first subsequent group that was terminated by (and did not contain) a \0\0\0 sequence, what regex would you use?
        Yes, AnomalousMonk, you are right, if I wanted to do that, I would probably use a regex very similar to what roboticus used. I was really wondering why he wanted to do something a bit strange like that, and he has not provided an answer which explains it all.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1093143]
Approved by Eily
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (9)
As of 2024-04-16 08:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found