Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Regex trouble w/ embedded 0s?

by Laurent_R (Canon)
on Jul 11, 2014 at 07:04 UTC ( [id://1093192]=note: print w/replies, xml ) Need Help??


in reply to Regex trouble w/ embedded 0s?

OK, Eily has provided the solution, but I don't understand why you use such a strange-looking reegex:
/^(.{36})(.*?)\0\0\0/
Why do you have 36 times any char, followed by the smallest number of any characters still making the rest of the match possible?

Replies are listed 'Best First'.
Re^2: Regex trouble w/ embedded 0s?
by roboticus (Chancellor) on Jul 11, 2014 at 12:25 UTC

    Laurent_R:

    I'm dismantling a large (5GB) binary file archive, and the first 36 bytes of each file entry is stuff I haven't determined the purpose of. Then comes the filename (variable length) and the data. The filename appears to be unicodey terminated by a 0, so it looks like: (letter, 0, letter, 0, ..., letter, 0, 0, 0). Since the filename is variable length, it felt like a regex would be the simplest to use to dismantle it.

    Normally when exploring things like this, I take things apart, and as I find the patterns, I improve the parsing. This file freely seems to mix binary, unicode and normal ASCII, I'm still thinking about how to dismantle it best. I also don't know much about the internal structure of the file yet, other than from a very gross overview. I could look it up on the 'net, but I like figuring stuff out as much as I can first before looking at the answer in the back of the book.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      OK, roboticus, thank you for answering, I now understand your context.
Re^2: Regex trouble w/ embedded 0s?
by AnomalousMonk (Archbishop) on Jul 11, 2014 at 12:09 UTC

    If you wanted to grab the first 36 chars from the start of a string, then grab the first subsequent group that was terminated by (and did not contain) a  \0\0\0 sequence, what regex would you use?

      I dunno, prolly
      /^.{36,}?\0\0\0/
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        But that doesn't differentiate between the first 36 chars and the subsequent whatsit, and includes the \0\0\0, and needs the use of  $& or a substr operation to access what was matched. My guess (supported by roboticus's later post) was that the chunks were wanted separately, sans terminator. Given that assumption, the regex didn't seem so strange. But there are many paths...

      If you wanted to grab the first 36 chars from the start of a string, then grab the first subsequent group that was terminated by (and did not contain) a \0\0\0 sequence, what regex would you use?
      Yes, AnomalousMonk, you are right, if I wanted to do that, I would probably use a regex very similar to what roboticus used. I was really wondering why he wanted to do something a bit strange like that, and he has not provided an answer which explains it all.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1093192]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2024-04-24 08:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found