Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Search hex string in vary large binary file

by westrock2000 (Beadle)
on Feb 06, 2015 at 23:25 UTC ( #1115815=perlquestion: print w/replies, xml ) Need Help??

westrock2000 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to search iTunes M4V movie files for a switch that determines if the "1080P" metatag has been set. The string is

68 64 76 64 00 00 00 11 64 61 74 61 00 00 00 15 00 00 00 00 02 00 00 00

So I have two questions
The first problem is that my video files are on the order of 4-20GB and I'm not sure if the standard

open FILE "video"
do something
close FILE

since this would be a very large file that is bigger then my memory.
How can I read the file from disk?

And then 2nd how would I do a regular expression for binary instead of searching for text?

EDIT:
I have used MP4::Info for other parts of my task.
Would it be possible to add the metatag info to the module code?

I have never altered a module before

The info I found about the iTunes data is

HD Video = hdvd 8-bit integer (boolean)
https://code.google.com/p/mp4v2/wiki/iTunesMetadata

MP4::Info source
http://cpansearch.perl.org/src/JHAR/MP4-Info-1.12/Info.pm
  • Comment on Search hex string in vary large binary file

Replies are listed 'Best First'.
Re: Search hex string in vary large binary file
by davido (Cardinal) on Feb 07, 2015 at 00:04 UTC

    BTW: index may be a good alternative to regex

    I agree with LanX: You're searching for a specific sequence, not a pattern. No need to fire up the regex engine to search for something that isn't a pattern. index is a good start.

    I don't know anything about the MV4 file format, but wouldn't the string you're searching for be in a header near the beginning of the file? That may also simplify your search.


    Dave

Re: Search hex string in vary large binary file
by LanX (Cardinal) on Feb 06, 2015 at 23:46 UTC
    general answers:

    > 1. How can I read the file from disk?

    use sliding window technique, you only need to hold at least twice the searched string in memory. see length argument in read

    Though multiples of 4kb big chunks seems reasonable.

    > And then 2nd how would I do a regular expression for binary instead of searching for text?

    strings are just binaries, you just need to convert¹ your hex to them.

    BTW: index may be a good alternative to regex

    Cheers Rolf

    PS: Je suis Charlie!

    ¹) e.g.

    DB<115> join "", map {chr} 0x20,0x41,0x42 => " AB"

    see also pack for a direct approach.

    DB<123> pack 'H*', '204142' => " AB"

Re: Search hex string in vary large binary file
by BrowserUk (Pope) on Feb 07, 2015 at 05:03 UTC

    Try this (I was bored:):

    #! perl -slw use strict; our $BUFN //= 1024; $BUFN *= 4096; our $SIG //= '68 64 76 64 00 00 00 11 64 61 74 61 00 00 00 15 00 00 0 +0 00 02 00 00 00'; $SIG =~ tr[ ][]d; $SIG = pack 'H*', $SIG; open my $in, '<:raw', $ARGV[0] or die $!; my( $offset, $buffer ) = ( 0, '' ); while( sysread( $in, $buffer, $BUFN, length $buffer ) ) { my $pos = 1+index( $buffer, $SIG ); if( $pos ) { print "Found signature at offset: ", $offset + $pos - 1; exit; } $offset += length( $buffer ) - length( $SIG ); $buffer = substr $buffer, - length $SIG; } close $in; print "Signature not found"; __END__ 02/02/2015 15:42 10,737,418,241 big.csv C:\test>junk71 -BUFN=4096 -SIG="52 5f d7 58 22 0d 0a 61 68 73 68 77 65 + 2c 38 30 33 37 31 37 38 35 2c 46" big.csv Found signature at offset: 1073741817

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Search hex string in vary large binary file
by GrandFather (Sage) on Feb 07, 2015 at 07:23 UTC

    Take a look at Video::Dumper::QuickTime. I wrote it when I needed to pull apart MP4 files and it's designed to have new metatag decoders plugged into it. Be warned, it's quite a few years since I last looked at it and the documentation may not be as good now as I thought it was when I understood the code!

    Perl is the programming world's equivalent of English
Re: Search hex string in vary large binary file
by Anonymous Monk on Feb 07, 2015 at 14:42 UTC

    Just to point out an issue with your approach: If you were simply looking for a specific sequence of 24 bytes within up to 20,000,000,000 bytes, what about false positives? To avoid that, you'd actually have to parse the file and only look in the appropriate places for that flag. Which, if you were to DIY, would be a lot of reading specs and writing code, so it really is best to use an existing tool.

    You're in luck! Someone actually submitted a patch for MP4::Info to add support for the HDVD tag: https://rt.cpan.org/Public/Bug/Display.html?id=101016

    There's a quick & really dirty way to patch the module on your system: "wget -nv https://rt.cpan.org/Ticket/Attachment/1444239/767837/0001-add-support-for-HDVD-tag.patch -O- | patch `perldoc -l MP4::Info`" (you'll probably need to do this as root). However, a somewhat cleaner way would be to patch the module before installation:

    # in the shell: $ cd /tmp $ wget http://www.cpan.org/authors/id/J/JH/JHAR/MP4-Info-1.13.tar.gz $ tar xzf MP4-Info-1.13.tar.gz $ cd MP4-Info-1.13/ $ wget -nv https://rt.cpan.org/Ticket/Attachment/1444239/767837/0001-a +dd-support-for-HDVD-tag.patch -O- | patch

    ... and then install to a local module repository separate from your system's modules. For example, see the instructions under "I don't have permission to install a module on the system!" in A Guide to Installing Modules.

      Its a point; but I wonder how many .mv4s you'd have to search before you found "hdvd" & "data" separated by exactly 4 bytes that wasn't part of the required 24 bytes?

      To clarify, in totally random data, there are 256**24 (6.2771e+57) permutations of 24 bytes.

      A 20GB file has 21474836473 sets of 24-bytes.

      So the odds of one of them being a false hit is: 3.4211e-48 (0.00000000000000000000000000000000000000000000034211%). And every restriction on those bytes increases the odds.

      Pretty good odds that any hit is a good one.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        Agreed!

        The point was also meant to be more general about the selection of the solution: personally, my Plan A would be "see if there's a module to do it 'right'", and Plan B would be "meh, I'll just grep the whole file", not the other way around (as the OP seems to imply).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1115815]
Approved by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2020-11-28 10:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?