Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

string? Or binary garbage?

by argv (Pilgrim)
on Dec 01, 2004 at 00:21 UTC ( [id://411335]=perlquestion: print w/replies, xml ) Need Help??

argv has asked for the wisdom of the Perl Monks concerning the following question:

I'm using Josh Carter's IPTCInfo package on cpan that reads the IPTC header from an image (e.g., jpg) and fills in fields such as "author", "location", etc. The problem is, some data may be corrupted, or perhaps unintelligible because it's written in a different language. Is there a way to know which?

At first, it seemed simple to just check for normal ascii, but then it occurs to me that I want to accept certain accented characters, like the é in café, and so on...

Before I go off writing some routine that checks for santiy in a string to see if it really is english text instead of arbitrary gobblygook, I figured maybe someone had such a thing. Even if I only look at the first N characters in a string, that'd be fine.

Again, the brute force intuitive step would be to just do something like

$string =~ /([\s\w]){25})/

but this seems like a hornet's nest of little gotchas where people have learned it ain't that simple.

Replies are listed 'Best First'.
Re: string? Or binary garbage?
by Albannach (Monsignor) on Dec 01, 2004 at 00:49 UTC
    It looks like a good place to start would be Lingua::Identify, which 'knows' 26 languages so far. I imagine that the IPTC headers should be short enough that you could check them whole, as I expect a larger sample would give more confidence, but some experimenting may be in order on that point.

    --
    I'd like to be able to assign to an luser

      And if you have any problem with it, you can always bug the author :-) Why don't you send him some examples so that he can tell you if Lingua::Identify is the right tool for that? :-) I'm pretty sure he'd like to help O:-)

      It is possible that it isn't, but it doesn't hurt trying, and L::I hasn't reached a final version yet, so maybe there's place for something like that. Bug the author!! :-) Definitely :-)

Re: string? Or binary garbage?
by davido (Cardinal) on Dec 01, 2004 at 00:28 UTC

    Would File::Type be of any help? You can feed data (not just filenames) to File::Type, and it does its best job of magically figuring out what you're looking at. Not sure if that will help or not, but it might be a start.


    Dave

      Given the expense of the application and that it has to ierate over thousands of images at a time, I don't want to look too far into other packages for heavy overhead. Now that I think of it, I just want to see if there are any unprintable control characters... Looking at the perldocs, I should be able to use

      $a =~ /\p{IsC}/

      ...which says, "crazy control characters and such." However, that seems to be matching legitimate text in some cases. So, I try

      $a !~ /\p{IsPrint}/

      but that doesn't work either. Then I tried:

      $a !~ /\p{IsASCII}/

      but that's also not working. I can get into the details on why these things aren't working, but I keep thinking someone's going to pipe in with a statement like, "I ran into this ages ago, and here's a routine that finds all the exceptions..."

      If not, I suppose I can elaborate on how this isn't working.

        Well, I fixed the problem myself. In my previous example, I had:

        $a !~ /\p{IsPrint}/

        which tests whether $a does NOT match any printing characters. (In other words, the test fails if it DOES match a PRINTABLE character.) So, what I wanted to do is close, but not quite the same thing:

        $a =~ /\P{IsPrint}/
        which basically tests "does $a match the compliment of a printing character?" In other words, "are there ANY non-printing characters here?"

        One has to look really closely and speak out the logic into a dark empty room before knowing intuitively which was the right choice ahead of time.

Re: string? Or binary garbage?
by Anonymous Monk on Dec 02, 2004 at 13:06 UTC
    The problem is, some data may be corrupted, or perhaps unintelligible because it's written in a different language.
    Different language from what? What does IPTC say it has to be?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://411335]
Approved by davido
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-24 19:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found