Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: heuristic to detect (perl) code

by tobyink (Abbot)
on Jan 19, 2013 at 11:45 UTC ( #1014200=note: print w/ replies, xml ) Need Help??


in reply to heuristic to detect (perl) code

use 5.010; use strict; use warnings; use File::Slurp qw(slurp); my $text = slurp(__FILE__); my $length = length $text; my $perlish = ($text =~ y(@$%;{}[]<>=~)//); my $metric = $perlish / $length; say "Metric is $metric"; if ($metric > 0.10) { say "Looks like code"; } elsif ($metric < 0.03) { say "Looks like text"; } else { say "Debatable"; }

I've only tried this on a few sample inputs, but it hasn't failed once. It correctly detects itself.

As you can see, it's a very simple metric, so should be trivial to port to Javascript or anything else.

perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'


Comment on Re: heuristic to detect (perl) code
Download Code
Re^2: heuristic to detect (perl) code
by LanX (Canon) on Jan 19, 2013 at 12:03 UTC
    yep, it's an extended version of my sigil-frequency idea... but it's a good start, THANKS! =)

    But IMHO it would be necessary to roughly strip:

    • comments,
    • strings,
    • here-docs and
    • __DATA__, __END__ sections,

    since adding some comments makes your example already "debatable".

    BTW: I tried your Nodelethack in Re^3: CSS Show and Tell: Colored Code but it didn't work for me...:(

    Do you know if CodeMagic provides an API to do code sniffing or is the logic internal?

    Cheers Rolf

      1) Comments:

      my $count = ($line =~ s/[#] .*? \n//xms); $total += $count;

      2) Strings:

      $line =~ s/["'] [^'"\n] ['"]//gxms

      3) use statements:

      $count = ($line =~ s/use [^;]+ ;//gxms); $total += $count;

      4) Here docs:

      my $count = ($all_text =~ s/<<(\w+) .*? \1//gxms); $total += $count;

      5) __DATA__, __END__:

      $total++ if $all_text =~ s/(__DATA__|__END__) .*//xms;
      Although, I would argue that if __DATA__ or __END__ appear anywhere in the text, then you couldn't go wrong by delcaring then and there that the text has perl code in it.
        It's not that easy

        for instance:

        1) comments '#' should (mostly) follow a newline or a semicolon or to be more precise the '#' shouldn't be preceded by a quote-like operator (single s, y tr, q, qq, qr or qw or whatever)

        2) strings are closed by the same quote so you need to capture the opening one and check the ending with \1.

        3) __DATA__ must appear at line start, OTOH the existence of DATA is already a good indicator for perlcode.

        I think discussing single strategies is for vain, in the end you have to test and train different criteria against a suitable big amount of perlmonk posts, to see if the code-sections are found.

        With bayes classifier there is a very good mathematical method to combine the probabilities of such methods.

        Some of the products I listed in OP use this approach, they are just not trained for perlmonks posts (where tiny code-snippets also appear in text) and have maybe a to heavy footprint to be integrated here.

        update

        For instance highlight.js has a function which returns the guessed language.

        Cheers Rolf

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1014200]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (7)
As of 2015-07-07 08:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (87 votes), past polls