Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: heuristic to detect (perl) code

by tobyink (Abbot)
on Jan 19, 2013 at 11:45 UTC ( #1014200=note: print w/replies, xml ) Need Help??

in reply to heuristic to detect (perl) code

use 5.010; use strict; use warnings; use File::Slurp qw(slurp); my $text = slurp(__FILE__); my $length = length $text; my $perlish = ($text =~ y(@$%;{}[]<>=~)//); my $metric = $perlish / $length; say "Metric is $metric"; if ($metric > 0.10) { say "Looks like code"; } elsif ($metric < 0.03) { say "Looks like text"; } else { say "Debatable"; }

I've only tried this on a few sample inputs, but it hasn't failed once. It correctly detects itself.

As you can see, it's a very simple metric, so should be trivial to port to Javascript or anything else.

perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'

Replies are listed 'Best First'.
Re^2: heuristic to detect (perl) code
by LanX (Bishop) on Jan 19, 2013 at 12:03 UTC
    yep, it's an extended version of my sigil-frequency idea... but it's a good start, THANKS! =)

    But IMHO it would be necessary to roughly strip:

    • comments,
    • strings,
    • here-docs and
    • __DATA__, __END__ sections,

    since adding some comments makes your example already "debatable".

    BTW: I tried your Nodelethack in Re^3: CSS Show and Tell: Colored Code but it didn't work for me...:(

    Do you know if CodeMagic provides an API to do code sniffing or is the logic internal?

    Cheers Rolf

      1) Comments:

      my $count = ($line =~ s/[#] .*? \n//xms); $total += $count;

      2) Strings:

      $line =~ s/["'] [^'"\n] ['"]//gxms

      3) use statements:

      $count = ($line =~ s/use [^;]+ ;//gxms); $total += $count;

      4) Here docs:

      my $count = ($all_text =~ s/<<(\w+) .*? \1//gxms); $total += $count;

      5) __DATA__, __END__:

      $total++ if $all_text =~ s/(__DATA__|__END__) .*//xms;
      Although, I would argue that if __DATA__ or __END__ appear anywhere in the text, then you couldn't go wrong by delcaring then and there that the text has perl code in it.
        It's not that easy

        for instance:

        1) comments '#' should (mostly) follow a newline or a semicolon or to be more precise the '#' shouldn't be preceded by a quote-like operator (single s, y tr, q, qq, qr or qw or whatever)

        2) strings are closed by the same quote so you need to capture the opening one and check the ending with \1.

        3) __DATA__ must appear at line start, OTOH the existence of DATA is already a good indicator for perlcode.

        I think discussing single strategies is for vain, in the end you have to test and train different criteria against a suitable big amount of perlmonk posts, to see if the code-sections are found.

        With bayes classifier there is a very good mathematical method to combine the probabilities of such methods.

        Some of the products I listed in OP use this approach, they are just not trained for perlmonks posts (where tiny code-snippets also appear in text) and have maybe a to heavy footprint to be integrated here.


        For instance highlight.js has a function which returns the guessed language.

        Cheers Rolf

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1014200]
and the monks are mute...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (5)
As of 2017-12-15 04:39 GMT
Find Nodes?
    Voting Booth?
    What programming language do you hate the most?

    Results (416 votes). Check out past polls.