Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

heuristic to detect (perl) code

by LanX (Canon)
on Jan 19, 2013 at 08:28 UTC ( #1014182=perlquestion: print w/ replies, xml ) Need Help??
LanX has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I'm meditating about a regex based heuristic to roughly detect if a text paragraph (multilines delimited by '\n\n') is rather perl source code than normal text.

The best idea I had so far was: using regexes to count the line endings with ';' or '}' possibly followed with a '#' part.

Another to check the frequency of words starting with a sigil.

I'm not talking about a valid parser, just a fuzzy detector.

Any better ideas?

One use case could be a JS that checks the contents of a posting in the monastery and warns about missing <code> tags, offering to include them.

(I'm a bit tired of unreadable posts here, and all the following edit-considerations and replies)

Cheers Rolf

PS: I'm not sure if this thread better belongs to PM-Discussions.

Update

Other ideas:

(average) line length
code is shorter than regular text
indentation :
text has rarely indented parts
word frequency :
statistics should show significant frequency differences of keywords in text and code
genetic algorithm trained on archive :
downloading old posts to optimize best mix of different metrics
typical starters
shebang, use strict; ...
Conditional_probability / Naive_Bayes_classifier
combining the results of different checks

Interesting links
  • highlight.js
  • SyntaxHighlighter.js
  • naive bayes classification course (Perl)
  • identify-programming-languages-with-source-classifier (Ruby)
  • how-to-detect-programming-language-from-a-string (SO)
  • detecting-programming-language-from-a-snippet (SO)
  • Comment on heuristic to detect (perl) code
    Download Code
    Re: heuristic to detect (perl) code
    by Anonymous Monk on Jan 19, 2013 at 08:40 UTC
        thanks but you forgot to link to PPI.js

        Cheers Rolf

          thanks but you forgot to link to PPI.js

          PPI is fairly straightforward, s/// is easily converted to .replace, the regex are the simple variety, it is possible

          OTOH :) Re^2: CSS Show and Tell: Colored Code

    Re: heuristic to detect (perl) code
    by tobyink (Abbot) on Jan 19, 2013 at 11:45 UTC
      use 5.010; use strict; use warnings; use File::Slurp qw(slurp); my $text = slurp(__FILE__); my $length = length $text; my $perlish = ($text =~ y(@$%;{}[]<>=~)//); my $metric = $perlish / $length; say "Metric is $metric"; if ($metric > 0.10) { say "Looks like code"; } elsif ($metric < 0.03) { say "Looks like text"; } else { say "Debatable"; }

      I've only tried this on a few sample inputs, but it hasn't failed once. It correctly detects itself.

      As you can see, it's a very simple metric, so should be trivial to port to Javascript or anything else.

      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
        yep, it's an extended version of my sigil-frequency idea... but it's a good start, THANKS! =)

        But IMHO it would be necessary to roughly strip:

        • comments,
        • strings,
        • here-docs and
        • __DATA__, __END__ sections,

        since adding some comments makes your example already "debatable".

        BTW: I tried your Nodelethack in Re^3: CSS Show and Tell: Colored Code but it didn't work for me...:(

        Do you know if CodeMagic provides an API to do code sniffing or is the logic internal?

        Cheers Rolf

          1) Comments:

          my $count = ($line =~ s/[#] .*? \n//xms); $total += $count;

          2) Strings:

          $line =~ s/["'] [^'"\n] ['"]//gxms

          3) use statements:

          $count = ($line =~ s/use [^;]+ ;//gxms); $total += $count;

          4) Here docs:

          my $count = ($all_text =~ s/<<(\w+) .*? \1//gxms); $total += $count;

          5) __DATA__, __END__:

          $total++ if $all_text =~ s/(__DATA__|__END__) .*//xms;
          Although, I would argue that if __DATA__ or __END__ appear anywhere in the text, then you couldn't go wrong by delcaring then and there that the text has perl code in it.

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Node Status?
    node history
    Node Type: perlquestion [id://1014182]
    Front-paged by davies
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others rifling through the Monastery: (6)
    As of 2014-09-03 03:06 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite cookbook is:










      Results (35 votes), past polls