Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

heuristic to detect (perl) code

by LanX (Chancellor)
on Jan 19, 2013 at 08:28 UTC ( #1014182=perlquestion: print w/replies, xml ) Need Help??
LanX has asked for the wisdom of the Perl Monks concerning the following question:


I'm meditating about a regex based heuristic to roughly detect if a text paragraph (multilines delimited by '\n\n') is rather perl source code than normal text.

The best idea I had so far was: using regexes to count the line endings with ';' or '}' possibly followed with a '#' part.

Another to check the frequency of words starting with a sigil.

I'm not talking about a valid parser, just a fuzzy detector.

Any better ideas?

One use case could be a JS that checks the contents of a posting in the monastery and warns about missing <code> tags, offering to include them.

(I'm a bit tired of unreadable posts here, and all the following edit-considerations and replies)

Cheers Rolf

PS: I'm not sure if this thread better belongs to PM-Discussions.


Other ideas:

(average) line length
code is shorter than regular text
indentation :
text has rarely indented parts
word frequency :
statistics should show significant frequency differences of keywords in text and code
genetic algorithm trained on archive :
downloading old posts to optimize best mix of different metrics
typical starters
shebang, use strict; ...
Conditional_probability / Naive_Bayes_classifier
combining the results of different checks

Interesting links
  • highlight.js
  • SyntaxHighlighter.js
  • naive bayes classification course (Perl)
  • identify-programming-languages-with-source-classifier (Ruby)
  • how-to-detect-programming-language-from-a-string (SO)
  • detecting-programming-language-from-a-snippet (SO)
  • Replies are listed 'Best First'.
    Re: heuristic to detect (perl) code
    by tobyink (Abbot) on Jan 19, 2013 at 11:45 UTC
      use 5.010; use strict; use warnings; use File::Slurp qw(slurp); my $text = slurp(__FILE__); my $length = length $text; my $perlish = ($text =~ y(@$%;{}[]<>=~)//); my $metric = $perlish / $length; say "Metric is $metric"; if ($metric > 0.10) { say "Looks like code"; } elsif ($metric < 0.03) { say "Looks like text"; } else { say "Debatable"; }

      I've only tried this on a few sample inputs, but it hasn't failed once. It correctly detects itself.

      As you can see, it's a very simple metric, so should be trivial to port to Javascript or anything else.

      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
        yep, it's an extended version of my sigil-frequency idea... but it's a good start, THANKS! =)

        But IMHO it would be necessary to roughly strip:

        • comments,
        • strings,
        • here-docs and
        • __DATA__, __END__ sections,

        since adding some comments makes your example already "debatable".

        BTW: I tried your Nodelethack in Re^3: CSS Show and Tell: Colored Code but it didn't work for me...:(

        Do you know if CodeMagic provides an API to do code sniffing or is the logic internal?

        Cheers Rolf

          1) Comments:

          my $count = ($line =~ s/[#] .*? \n//xms); $total += $count;

          2) Strings:

          $line =~ s/["'] [^'"\n] ['"]//gxms

          3) use statements:

          $count = ($line =~ s/use [^;]+ ;//gxms); $total += $count;

          4) Here docs:

          my $count = ($all_text =~ s/<<(\w+) .*? \1//gxms); $total += $count;

          5) __DATA__, __END__:

          $total++ if $all_text =~ s/(__DATA__|__END__) .*//xms;
          Although, I would argue that if __DATA__ or __END__ appear anywhere in the text, then you couldn't go wrong by delcaring then and there that the text has perl code in it.
    Re: heuristic to detect (perl) code
    by Anonymous Monk on Jan 19, 2013 at 08:40 UTC
        thanks but you forgot to link to PPI.js

        Cheers Rolf

          thanks but you forgot to link to PPI.js

          PPI is fairly straightforward, s/// is easily converted to .replace, the regex are the simple variety, it is possible

          OTOH :) Re^2: CSS Show and Tell: Colored Code

    Log In?

    What's my password?
    Create A New User
    Node Status?
    node history
    Node Type: perlquestion [id://1014182]
    Front-paged by davies
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others about the Monastery: (10)
    As of 2016-10-25 18:13 GMT
    Find Nodes?
      Voting Booth?
      How many different varieties (color, size, etc) of socks do you have in your sock drawer?

      Results (326 votes). Check out past polls.