Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

heuristic to detect (perl) code

by LanX (Canon)
on Jan 19, 2013 at 08:28 UTC ( #1014182=perlquestion: print w/ replies, xml ) Need Help??
LanX has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I'm meditating about a regex based heuristic to roughly detect if a text paragraph (multilines delimited by '\n\n') is rather perl source code than normal text.

The best idea I had so far was: using regexes to count the line endings with ';' or '}' possibly followed with a '#' part.

Another to check the frequency of words starting with a sigil.

I'm not talking about a valid parser, just a fuzzy detector.

Any better ideas?

One use case could be a JS that checks the contents of a posting in the monastery and warns about missing <code> tags, offering to include them.

(I'm a bit tired of unreadable posts here, and all the following edit-considerations and replies)

Cheers Rolf

PS: I'm not sure if this thread better belongs to PM-Discussions.

Update

Other ideas:

(average) line length
code is shorter than regular text
indentation :
text has rarely indented parts
word frequency :
statistics should show significant frequency differences of keywords in text and code
genetic algorithm trained on archive :
downloading old posts to optimize best mix of different metrics
typical starters
shebang, use strict; ...
Conditional_probability / Naive_Bayes_classifier
combining the results of different checks

Interesting links
  • highlight.js
  • SyntaxHighlighter.js
  • naive bayes classification course (Perl)
  • identify-programming-languages-with-source-classifier (Ruby)
  • how-to-detect-programming-language-from-a-string (SO)
  • detecting-programming-language-from-a-snippet (SO)
  • Comment on heuristic to detect (perl) code
    Download Code
    Re: heuristic to detect (perl) code
    by Anonymous Monk on Jan 19, 2013 at 08:40 UTC
        thanks but you forgot to link to PPI.js

        Cheers Rolf

          thanks but you forgot to link to PPI.js

          PPI is fairly straightforward, s/// is easily converted to .replace, the regex are the simple variety, it is possible

          OTOH :) Re^2: CSS Show and Tell: Colored Code

    Re: heuristic to detect (perl) code
    by tobyink (Abbot) on Jan 19, 2013 at 11:45 UTC
      use 5.010; use strict; use warnings; use File::Slurp qw(slurp); my $text = slurp(__FILE__); my $length = length $text; my $perlish = ($text =~ y(@$%;{}[]<>=~)//); my $metric = $perlish / $length; say "Metric is $metric"; if ($metric > 0.10) { say "Looks like code"; } elsif ($metric < 0.03) { say "Looks like text"; } else { say "Debatable"; }

      I've only tried this on a few sample inputs, but it hasn't failed once. It correctly detects itself.

      As you can see, it's a very simple metric, so should be trivial to port to Javascript or anything else.

      perl -E'sub Monkey::do{say$_,for@_,do{($monkey=[caller(0)]->[3])=~s{::}{ }and$monkey}}"Monkey say"->Monkey::do'
        yep, it's an extended version of my sigil-frequency idea... but it's a good start, THANKS! =)

        But IMHO it would be necessary to roughly strip:

        • comments,
        • strings,
        • here-docs and
        • __DATA__, __END__ sections,

        since adding some comments makes your example already "debatable".

        BTW: I tried your Nodelethack in Re^3: CSS Show and Tell: Colored Code but it didn't work for me...:(

        Do you know if CodeMagic provides an API to do code sniffing or is the logic internal?

        Cheers Rolf

          1) Comments:

          my $count = ($line =~ s/[#] .*? \n//xms); $total += $count;

          2) Strings:

          $line =~ s/["'] [^'"\n] ['"]//gxms

          3) use statements:

          $count = ($line =~ s/use [^;]+ ;//gxms); $total += $count;

          4) Here docs:

          my $count = ($all_text =~ s/<<(\w+) .*? \1//gxms); $total += $count;

          5) __DATA__, __END__:

          $total++ if $all_text =~ s/(__DATA__|__END__) .*//xms;
          Although, I would argue that if __DATA__ or __END__ appear anywhere in the text, then you couldn't go wrong by delcaring then and there that the text has perl code in it.

    Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Node Status?
    node history
    Node Type: perlquestion [id://1014182]
    Front-paged by davies
    help
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others studying the Monastery: (21)
    As of 2015-07-01 15:38 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (6 votes), past polls