Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re^3: heuristic to detect (perl) code

by 7stud (Deacon)
on Jan 20, 2013 at 09:22 UTC ( #1014264=note: print w/replies, xml ) Need Help??

in reply to Re^2: heuristic to detect (perl) code
in thread heuristic to detect (perl) code

1) Comments:

my $count = ($line =~ s/[#] .*? \n//xms); $total += $count;

2) Strings:

$line =~ s/["'] [^'"\n] ['"]//gxms

3) use statements:

$count = ($line =~ s/use [^;]+ ;//gxms); $total += $count;

4) Here docs:

my $count = ($all_text =~ s/<<(\w+) .*? \1//gxms); $total += $count;

5) __DATA__, __END__:

$total++ if $all_text =~ s/(__DATA__|__END__) .*//xms;
Although, I would argue that if __DATA__ or __END__ appear anywhere in the text, then you couldn't go wrong by delcaring then and there that the text has perl code in it.

Replies are listed 'Best First'.
Re^4: heuristic to detect (perl) code
by Anonymous Monk on Jan 20, 2013 at 10:25 UTC
Re^4: heuristic to detect (perl) code
by LanX (Bishop) on Jan 20, 2013 at 10:46 UTC
    It's not that easy

    for instance:

    1) comments '#' should (mostly) follow a newline or a semicolon or to be more precise the '#' shouldn't be preceded by a quote-like operator (single s, y tr, q, qq, qr or qw or whatever)

    2) strings are closed by the same quote so you need to capture the opening one and check the ending with \1.

    3) __DATA__ must appear at line start, OTOH the existence of DATA is already a good indicator for perlcode.

    I think discussing single strategies is for vain, in the end you have to test and train different criteria against a suitable big amount of perlmonk posts, to see if the code-sections are found.

    With bayes classifier there is a very good mathematical method to combine the probabilities of such methods.

    Some of the products I listed in OP use this approach, they are just not trained for perlmonks posts (where tiny code-snippets also appear in text) and have maybe a to heavy footprint to be integrated here.


    For instance highlight.js has a function which returns the guessed language.

    Cheers Rolf

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1014264]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2018-03-24 16:02 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (299 votes). Check out past polls.