Fastest way to minimally check that file contains perl code?

DRVTiny has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Fastest way to minimally check that file contains perl code? by vr (Curate) on Mar 13, 2020 at 14:21 UTC
When i using perl -c it takes too much time to check Why don't you spend only as much time as you are ready to spare, and not a millisecond more? (Note: I'm on Windows here. Use `Time::HiRes::ualarm` in Linux). use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use Win32::Process qw/ CREATE_NO_WINDOW STILL_ACTIVE /; my $timeout = 75; # 75 ms for my $fname ( $0, # valid Perl, won't timeout 'Robot3.pm', # some valid Perl, will timeout # (~ 250 ms to check normally) '../DISTRIBUTIONS.txt' # list of Strawberry distributions ) { my $t = time; my $obj; Win32::Process::Create( $obj, $^X, "$^X -c $fname", 0, CREATE_NO_WINDOW, '.' ) or die; $obj-> Wait( $timeout ); my $code; $obj-> GetExitCode( $code ); print "$fname is ", ( $code == 0 or $code == STILL_ACTIVE ) ? "valid perl" : "something else"; $obj-> Kill( 0 ); printf ", we spent %.3fs to check\n", time - $t; } __END__ alarm.pl is valid perl, we spent 0.058s to check Robot3.pm is valid perl, we spent 0.072s to check ../DISTRIBUTIONS.txt is something else, we spent 0.025s to check [download]	[reply] [d/l] [select]
Re^2: Fastest way to minimally check that file contains perl code? by Fletch (Bishop) on Mar 13, 2020 at 14:58 UTC
Taking this (neat) idea even further you could spin up a worker pool (Parallel::ForkManager or MCE) and you can parallelize and increase the number of files you check in that same maximal time (the question being do you have enough files that the parallelism overhead amortized across all the files is worth that hit). The cake is a lie. The cake is a lie. The cake is a lie.	[reply]
Re: Fastest way to minimally check that file contains perl code? by LanX (Saint) on Mar 13, 2020 at 12:21 UTC
This thread might be of interest heuristic to detect (perl) code It's actually very detailed and shows a variety of strategies I already forgot about. HTH! :) Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re: Fastest way to minimally check that file contains perl code? by haukex (Archbishop) on Mar 13, 2020 at 12:37 UTC
For me it is absolutely enough to know that, say, file XXX is a perl source file with a probability of 80%. ... I tried file and ohlohcount utilities as well, but its assumptions is VERY inaccurate. How inaccurate is file exactly? What test cases did it have trouble with? I would have thought that if you don't need accuracy, something like checking the first couple of lines against a regex that looks for the shebang line and/or some `use` statements might be enough for ~80%, but I think you'll need to be more specific in your question for more specific answers. Only `perl` can parse Perl; I think solutions like PPI will likely be slower. You could try PPR: `use PPR; my $perl_code = <<'END'; #!/usr/bin/env perl use warnings; use strict; print "Hello, World!\n"; END if ( $perl_code =~ m{ \A (?&PerlDocument) \Z $PPR::GRAMMAR }x ) { print "It looks like it could parse as Perl\n" } else { print "It doesn't look like Perl\n" }` [download]	[reply] [d/l] [select]
Re: Fastest way to minimally check that file contains perl code? by LanX (Saint) on Mar 13, 2020 at 12:06 UTC
In general: 80% should be easy enough to achieve, use some rough regex to strip all variables with sigils and function calls and comments and count the built in commands... A quick search didn't show any cpan modules for that (which is most likely due to my search strategy) I know about JS libraries for syntax highlighting guessing the code. The recommended technique is to train a classifier based on a "term frequency algo" (see tf-idf) with right and wrong code,( problem here is I don't know your wrong code, PHP should pose the biggest problem) We also had a similar discussion in the past in order to decide if a poster forgot code tags. update After finding it again, I realized that it's so detailed that it merits an extra reply. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re: Fastest way to minimally check that file contains perl code? by roboticus (Chancellor) on Mar 13, 2020 at 12:09 UTC
DRVTiny: I don't know of any such tool, but if I had to try to create one, I'd probably do something like this: Break the suspect file up into chunks Use Parse::Perl to try to parse each chunk If the ratio of 'good chunks'/'chunks' meets your threshold, assume it's valid. There are quite a few problems in this approach. Choosing a threshold will be difficult, and the value may be too sensitive to the files you use to test it. If there are enough templates or strings in the code, you'll likely reject the file. If you lower the acceptance ratio to pass those, then you may pass quite a bit of junk. Breaking the file into chunks is also a challenge--if you break them apart poorly, you'll get too many parse failures, so how will you split it up--will you split on newlines? semicolons? curly brackets? Have you thought about recognizing various patterns of junk and rejecting files based on those patterns? It might be easier. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re: Fastest way to minimally check that file contains perl code? by Eily (Monsignor) on Mar 13, 2020 at 12:44 UTC
If it's on Linux, looking for the shebang is probably a very good sign. The line might be missing more often on Windows. Then you can look at use directives (things that look like `use stuff` before any other line (except the shebang) in the file). strict and warnings are pretty sure signs. Cpan modules give a pretty good clue as well. But even something that looks like a valid use statement like `use UnknownModuleBecauseIt'sCustom;` put the odds in favor of the file being perl.	[reply] [d/l] [select]
Re: Fastest way to minimally check that file contains perl code? by haj (Vicar) on Mar 13, 2020 at 13:39 UTC
Perl is notorious for being able to parse stuff which looks like garbage, there's a whole category for Obfuscated code on PerlMonks. So let's hope your Perl programmers do this in less than 20% of their files ;) For the general task of classifying data, there's AI::NaiveBayes and AI::Categorizer. They both need some adaption to parse text into the categories "Perl source code" and "garbage". I would guess that you get 80% accuracy with a filter based on the regular expressions presented by other monks, so only if this fails, training a Bayesian might be an alternative.	[reply]
Re^2: Fastest way to minimally check that file contains perl code? by LanX (Saint) on Mar 13, 2020 at 16:28 UTC
> there's a whole category for Obfuscated code on PerlMonks On a side note: It's possible to run Perl::Tidy in a server mode, which is far faster than starting it up for each file. Though I doubt it's faster than `perl -c` , unless using/requiring a large tree of dependencies (like Moose) is causing the lag here. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l]
Re^3: Fastest way to minimally check that file contains perl code? by hippo (Bishop) on Mar 13, 2020 at 16:46 UTC
The Moose argument is why I like vr's idea. If `perl -c` doesn't bomb out early on you can be pretty confident it is actually compiling Perl (with Moose or something equally heavy).	[reply] [d/l]
Re^4: Fastest way to minimally check that file contains perl code? by LanX (Saint) on Mar 13, 2020 at 17:22 UTC
Re: Fastest way to minimally check that file contains perl code? by kcott (Archbishop) on Mar 13, 2020 at 18:33 UTC
G'day DRVTiny, Welcome to the Monastery. "When i using perl -c it takes too much time to check ..." As I came to post this, I saw that ++vr had posted a Win32 solution. Our code is functionally similar up to the point of getting the `perl -c` exit code; then our methods diverge somewhat. Our general thinking about the problem was also very similar: don't let `perl -c` run for as long as it likes; let it run for as long as you like. Here's my code. #!/usr/bin/env perl use strict; use warnings; use constant { TIMEOUT_USECS => 100000, MIN_LINES_TO_ASSESS => 10, OUT_FILE => 'pm_11114214_minimal_is_perl_test.out', }; use constant CMD_LINE => 'perl -c IN_FILE 2> ' . OUT_FILE; use Time::HiRes 'ualarm'; for my $file (@ARGV) { my $cmd = CMD_LINE; $cmd =~ s/IN_FILE/$file/; eval { local $SIG{ALRM} = sub { die }; ualarm TIMEOUT_USECS; `$cmd`; $? and die; ualarm 0; print "$file is valid Perl code.\n"; 1; } or do { ualarm 0; heuristic_check($file); }; } sub heuristic_check { my ($file) = @_; if (-z OUT_FILE) { print "$file could be Perl code.\n"; } else { my $file_lines = (split ' ', `wc -l $file`)[0]; my $out_lines = (split ' ', `wc -l @{[OUT_FILE]}`)[0]; if ($file_lines < MIN_LINES_TO_ASSESS) { print "$file is too small to assess. [$file_lines lines]\n +"; } elsif ($out_lines > $file_lines) { print "$file does not look like Perl code.\n"; } else { printf "%s has a %.02f%% chance of being Perl code\n", $file, 100 * ($file_lines - $out_lines) / $file_lines; } } return; } [download] With the timeout set to 1ms, it assessed itself as "a 98.18% chance of being Perl code"; at 10ms it was a tad more confident with "a 98.25% chance of being Perl code"; and at 100ms, it was sure, with "is valid Perl code". I tested a tiny text file I had in my `tmp` directory. It wasn't Perl and I decided tiny files were too small to assess if they didn't pass `perl -c`. In that case, the output showed "too small to assess. [6 lines]". However, I did create a file with just `print "Hello, world!\n";`. That gave "hello.pl is valid Perl code." at 100ms but, at 10ms, the output was "hello.pl is too small to assess. [1 lines]". And I tested a plain text file containing no Perl code: that gave "does not look like Perl code" at 1ms, 10ms and 100ms. — Ken	[reply] [d/l] [select]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks

Fastest way to minimally check that file contains perl code?

update