Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

Fastest way to minimally check that file contains perl code?

by DRVTiny (Novice)
on Mar 13, 2020 at 11:41 UTC ( [id://11114214]=perlquestion: print w/replies, xml ) Need Help??

DRVTiny has asked for the wisdom of the Perl Monks concerning the following question:

"perl -c file" seems very slow, for big amount of perl sources it is excessive, when all is needed is to check whether some file contains perl code or not. I dont need to check stricts or some complex conditions like possibility to load packages via "use" directives. For me it is absolutely enough to know that, say, file XXX is a perl source file with a probability of 80%. I dont need to execute anything in BEGIN {} blocks or check whether file syntactically correct for 100%. My goal is to separate perl source files from some garbage, when perl files and garbage files can have any "extensions" (because a lot of perl files i'm dealing with has names like foo.bar.abc.do-me.good). When i using perl -c it takes too much time to check, though i use AnyEvent and Proc::FastSpawn in my checker script.

I tried file and ohlohcount utilities as well, but its assumptions is VERY inaccurate.

So my Q is: are there any performance-oriented perl package or so to check whether some text is a perl code with some estimate of probability?

Thanks!

  • Comment on Fastest way to minimally check that file contains perl code?

Replies are listed 'Best First'.
Re: Fastest way to minimally check that file contains perl code?
by vr (Curate) on Mar 13, 2020 at 14:21 UTC
    When i using perl -c it takes too much time to check

    Why don't you spend only as much time as you are ready to spare, and not a millisecond more? (Note: I'm on Windows here. Use Time::HiRes::ualarm in Linux).

    use strict; use warnings; use feature 'say'; use Time::HiRes 'time'; use Win32::Process qw/ CREATE_NO_WINDOW STILL_ACTIVE /; my $timeout = 75; # 75 ms for my $fname ( $0, # valid Perl, won't timeout 'Robot3.pm', # some valid Perl, will timeout # (~ 250 ms to check normally) '../DISTRIBUTIONS.txt' # list of Strawberry distributions ) { my $t = time; my $obj; Win32::Process::Create( $obj, $^X, "$^X -c $fname", 0, CREATE_NO_WINDOW, '.' ) or die; $obj-> Wait( $timeout ); my $code; $obj-> GetExitCode( $code ); print "$fname is ", ( $code == 0 or $code == STILL_ACTIVE ) ? "valid perl" : "something else"; $obj-> Kill( 0 ); printf ", we spent %.3fs to check\n", time - $t; } __END__ alarm.pl is valid perl, we spent 0.058s to check Robot3.pm is valid perl, we spent 0.072s to check ../DISTRIBUTIONS.txt is something else, we spent 0.025s to check

      Taking this (neat) idea even further you could spin up a worker pool (Parallel::ForkManager or MCE) and you can parallelize and increase the number of files you check in that same maximal time (the question being do you have enough files that the parallelism overhead amortized across all the files is worth that hit).

      The cake is a lie.
      The cake is a lie.
      The cake is a lie.

Re: Fastest way to minimally check that file contains perl code?
by LanX (Saint) on Mar 13, 2020 at 12:21 UTC
Re: Fastest way to minimally check that file contains perl code?
by haukex (Archbishop) on Mar 13, 2020 at 12:37 UTC
    For me it is absolutely enough to know that, say, file XXX is a perl source file with a probability of 80%. ... I tried file and ohlohcount utilities as well, but its assumptions is VERY inaccurate.

    How inaccurate is file exactly? What test cases did it have trouble with? I would have thought that if you don't need accuracy, something like checking the first couple of lines against a regex that looks for the shebang line and/or some use statements might be enough for ~80%, but I think you'll need to be more specific in your question for more specific answers. Only perl can parse Perl; I think solutions like PPI will likely be slower. You could try PPR:

    use PPR; my $perl_code = <<'END'; #!/usr/bin/env perl use warnings; use strict; print "Hello, World!\n"; END if ( $perl_code =~ m{ \A (?&PerlDocument) \Z $PPR::GRAMMAR }x ) { print "It looks like it could parse as Perl\n" } else { print "It doesn't look like Perl\n" }
Re: Fastest way to minimally check that file contains perl code?
by LanX (Saint) on Mar 13, 2020 at 12:06 UTC
    In general:

    80% should be easy enough to achieve, use some rough regex to strip all variables with sigils and function calls and comments and count the built in commands...

    A quick search didn't show any cpan modules for that (which is most likely due to my search strategy)

    I know about JS libraries for syntax highlighting guessing the code.

    The recommended technique is to train a classifier based on a "term frequency algo" (see tf-idf) with right and wrong code,( problem here is I don't know your wrong code, PHP should pose the biggest problem)

    We also had a similar discussion in the past in order to decide if a poster forgot code tags.

    update

    After finding it again, I realized that it's so detailed that it merits an extra reply.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

Re: Fastest way to minimally check that file contains perl code?
by roboticus (Chancellor) on Mar 13, 2020 at 12:09 UTC

    DRVTiny:

    I don't know of any such tool, but if I had to try to create one, I'd probably do something like this:

    • Break the suspect file up into chunks
    • Use Parse::Perl to try to parse each chunk
    • If the ratio of 'good chunks'/'chunks' meets your threshold, assume it's valid.

    There are quite a few problems in this approach. Choosing a threshold will be difficult, and the value may be too sensitive to the files you use to test it. If there are enough templates or strings in the code, you'll likely reject the file. If you lower the acceptance ratio to pass those, then you may pass quite a bit of junk. Breaking the file into chunks is also a challenge--if you break them apart poorly, you'll get too many parse failures, so how will you split it up--will you split on newlines? semicolons? curly brackets?

    Have you thought about recognizing various patterns of junk and rejecting files based on those patterns? It might be easier.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Fastest way to minimally check that file contains perl code?
by Eily (Monsignor) on Mar 13, 2020 at 12:44 UTC

    If it's on Linux, looking for the shebang is probably a very good sign. The line might be missing more often on Windows. Then you can look at use directives (things that look like use stuff before any other line (except the shebang) in the file). strict and warnings are pretty sure signs. Cpan modules give a pretty good clue as well. But even something that looks like a valid use statement like use UnknownModuleBecauseIt'sCustom; put the odds in favor of the file being perl.

Re: Fastest way to minimally check that file contains perl code?
by haj (Vicar) on Mar 13, 2020 at 13:39 UTC

    Perl is notorious for being able to parse stuff which looks like garbage, there's a whole category for Obfuscated code on PerlMonks. So let's hope your Perl programmers do this in less than 20% of their files ;)

    For the general task of classifying data, there's AI::NaiveBayes and AI::Categorizer. They both need some adaption to parse text into the categories "Perl source code" and "garbage". I would guess that you get 80% accuracy with a filter based on the regular expressions presented by other monks, so only if this fails, training a Bayesian might be an alternative.

      > there's a whole category for Obfuscated code on PerlMonks

      On a side note: It's possible to run Perl::Tidy in a server mode, which is far faster than starting it up for each file.

      Though I doubt it's faster than perl -c , unless using/requiring a large tree of dependencies (like Moose) is causing the lag here.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        The Moose argument is why I like vr's idea. If perl -c doesn't bomb out early on you can be pretty confident it is actually compiling Perl (with Moose or something equally heavy).

Re: Fastest way to minimally check that file contains perl code?
by kcott (Archbishop) on Mar 13, 2020 at 18:33 UTC

    G'day DRVTiny,

    Welcome to the Monastery.

    "When i using perl -c it takes too much time to check ..."

    As I came to post this, I saw that ++vr had posted a Win32 solution. Our code is functionally similar up to the point of getting the perl -c exit code; then our methods diverge somewhat. Our general thinking about the problem was also very similar: don't let perl -c run for as long as it likes; let it run for as long as you like.

    Here's my code.

    #!/usr/bin/env perl use strict; use warnings; use constant { TIMEOUT_USECS => 100000, MIN_LINES_TO_ASSESS => 10, OUT_FILE => 'pm_11114214_minimal_is_perl_test.out', }; use constant CMD_LINE => 'perl -c IN_FILE 2> ' . OUT_FILE; use Time::HiRes 'ualarm'; for my $file (@ARGV) { my $cmd = CMD_LINE; $cmd =~ s/IN_FILE/$file/; eval { local $SIG{ALRM} = sub { die }; ualarm TIMEOUT_USECS; `$cmd`; $? and die; ualarm 0; print "$file is valid Perl code.\n"; 1; } or do { ualarm 0; heuristic_check($file); }; } sub heuristic_check { my ($file) = @_; if (-z OUT_FILE) { print "$file could be Perl code.\n"; } else { my $file_lines = (split ' ', `wc -l $file`)[0]; my $out_lines = (split ' ', `wc -l @{[OUT_FILE]}`)[0]; if ($file_lines < MIN_LINES_TO_ASSESS) { print "$file is too small to assess. [$file_lines lines]\n +"; } elsif ($out_lines > $file_lines) { print "$file does not look like Perl code.\n"; } else { printf "%s has a %.02f%% chance of being Perl code\n", $file, 100 * ($file_lines - $out_lines) / $file_lines; } } return; }

    With the timeout set to 1ms, it assessed itself as "a 98.18% chance of being Perl code"; at 10ms it was a tad more confident with "a 98.25% chance of being Perl code"; and at 100ms, it was sure, with "is valid Perl code".

    I tested a tiny text file I had in my tmp directory. It wasn't Perl and I decided tiny files were too small to assess if they didn't pass perl -c. In that case, the output showed "too small to assess. [6 lines]". However, I did create a file with just print "Hello, world!\n";. That gave "hello.pl is valid Perl code." at 100ms but, at 10ms, the output was "hello.pl is too small to assess. [1 lines]".

    And I tested a plain text file containing no Perl code: that gave "does not look like Perl code" at 1ms, 10ms and 100ms.

    — Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11114214]
Approved by Ratazong
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (10)
As of 2024-04-19 09:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found