Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

hexcoder's scratchpad

by hexcoder (Deacon)
on Aug 12, 2008 at 06:36 UTC ( #703825=scratchpad: print w/replies, xml ) Need Help??

all about autovivification Re^2: Autovivification sucking the life out of me
great book list Which non-Perl books made you a better (?:Perl )?Programmer?

autostopping the debugger

Lets assume, you want to use the debugger. Then the problem might be to get the debugger to break after a warning condition happened.

The warnings contain a line number which is fine, but when the line is part of a loop, this is not enough information. We would want a stop with the current context. Only then can we examine the state of the program in the context that produced the problem.

So what to do?
I replace the signal handler for SIGWARN with my own handler that checks for the typical format of a Perl warning. I do that because I am interested only in warnings from the Perl interpreter. If this format has been detected, the code branches into a special path where we can setup the debugger. We want the debugger to stop and to return to the caller, where there warning was caused. So we set a variable that causes the debugger to stop. This will take effect when the signal handler has returned. After that setup stage, the warning message is printed as before the modification.

The signal handler code should go into the debugger initialization file .perldb. Then I do not have to modify the original source code.

This is the content of file .perldb (place it in the current or in the home directory):

sub afterinit { $::SIG{'__WARN__'} = sub { my $warning = shift; if ( $warning =~ m{\s at \s \S+ \s line \s \d+ \. $}xms ) { $DB::single = 1; # debugger stops automatically after # the line that caused the warning. } warn $warning; }; print "sigwarn handler installed!\n"; return; }
Note: This is mostly stolen from my writeup Re: Debugging a program and Re^3: Debugging a program, but I think it fits better here.

Using suffix- and longest common prefix-arrays to find duplicate code in O(n)

Theory of operation
Example: Instead of real perl code I use the string 'mississippi' for simplicity. A suffix is a partial string of an input string, which ends at the end of the input string. A prefix is a partial string of an input string, which starts at the start of the input string. The suffix array of a string is a list of offsets (each one for a suffix), which is sorted lexicographically by suffix:
Offset Prefix ============= 10: i 7: ippi 4: issippi 1: ississippi 0: mississippi 9: pi 8: ppi 6: sippi 3: sissippi 5: ssippi 2: ssissippi
The other structure needed is the longest common prefix array (LCP). It contains the maximal length of the prefix for this entry shared with the previous entry from the suffix array. It looks like this for this example.
Offset LCP (prefix shown in ()) ================================ 10: 0 () 7: 1 (i) 4: 1 (i) 1: 4 (issi) overlapping, 3 (iss) non overlapping 0: 0 () 9: 0 () 8: 1 (p) 6: 0 () 3: 2 (si) 5: 1 (s) 2: 3 (ssi)
Sorted by LCP value
Offset LCP (prefix shown in ()) ================================ 1: 4 (issi) overlapping, 3 (iss) non overlapping 2: 3 (ssi) 3: 2 (si) 7: 1 (i) 4: 1 (i) 8: 1 (p) 5: 1 (s) 10: 0 () 0: 0 () 9: 0 () 6: 0 ()
Care must be taken to avoid overlapping of the prefixes. This can be achieved by limiting the LCP values when the offsets (1) plus LCP (4) crosses the offset of the previous entry in the suffix array (4).

Similarly care must also be taken to avoid matches are crossing files. By limiting the LCP values when the offsets plus LCP crosses file offsets this can be avoided.

Some smaller repetitions hide in the larger ones and can be ignored. For example

4: 1 (i) 3: 2 (si)

are inside
2: 3 (ssi)
Some matches are exclusive like
1: 3 (iss) non overlapping 2: 3 (ssi)
A good heuristic would be to choose the candidate which would reduce the most copies.

Using these structures one could find repetitions in the 'character stream' in O(n).

For analysing multiple files of Perl code, it is easiest to concatenate them in a long string. Then for reporting matches offsets from the suffix array should be retransformed back to files and line offsets. Also a report should be able to show more than two copies of the same original part.

How to get that theory in working order?

Road map
  1. start with existing C code (8-bit characters only) for suffix array construction
    • use a front end similar to Ovids
    • maintain arrays of file offsets and file-line-offsets for reverse transformations
    • implement a XS wrapper to sais (with LCP) copying arrays to Perl
    • report (whitespace agnostic, but otherweise verbatim) duplicates with the help of offset structures
  2. like previous, but optimize array treatment costs with PDL
    • implement a XS wrapper to sais (with LCP) using PDL to treat arrays
  3. like previous, but optimize array treatment costs without the dependency of PDL
    • implement a XS wrapper to sais (with LCP) and steal array treatment from PDL
  4. like previous, but extend to wide characters (utf8)
    • extend C code for wide characters
  5. independent variant: implement a pure Perl solution
    • replace XS code with pure Perl
  6. like above, but optimize for more general detection of duplicate code parts (parameter renaming)
    • use PPI to get a token stream of the perl source code, then look for duplicates in this stream (experimentally).
    • reduce parameter names to unified ones using scope.
    • unify variants of (if/unless), ...
Comments, suggestions? When the basic mechanics are in place, the more interesting stuff begins. Detect more repetitions when for example
  1. several different forms of loops are are being canonicalized prior to analysis.
  2. naming for parameters, variables and subroutines is generalized (obeying scoping rules)
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2023-10-02 17:16 GMT
Find Nodes?
    Voting Booth?

    No recent polls found