Re: Analysis of Regular Expressions

Assumptions:

You have access to a large subset of the files that these regular expressions will be used against.

If these regular expressions are simply being ranked in the abstract, you have some means of constantly accessing a variety of real-life sample files.

Write a daemon that continuously runs each regular expression against an ever growing list of files. The daemon updates a table, and each row contains 2 columns: the regular expression and the average line count.

Your output program simply sorts the table on the average line count column. So it is quick in that regard. However, as the daemon runs each regular expression against more and more files, the ranking may change.

Obviously, newly added regular expressions will have a more volatile rank compared to older ones that have been run against thousands of files. To combat this problem, you could determine a minimum file comparison quantity before the regular expression shows up in the table. For speed, you could have the daemon give priority to newly added regular expressions until their rank stabilizes. In fact, these two points should be configurable as tuning parameter of the daemon.

What I like about this approach is that it throws all the theoretical junk out the door. Brute force can be ugly, but then again, the map is not the territory, and brute force reveals the territory.

Comment on Re: Analysis of Regular Expressions


Come for the quick hacks, stay for the epiphanies.
	PerlMonks