Efficient code.

murugu has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Efficient code. by merlyn (Sage) on Jan 30, 2006 at 17:44 UTC
It's not clear from the context, but are you just looking for set intersection? I mean, if the comparison is merely "is this line of B also a line in A in its entirety", then the problem is trivial: `open A, "a" or die; my %in_a = map { $_, 1 } <A>; open B, "b" or die; while (<B>) { print if $in_a{$_}; }` [download] -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l]
Re: Efficient code. by holli (Abbot) on Jan 30, 2006 at 17:34 UTC
A group of 400 alternations is something Perl's regex engine can surely handle. Remember to `quotemeta` your lines, when it's possible they contain regex relevant chars (braces for example). Then you can have a look at Regex::Assemble which will do the regex construction and optimize the result. Personally I'd go for a hashed based solution and simply use the lines from "a" as keys for a hash to look up the contents of "b". Alternatively you can sort the lines and use a binary search algorithm. holli, /regexed monk/	[reply] [d/l] [select]
Re^2: Efficient code. by Aristotle (Chancellor) on Jan 30, 2006 at 17:55 UTC
400 alternations is manageable, but don’t forget the match is executed 50,000 times. Regexp::Assemble helps by reducing the necessary backtracking, but don’t forget that Regexp::Assemble itself doesn’t work by magic and will need to do a fair amount of string munging. Most likely, that effort is easily amortised during the actual tests, but you don’t know that yet. Then the question that remains is how similar the lines are; if they all differ greatly, then Regexp::Assemble can’t do much. Without benchmarks I wouldn’t go on the record with any prediction as to what’s going to be faster. And the problem that the code does not match intention closely remains. The hash-based approach will be trivial to get right, and it will be much easier to understand in six months. Makeshifts last the longest.	[reply]
Re^3: Efficient code. by demerphq (Chancellor) on Jan 30, 2006 at 18:23 UTC
Juts FYI: In Perl 5.10 alternations not containing regex metachars will not backtrack at all as they are represented by a trie node that can traverse all of the alternations in a single pass. Id say which version of perl contains this but embarrasingly I appear to have forgotten. --- $world=~s/war/peace/g	[reply]
Re^4: Efficient code. by Aristotle (Chancellor) on Jan 30, 2006 at 20:56 UTC
Re: Efficient code. by dorko (Prior) on Jan 30, 2006 at 17:43 UTC
murugu, To put it differently, you don't know, your friend doesn't know and your friend's boss doesn't know. You're taking a guess. And you're asking us to take a guess. Why don't you code up both approaches and use something like Devel::Dprof to find out for sure which is faster? Personally, I'd use List::Compare to do it and be done with it. (Make sure you look at the `unsorted` option if you do use List::Compare.) Later, if it was judged to be "too slow" then I'd worry about making it faster. Cheers, Brent -- Yeah, I'm a Delt.	[reply] [d/l]
Re: Efficient code. by McDarren (Abbot) on Jan 30, 2006 at 17:43 UTC
I'd do something like this (untested): `#!/usr/bin/perl -w use strict; my %firstfile; open IN_A, "<a.log" or die "$!\n"; while (<IN_A>) { chomp; $firstfile{$_}++; } close IN_A; open IN_B, "<b.log" or die "$!\n"; while (<IN_B>) { chomp; print if exists($firstfile{$_}); } close IN_B;` [download] Cheers, Darren :)	[reply] [d/l]
Re: Efficient code. by Aristotle (Chancellor) on Jan 30, 2006 at 17:30 UTC
Alternations in a pattern involve backtracking. I can’t imagine that a pattern with ~~50,000~~ 400 of them executed 50,000 times will perform like anything other than molasses. The matching engine doesn’t work by magic. If you really only need to compare line 1 in A to line 1 in B, line 2 in A to line 2 in B, etc, then doing it line-by-line will be much faster. 50,000 lines is nothing; computers eat such files for breakfast nowadays. If there isn’t a need to run the comparison extremely frequently, then the effort of optimising it will be wasted. Update: I misread the problem; mea culpa. I suggest using a hash, as holli proposed – they’re designed for exactly the sort of problem your friend has to solve. Of course the above is just conjecture. If I really cared, I’d write some code to benchmark and possibly profile, because a programmer is always wrong about how fast code is and what it spends the most time doing. But never mind the performance question: the next guy who needs to touch the code will actually be able to understand what it’s doing if your friend writes it the straightforward way, and unglamorous as it may be to write simple, straightforward code, it is 100× more important than performance. “Correct code is easier to optimise than optimised code is to correct” and all that. ~~Tell your friend to stop wasting significant programmer time on saving irrelevant amounts of computer time before he even knows there’s a problem.~~ Makeshifts last the longest.	[reply]
Re: Efficient code. by jffry (Hermit) on Jan 30, 2006 at 17:45 UTC
A meta-answer: PBP, p. 456, says to benchmark, not optimize code. Personally, I favor brute force, i.e. line by line, because I'm not too clever.	[reply]
Re: Efficient code. by NiJo (Friar) on Jan 30, 2006 at 20:40 UTC
Bloom filters are very efficient, especially with low hit rates. Speed, linear time behavior and low memory are major advantages. Disadvantages include a tunable rate of false positives and not telling which line did match. Reimplement the algorithm using Bit::Vector if you need more speed.	[reply]


Welcome to the Monastery
	PerlMonks