Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Efficient code.

by murugu (Curate)
on Jan 30, 2006 at 17:22 UTC ( #526494=perlquestion: print w/replies, xml ) Need Help??
murugu has asked for the wisdom of the Perl Monks concerning the following question:

My friend is having two text files a and b. a is having 400 lines and b is having 50000 lines. He needs to compare the files and write in to a new log file where the matched contents should be written.

My friend's boss thinks that fetching 'a' line by line and comparing it with the file b at a stretch is effecient. But what my friend suggested is to open the file 'b' line by line and compare it with a regular expression framed by joining all the lines of a with '|'. My friend asked me which one is better and efficient. I thought that what my friend suggested is efficient.

Monks, please suggest us which method we should go for. Is there any other effecient way to do this?

Million thanks in advance.

Murugesan Kandasamy
use perl for(;;);

Replies are listed 'Best First'.
Re: Efficient code.
by merlyn (Sage) on Jan 30, 2006 at 17:44 UTC
    It's not clear from the context, but are you just looking for set intersection? I mean, if the comparison is merely "is this line of B also a line in A in its entirety", then the problem is trivial:
    open A, "a" or die; my %in_a = map { $_, 1 } <A>; open B, "b" or die; while (<B>) { print if $in_a{$_}; }

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

Re: Efficient code.
by holli (Monsignor) on Jan 30, 2006 at 17:34 UTC
    A group of 400 alternations is something Perl's regex engine can surely handle. Remember to quotemeta your lines, when it's possible they contain regex relevant chars (braces for example). Then you can have a look at Regex::Assemble which will do the regex construction and optimize the result.

    Personally I'd go for a hashed based solution and simply use the lines from "a" as keys for a hash to look up the contents of "b". Alternatively you can sort the lines and use a binary search algorithm.

    holli, /regexed monk/

      400 alternations is manageable, but don’t forget the match is executed 50,000 times.

      Regexp::Assemble helps by reducing the necessary backtracking, but don’t forget that Regexp::Assemble itself doesn’t work by magic and will need to do a fair amount of string munging. Most likely, that effort is easily amortised during the actual tests, but you don’t know that yet. Then the question that remains is how similar the lines are; if they all differ greatly, then Regexp::Assemble can’t do much.

      Without benchmarks I wouldn’t go on the record with any prediction as to what’s going to be faster.

      And the problem that the code does not match intention closely remains. The hash-based approach will be trivial to get right, and it will be much easier to understand in six months.

      Makeshifts last the longest.

        Juts FYI: In Perl 5.10 alternations not containing regex metachars will not backtrack at all as they are represented by a trie node that can traverse all of the alternations in a single pass. Id say which version of perl contains this but embarrasingly I appear to have forgotten.


Re: Efficient code.
by dorko (Parson) on Jan 30, 2006 at 17:43 UTC

    To put it differently, you don't know, your friend doesn't know and your friend's boss doesn't know. You're taking a guess. And you're asking us to take a guess.

    Why don't you code up both approaches and use something like Devel::Dprof to find out for sure which is faster?

    Personally, I'd use List::Compare to do it and be done with it. (Make sure you look at the unsorted option if you do use List::Compare.) Later, if it was judged to be "too slow" then I'd worry about making it faster.



    -- Yeah, I'm a Delt.
Re: Efficient code.
by McDarren (Abbot) on Jan 30, 2006 at 17:43 UTC
    I'd do something like this (untested):
    #!/usr/bin/perl -w use strict; my %firstfile; open IN_A, "<a.log" or die "$!\n"; while (<IN_A>) { chomp; $firstfile{$_}++; } close IN_A; open IN_B, "<b.log" or die "$!\n"; while (<IN_B>) { chomp; print if exists($firstfile{$_}); } close IN_B;

    Darren :)

Re: Efficient code.
by Aristotle (Chancellor) on Jan 30, 2006 at 17:30 UTC

    Alternations in a pattern involve backtracking. I can’t imagine that a pattern with 50,000 400 of them executed 50,000 times will perform like anything other than molasses. The matching engine doesn’t work by magic.

    If you really only need to compare line 1 in A to line 1 in B, line 2 in A to line 2 in B, etc, then doing it line-by-line will be much faster. 50,000 lines is nothing; computers eat such files for breakfast nowadays. If there isn’t a need to run the comparison extremely frequently, then the effort of optimising it will be wasted.

    Update: I misread the problem; mea culpa. I suggest using a hash, as holli proposed – they’re designed for exactly the sort of problem your friend has to solve.

    Of course the above is just conjecture. If I really cared, I’d write some code to benchmark and possibly profile, because a programmer is always wrong about how fast code is and what it spends the most time doing.

    But never mind the performance question: the next guy who needs to touch the code will actually be able to understand what it’s doing if your friend writes it the straightforward way, and unglamorous as it may be to write simple, straightforward code, it is 100× more important than performance. “Correct code is easier to optimise than optimised code is to correct” and all that. Tell your friend to stop wasting significant programmer time on saving irrelevant amounts of computer time before he even knows there’s a problem.

    Makeshifts last the longest.

Re: Efficient code.
by jffry (Hermit) on Jan 30, 2006 at 17:45 UTC
    A meta-answer:

    PBP, p. 456, says to benchmark, not optimize code.

    Personally, I favor brute force, i.e. line by line, because I'm not too clever.

Re: Efficient code.
by NiJo (Friar) on Jan 30, 2006 at 20:40 UTC
    Bloom filters are very efficient, especially with low hit rates.

    Speed, linear time behavior and low memory are major advantages. Disadvantages include a tunable rate of false positives and not telling which line did match.

    Reimplement the algorithm using Bit::Vector if you need more speed.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://526494]
Approved by dorward
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (5)
As of 2017-08-23 05:31 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (347 votes). Check out past polls.