http://www.perlmonks.org?node_id=1002977

rootcho has asked for the wisdom of the Perl Monks concerning the following question:

hi,
Any idea which will be the fastest way to remove a lines from a file (of hundred of thousands lines) which match any line from another file or array-of-strings (thousands of lines).
Should I build one giant regex from the second-file and then compare against the first one by one OR there is faster way ??

Replies are listed 'Best First'.
Re: remove lines matching list of strings
by TomDLux (Vicar) on Nov 08, 2012 at 21:51 UTC
    grep -v -F -f file2 file1

    English translation: using the Unix grep command, search file1 for lines which do not match ( -v ) the fixed strings ( -F ) ( as opposed to regular expressions ) found in file2 ( -f file2 ).

    There are versions of grep available for Windows. Mac OS has a Unix basis, so it already has it.

    If you absolutely have to do it in Perl, I would use the lines from file2 as the keys of a hash, assigning the number 1 as a value. Then, as I read the other file, it's trivial to check whether it is present in the hash.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      Yes, I was thinking in the same line... if done in perl using hash seem to be better than building giant regex.
      Heh...didn't know about -F option.. will check it out
      thanks

        Neither did I, but I scanned through man grep to make sure I was doing things right. I did have a vague recollection there was an option to search for strings rather than regex ... it helps if you know to search for something.

        As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Re: remove lines matching list of strings
by roboticus (Chancellor) on Nov 08, 2012 at 21:37 UTC

    rootcho:

    The fastest way? If you're on a unix box, I'd try using grep. It's specialized for that sort of task.

    If you just want to do it with perl, I think reading the entire file into a scalar and then building the giant regex may be the fastest. But you may want to use Benchmark and test to find out what's fast or not.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: remove lines matching list of strings
by frozenwithjoy (Priest) on Nov 08, 2012 at 21:11 UTC
    Couple questions to start:
    • Are the lines that you want to remove exact matches between the two files?
    • Are the lines in common in the same order for both files?
      - not exact match
      - no, the order is random