Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Filtering Source Text File with 2nd Text File of Terms

by Loops303 (Novice)
on Apr 02, 2012 at 22:57 UTC ( #963141=perlquestion: print w/ replies, xml ) Need Help??
Loops303 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am a novice at Perl.

I have a SOURCE text file with a list of strings, such as

http://www.google.com
www3.manish.net
www.nov8rix.com
www.thisisannoying.com

and a 2nd text file FILTER TERMS with a list of terms such as

google
manish
www.thisisannoying.com

What I want to do is read the 2nd file and using the list of those terms, to filter out the first file.

The desired end result OUTPUT would be

www.nov8rix.com

It would not write

http://www.google.com --- because it matches the "google"
www3.manish.net --- because it matches the "manish"
www.thisisannoying.com --- because it matches the "www.thisisannoying.com"

Can anyone please help me figure out how to do this?

Here is the code I have thus far (this is the 20th iteration of various attempts, having spent about 5 hours on this already today --- see, I am new at this!)
#!/usr/bin/perl open (F1, "<filterTerms.txt"); open (F2, "<source.txt"); my %terms = (); my %source = (); while (<F1>) { my $term=$_; chomp ($term); $terms{$term}=$term; } while (<F2>) { my $item=$_; chomp ($item); $source{$item}=$item; foreach (keys %source) { if ($source=~m/($term{$term})/) { #do nothing } else { print $1."\n"; } } } close (F1); close (F2);
Thank you.

Comment on Filtering Source Text File with 2nd Text File of Terms
Download Code
Re: Filtering Source Text File with 2nd Text File of Terms
by Riales (Hermit) on Apr 02, 2012 at 23:49 UTC

    Your main problem is when you check to see if the source matches any of the terms, you're only checking the last term in the file.

    You're also trying to print a match with the $1 but that's not really what you want.

    Beyond that, is there a particular reason you're choosing to use hashes instead of arrays? I would think arrays are more what you want.

    # Building the array of terms: my @terms = (); while (my $term = <F1>) { chomp $term; push @terms, $term; }

    This way, when you are checking each term against the source, you just need to do this:

    # Printing sources that do not match of of the terms: while (my $source = <F2>) { chomp $source; print "$source\n" unless grep { $source =~ /$_/ } @terms; }
      while my $term (<F1>) {

      That is a syntax error.    Perhaps you meant:

      while ( my $term = <F1> ) {


      foreach my $source (<F2>) {

      Why would you read in the whole file instead of just reading one line at a time?    Perhaps you meant:

      while ( my $source = <F2> ) {

        Argh, you're absolutely right. I guess I was just too eager to fire off my response. I'll change my original post.

        Thanks for catching that.
      i think i was considering a hash would allow me to check all the terms at once, as opposed to do it a line at a time and output unfiltered items into the output... but thanks for the tip, i will give it a try. very helpful. this site rules.
Re: Filtering Source Text File with 2nd Text File of Terms
by vitoco (Friar) on Apr 03, 2012 at 17:53 UTC

    Please note that unescaped special characters in strings used as patterns could give unpredictable results!!!

    Example: the term "www.thisisannoying.com" will also match lines with "wwwithisisannoyingacom"...

    If the terms from the list are single words, probably the test from previous posts should be:

    print "$source\n" unless grep { $source =~ /\b$_\b/ } @terms;

    where \b is used to check for word boundaries, so "googleeee" won't be matched by "google" term.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://963141]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2014-10-25 21:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (149 votes), past polls