Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

comparing columns using regular expression

by rocky13 (Acolyte)
on Jan 19, 2011 at 03:56 UTC ( #883048=perlquestion: print w/replies, xml ) Need Help??
rocky13 has asked for the wisdom of the Perl Monks concerning the following question:

I have two files. File1 has three columns and file2 has one column. File2 is a subset(w/ some repetitions) of the second column from file1. Here is a subset of the data for each file:

file1: Box of Joe CA12DE 12345 Big Box BA23DF 0123X Small FFFFF3 111XX Big and Small 4F4FGG XCA12 None XXXXXX 00000 file2: BA23DF BA23DF 4F4FGG

I need to print the entire line from file1 that does NOT contain the rows in file2. The output should look like this:

Name Number Count Box of Joe CA12DE 12345 Small FFFFF3 0123X None XXXXXX 00000
#!/usr/bin/perl open(F1 "<file1.txt") or die "Can't open\n"; open(F2 "<file2.txt") or die "can't open\n"; my @field = (); foreach (my $line = <F2>) { push (@field, $line); } my $string = join(",",@field); my @f = split(/,/, $string); my @unique_f = split(/,/, unique(@f)); while ($m_line = <F1>) { if !($m_line =~ m/????????/i) sub unique { my %seen = (); my @r = (); foreach my $a (@_) { unless ($seen{$a}) { push @r, $a; $seen{$a} = 1; } } return @r; } #I have gotten this far but now I want to use regular expression that +matches each entry of the second column of file1 to the entire @uniqu +e_f. Thus, if the match is NOT DEFINED, then it should print the who +le line. Thanks!

Replies are listed 'Best First'.
Re: comparing columns using regular expression
by ELISHEVA (Prior) on Jan 19, 2011 at 05:26 UTC
    File2 is a subset(w/ some repetitions) of the second column from file1. I need to print the entire line from file1 that does NOT contain the rows in file2

    I also question why you would want to use regex's, even if the file doesn't have fixed width fields and only appears that way to make your post look pretty. If file 2 stores the whole value of column 2, why not just store the values of file 2 in a hash? Then when you read file 1, split the row into fields, and look up the value of field 2 in a hash to see if you want the line.

    my @aKeep; while (my $line = <F1>) { # split line into fields - depends on how the line is formatted @aFields = .... if (!exists($hValuesInColumn2{$aFields[1]}) { push @aKeep, $line; } }

    A few efficiency comments as well. There is no need to explicitly "unique" the second file if you use a hash to store the field values. Since a hash key is stored only once it automatically "uniques" the keys.

    Also when reading in the lines of a file, it is better to use while than foreach. foreach slurps in all of the lines into an array and then visits each, whereas while reads the lines in one at a time and is much more memory efficient.

    my %hValuesInColumn2; # use while, not foreach while (my $line = <F2>) { #get rid of trailing new line chomp $line; # strip remaining leading and trailing whitespace # (if that is an issue) $line =~ s/^\s+//; #leading $line =~ s/\s+$//; #trailing # record field value $hValuesInColumn2->{$line} = 1; } # extracting the keys may not even be necessary # see first part of this answer. my @unique_f = keys %hValuesInColumn2;

    Update: reordered paragraphs to put focus on OP's original question rather than efficiency issues.

      Everything makes sense but how do I look up the values of field 2 in file1? As you can see, the number of fields in each line are different. How do you know for sure that you're getting the correct field? Here are the facts: The first column has only word(s)(because they are names/titles). The second and third column will have mix of numbers/letters. Using regex, I though there is a way to isolate the 2nd column by: (starts w/ any comb of word characters and \s)2nd col(ends with space and numbers/letters only) I will try it. If i get it i'll post it. If there is a another way, please let me know. Thanks!

        Put another way, you aren't sure about how to split the line into fields? Once you have the line split up into fields, you just extract the second element of the array, i.e. $aFields[1] and see if it is in the hash.

        How you split up the lines depends heavily on the syntax/grammar of your file. Do you know what that is? You seem a bit uncertain. Are these "fields" from a row of a database or a generated report? Does the file have a documented format? Or are these actually just words in a line?

        If these are just words, you could safely split the line on whitespace, like this

        $line =~ s/^\s+//; # strip leading whitespace from the line $line =~ s/\s+\z//; # strip trailing whitespace from the line my @aFields = split(/\s+/,$line); #extract words

        However, the above won't work if you have whitespace inside a field because it will split the field in two. If on some line of the file, the first column contains three words, then column 2 would end up in the fourth array element and you'd never know.

        What are the rules that actually govern the organization of this file into rows and fields? To know whether or not you need to use regex's you first need to know the file's rules. Regex's aren't always the best solution. thezip has pointed out that unpack would be a better way to split the line into fields if you are dealing with fixed width columns. (each column/field has known-in-advance number of characters.).

        On the other hand, if fields are separated from one another using a separator string or character, regex's are often good for splitting such lines into fields. However, you won't know what regex to use without knowing the format. Rules for separator delimited fields can be very simple or complex.

        A simple rule might be "a tab always means column separator". If that was your rule, you could just use split("\t",$line) to break up the line into fields.

        Or it could be more complicated - columns are separated by whitespace except where the whitespace is quoted or escaped. Or it could be even more complicated: the first character of each line determines the field separator for the rest of the line, plus there is an escaping/quoting mechanism. The possibilities are endless. It would be hard to advise you without knowing the intended rules of the file.

        Update: various clarifications and rewordings.

Re: comparing columns using regular expression
by thezip (Vicar) on Jan 19, 2011 at 04:15 UTC

    Are you sure you want to use a regex to do this? I think it's much more natural to use unpack (or even split, possibly) to do what you're trying to do.

    Update: I suppose I should mention that the *reason* it would be more natural to use unpack is because your columns are fixed-width columns. Unpack is great for extracting columnar data formatted in this way...

    Try this to see if it works for you:

    #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %hash = ( 'CA12DE' => 1 ); my $line = 'Box of Joe CA12DE 12345'; my @data = unpack 'A15A10A9', $line; print Dumper(\@data); my $searchkey = $data[1]; if (exists($hash{$searchkey})) { print qq(searchkey "$searchkey" exists in the hash...\n); }
    __OUTPUT__ $VAR1 = [ 'Box of Joe', 'CA12DE', '12345' ]; searchkey "CA12DE" exists in the hash...

    What can be asserted without proof can be dismissed without proof. - Christopher Hitchens
Re: comparing columns using regular expression
by JavaFan (Canon) on Jan 19, 2011 at 10:02 UTC
    File1 has three columns
    Really? Your first line is:
    Box of Joe CA12DE 12345
    Why is that three columns, not five?

    For the rest of the post, I will assume your columns are tab separated. (Maybe your data actually does contain tabs, but I cannot see that). Adjust accordingly if columns are defined differently. Either use a hash, or ~~ (untested):

    chomp(my @file2 = <F2>); while (<F1>) { print unless @file2 ~~ (split /\t/)[1]; }
    See also the questions "How can I tell whether a certain element is contained in a list or array?", "How can I remove duplicate elements from a list or array?", and "How do I compute the difference of two arrays? How do I compute the intersection of two arrays?" from the perlfaq. (man perlfaq4).

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://883048]
Approved by ELISHEVA
[PriNet]: go figure, i just tried that, it retains some of the values (not the key) of the pre-assigned value
[PriNet]: i guess i'll just have to use two seperate hashes, there are two, but one is larger than the other, and when i reference the smaller one, it holds onto the values that were assigned to the larger hash (the keynames change, but unless i reassign a new valu

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (3)
As of 2017-06-28 03:02 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (619 votes). Check out past polls.