Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Search and Remove

by PyrexKidd (Monk)
on Jun 12, 2010 at 02:03 UTC ( #844319=perlquestion: print w/replies, xml ) Need Help??
PyrexKidd has asked for the wisdom of the Perl Monks concerning the following question:

I have a file of CSV in the format:


1111 1234

I need to remove a list of numbers that I have in another format in the same format.

My original thought was to read everything into an array and remove it that way, the problem I am running into is when I read it in it separates the ID and pin into different entries in the array: ie

@originalList = <LISTFILE> $originalList[0] = "ID" $originalList[1] = "PIN"

so then I thought if I could just parse through the file then I could pull out the items that don't match:

open (ORGLIST, "<", $ARGV[1]); open (BEEPON, ">>", "beep_on.list"); open (BEEPOFF, ">>", "beep_off.list"); my $i = 0; foreach my $line (<ORGLIST>){ my $notLine = 0; open (DELLIST, "<", $ARGV[0]); foreach my $otherLine (<DELLIST>){ if ($line eq $otherLine){ $notLine++; last; } } close (DELLIST); if ($notLine > 0){ print BEEPON $line; $i++; } elsif ($notLine == 0){ print $line; print BEEPOFF $line; } } close (BEEPON); close (BEEPOFF); print "Items removed $i \n"; close (DELLIST);

as you can see this code is defunct too...

This makes a list for every iteration of the file line.

any suggestions please?

I'm not looking for someone to write the code for me, I just need someone to point me in the right direction.

Replies are listed 'Best First'.
Re: Search and Remove
by rjt (Deacon) on Jun 12, 2010 at 10:00 UTC

    You basically want to calculate the intersection of two lists (and the negation of that). The following algorithm will work, assuming relatively small lists and unsorted initial input:

    1. Verify your commandline arguments (did the user specify two files, and do they exist?)
    2. Open dellist, and for each line:
      1. split/parse the line as necessary (I'm not sure what you actually want to key on... just the ID, or the whole mess?)
      2. Put the key into a %del hash, with a value of 1.
    3. Close dellist for good.
    4. Open orglist. For each line:
      1. Parse again as per 2(1), above. (Say, maybe you could write a function to re-use for that.)
      2. If the key exists in %del, you have a match; save to beepon, else save to beepoff

    The more general algorithm, above, runs at near-linear time but will consume memory proportional to the number of elements in your DEL list (which is still much better than your solution, which re-reads the entire DEL list for each line of the ORG list!)

    Less general option: If your initial files are known to be pre-sorted by your key value, you can traverse each with a pair of list cursors. The advantage of this approach is that it runs in strict linear time on the total number of elements across both lists, and uses a small constant amount of memory (i.e., memory does not increase with the size of the DEL list.)

    You might also want to look at List::Compare (specifically, the intersection method) for a reusable solution if your data is small enough where storing both sets in RAM isn't an issue.

    Have fun!

      How about this?

      Only thing is when I include:

      use warnings;

      I get the following errors:

      Use of uninitialized value in string eq at ./ line +27. Use of uninitialized value in print at ./ line 37.
      and the really strange thing is when I get the length of @orgList after deleting all those lines it is the lenghth at the beginning - 1... strange.
      # open files, read into array, sort array open (ORGLIST, "<", $ARGV[0]); open (DELLIST, "<", $ARGV[1]); my @orgList = <ORGLIST>; my @delList = <DELLIST>; close ORGLIST; close DELLIST; @orgList = sort(@orgList); @delList = sort(@delList); # counter for entries removed my $j = 0; # iterate through @delList foreach my $line (@delList){ # iterate through @orgList for ( my $i = 0 ; $i < @orgList ; $i++ ){ # if found delete from array if ($line eq $orgList[$i]){ $j++; delete $orgList[$i]; } # end if } # end for } # end foreach print "Entries Removed: $j \n"; open ( NEWORGLIST, ">", "test.out"); print NEWORGLIST @orgList; close (NEWORGLIST); unlink @orgList; unlink @delList;
        You cannot use length to get the number of array elements, see length and scalar (the latter you should use instead).

        The last line is also strange. Do you really want to unlink the arrays?

        Well, this is getting better, but still doesn't really match the algorithm. You have eliminated a lot of unnecessary reads, but you still have an O(n*m) solution, which can be improved on greatly. To get around this (and make the code much cleaner), you need to implement the hashing step I outlined in the above node.

        Here's just a small snippet to get you thinking. This does the setup of the %del hash.

        # Step 2 my %del; open my $dellist, '<', $delfile or die "Could not open $delfile: $!\n" +; for (<$dellist>) { chomp; $del{$_} = 1; } close $dellist; # Step 3

        Now the 'key' bit (if you'll pardon the pun) is that you never, ever have to iterate through that list again, or open the del file again.

        Step 4 is where you iterate through your org list. The setup will be similar to the del list, except that for each line in your org list, you just need to check whether it exists in your %del hash and act accordingly. For instance, exists $del{1234} returns true if and only if `1234' is a key in your %del hash. Read the exists page for more detail.

        From your code it looks like you have influence from other structured programming languages. (Are you perhaps thinking that unlink is something like C's free(3) to free allocated memory? You don't need to do that in Perl; it handles garbage collection for you automatically.)

        That influence from other languages will certainly help you, but Perl has rich features that you'll want to take advantage of. For instance, the introduction on hashes could be sort of life changing if you're coming from a language that doesn't have a built-in hash/associative array data type.

Re: Search and Remove
by Corion (Pope) on Jun 12, 2010 at 07:35 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://844319]
Approved by ww
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2017-06-29 14:29 GMT
Find Nodes?
    Voting Booth?
    How many monitors do you use while coding?

    Results (670 votes). Check out past polls.