Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re: Search and Remove

by rjt (Deacon)
on Jun 12, 2010 at 10:00 UTC ( #844340=note: print w/replies, xml ) Need Help??

in reply to Search and Remove

You basically want to calculate the intersection of two lists (and the negation of that). The following algorithm will work, assuming relatively small lists and unsorted initial input:

  1. Verify your commandline arguments (did the user specify two files, and do they exist?)
  2. Open dellist, and for each line:
    1. split/parse the line as necessary (I'm not sure what you actually want to key on... just the ID, or the whole mess?)
    2. Put the key into a %del hash, with a value of 1.
  3. Close dellist for good.
  4. Open orglist. For each line:
    1. Parse again as per 2(1), above. (Say, maybe you could write a function to re-use for that.)
    2. If the key exists in %del, you have a match; save to beepon, else save to beepoff

The more general algorithm, above, runs at near-linear time but will consume memory proportional to the number of elements in your DEL list (which is still much better than your solution, which re-reads the entire DEL list for each line of the ORG list!)

Less general option: If your initial files are known to be pre-sorted by your key value, you can traverse each with a pair of list cursors. The advantage of this approach is that it runs in strict linear time on the total number of elements across both lists, and uses a small constant amount of memory (i.e., memory does not increase with the size of the DEL list.)

You might also want to look at List::Compare (specifically, the intersection method) for a reusable solution if your data is small enough where storing both sets in RAM isn't an issue.

Have fun!

Replies are listed 'Best First'.
Re^2: Search and Remove
by PyrexKidd (Monk) on Jun 12, 2010 at 21:34 UTC

    How about this?

    Only thing is when I include:

    use warnings;

    I get the following errors:

    Use of uninitialized value in string eq at ./ line +27. Use of uninitialized value in print at ./ line 37.
    and the really strange thing is when I get the length of @orgList after deleting all those lines it is the lenghth at the beginning - 1... strange.
    # open files, read into array, sort array open (ORGLIST, "<", $ARGV[0]); open (DELLIST, "<", $ARGV[1]); my @orgList = <ORGLIST>; my @delList = <DELLIST>; close ORGLIST; close DELLIST; @orgList = sort(@orgList); @delList = sort(@delList); # counter for entries removed my $j = 0; # iterate through @delList foreach my $line (@delList){ # iterate through @orgList for ( my $i = 0 ; $i < @orgList ; $i++ ){ # if found delete from array if ($line eq $orgList[$i]){ $j++; delete $orgList[$i]; } # end if } # end for } # end foreach print "Entries Removed: $j \n"; open ( NEWORGLIST, ">", "test.out"); print NEWORGLIST @orgList; close (NEWORGLIST); unlink @orgList; unlink @delList;
      You cannot use length to get the number of array elements, see length and scalar (the latter you should use instead).

      The last line is also strange. Do you really want to unlink the arrays?

      Well, this is getting better, but still doesn't really match the algorithm. You have eliminated a lot of unnecessary reads, but you still have an O(n*m) solution, which can be improved on greatly. To get around this (and make the code much cleaner), you need to implement the hashing step I outlined in the above node.

      Here's just a small snippet to get you thinking. This does the setup of the %del hash.

      # Step 2 my %del; open my $dellist, '<', $delfile or die "Could not open $delfile: $!\n" +; for (<$dellist>) { chomp; $del{$_} = 1; } close $dellist; # Step 3

      Now the 'key' bit (if you'll pardon the pun) is that you never, ever have to iterate through that list again, or open the del file again.

      Step 4 is where you iterate through your org list. The setup will be similar to the del list, except that for each line in your org list, you just need to check whether it exists in your %del hash and act accordingly. For instance, exists $del{1234} returns true if and only if `1234' is a key in your %del hash. Read the exists page for more detail.

      From your code it looks like you have influence from other structured programming languages. (Are you perhaps thinking that unlink is something like C's free(3) to free allocated memory? You don't need to do that in Perl; it handles garbage collection for you automatically.)

      That influence from other languages will certainly help you, but Perl has rich features that you'll want to take advantage of. For instance, the introduction on hashes could be sort of life changing if you're coming from a language that doesn't have a built-in hash/associative array data type.

        From your code it looks like you have influence from other structured programming languages. (Are you perhaps thinking that unlink is something like C's free(3) to free allocated memory? You don't need to do that in Perl; it handles garbage collection for you automatically.)

        Haha, that's what they said about java... and, well--we all know the truth...

        I started with BASIC when I was a wee tot. I only ever learned enough to copy out of the BASIC book. From there I took a VBasic course and learned VB6. From that I was able to apply what I'd learned into automating MANY tasks with VBA. Parallel to that I was trying to learn C++; I've never even gotten a "Hello World" program to work in C++. After that I focused my energy on C# which I grasped rather quickly and banged out several "practice" programs. Also IMPORTANT to note this is when I discovered the "open source" world, and discovered how to shed my Microsucks dependencies...

        Enter Ubuntu...

        At this point I've changed schools and majors (I'm a certified EMT... But I needed a way to pay for my paramedic degree...). Enter Computer Science.

        Computer Science:

        Fortunately as a previous Math major I've completed most of the requirements for my CS degree. As a CS Major, our language of study is java... IMHO on of the most klutzyist languages I've studied.

        In parallel to starting my CS Major I landed a job. This job promoted me programming where I learned to love Perl.

        Anyway... Long Story Short (too late), Out of necessity I've needed to become a Server and Perl Expert over night. Obviously, I still have some learning to do.

        I'm really not sure how making a hash key = 1 is helpful. I understand making "the whole mess" a key is like making it an array.

        I've been studying from "PERL By Examply" By Ellie Quigly. Can you maybe recommend a better book?

        So how about this?
        #!/usr/bin/perl use strict; use warnings; use diagnostics; if (@ARGV = 2){ my %delList; my @orgList; open (BEEPON, >>, "/path/to/beepon.file); open (BEEPOFF, >>, /path/to/beepoff.file); open (DELLIST, <, $ARGV[0]); foreach (<DELLIST>){ %delList{chomp($_)} = 1; } close DELLIST; open (ORGLIST, <, $ARGV[1]); @orgList = <ORGLIST> close (ORGLIST); for (my $i = 0; $i < @orgList; $i++){ if (exists $del{$orgList[$i]}){ print BEEPON $orgList[$i]; } else print BEEPOFF $orgList[$i]; } } close BEEPON; close BEEPOFF;
        I can see how this would be more efficient. I'm not really sure that it is what you outlined... I'm still not entirely understanding the use of hashes. ie why are all the values set to 1?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://844340]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2018-06-25 16:29 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (127 votes). Check out past polls.