Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Getting data from second file, based on first files contents;

by kcott (Archbishop)
on Oct 29, 2015 at 05:28 UTC ( [id://1146351]=note: print w/replies, xml ) Need Help??


in reply to UPDATED - Getting data from second file, based on first files contents;

G'day james28909,

"Anyway, let me start off by posting example code and files:"

For future reference, please post a short, representative sample of your data here. I tried to download the zip file you linked to, but

$ wget https://dl.dropboxusercontent.com/u/64707444/monks/monks.zip --2015-10-29 15:29:58-- https://dl.dropboxusercontent.com/u/64707444/ +monks/monks.zip Resolving dl.dropboxusercontent.com (dl.dropboxusercontent.com)... 199 +.47.217.101 Connecting to dl.dropboxusercontent.com (dl.dropboxusercontent.com)|19 +9.47.217.101|:443... connected. ERROR: The certificate of `dl.dropboxusercontent.com' is not trusted. ERROR: The certificate of `dl.dropboxusercontent.com' hasn't got a kno +wn issuer.

[Perhaps I could've tried harder to get this but I don't really have the time and I shouldn't have to, anyway.]

Here's some tips on the code you presented.

When opening files, always check for problems. Either use the autodie pragma or hand-craft messages (see open for examples).

Repeatedly opening files in a loop, and reading their entire contents multiple times, is rarely (if ever) a good idea. I see that you've done this in both a while and a for loop. Aim to open and read once. If you need to jump around in an opened file, consider seek and tell.

When you read "file1" (for the first time), it may be better to store the data in a hash. For example, instead of

push( @original, $rightside );

perhaps something closer to

++$original{$rightside};

You can then lose the "for (@original) {...}" loop altogether, and change

if ( $last =~ $_ ) {

to something like

if ($original{$last}) {

Also, your use of a regex match ($last =~ $_) seems questionable. I haven't delved too deeply into this, but a straight equality check ($last eq $_) looks like it might be a better idea.

These suggestions have been intentionally vague. Without any input and only erroneous expected output (you wrote: "EDIT: It seems there is indeed an error in the output"), I am somewhat loathe to attempt to suggest anything more concrete with regards to the actually processing.

If you do provide sample input and real expected output, myself (or another monk) might provide a better answer.

— Ken

Replies are listed 'Best First'.
Re^2: Getting data from second file, based on first files contents;
by james28909 (Deacon) on Oct 29, 2015 at 16:42 UTC
    here is some sample data:

    file1.txt
    123 456 789
    file2.txt
    123 string 1 111 string 1 script should skip this line 222 string 1 333 string 1 456 string 2 444 string 2 it should skip this line as well 555 string 2 666 string 2 789 string 3 777 string 3 also skipping this line too 888 string 3 999 string 3
    and the stuff that gets extracted from file 2 are based off of file1's contents. iIt takes the data from file 1 and gets the first match it finds in file 2, then gets the right side column and compasres that against the entire file.

    Output:
    123 string 1 111 string 1 222 string 1 333 string 1 456 string 2 444 string 2 555 string 2 666 string 2 789 string 3 777 string 3 888 string 3 999 string 3
    Thanks for the tips btw :)
    Will updated OP

      The following achieves what you want with just one pass over file1.txt and two passes over file2.txt.

      #!/usr/bin/env perl use strict; use warnings; use autodie; my ($ref_file, $data_file) = qw{pm_1146340_file1.txt pm_1146340_file2. +txt}; my (%ref_left, %ref_right, @output); open my $ref_fh, '<', $ref_file; while (<$ref_fh>) { chomp; undef $ref_left{$_}; } close $ref_fh; open my $data_fh, '<', $data_file; while (<$data_fh>) { my ($left, $right) = split ' ', $_, 2; next unless exists $ref_left{$left} and not defined $ref_left{$lef +t}; ++$ref_left{$left}; ++$ref_right{$right}; } seek $data_fh, 0, 0; while (<$data_fh>) { my ($left, $right) = split ' ', $_, 2; next unless $ref_right{$right}; push @output, $_; } close $data_fh; print for @output;

      Output:

      123 string 1 111 string 1 222 string 1 333 string 1 456 string 2 444 string 2 555 string 2 666 string 2 789 string 3 777 string 3 888 string 3 999 string 3

      If the data in file2.txt is always ordered as shown, i.e. references to file1.txt data always appear first, such as

      123 string 1 111 string 1

      and never as

      111 string 1 123 string 1

      you'll only need one pass over file2.txt.

      To more fully test your code, I'd completely jumble up file2.txt and then add additional records, such as

      123 string 4 111 string 4

      The output should be the same with no instances of "string 4" appearing at all.

      Update: I took my own advice (re "To more fully test your code, ...") and found a problem. I have fixed this by making changes to the first and second while loops. The original code is in the spoiler below.

      — Ken

        yeah it SHOULD match 'string 4' (on all occurences in file2) IF it contains any lines from file1.txt. if you put '123 string 4' inside of file 2, then it should take '123' from file one, and match the same '123' in file2. then you get the value directly to the right of the match (in file2) and compare it with the whole of file2, if the value to the right of '123' is 'string 4' then it will most def need to match if 123 is in file1, which obviously is. in essence you are trying to filter the file2.txt and you could say it could be like a database or something. anyway thanks for post replying :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1146351]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-04-26 02:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found