Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

extract line

by lallison (Novice)
on Jul 20, 2013 at 16:59 UTC ( #1045452=perlquestion: print w/ replies, xml ) Need Help??
lallison has asked for the wisdom of the Perl Monks concerning the following question:

I have two files and I need to print the line from file1 that matches the part number in file2.
File1 looks like:
3478748:AA:1D:AAA:Descriptions:C:2
3478749:AA:1D:AAA:Descriptions:C:2
3633731:AA:3E:AAA:Descriptions:C:2
File2 looks like:
3478749
3633731

Currently it is printing all lines in File1

Output should be line in File1 that have string in File2 or:

3478749:AA:1D:AAA:ROCKER ARM ASSEMBLIES AND GROUPS:C:2
3633731:AA:3E:AAA:ROCKER ARM ASSEMBLIES AND GROUPS:C:2

<code> #!/usr/bin/perl
use Encode;
use strict;
use warnings;
require 'tools/csvutils.pl';
$|=1;
#open (OUTFILE, '>:crlf', 'final_lines.txt') or die 'Cannot open file';
#open parent file
open (FILE1, '<:crlf', 'file1.txt') or die 'Cannot open file';
#open child file
open (FILE2, '<:crlf', 'file2.csv') or die 'Cannot open file';
my @goodarray =<FILE2>;
while (my $line = <FILE1>) {
my @filerecord = $line;
for (@goodarray){
my $data = $_;
my @arrfields = $data;
chomp (@arrfields);
my $filedata =\@filerecord;
${$filedata}[0] =~ s/^\s+|\s+$//g;
$arrfields[0] =~ s/^\s+|\s+$|-//g;
#$arrfields[0] =~ s/-//g;
my $string = ${$filedata}[0];
if ($string =~ /(^\d{7,8})/ ){
my $substr = $1;
if (index($string, $substr) !=-1)
{
print "$string\n";
last;
}
}
}
}

Comment on extract line
Re: extract line
by frozenwithjoy (Curate) on Jul 20, 2013 at 17:15 UTC
    I made a hash containing all of the part numbers of interest (file 1) then I read through file 2 and printed any lines that have a part number that exists in the hash of part numbers:
    #!/usr/bin/env perl use strict; use warnings; my %part_nums = map { $_ => 1 } qw(3478749 3633731); while ( my $line = <DATA> ) { my ($part) = split /:/, $line; print $line if exists $part_nums{$part}; } __DATA__ 3478748:AA:1D:AAA:Descriptions:C:2 3478749:AA:1D:AAA:Descriptions:C:2 3633731:AA:3E:AAA:Descriptions:C:2

    OUTPUT:

    3478749:AA:1D:AAA:Descriptions:C:2 3633731:AA:3E:AAA:Descriptions:C:2
Re: extract line
by mtmcc (Hermit) on Jul 20, 2013 at 17:26 UTC
    I'm not sure where the "ROCKER ARM ASSEMBLIES AND GROUPS" is coming from, but if I understand what you're trying to do, I think you might be over complicating it.

    Something like this should work:

    #! /usr/bin/perl use strict; use warnings; my $serialNumbers = $ARGV[0]; my $partNumbers = $ARGV[1]; my @line; my %partNumber; open (my $fileTwo, "<", $partNumbers); while (<$fileTwo>) { chomp; $partNumber{$_} = 1; } open (my $fileOne, "<", $serialNumbers); while (<$fileOne>) { @line = split (":", $_); print STDERR "$_" if exists $partNumber{$line[0]}; }

    I hope that helps.

      Thank you, I will give this a try. This file has over a million lines and is running very slow. Is there a quick way to control the buffering.
        I'm afraid not that I can see. I think the split probably slows it down, but I don't think it's avoidable. If you do find a faster way, please let me know!

        Best of luck.

      Not very familiar with argv. Where am I giving the program my file names? I am also printing to a outfile. I tried it as written and receive an uninitialized value $partNumbers but believe this is due to the filename.

        $ARGV[0] is the first argument on the command line (name of file with longer 'serial numbers'), and $ARGV[1] is the second argument on the command line (file containing the integers).

        To run it:  script.pl serialnumbers.txt integers.txt

        To print to a third file, add this somewhere before the second while loop:

        open (my $output, ">", 'nameOfOutputFile.txt');

        and change print STDERR to print $output.

        I hope that works!
Re: extract line
by Laurent_R (Vicar) on Jul 20, 2013 at 17:54 UTC

    Yep, that't the right (and standard) way: open file2, load its content into a hash (each part nu¨mber as a key to the hash), close file2; then scan file1 line by line, check if the part number is in the hash, print the line it it is. This is also very fast, since loopup in a hash is fast. It breaks only if file2 is so huge that it will not fit into memory.

Re: extract line
by poj (Curate) on Jul 20, 2013 at 18:27 UTC

    If you are curious to know why your script fails then perhaps this explains it. Essentially you are matching a file1 value with a substring extracted from the same value

    while (my $line = <FILE1>) { # this sets $filerecord[0] to a file1 value my @filerecord = $line; for (@goodarray){ my $data = $_; # this sets $arrfield[0] only to a file2 value my @arrfields = $data; # this removes line ending from $arrfield[0] # which holds the file2 value chomp (@arrfields); # this trims $filedata[0] which refers # to $filerecord[0] the file1 value my $filedata =\@filerecord; ${$filedata}[0] =~ s/^\s+|\s+$//g; # this trims the file2 value $arrfields[0] =~ s/^\s+|\s+$|-//g; #$arrfields[0] =~ s/-//g; # this sets $string to trimmed file1 value my $string = ${$filedata}[0]; # this extract number from start of file1 value if ($string =~ /(^\d{7,8})/ ){ # $substr holds number from file1 my $substr = $1; # this matches file1 value to the number # extracted from file1 value # so will allways match if (index($string, $substr) !=-1){ #print "$string\n"; #last; } # this would work matching file1 value to file2 value if (index($string, $arrfields[0]) != -1){ print "$string\n"; last; } } } }
    poj

      Printing the string this way still give me all the lines from file1. I even changed your line:

      if (index($string, $arrfields[0]) != -1){
      print "$string\n";
      last
      to
      if (index(${$filedata}[0], $arrfields[0]) != -1){
      print "${$filedata}[0]\n";
      last

      It is not making the connection to grab the line with that part. It prints them all up to the number that ends the $arrfields[0].This file has over 1 mil lines so I cannot use one liners

        Without seeing all your code and an example of the data set that is failing to connect it is difficult for me to explain it. However this line

        if ($string =~ /(^\d{7,8})/ )

        suggests your have both 7 and 8 digit numbers in which case using index will give you incorrect results. For example 1234567 will match numbers 11234567,21234567, etc as well as 12345670,12345671, etc.

        You could use an exact match

        if ($substr eq $arrfields[0]){ print "$string\n"; last; }

        but if speed is important then I suggest you use one of the hash based solution other monks have provide like this

        .
        #!/usr/bin/perl use strict; use warnings; # start my $t0 = time(); my $file1 = 'file1.txt'; my $file2 = 'file2.csv'; my $outfile = 'final_lines.txt'; # run once #testdata(); my %file2=(); open FILE2, '<',$file2 or die "Could not open $file2 $!"; while (<FILE2>){ s/[\r\n]//g; $file2{$_} = 1; } my $dur = time() - $t0; print "$. records read from $file2 in $dur seconds\n"; close FILE2; $t0 = time(); open OUTFILE,'>',$outfile or die "Could not open $outfile $!"; open FILE1, '<',$file1 or die "Could not open $file1 $!"; my $count_out=0; while (<FILE1>){ my ($id,undef) = split /:/; if (exists $file2{$id}){ print OUTFILE $_; ++$count_out; } } $dur = time() - $t0; print "$. records read from $file1 in $dur seconds\n"; close FILE1; close OUTFILE; print "$count_out records written to $outfile\n"; # some random data sub testdata { my $count; my @char = ('A'..'Z','a'..'z','0'..'9'); open OUT1,'>',$file1 or die "$file2 $!"; open OUT2,'>',$file2 or die "$file2 $!"; for (my $i=1_000_000;$i<=99_999_999;$i+=99){ my @chars = map{ $char[int(rand(62))] }(1..60); my $line = ':'.(join '',@chars); print OUT1 ($i + int rand(99))."$line\n"; print OUT2 ($i + int rand(99))."\n"; ++$count; } close OUT1; close OUT2; print "$count records created in $file1 and $file2\n"; }
        poj
Re: extract line
by Loops (Hermit) on Jul 20, 2013 at 18:38 UTC

    Probably there is a shorter one liner than this:

    perl -F: -ane 'BEGIN {open K, "<file2"; $h{0+$_}=1 for <K>} print if e +xists $h{$F[0]}' file1

      How about this?

      perl -pe'open _,file2;0+$_~~[<_>]or$_=""' file1

      47 chars, versus 91 (84, removing unnecessary whitespace). This one's definitely "just for fun", though, due to now-experimental smartmatch and re-reading of file2.

Re: extract line
by kcott (Abbot) on Jul 21, 2013 at 06:45 UTC

    G'day lallison,

    Welcome to the monastery.

    This code does what you describe as being wanted:

    $ perl -Mstrict -Mwarnings -e ' use autodie; use Tie::File; my $re = qr{^((\d+).+$)}s; my %data_for_part; open my $f1, "<", "pm_1045452_file1.txt"; while (<$f1>) { /$re/; $data_for_part{$2} = $1; } close $f1; tie my @file2, "Tie::File", "pm_1045452_file2.txt"; print $data_for_part{$_} for @file2; untie @file2; ' 3478749:AA:1D:AAA:DescriptionsY:C:2 3633731:AA:3E:AAA:DescriptionsZ:C:2

    I made a minor change to "File1" to show different Descriptions:

    $ cat pm_1045452_file1.txt 3478748:AA:1D:AAA:DescriptionsX:C:2 3478749:AA:1D:AAA:DescriptionsY:C:2 3633731:AA:3E:AAA:DescriptionsZ:C:2

    "File2" data is as you show it:

    $ cat pm_1045452_file2.txt 3478749 3633731

    Notes:

    • You don't need to chomp any input nor add any newlines to the output.
    • There's no temporary arrays to process.
    • Tie::File comes standard with Perl: you won't need to install it.
    "This file has over a million lines and is running very slow."

    Given that you've been provided with a number of solutions, use Benchmark to determine which works best for you. (That module also comes standard with Perl.)

    [Aside: The code you posted is difficult to read due to the <code> tag issue. You appear to have made an effort but were unsuccessful: see Writeup Formatting Tips for how, where and why to do it.]

    -- Ken

      are you running the file with cat pm_1045452_file2.txt statement?

        If you're unfamiliar with *nix OSes, perhaps what I posted requires a little further explanation:

        • The actual code I ran is the "perl -Mstrict -Mwarnings -e ' ... '" part (see perlrun).
        • The two lines immediately following that second single quote is the output produced by the print statement.
        • cat is a commonly used *nix command (unrelated to Perl) that prints the contents of file(s). You can read "$ cat pm_1045452_file1.txt" as "Here's the contents of the file pm_1045452_file1.txt:". This is entirely unrelated to the Perl code; it merely shows the data the Perl code is using (which, as stated, I had slightly modified).

        [In case you didn't know, "*nix" is just an umbrella term for any UNIX-like OS.]

        -- Ken

      what should $2 refer to?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1045452]
Approved by frozenwithjoy
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (9)
As of 2014-07-23 09:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (137 votes), past polls