Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Re: extract line

by poj (Monsignor)
on Jul 20, 2013 at 18:27 UTC ( #1045459=note: print w/replies, xml ) Need Help??

in reply to extract line

If you are curious to know why your script fails then perhaps this explains it. Essentially you are matching a file1 value with a substring extracted from the same value

while (my $line = <FILE1>) { # this sets $filerecord[0] to a file1 value my @filerecord = $line; for (@goodarray){ my $data = $_; # this sets $arrfield[0] only to a file2 value my @arrfields = $data; # this removes line ending from $arrfield[0] # which holds the file2 value chomp (@arrfields); # this trims $filedata[0] which refers # to $filerecord[0] the file1 value my $filedata =\@filerecord; ${$filedata}[0] =~ s/^\s+|\s+$//g; # this trims the file2 value $arrfields[0] =~ s/^\s+|\s+$|-//g; #$arrfields[0] =~ s/-//g; # this sets $string to trimmed file1 value my $string = ${$filedata}[0]; # this extract number from start of file1 value if ($string =~ /(^\d{7,8})/ ){ # $substr holds number from file1 my $substr = $1; # this matches file1 value to the number # extracted from file1 value # so will allways match if (index($string, $substr) !=-1){ #print "$string\n"; #last; } # this would work matching file1 value to file2 value if (index($string, $arrfields[0]) != -1){ print "$string\n"; last; } } } }

Replies are listed 'Best First'.
Re^2: extract line
by lallison (Novice) on Jul 20, 2013 at 22:52 UTC

    Printing the string this way still give me all the lines from file1. I even changed your line:

    if (index($string, $arrfields[0]) != -1){
    print "$string\n";
    if (index(${$filedata}[0], $arrfields[0]) != -1){
    print "${$filedata}[0]\n";

    It is not making the connection to grab the line with that part. It prints them all up to the number that ends the $arrfields[0].This file has over 1 mil lines so I cannot use one liners

      Without seeing all your code and an example of the data set that is failing to connect it is difficult for me to explain it. However this line

      if ($string =~ /(^\d{7,8})/ )

      suggests your have both 7 and 8 digit numbers in which case using index will give you incorrect results. For example 1234567 will match numbers 11234567,21234567, etc as well as 12345670,12345671, etc.

      You could use an exact match

      if ($substr eq $arrfields[0]){ print "$string\n"; last; }

      but if speed is important then I suggest you use one of the hash based solution other monks have provide like this

      #!/usr/bin/perl use strict; use warnings; # start my $t0 = time(); my $file1 = 'file1.txt'; my $file2 = 'file2.csv'; my $outfile = 'final_lines.txt'; # run once #testdata(); my %file2=(); open FILE2, '<',$file2 or die "Could not open $file2 $!"; while (<FILE2>){ s/[\r\n]//g; $file2{$_} = 1; } my $dur = time() - $t0; print "$. records read from $file2 in $dur seconds\n"; close FILE2; $t0 = time(); open OUTFILE,'>',$outfile or die "Could not open $outfile $!"; open FILE1, '<',$file1 or die "Could not open $file1 $!"; my $count_out=0; while (<FILE1>){ my ($id,undef) = split /:/; if (exists $file2{$id}){ print OUTFILE $_; ++$count_out; } } $dur = time() - $t0; print "$. records read from $file1 in $dur seconds\n"; close FILE1; close OUTFILE; print "$count_out records written to $outfile\n"; # some random data sub testdata { my $count; my @char = ('A'..'Z','a'..'z','0'..'9'); open OUT1,'>',$file1 or die "$file2 $!"; open OUT2,'>',$file2 or die "$file2 $!"; for (my $i=1_000_000;$i<=99_999_999;$i+=99){ my @chars = map{ $char[int(rand(62))] }(1..60); my $line = ':'.(join '',@chars); print OUT1 ($i + int rand(99))."$line\n"; print OUT2 ($i + int rand(99))."\n"; ++$count; } close OUT1; close OUT2; print "$count records created in $file1 and $file2\n"; }

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1045459]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2018-05-27 05:38 GMT
Find Nodes?
    Voting Booth?