Misprocessed Read From Files?

Napa has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

First, please forgive the lengthiness and any undesired formatting, I am a newbee and am just becoming familiar with the forum.

I was wondering if anyone might be able to help me.

I am trying to write code that ultimately will do some complex searching.

The problem I am stuck with is this:

I have two types of files: The first is one single file (example 1), while the second is about 250 files (example 2).

Example 1 is about 1173 pages worth of the following types of lines:

abaci, U, ae 1 b ax 0 s ay 0, 100, 0

All of the example 2 files contain somewhere around 20+ pages and look something like:

47.307796 122 <EXT-I've>; U; U

47.530873 122 lived; l ih v d; l ah v d

Currently I am trying to get PERL to read all of the lines in the first file, split them, and print one column. I also am trying to do the same using globbing for the other 250+ files. When I do either of these in isolation I can do this perfectly and generate exactly the results I need. However, when I combine the code (see below) I run into results such as:

a) Printing only the amount of information from the first file that equals the length of the other example. For example, if I begin the code with example 1 and only test with 2 files from ex. 2 (i.e., about 40 pages of ex. 2), I get either (a) the first 40 pages of ex. 1, or one line (first or last) of the file) repeated over 40 pages. The reverse happens when I flip around the codes.

I have tried (not shown in my code below) things such as switching around loops, changing the foreach loop of the globbing to a while loop, moving around the close statements, moving "}" around, switching the order of which file should be worked with first, opening all files in the glob (including ex. 1) and than trying a sort of conditional split function, altering the file handles and $lines to make them more distinct... Nothing I am doing results in a better output.

Thus if anyone has thoughts of why this is misprocessing the information I need, I would sincerely appreciate it.

The code is:

#Open file matching ex.1;
open (C, "<dic.txt") || die "dictionary";
#open file to write to;
open (B, ">>all.txt") || die "output";
#Making a loop of all lines in example 1 file;
while ($line2 = <C>) {
#Getting rid of the newline;
chomp $line2;
#Split all lines;
@firstgrouping = split(/, |,\s|,\t|,|\s,|\t,| ,/, $line2);
#splitting the lines in $firstgrouping[2] by the numbers so that text 
+before and after number are different indexed scalars;
@actualsyll = split(/\d |\d\s|\d\t|\d|\s\d| \d|\t\d|\t\d\t|\s\d\s| \d 
+/, $firstgrouping[2]); 
#Printing the new version of @firstgrouping[2];
print B "@actualsyll\n";
}
close C;

#Loop gets all files matching ex. 2 opens them;
foreach $file (<s*.words>) {
open (A, "<$file") || die "files";
#open (B, ">>awe.txt") || die "output";
#Making a loop of all lines in each file;
while ($line1 = <A>) {
#There are headers with information I do not need so this essentially 
+cuts them out;
$line1 =~ s/^   |^    |^\s\s\s|\s{3,4}//;
#Chomping of the newline;
chomp;
#Making a loop of all lines in all files from ex. 2 without their head
+ers;
foreach ($line1 =~ /^\d/g) {
#Splitting the files into the numbers to the first space, the 122, the
+ word minus extra markers, the chopped up word before the ";, the fin
+al chopped up word;
if ($line1 =~ /\d\s\w|\d\s{1,2}\d|\d\s\s\d|\d  \d|;\s\w/gi) {
$line1 =~ s/\s| |\s\s|  |\s{2}/\t/g;
$line1 =~ s/\t\t|\t{2}/\t/;
($stamp,$extra,$orth,$a,$b,$c,$d,$e,$f,$g, $h,$i,$j,$k,$l, $m, $n,$o,$
+p,$q,$r,$s,$t,$u,$v,$w,$x,$y,$z) = split(/ <|>;|\t/, $line1);
#splitting all of the information after the first ";" into 2 scalars;
$split = "$a $b $c $d $e $f $g $h $i $j $k $l $m $n $o $p $q $r $s $t 
+$u $v $w $x $y $z";
($canon,$spoke) = split(/; /, $split); 
#Getting rid of some additional extraneous material (i.e., unwanted sp
+aces...);
$orth =~ s/;//;
$spoke =~ s/\s{1,}$|\t{1,}$//g;
#Making an array that will bind everything together (mostly to aid in 
+later coding not yet created);

@general = ($file, $stamp,$extra,$orth,$canon, $spoke, $syll);
}
#combining all of the $orth's into a loop;
foreach ($general[3]) {
#Making each column into its own array;
push(@array0, $general[0]);
push(@array1, $general[1]);
push(@array2, $general[2]);
push(@array3, $general[3]);
push(@array4, $general[4]);
push(@array5, $general[5]);
}}}}
close A;

#Making a loop of each array created above;
foreach (@array0,@array1,@array2,@array3,@array4,@array5) {
#Removing each element one at a time for later (not yet created) condi
+tional searching of each array element;
@shift0 = shift @array0;
@shift1 = shift @array1;
@shift2 = shift @array2;
@shift3 = shift @array3;
@shift4 = shift @array4;
@shift5 = shift @array5;
#Prints out the $orth word of each line on its own line (used mostly a
+s a debugger right now);


print B "@shift3\n";

}
[download]

Many thanks,
Napa

Comment on Misprocessed Read From Files? Download Code

Replies are listed 'Best First'.
Re: Misprocessed Read From Files? by ikegami (Patriarch) on Sep 06, 2008 at 06:33 UTC
I think the problem you are trying to describe is the one caused by `foreach (@array0,@array1,@array2,@array3,@array4,@array5) { @shift0 = shift @array0; @shift1 = shift @array1; @shift2 = shift @array2; @shift3 = shift @array3; @shift4 = shift @array4; @shift5 = shift @array5; ... }` [download] That line loops over every element of @array0, then over every element of @array1, then ... @array5. This is further complicated by the fact that you modify those arrays in the loop (a big no no). A solution: `push(@array0, $general[0]); push(@array1, $general[1]); push(@array2, $general[2]); push(@array3, $general[3]); push(@array4, $general[4]); push(@array5, $general[5]); ... while (@array0) { my $shift0 = shift @array0; my $shift1 = shift @array1; my $shift2 = shift @array2; my $shift3 = shift @array3; my $shift4 = shift @array4; my $shift5 = shift @array5; ... }` [download] Like many other things in your program, that can be simplified to: `push(@array, [ @general ]); ... foreach my $general (@array) { my ($shift0, $shift1, $shift2, $shift4, $shift5) = @$general; ... }` [download]	[reply] [d/l] [select]
Re^2: Misprocessed Read From Files? by Napa (Novice) on Sep 07, 2008 at 05:48 UTC
Hello All, First, thank you all sincerely for the very prompt tips and advice. After trying out some of the suggestions, I am running into the exact same issue. I realized after testing out and playing around with the suggestions that I needed to add an extra note in my original post to help clarify the problem a little further. When mentioning that this code did not do what I expected, what I was really meaning, was that if you bring the first print statement to the end of the code, this is where I run into problems. Essentially, what ultimately happens, is that I get a situation where I get what I originally described. If I move the statement: `print B "@actualsyll\n";` [download] down to where the second print statement is: `print B @shift3\n";` [download] I get repeats of the first line from the first document. If I flip the codes around such that the entire second open/used files are placed above the first half of the code (see below), I get the same thing in reverse. I am just stuck and would love any further thoughts. I realize my coding is not high quality at the moment, but I have not been programming for long and thus am still learning the more moderate/advanced level stuff. Before I post the code I wanted to note that I am genuinely sorry if it is not in a nicely formatted fashion (i.e., indented properly). I am totally blind and as I use speech software to read material on the computer, such indents are unnecessary and actually less helpful for me. Thus, I am not tabbing because I do not want to produce more confusion for readers by oddly indenting things that are not normally indented. Again I am sorry if this makes my code more difficult for readers. The flipped code follows (please note that this is very close to the original (i.e., not including a lot of changes for help in clarification so if your changes are not included it was not meant as ignoring or trying them out but rather for simplification)): #Loop gets all files matching ex. 2 opens them; foreach $file (<s*.words>) { open (A, "<$file") \|\| die "$file"; open (B, ">>awe.txt") \|\| die "output"; #Making a loop of all lines in each file; while ($line1 = <A>) { #There are headers with information I do not need so this essentially +cuts them out; $line1 =~ s/^ \|^ \|^\s\s\s\|\s{3,4}//; #Chomping of the newline; chomp; #Making a loop of all lines in all files from ex. 2 without their head +ers; foreach ($line1 =~ /^\d/g) { #Splitting the files into the numbers to the first space, the 122, the + word minus extra markers, the chopped up word before the ";, the fin +al chopped up word; if ($line1 =~ /\d\s\w\|\d\s{1,2}\d\|\d\s\s\d\|\d \d\|;\s\w/gi) { $line1 =~ s/\s\| \|\s\s\| \|\s{2}/\t/g; $line1 =~ s/\t\t\|\t{2}/\t/; ($stamp,$extra,$orth,$a,$b,$c,$d,$e,$f,$g, $h,$i,$j,$k,$l, $m, $n,$o,$ +p,$q,$r,$s,$t,$u,$v,$w,$x,$y,$z) = split(/ <\|>;\|\t/, $line1); #splitting all of the information after the first ";" into 2 scalars; $split = "$a $b $c $d $e $f $g $h $i $j $k $l $m $n $o $p $q $r $s $t +$u $v $w $x $y $z"; ($canon,$spoke) = split(/; /, $split); #Getting rid of some additional extraneous material (i.e., unwanted sp +aces...); $orth =~ s/;//; $spoke =~ s/\s{1,}$\|\t{1,}$//g; #Making an array that will bind everything together (mostly to aid in +later coding not yet created); @general = ($file, $stamp,$extra,$orth,$canon, $spoke, $syll); } #combining all of the $orth's into a loop; foreach ($general[3]) { #Making each column into its own array; push(@array0, $general[0]); push(@array1, $general[1]); push(@array2, $general[2]); push(@array3, $general[3]); push(@array4, $general[4]); push(@array5, $general[5]); }}}} close A; #Making a loop of each array created above; while (@array0) { #Removing each element one at a time for later (not yet created) condi +tional searching of each array element; $shift0 = shift @array0; $shift1 = shift @array1; $shift2 = shift @array2; $shift3 = shift @array3; $shift4 = shift @array4; $shift5 = shift @array5; #Prints out the $orth word of each line on its own line (used mostly a +s a debugger right now); #print B "$shift3\n"; } #Open file matching ex.1; open (C, "<dic.txt") \|\| die "dictionary"; #open file to write to; #open (B, ">>all.txt") \|\| die "output"; #Making a loop of all lines in example 1 file; while ($line2 = <C>) { #Getting rid of the newline; chomp $line2; #Split all lines; @firstgrouping = split(/, \|,\s\|,\t\|,\|\s,\|\t,\| ,/, $line2); #splitting the lines in $firstgrouping[2] by the numbers so that text +before and after number are different indexed scalars; @actualsyll = split(/\d \|\d\s\|\d\t\|\d\|\s\d\| \d\|\t\d\|\t\d\t\|\s\d\s\| \d +/, $firstgrouping[2]); #Printing the new version of @firstgrouping[2]; #print B "@actualsyll\n"; print B "$shift3\n"; } close C; [download] Many thanks, Napa	[reply] [d/l] [select]
Re^3: Misprocessed Read From Files? by toolic (Bishop) on Sep 08, 2008 at 15:58 UTC
Welcome to the Monastery, Napa. I am genuinely sorry if it is not in a nicely formatted fashion (i.e., indented properly) Obviously, given your circumstances, no apologies are necessary. I have re-posted your code below for the benefit of others, with some indentation using the free utility, perltidy. Read more... (5 kB)	[reply] [d/l]
Re^3: Misprocessed Read From Files? by broomduster (Priest) on Sep 08, 2008 at 17:24 UTC
As toolic said, Welcome to the Monastery. Since you are working with a large number of large input files, let's do some simple tests to help isolate the problems with the code you posted. It will help us if you make a small number of small files for some test runs (and working with the small files may even make the entire problem immediately clear to you). I suggest three files of five lines each, as follows (be sure to include the header lines, if any, in each of these files): An "Example 1" file as described in your original post (I don't think this one has headers, but include them if there are any). Two "Example 2" files, one for each of the example 2 types that you showed in your original post (don't forget the headers). We also need some output files. As I understand your original post, you started with two separate programs, one to process "Example 1" data and another to process "Example 2" data, and each of those worked as you expected. Please run each of these programs on the appropriate short sample file(s) noted above, and show us the output files for each of the three input files. Last of all (for now, anyway), run your combined program on the three short input files and show us the output file that it generates. Hopefully, we will be able to see how this output differs from the outputs from the individual programs. Once we have the samples in front of us, we will be in better shape to help you get things working the way you expect.	[reply]
Re: Misprocessed Read From Files? by jwkrahn (Abbot) on Sep 06, 2008 at 13:46 UTC
`1 #Open file matching ex.1; 2 open (C, "<dic.txt") \|\| die "dictionary"; 3 #open file to write to; 4 open (B, ">>all.txt") \|\| die "output";` [download] You should at least include the file name and the $! variable in the error message so you know why open failed. Your program should start with the two lines: `use warnings; use strict;` [download] `10 @firstgrouping = split(/, \|,\s\|,\t\|,\|\s,\|\t,\| ,/, $line2);` [download] The `\s` character class includes both the `" "` and the `"\t"` characters so that regular expression could be simplified to: `/\s?,\s?/` `12 @actualsyll = split(/\d \|\d\s\|\d\t\|\d\|\s\d\| \d\|\t\d\|\t\d\t\|\s\ +d\s\| \d /, $firstgrouping[2]);` [download] And that regular expression could be simplified to: `/\s?\d\s?/` `20 open (A, "<$file") \|\| die "files";` [download] Again, you should include the file name and the $! variable in the error message so you know why open failed. `25 $line1 =~ s/^ \|^ \|^\s\s\s\|\s{3,4}//;` [download] That regular expression could be simplified to `/^ {4}\|^\s{3}\|\s{3,4}/` `27 chomp;` [download] You are chomping the $_ variable but you are not using the $_ variable. `29 foreach ($line1 =~ /^\d/g) {` [download] `$line1 =~ /^\d/g` returns a list of the matches in `$line1` and stores each match in the $_ variable each time through the loop. However the pattern `/^\d/` will only match once because it is anchored at the beginning of the line. So perhaps that line should be: `29 if ( $line1 =~ /^\d/ ) {` [download] `31 if ($line1 =~ /\d\s\w\|\d\s{1,2}\d\|\d\s\s\d\|\d \d\|;\s\w/gi) {` [download] That regular expression could be simplified to `/[\d;]\s\w\|\d\s{2}\d/` `32 $line1 =~ s/\s\| \|\s\s\| \|\s{2}/\t/g;` [download] That regular expression could be simplified to `/\s/` `33 $line1 =~ s/\t\t\|\t{2}/\t/;` [download] That regular expression could be simplified to `/\t\t/` `40 $spoke =~ s/\s{1,}$\|\t{1,}$//g;` [download] The `/g` option is extraneous because the pattern is anchored at the end of the line and will only match once. That regular expression could be simplified to `/\s+$/` `54 }}}} 55 close A;` [download] You are closing the `A` filehandle outside of the `foreach` loop, which is OK because perl will automatically close it every time it opens it again. So, removing all the unneeded variables and adding indentation, your code can be simplified to: #!/usr/bin/perl use warnings; use strict; # Open file matching ex.1 open C, '<', 'dic.txt' or die "dic.txt: $!"; # open file to write to open B, '>>', 'all.txt' or die "all.txt: $!"; # Making a loop of all lines in example 1 file while ( my $line2 = <C> ) { # Getting rid of the newline chomp $line2; # Split all lines my $firstgrouping = ( split /\s?,\s?/, $line2 )[ 2 ]; # splitting the lines in $firstgrouping[2] by the numbers so that +text before and after number are different indexed scalars my @actualsyll = split /\s?\d\s?/, $firstgrouping; # Printing the new version of @firstgrouping[2] print B "@actualsyll\n"; } close C; # Loop gets all files matching ex. 2 opens them my @array3; for my $file ( <s*.words> ) { open A, '<', $file or die "$file: $!"; # Making a loop of all lines in each file while ( my $line1 = <A> ) { # There are headers with information I do not need so this ess +entially cuts them out $line1 =~ s/^ {4}\|^\s{3}\|\s{3,4}//; # Chomping of the newline chomp $line1; next unless $line1 =~ /^\d\|[\d;]\s\w\|\d\s{2}\d/; $line1 =~ s/\s/\t/g; $line1 =~ s/\t\t/\t/; my $orth = ( split / <\|>;\|\t/, $line1 )[ 2 ]; # Getting rid of some additional extraneous material $orth =~ s/;//; push @array3, $orth; } close A; } # Making a loop of each array created above for my $shift3 ( @array3 ) { # Prints out the $orth word of each line on its own line (used mos +tly as a debugger right now) print B "$shift3\n"; } [download]	[reply] [d/l] [select]
Re: Misprocessed Read From Files? by graff (Chancellor) on Sep 06, 2008 at 15:46 UTC
Based on what you've said so far about your goal: Currently I am trying to get PERL to read all of the lines in the first file, split them, and print one column. I also am trying to do the same using globbing for the other 250+ files. When I do either of these in isolation I can do this perfectly and generate exactly the results I need. However, when I combine the code... I don't know what sort of output you really want to get when you "combine the code". I gather it has something to do with combining the two sets of data, and "complex searching", but... how would you describe/define that, exactly? Writing a reduced output version of every input file is an okay thing to do, I guess, but how does it help in moving you toward to your real goal (whatever that is)? If you could reach that goal without writing all those output files, wouldn't that be better? Apart from the very good advice given in the earlier replies, I would encourage you to pay attention to the visual appearance of your code: things like proper indentation and strategic use of an occasional blank line (to delimit logical chunks like loops and conditional blocks) can do wonders for making the code easier to manage and maintain. It does make a difference. Also, this part of your code looks wrong -- maybe it works for you now, but I would not trust it: `($stamp,$extra,$orth,$a,...$z) = split(/ <\|>;\|\t/, $line1); #splitting all of the information after the first ";" into 2 scalars; $split = "$a ... $z"; ($canon,$spoke) = split(/; /, $split);` [download] Please look again at the docs for split. You can split to an array (much easier and less brittle that a long list of scalars), and/or you can limit the number of elements returned by split. Examples: `# split into two scalars and an array: ( $field1, $field2, @rest ) = split( /$separators/, $input_string ); # or split into three scalars, put everything after "$field2" into $re +st: ( $field1, $field2, $rest ) = split( /$separators/, $input_string, 3 + );` [download] Bear in mind that something like the following would be a mistake, since the first array will take up all the results from split, and any subsequent variable(s) will be empty (undef): `# wrong: ( @fields, $comment ) = split( /[\t#]/, $input_string );` [download]	[reply] [d/l] [select]