Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Use Schwartzian transform across multiple files

by Sonya777 (Novice)
on Sep 19, 2016 at 12:28 UTC ( [id://1172125]=perlquestion: print w/replies, xml ) Need Help??

Sonya777 has asked for the wisdom of the Perl Monks concerning the following question:

I would like to use the following Schwartzian transform sorting script (which is working perfectly as a standalone script), on a multiple files in the folder:
#!/usr/bin/perl use strict; use warnings; open my $input, '<' or die "Unable to open input file: $!"; my @file = <$input>; my @sorted_file = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { my ($x) = $_ =~ /VerNumber:\((\d+)/i; [$_, $x]; } @file; open my $output, '>' or die "Unable to open output file: $!"; print $output $_ for @sorted_file;
The script should take as an input all the files in one folder starting with file*, and sort the content of each one:
file1.txt file2.txt ... file1000.txt
Then, as an output I would like for the script to create new folder in which it will place new files, with the sorted content, keeping the same file names.
/sorted file1.txt -> /sorted/file1.txt file2.txt -> /sorted/file2.txt ... file1000.txt -> /sorted/file1000.txt
I have made the following script. It does write the files in the output folder, keeping the same file names, but the sorting part is not working and I am getting the same files in the output (even though the sorting script is working fine as a standalone one). Any help?
#!/usr/bin/perl use strict; use warnings; use Getopt::Long; my $version="0.2"; my $files_match=""; my $files_dir=""; my $file_name=""; my $help_flag=""; my $version_flag=""; GetOptions( 'm|match=s' => \$files_match, 'd|directory=s' => \$files_dir, 'h|help' => \$help_flag, 'v|version' => \$version_flag, ); sub sorting { my @file = "$_"; my @sorted = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { my ($x) = $_ =~ /VerNumber:\((\d+)/i; [$_, $x]; } @file; print FILE $_; } if (($files_match ne "") and ($files_dir ne "")) { chdir("$files_dir") or die "$!"; opendir (DIR, ".") or die "$!"; my @files = grep {/$files_match/} readdir DIR; my $files_size = $#files + 1; my $index_file = 1; print "Files to process: $files_size\n"; close DIR; foreach (@files) { open(FILE, ">./sorted/$_.sort") or die $!; my @singlefile = $_; print "Processing $index_file of $files_size files: $_ +\n"; local @ARGV = @singlefile; while(<>){ sorting($_); } close(FILE); $index_file++; print "OK: Sorted @singlefile \n"; } } elsif ((!$help_flag) and (!$version_flag)){printHelp();}
I am a beginner in Perl and any help would be more than welcome! Thank you in advance!

Replies are listed 'Best First'.
Re: Use Schwartzian transform across multiple files
by choroba (Cardinal) on Sep 19, 2016 at 13:17 UTC
    The original Schwartzian transform processed @file , a poorly named array of file lines. Your implementation uses the same array, but assigns only a single line to it:
    my @file = "$_";

    Pass all the lines to the sub, and assign it as (note the better name):

    my @lines = @_; # ... sorting(<>);

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Hi! Thanks for the comment! I have corrected that. Here is the upated script:
      sonya$ vi sorting.pl #!/usr/bin/perl use strict; use warnings; use Getopt::Long; my $version="0.2"; my $files_match=""; my $files_dir=""; my $file_name=""; my $help_flag=""; my $version_flag=""; GetOptions( 'm|match=s' => \$files_match, 'd|directory=s' => \$files_dir, 'h|help' => \$help_flag, 'v|version' => \$version_flag, ); sub sorting { my @lines = @_; my @sorted = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { my ($x) = $_ =~ /VerNumber:\((\d+)/i; [$_, $x]; } @lines; print FILE for @sorted; } if (($files_match ne "") and ($files_dir ne "")) { chdir("$files_dir") or die "$!"; opendir (DIR, ".") or die "$!"; my @files = grep {/$files_match/} readdir DIR; my $files_size = $#files + 1; my $index_file = 1; print "Files to process: $files_size\n"; close DIR; foreach (@files) { open(FILE, ">./sorted/$_.sort") or die $!; my @singlefile = $_; print "Processing $index_file of $files_size files: $_ ++\n"; local @ARGV = @singlefile; while(<>){ sorting($_); } close(FILE); $index_file++; print "OK: Sorted @singlefile \n"; } } elsif ((!$help_flag) and (!$version_flag)){printHelp();} "sorting.pl" [New] 49L, 1391C written
      However, I am still getting the unchanged files as a result:
      sonya$ ./full.pl -d /home/sonya/test/ -m file Files to process: 3 Processing 1 of 3 files: file1.txt+ OK: Sorted file1.txt Processing 2 of 3 files: file2.txt+ OK: Sorted file2.txt Processing 3 of 3 files: file3.txt+ OK: Sorted file3.txt sonya$ diff /home/sonya/test/file1.txt /home/sonya/test/sorted/file1.t +xt.sort sonya$ sonya$
      The standalone sorting script is working so there is something wrong with my code.
Re: Use Schwartzian transform across multiple files
by Corion (Patriarch) on Sep 19, 2016 at 12:35 UTC

    In your program main loop, you still iterate over each file and call sorting. If you want to sort the content of all files, either concatenate all files before starting your program or change your program to first read the content of all files and then sort it, instead of reading and writing a single file. This will mean moving some of the code out of sorting, especially the part where it reads in the whole file, and the part where it writes the sorted file.

      Thank you for your quick replay. I cannot use the option to concatenate files as I want to sort the content of all files, one by one, and place the new files (with the same name) to the new folder. This is important because of other steps in my procedure. Therefore I need to first read the content of all files an then sort them. When you say moving some of the code out of sorting, which part excatly are you refering to? I am sorry for the additional question, but I am a newbe in Perl and I still find it hard to understand the syntax. Thanks!
Re: Use Schwartzian transform across multiple files
by Marshall (Canon) on Sep 19, 2016 at 20:24 UTC
    Your longer version does not match up with the usage of your single use sorting code (that you say works). The first version sends an input file handle to sorting, while your second version is just sending a simple name. Beware: I didn't test the code below, but this should get you closer to working code. Tested code below as an update.

    One way to do this is open both the input and output file handles and pass them both to the sorting routing. Your code used a bare word name, FILE for the output file. It is not so obvious that FILE is actually changing as the program runs. Better to use a lexical file handle and pass it to sorting(). FILE is global in scope and you were using that to pass a variable to sorting(), which is not a good idea.

    I prefer to close() the filehandles at the same level of code as the open().

    I do have some issues with the variable names. These names are important. For the "counter", perhaps $file_num, $file_counter, $filecnt or some such. "$index_file" is confusing. First of all this is not any kind of an "index". Anyway, I suggest you spend a bit more time considering names.

    Unless the for (or foreach) loop is really short and obvious, I prefer to declare an actual name for the loop variable rather than using $_; When you into the body of the loop, $_ is so generic that it can be confusing about what that really is. An extra "my variable" is a "cheap" thing to create and usually well work the effort.

    Update: I went ahead and added a few lines to make this a complete program that I could test. Glob is a bit easier than readdir for a very simple case like this. You also had a number of lines of code where I couldn't figure out what the intent was, so I deleted them.

    #!/usr/bin/perl use strict; use warnings; if (!-d "sorted") { mkdir "sorted" or die "unable to create dir sorted $!"; } my @files2sort = <file*.txt>; #just use glob to get names my $curfilenum =1; foreach my $file (@files2sort) { open my $fh_in, '<', $file or die "$file failed to open $!"; open(my $fh_out, '>', "./sorted/$file.sort") or die "cannot create +out $file.sort $!"; print "Processing ".$curfilenum++." of ".@files2sort." $file\n"; sortfile($fh_in, $fh_out); close ($fh_out); close ($fh_in); print "OK: Sorted $file \n"; } sub sortfile { my ($fh_in, $fh_out) = @_; my @lines = <$fh_in>; my @sorted = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { my ($x) = $_ =~ /VerNumber:\((\d+)/i; [$_, $x]; +} @lines; print $fh_out @sorted; } __END__ Processing 1 of 2 file1.txt OK: Sorted file1.txt Processing 2 of 2 file2.txt OK: Sorted file2.txt #these names are confusing... my $files_size = $#files + 1; my $index_file = 1; #better names?: my $num_files = @files; # the value of an array in a scalar context is the # number of items in the array, no need for $#files+1 # or you can just use @files in a scalar context without # creatng $num_files at all. my $file_counter =1; # "$index_file" would mean something different
Re: Use Schwartzian transform across multiple files
by Marshall (Canon) on Sep 19, 2016 at 22:29 UTC
    Hi Sonya777!

    I showed some working Schwartzian Transform (ST) code at Re: Use Schwartzian transform across multiple files. As a beginner, I would certainly consider the idea of a more straightforward approach. I recommend that you master the basics before trying to use advanced techniques. I show "another way" for you below.

    The sort routine selects pairs of things to "judge". The user supplied function's job is to decide: less than, equal, or greater than. In the code below, there is a lot of "extra work" because a regex has to be run twice every time a new pair of "things" is selected for comparison. The ST is faster because it calculates all of the regex's only once and saves that result in an intermediate array before the actual sort is run.

    However you should consider that often this extra efficiency doesn't matter at all in the overall scheme of things. In fact, for small numbers of lines, the ST can actually be slower due to the overhead of creating the intermediate array and transforming it back to the original representation.

    How fast is "fast enough" depends upon the application. If you are sorting an array of say 80,000 elements, there probably will be a user noticeable difference between algorithms. With 100-200 lines, probably not.

    Once you get your code working, I encourage you to benchmark the code below vs my ST version. Make the comparison as "fair as possible". Also be aware that the second time you run the program, it will run faster because the files will be in memory disk cache and that speeds things up a lot. But even so, you probably will learn something from doing a simple benchmark exercise. I don't know what OS you are using, but also be aware that on some OS'es. Windows in particular, console I/O is an extremely "expensive" operation and takes a lot of execution time. I/O to report benchmark progress can consume so much time that it skews the results.

    #!/usr/bin/perl use strict; use warnings; if (!-d "sorted") { mkdir "sorted" or die "unable to create dir sorted $!"; } my @files2sort = <file*.txt>; #just use glob to get names my $curfilenum =1; foreach my $file (@files2sort) { open my $fh_in, '<', $file or die "$file failed to open $!"; open(my $fh_out, '>', "./sorted/$file.sort") or die "cannot create +out $file.sort $!"; print "Processing ".$curfilenum++." of ".@files2sort." $file\n"; sortfile2($fh_in, $fh_out); close ($fh_out); close ($fh_in); print "OK: Sorted $file \n"; } sub sortfile2 { my ($fh_in, $fh_out) = @_; my @lines = <$fh_in>; @lines = sort by_version @lines; print $fh_out @lines; #can do a sort "in place" #separate @sorted var is not needed. } sub by_version { my ($verA) = $a =~ /VerNumber:\((\d+)/i; my ($verB) = $b =~ /VerNumber:\((\d+)/i; $verA <=> $verB #returns -1,0,+1 } __END__ Processing 1 of 2 file1.txt OK: Sorted file1.txt Processing 2 of 2 file2.txt OK: Sorted file2.txt
      Hi Marshall!

      Thank you so much for your detailed response and guidance. I am a newbe in Perl and I got a bit lost with my attempt using as you say advanced options, so I really appreciate your comments. I agree that using the "glob" is easier and much more clear to me in this specific case.

      I decided to go with the ST code since I have 16000 files with approx. 4000 arrays each. So based on your guidance this would be the faster way.

      Most importantly, the ST script you have suggested works perfectly! I have just tried it on all the files and it was done in less than 5 minutes. Later on I will test and compare both scripts. With ST script you have suggested I got the desired result and I cannot thank you enough.

      It is always nice to see someone ready to share his knowledge with the world and it is a motivation for the rest of us to do the same!

      Cheers!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1172125]
Approved by marto
Front-paged by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2024-04-23 16:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found