Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Read files not subdirectories

by Laurent_R (Canon)
on Jan 29, 2015 at 22:54 UTC ( [id://1114989]=note: print w/replies, xml ) Need Help??


in reply to Read files not subdirectories

To discard directories from your file list and keep only plain files:
next unless -f $file;
Please also note that it is not the best to use for or foreach to read the lines of a file, because this implies making a copy of the full file into memory before starting to process it. If the file is very big, the program might just crash because of memory overflow. This is also likely to slow down things slightly, although I am not sure this will make a big or even noticeable difference. The better way to read a text file is to use the while loop operator, for example as follows:
while (my $line = <$file>) { # ... }
With the while statement, you are reading the file line by line, so that you only have one line in memory at any given time.

Je suis Charlie.

Replies are listed 'Best First'.
Re^2: Read files not subdirectories
by wrkrbeee (Scribe) on Jan 29, 2015 at 23:04 UTC
    That's perfect!! Also, thanks for your advice concerning the use of WHILE in lieu of FOREACH! :-))
Re^2: Read files not subdirectories
by wrkrbeee (Scribe) on Jan 30, 2015 at 02:55 UTC

    Could I please ask another question? After using "next unless -f $file" the program runs, but fails to execute anything thereafter. As a test, I inserted a simple PRINT statement immediately after the "next unless" statement, and received nothing. If I uncomment the "next if" statement, and omit the "next unless", then the simple PRINT statement works, but the program crashes trying to execute the write statement. In sum, it seems that the "next unless" filters out all obs. Make any sense?

    #! /usr/bin/perl -w use strict; use warnings; use lib "c:/strawberry/perl/site/lib"; use HTML::Strip; my $hs = HTML::Strip->new(); my $write_dir = 'G:\research\sec filings 10k and 10Q\data\filing docs\ +1993\Clean'; my $files_dir = 'C:\Dwimperl\Perl\1993'; opendir (my $dir_handle, $files_dir) || die "failed to open '$files_di +r' <$!>"; while (my $file = readdir($dir_handle) ) { next unless -f $file; #next if $file eq '.' or $file eq '..'; open my $file_handle, "/dwimperl/perl/1993/$file" or die "failed t +o open '$file' <$!>"; while (my $line = <$file>) { my $clean_text = $hs->parse( ' ' ); print $write_dir "$file\n"; $hs->eof; } } close(); closedir $dir_handle;

      Consult a beginner level Perl book ("Beginner Perl" for an example) to understand difference between file and file handle; currently selected file handle for print & its various forms.

      ... my $write_dir = 'G:\research\sec filings 10k and 10Q\data\filing docs\ +1993\Clean'; ... opendir (my $dir_handle, $files_dir) || die "failed to open '$files_di +r' <$!>"; while (my $file = readdir($dir_handle) ) { ... open my $file_handle, "/dwimperl/perl/1993/$file" or die "failed +to open '$file' <$!>"; while (my $line = <$file>) {

      Actually use the file handle, not a file path, to read a line.

      ... print $write_dir "$file\n"; ...

      The directory path is not a file handle but a string. If there is none such open file handle, print will fail. To write to a file for a specific file handle, open the file in write mode; use print FILEHANDLE LIST syntax; see print.

      To copy or move files, see File::Copy.

      Thank you! Apologize for the inconvenience.

        You are welcome. I was not inconvenienced to point out the errors. Acutally, OP's reply may not be a direct reply to me as it was reply to OP's own post. Then again, that might just be the case of not being familiar with perlmonks.

      On many systems, doing something to a file ... even, just opening it ... can interfere with a directory-scan, causing it to end prematurely, to list the same file more than once, and so on.   (And this would be true no matter what high-level language e.g. Perl was being used to do it.)

      Therefore, I suggest that you first retrieve the entire list of files into an in-memory list ... which you can very easily do in Perl just by using the list context.   Then, iterate through the in-memory list that you have just retrieved, checking to see if they are or aren’t directories and so-on.   Start and finish the task of retrieving the list, for any given directory that you are now “in” ... then process the list.

      Of course, “file finding” is such a common requirement that there are many CPAN modules like File::Find.   If you need to “take a walk through a directory tree,” there are plenty of tour-guides . . .

Re^2: Read files not subdirectories
by wrkrbeee (Scribe) on Jan 30, 2015 at 17:01 UTC

    Could I ask another question, please? The code below runs, but fails to write/save the HTML-stripped text files. With a simple print statement, I've determined that the "second" WHILE statement must return FALSE, as the program never makes it this far. I am grateful for any insight!

    #! /usr/bin/perl -w use strict; use warnings; use lib "c:/strawberry/perl/site/lib"; use HTML::Strip; my $hs = HTML::Strip->new(); #Where I will store the end results; my $write_dir = 'G:\research\sec filings 10k and 10Q\data\filing docs\ +1993\Clean'; #Where the files with the HTML tags are located; my $files_dir = 'C:\Dwimperl\Perl\1993'; #Open the directory where the target files with HTML tags are located; + #Why am I doing this? Stores file names in a directory handle? opendir (my $dir_handle, $files_dir) || die "failed to open '$files_di +r' <$!>"; #Loop through each entry/file in the directory; #What is readdir doing here? It's not really reading anything; #Is it simply advancing us to the next entry?; #Seems like the real READ occurs via the OPEN statement below; while (my $file = readdir($dir_handle) ) { next unless -f $file; #next if $file eq '.' or $file eq '..'; #Open the current file so I can strip the HTML tags ??? ; open my $file_handle, '<', $file or die "failed to open '$file' <$ +!>"; #Read the current file one line at a time??; while (my $line = <$file_handle>) { ########The WHILE statement above must return FALSE cuz the program ne +ver makes it here; #Strip the HTML tags??; my $clean_text = $hs->parse( ' ' ); #Save the clean (no HTML tags) text file in a new file/locatio +n??; print $write_dir "$file\n"; $hs->eof; } } close(); closedir $dir_handle;

      Is your script located in the same folder as the html files ?. If not add the directory to get the full path like this

      #!perl use strict; use warnings; my $files_dir = 'C:\Dwimperl\Perl\1993'; opendir (my $dir_handle, $files_dir); while (my $filename = readdir($dir_handle)){ next unless -f $files_dir.'/'.$filename; print "$filename\n"; }
      poj

        I'm guessing you want to process each line and write it out (untested)

        #!perl use strict; use warnings; use HTML::Strip; my $hs = HTML::Strip->new(); my $files_dir = 'C:\Dwimperl\Perl'; my $write_dir = 'G:\research\sec filings 10k and 10Q\data\filing docs\ +1993\Clean'; opendir (my $dir_handle, $files_dir); while (my $filename = readdir($dir_handle)){ next unless -f $files_dir.'/'.$filename; print "Procesing $filename\n"; open my $fh_in, '<', $files_dir.'/'.$filename or die "failed to open '$filename' for read"; open my $fh_out, '>', $write_dir.'/'.$filename or die "failed to open '$filename' for write"; my $count=0; while (my $line = <$fh_in>) { my $clean_text = $hs->parse($line); print $fh_out "$clean_text\n"; ++$count; } $hs->eof; print "$count lines read from $filename\n;" }
        poj
        We're close, writes the files to output location, but the files are empty (size 0 kb). Ideas?
        Works! Very grateful for you time and patience with me. You're the best!
        Hi poj, your script will print the file names. Where are we going here?

        Hi poj, corrected a couple of stupid things on my part (e.g., ensuring my portable hard drive is available/plugged in, and actually opening the output file for output). Now gives me a "failed to open" for the output file at line 12. Here is the revised code. I apologize for the hassle.

        #! /usr/bin/perl -w use strict; use warnings; use lib "c:/strawberry/perl/site/lib"; use HTML::Strip; my $hs = HTML::Strip->new(); #Where I will store the end results; my $write_dir = 'F:\research\sec filings 10k and 10Q\data\filing docs\ +1993\Clean'; open (my $outfile_hand, '>', $write_dir) || die "failed to open '$writ +e_dir' <$!>"; #Where the files with the HTML tags are located; my $files_dir = 'C:\Dwimperl\Perl';#\1993'; #Open the directory where the target files with HTML tags are located; + #Why am I doing this? Stores file names in a directory handle? opendir (my $dir_handle, $files_dir) || die "failed to open '$files_di +r' <$!>"; #Loop through each entry/file in the directory; #What is readdir doing here? It's not really reading anything; #Is it simply advancing us to the next entry?; #Seems like the real READ occurs via the OPEN statement below; while (my $file = readdir($dir_handle) ) { next unless -f $file; #next if $file eq '.' or $file eq '..'; #Open the current file so I can strip the HTML tags ??? ; open my $file_handle, '<', $file or die "failed to open '$file' <$ +!>"; #Read the current file one line at a time??; while (my $line = <$file_handle>) { ########The WHILE statement above must return FALSE cuz the program ne +ver makes it here; #Strip the HTML tags??; my $clean_text = $hs->parse( ' ' ); #Save the clean (no HTML tags) text file in a new file/locatio +n??; print $outfile_hand "$file\n"; $hs->eof; } } close(); closedir $dir_handle;

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1114989]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-25 07:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found