Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

odd behavior with DATA section

by Nkuvu (Priest)
on Jul 22, 2005 at 23:36 UTC ( #477382=perlquestion: print w/replies, xml ) Need Help??

Nkuvu has asked for the wisdom of the Perl Monks concerning the following question:

I have a series of spreadsheets and a large directory structure to move them into. I have a list of the filenames for each sheet, along with the final residing directory. Rather than move these all manually, I decided to write a quick script to do this.

The script worked fine, and moved all of the files into the right directories (and created the directories if they didn't exist) but along the way I ran into some errors that I can't explain. Specifically, some null lines seem to have infested my DATA section...

After changing the names to protect the innocent, this is the script I'm using:

#!/usr/bin/perl use strict; use warnings; my $base_dir = "C:/Documents/perl"; while (my $line = <DATA>) { chomp $line; next if $line =~ /^\s*\#/; # next if not $line; my ($filename, $filepath) = split '\t', $line; if (not -d "$base_dir/$filepath") { system "c:/cygwin/bin/mkdir -p $base_dir/$filepath"; print "Made $base_dir/$filepath\n"; } if (not $filename) { warn "Filename is blank, filepath is [$filepath], line is [$li +ne]\n"; } else { system "mv $base_dir/$filename.xls $base_dir/$filepath"; print "Moved $base_dir/$filename.xls to $base_dir/$filepath\n" +; } print "\n"; } __DATA__ filename1 some/file/path filename2 some/other/file/path filename3 yet/another/file/path/oooh/this/one/is/long
The script produces the following output:
Made C:/Documents/perl/some/file/path Moved C:/Documents/perl/filename1.xls to C:/Documents/perl/some/file/p +ath Use of uninitialized value in concatenation (.) or string at C:\Docume +nts\perl\ line 15, <DATA> line 2. Use of uninitialized value in concatenation (.) or string at C:\Docume +nts\perl\ line 21, <DATA> line 2. Filename is blank, filepath is [], line is [] Made C:/Documents/perl/some/other/file/path Moved C:/Documents/perl/filename2.xls to C:/Documents/perl/some/other/ +file/path Use of uninitialized value in concatenation (.) or string at C:\Docume +nts\perl\ line 15, <DATA> line 4. Use of uninitialized value in concatenation (.) or string at C:\Docume +nts\perl\ line 21, <DATA> line 4. Filename is blank, filepath is [], line is [] Made C:/Documents/perl/yet/another/file/path/oooh/this/one/is/long Moved C:/Documents/perl/filename3.xls to C:/Documents/perl/yet/another +/file/path/oooh/this/one/is/long

Note that I can easily fix this behavior by uncommenting the next if not $line part. The question is not how to avoid this error, the question is why it's happening in the first place.

All of the lines are read, and I don't see anything in the script that would cause me to read another line. I don't recall seeing this sort of behavior before, but I do normally include the check against empty lines.

What simple thing am I overlooking here? I SuperSearched for "__DATA__ null" as well as "__DATA__ empty" with no promising results. I've spent too much time trying to diagnose this problem, though, so I'm throwing in the towel and asking for ideas or information.

(by the way, "This is perl, v5.8.7 built for MSWin32-x86-multi-thread")

Update: I don't see the same behavior on my Mac (running OS X 10.4.1). "This is perl, v5.8.6 built for darwin-thread-multi-2level" So I'm totally baffled.

Replies are listed 'Best First'.
Re: odd behavior with DATA section
by GrandFather (Saint) on Jul 22, 2005 at 23:54 UTC

    A trailing new line on the "last" line will generate a last empty line. Stick your blank line detector back in or remove the trailing new line.

    Perl is Huffman encoded by design.
      For future reference, that's only the case for Windows, or it's at least not true for unix.

        You mean the Unix file system strips trailing new lines from a file? Surely that would be a rather obnoxious thing for any file system to do?

        Perhaps I should make it clear that by new line I mean what ever line end indicating convention might be used by whatever OS you happen to be using (CR for MAC, LF for *nix, CRLF for DOS/Windows).

        Perl is Huffman encoded by design.
      Unfortunately I'm not just getting a blank line at the end of the DATA block. I'm getting it between each and every line in the DATA block. Just the trailing newline I understand, this issue I don't.
Re: odd behavior with DATA section
by davidrw (Prior) on Jul 23, 2005 at 00:16 UTC
    i personally like this for ditching comments and blank lines (but obviously this is a TMTOWTDI item):
    while (my $line = <DATA>) { chomp $line; $line =~ s/\s*#.*//; unless $line =~ /\S/; ... }
    Also note that for the path creation you can use File::Path
    my $dir = "$base_dir/$filepath" do { mkpath $dir; print "Made $dir\n"; } unless -d $dir;
    As for the original issue, I'm not sure where those blanks are coming from.. It doesn't seem like it's one trailing line at the end of DATA.. to debug further i would first try this and hope the answer becomes apparent:
    warn "Line: ==$line=="; my ($filename, $filepath) = split '\t', $line; warn "filename: '$filename' --- filepath: '$filepath'";
      Thanks for the note on the File::Path. I meant for this to be a quick throwaway script, and knew that the Cygwin mkdir could do this pretty easily. If I had planned for something any more long term than throwaway I would have looked for a solution better suited to re-use. But nice to know what to look for, anyway.

      I did change the script to output the line before the split, and the file and path after. The output is getting long:

      And no apparent answer jumps at me. Another good suggestion, though, and one I hadn't tried.
Re: odd behavior with DATA section
by GrandFather (Saint) on Jul 23, 2005 at 00:34 UTC

    Actual problem seems to be that your "tabs" are actually spaces, probably due to editor settings. Use the following for your split:

    my ($filename, $filepath) = split '\s+', $line;

    unless of course you have spaces in your file names or paths :)

    Perl is Huffman encoded by design.
      In this post, maybe. In the file itself, they're actual tabs. I use emacs, and it only untabifies if I tell it to. Which I haven't.

      I did triple check that the DATA lines in question are using real tabs, by the way.

      But just to be sure that I'm sure, I did change the split to use \s+ and it produced the same behavior.

Re: odd behavior with DATA section
by digiryde (Pilgrim) on Jul 23, 2005 at 02:39 UTC

    Have you checked to make sure the file format is what perl as compiled expects? Did you look at the file with a hex editor to make sure the EOLs are as they are supposed to be?

    The null could be because you have what it thinks are two end of lines back to back. They might not display correctly in windows (if it is expecting a *nix type EOL, and cygwin perl can expect that.)

    In the code section, it would not cause problems since blank lines are okay. I know it is not supposed to work like that, but I have had that issue exactly bite me in the past, and there is no way any of us would see it as things are posted to the web page.

    The fact that it works with the line
    #    next if not $line;
    uncommented makes me even more suspicious.

    The last line of your submission, that it works fine on MacOS makes me wonder... Did you create the file on something other than windows? Or use a text editor for somethign other than windows? Happy hunting.

    Good Luck.

      After first running into the issue, I double checked line endings and tab characters in UltraEdit, using the view invisible characters bit. I should clarify, this is not Cygwin Perl, it's ActiveState Perl. I have Cygwin installed, but this is called from a Windows command prompt, not a Cygwin shell (and actually Cygwin Perl isn't even installed). The only reason I was using the Cygwin mkdir is so I didn't have to parse out the path and do repeated mkdir calls (the -p switch, if you're not familiar, creates all intermediate directories as necessary).

      The original list of files and directory structures (with the not-so-innocent names) was created by someone else in Excel. I copied the data to emacs where I created the script. For the innocently named script I copied the source code, then typed the DATA lines by hand, still in emacs. So no possiblities of pasting odd characters in there. For the OS X test I copied the source code and DATA from Perlmonks into emacs.

      So the short story is that I'm fairly sure that the line endings are correct for the operating system under test.

      But all that being said, I would not be too surprised to learn that your suspicions are accurate. I just double checked the output of the script, and noticed that the DATA line errors occur on lines 2, 4, 6, et cetera. Yet what I would expect to be line 2 in the DATA section worked fine.

      Of course now that I'm home I don't have the original script files, so I'll have to triple check line endings when I go back to work on Monday. Thanks for the input.

        Gather more data about what perl is doing. Print out the value of $line, paying close attention to the line ending. Save the return value of chomp $line, print it, and see if it is what you expect.

        I have run into issues like that on UltraEdit in the past where characters did not show up in the edit screen that were there in the file. I still use UltraEdit, but I am a little more wary of things like that now.
Re: odd behavior with DATA section
by converter (Priest) on Jul 23, 2005 at 12:46 UTC

    You really ought to be using __END__ instead of __DATA__. See the SelfLoader POD explanation of the way the __DATA__ is intended to be used.

    Notice that the first warning is issued for the data produced from the second input record. The fact that next if not $line; allows you to avoid the warnings tells us that $line is an empty string (it could not be uninitialized, since this would terminate the while loop). This means that either the string stored in $/ is being read immediately after the end of the first input record, or that your perl build's IO is broken or confused by the record delimiters in the text. What is the value of the $/ variable?

    For testing, on the line before your while() loop, insert the following assignment:

    $/ = '|';

    change the text after the __DATA__ token to (the data text should be on one line):

    __DATA__ filename1:some/file/path|filename2:some/other/file/path|filename3:yet/ +another/file/path/oooh/this/one/is/long

    and split with the pattern: /:/

    If the warnings go away, I suspect there is a problem with your record delimiters (LF, CR/LF, etc.) or the default value of $/.

      You really ought to be using __END__ instead of __DATA__. See the SelfLoader POD explanation of the way the __DATA__ is intended to be used.

      Maybe you can explain this better. I read the SelfLoader POD, and it seems to indicate that __END__ indicates the end of any subroutines in the __DATA__ section. Sort of. It seems to be talking about using the __DATA__ section in packages other than main (I'm referring to sections of the POD that say "works just like __END__ in main"). So if I simply replace __DATA__ with __END__ I can't read from the section. while (my $line = <END>) doesn't work at all. If I add the __END__ token after the __DATA__ then __END__ is read in as a line.

      The SelfLoader documentation, as far as I can tell, is used to replace the AutoLoader to be able to do something. I don't write packages, so I am not understanding a lot of the POD. But my best guess is that the __DATA__ section is used for subroutines that might not be called, so you can load them only when they are used. Which doesn't do anything for me, since I'm not writing packages, I'm tossing some input onto the end of my script. I could put the information into another file, and read that, but I don't really see the point of that.

      So tell me what it is that I am missing, please.

        The DATA filehandle is always used to access the text after the token, no matter if it is the __END__ or the __DATA__ token. The perldata man page gives a few important details that you should be aware of.

        I had always believed that __END__ should be used in the top-level script, with __DATA__ used only in code compiled via require or do. After re-reading the perldata man page I am of the opinion that the __DATA__ token is similar to the __END__ token but with extra features and that its use in the top-level script is fine.

        Someone please correct me if I'm wrong.

      Changed the text after the __DATA__ token like your example. The warnings went away. Undid the changes, the warnings did not reappear.

      So frustration of frustrations, the file seems to be "fixed." I can't reproduce the error, even after recreating the file exactly the same way I did on Friday. If I did not have documented evidence of the output I'd categorize this as temporary insanity.

        Somewhere along the way your text was probably edited with both a Windows editor and a Mac editor, leaving alternating records with an odd combination of CR and LF on the ends, creating "invisible" empty records in the process. Merging the lines together probably removed the offending characters.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://477382]
Approved by GrandFather
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (6)
As of 2021-04-21 12:20 GMT
Find Nodes?
    Voting Booth?

    No recent polls found