Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^3: Regular expressions across multiple lines

by afoken (Abbot)
on Apr 24, 2016 at 19:04 UTC ( #1161388=note: print w/replies, xml ) Need Help??


in reply to Re^2: Regular expressions across multiple lines
in thread Regular expressions across multiple lines

chomp() is multi-platform. It will delete <CR><NL> and <NL>, even on Windows. These line endings even if mixed will not matter.

Well, it may look so, but what really happens is different. See chomp:

This safer version of chop removes any trailing string that corresponds to the current value of $/ (also known as $INPUT_RECORD_SEPARATOR in the English module).

Note: Not a single word of the CR or LF control characters, the CR-LF pair, or NL (newline).

The input record separator $/ is documented, it defaults to an abstract "newline" character:

The input record separator, newline by default. This influences Perl's idea of what a "line" is. [...] See also Newlines in perlport.

Now, "newlines". Perl has inherited them from C, by using two modes for accessing files, text mode and binary mode. In text mode, the systems native line ending, whatever that may be, is translated from or to a logical newline, also known as "\n". In binary mode, file content is not modified during read or write. C has been defined in a way that the logical newline is identical with the native line ending on unix, LF. So, there is no difference between text mode and binary mode ON unix.

Quoting Newlines in perlport:

In most operating systems, lines in files are terminated by newlines. Just what is used as a newline may vary from OS to OS. Unix traditionally uses \012, one type of DOSish I/O uses \015\012, Mac OS uses \015, and z/OS uses \025.

Perl uses \n to represent the "logical" newline, where what is logical may depend on the platform in use. In MacPerl, \n always means \015. On EBCDIC platforms, \n could be \025 or \045. In DOSish perls, \n usually means \012, but when accessing a file in "text" mode, perl uses the :crlf layer that translates it to (or from) \015\012, depending on whether you're reading or writing. Unix does the same thing on ttys in canonical mode. \015\012 is commonly referred to as CRLF.

What happens here is that Perl has reasonable defaults for text handling, so it opens files (including STDIN, STDOUT, STDERR) in text mode by default, $/ defaults to a single logical newline ("\n"), and so native newline characters are translated before chomp just removed that "\n", on any platform.

When reading text files using a non-native line ending, things will usually go wrong:

/tmp/demo>file *.txt linux-file.txt: ASCII text mac-file.txt: ASCII text, with CR line terminators windows-file.txt: ASCII text, with CRLF line terminators /tmp/demo>perl -MData::Dumper -E '$Data::Dumper::Useqq=1; for $fn (@AR +GV) { open $f,"<",$fn or die; @lines=<$f>; chomp @lines; say "$fn:"; +say Dumper(\@lines); }' *.txt linux-file.txt: $VAR1 = [ "A simple file generated", "on Linux with Unix", "line endings." ]; mac-file.txt: $VAR1 = [ "A simple file generated\ron Windows with Old Mac\rline endi +ngs.\r" ]; windows-file.txt: $VAR1 = [ "A simple file generated\r", "on Windows with Windows\r", "line endings.\r" ]; /tmp/demo>

Of course, it depends on the system you are using:

H:\tmp\demo>perl -MWin32::autoglob -MData::Dumper -E "$Data::Dumper::U +seqq=1; fo r $fn (@ARGV) { open $f,'<',$fn or die; @lines=<$f>; chomp @lines; say + qq<$fn:>; say Dumper(\@lines); }" *.txt linux-file.txt: $VAR1 = [ "A simple file generated", "on Linux with Unix", "line endings." ]; mac-file.txt: $VAR1 = [ "A simple file generated\ron Windows with Old Mac\rline endi +ngs.\r" ]; windows-file.txt: $VAR1 = [ "A simple file generated", "on Windows with Windows", "line endings." ]; H:\tmp\demo>

So, chomp is NOT cross-platform. It can handle input from native text files on all platform out of the box. But if you have to work with ASCII files with mixed line endings (CR, LF, CR-LF, LF-CR), chomp can't work reliably. This is not chomp's fault, neither is it perl's fault.

Alexander

--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^4: Regular expressions across multiple lines
by Marshall (Abbot) on Apr 25, 2016 at 00:17 UTC
    This looks mainly right, but with some quibbles.

    1) Correct, the standard text line endings are:
    Unix: <LF> - Line Feed
    Windows: <CR><LF> - Carriage Return, Line Feed
    Network Socket: <CR><LF> - Carriage Return, Line Feed
    Old Mac: <CR> - Carriage Return.

    That a standard network socket (of course even on Unix) uses "Windows" line endings may be news to some.

    2) The chomp description is not 100% clear. When reading in text mode and with the default input record separator of "\n", chomp() will remove any of the line endings that "\n" could mean on any of these 3 platforms between Unix and Windows. The C function getline() will work similarly. Reading a Windows file on Unix will work fine with this text oriented C read function.

    3) Some ancient Unix functions like lp (line print) will not work with Windows line endings. Perl is fine, but lp not. In that case: while(<>){chomp;print;} will set things right. I have used this many times on Unix to convert a mixed file to <LF> endings and vice versa on Windows to convert to <CR><LF>. Although my Windows programs just don't seem to care.

    4) I don't know how these test cases were generated. There is no way to do that without being in Perl bin mode or writing a C program.

    Update: well it appears that Perl doesn't like old Mac endings on my Windows XP machine. This does work with the <LF> ending. So something like "works between Unix and Windows" may be closer to the truth (dual platform) rather than "multi-platform". On Unix, Perl has to be able to read from both hard disk files and network sockets which have different line endings.

    #!/usr/bin/perl use warnings; use strict; open OUT, '>', "unixending.txt" or die "$!"; binmode OUT; print OUT pack "C8", 0x41,0x42,0x43,0x0A,0x44,0x45,0x46,0x0D; close OUT; open IN, '<', "unixending.txt" or die "$!"; while (<IN>) { chomp; print "\"$_\"\n"; } __END__ "ABC" ## fine for Unix <LF> 0x0A "DEF ## didn't work for old MAC <CR> 0x0D "
      Perl (the compiler) has no problem reading Windows or Unix line endings in scripts on any platform.

      Perl programs may have problems when reading a Windows-generated file under Unix or a Unix-generated file under Windows, because Perl applies the conventions of the platform on which it executes and does not know the file has been generated on another OS.

      If you transfer a file by FTP in ASCII mode, FTP will do the conversion for you. If you use Bin mode or SFTP, the con version will not occur and you may end up with problems.

      As for chomp, I use it when I have a decent control of where the file has been generated (especially it it is a file I previously generated). When the file is coming from some outside source, I usually use aa tr// or a regex s/// to remove safely line endings.

      > When reading in text mode and with the default input record separator of "\n" , chomp() will remove any of the line endings that "\n" could mean

      chomp in not related to reading in any mode. It just gets a string and changes it. readline might translate the line ending depending on the :crlf IO-layer.

      $ perl -we '$/ = "\n"; @s = ("1\r\n", "2\n\r", "3\n", "4\r"); chomp @s +; print @s' | xxd 0000000: 310d 320a 0d33 340d 1.2..34.

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Just for completeness:

      4) I don't know how these test cases were generated.

      The windows text file was generated using the Notepad application on Windows 7, on a Samba share mapped as drive H: from a Slackware Linux 14.1 server. The Linux file was generated using joe on the Linux server. The old mac file was generated by using Notepad++ on the Windows machine (I have no old Mac). Re^5: Regular expressions across multiple lines shows the hexdumps of the files.

      There is no way to do that without being in Perl bin mode or writing a C program.

      With my setup, there is a way to generate all three files on Linux, without using binmode. This trick abuses the fact that there is absolutely no difference between text mode and binary mode on unix:

      /tmp/demo2>cat three-os.pl #!/usr/bin/perl use strict; use warnings; open OUT,'>','unix.txt' or die "unix.txt: $!"; print OUT "line 1\x0Aline 2\x0Aline 3\x0A"; close OUT; open OUT,'>','oldmac.txt' or die "oldmac.txt: $!"; print OUT "line 1\x0Dline 2\x0Dline 3\x0D"; close OUT; open OUT,'>','windows.txt' or die "windows.txt: $!"; print OUT "line 1\x0D\x0Aline 2\x0D\x0Aline 3\x0D\x0A"; close OUT; exec "file *.txt" or die "exec failed: $!"; /tmp/demo2>perl three-os.pl oldmac.txt: ASCII text, with CR line terminators unix.txt: ASCII text windows.txt: ASCII text, with CRLF line terminators /tmp/demo2>

      This won't work on Windows, because for C and Perl on Windows, \n and \x0A are equal. Then, text mode translation happens and every \x0A is replaced with CRLF. Running the same script on Windows (again using the Samba share) will complain about a missing "file" utility and gives this result:

      /tmp/demo2>file *.txt oldmac.txt: ASCII text, with CR line terminators unix.txt: ASCII text, with CRLF line terminators windows.txt: ASCII text, with CRLF, CR line terminators /tmp/demo2>od -tx1 -c windows.txt 0000000 6c 69 6e 65 20 31 0d 0d 0a 6c 69 6e 65 20 32 0 +d l i n e 1 \r \r \n l i n e 2 \ +r 0000020 0d 0a 6c 69 6e 65 20 33 0d 0d 0a \r \n l i n e 3 \r \r \n 0000033 /tmp/demo2>

      The output from the file utility is a little bit misleading. The lines in windows.txt are terminated by CR CR LF, this can be seen in the output of od.

      Alexander

      --
      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re^4: Regular expressions across multiple lines
by Discipulus (Monsignor) on Apr 24, 2016 at 19:10 UTC
    ++afoken you have the karma of exhaustiveness! i must bookmark as newline gory details

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1161388]
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (8)
As of 2017-12-16 21:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (459 votes). Check out past polls.

    Notices?