http://www.perlmonks.org?node_id=1014740

perlhappy has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I have a quick question about the behavior of Perl when reading from the STDIN. Specifically about a particular piece of code.

Here is the code:

 cat input_file.fq | perl -ne '$s=<>;<>;<>;chomp($s);print length($s)."\n";' > output.txt

The format for the input_file.fq is a FASTQ format file. This is standard for storing biological data.

e.g.

@HWI-EAS283_0004_FC:1:1:1321:1118#0/1
TTGCTCAGCAGGTTCAACTGCAGGTTGCCCAGGACTTTAC
+HWI-EAS283_0004_FC:1:1:1321:1118#0/1
gg/fgag_ffgcfgeffafSKd\\adfRffff]fa[fffaf
@HWI-EAS283_0004_FC:1:1:1399:1117#0/1
CTTGACGATTCCCCGCAGGCTGTTCCCGCGGGCCGCAATG
+HWI-EAS283_0004_FC:1:1:1399:1117#0/1

Every line beginning with '@' is the ID for the next 3 lines. The second line is a collection of letters, typically either ATCG. The line beginning with + is just a repeat for the ID and then the fourth line is the last relevant line for a segment. Then this repeats for a new 4 line segment.

Basically, the above code gets the length of the sequence (ATCG) line for every segment, which is great but I dont understand the behaviour of the $s=<>;<>;<>; part of the code.

Could anyone explain what its doing, and how it knows only to look at the correct line (which will be line number 2, 6, 10, 14, 18 etc)? I've played around with this on different file formats and cant figure it out.

Any advice would be greatly appreciated

Replies are listed 'Best First'.
Re: command line perl reading from STDIN
by choroba (Cardinal) on Jan 22, 2013 at 17:32 UTC
    <> is a nicer name for readline. In scalar context, which is the case of your code, it reads one line from the input. If you do not assign the returned value to a variable, the line is skipped.
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      cat input_file.fq | perl -ne '$s=<>;<>;<>;chomp($s);print length($s)."\n";' > output.txt
      Ok, but for the above, if we were to step-by-step describe what is going on. How would it be described? It just that from the above I thought it would assign $s on every line OR skip 2 lines and then assign $s to the third line and repeat.
        Actually, I think I've just worked it out my head. Took a bit of thinking but this is what i think it is doing
        first line gets sent, but because <> is unassigned (before $s), it skips; then reads the second line and assigns $s to the line and does whatever; then reads third line but because <> is unassigned, it skips; reads fourth line and the same happens; restarts with 5 line but again because <> is unassigned before the $s it skips; reads 6th line and assigns it to $s and does whatever again
        this continues until end of file

        If this is incorrect, let me know. Otherwise I hope this helps anyone else who might look at it. Thanks for your help choroba
Re: command line perl reading from STDIN
by talexb (Chancellor) on Jan 22, 2013 at 17:56 UTC
      cat input_file.fq | \ perl -ne '$s=<>;<>;<>;chomp($s);print length($s)."\n";' > output.txt

    Just a stylistic note, but this can be restated as the following:

      perl -ne '$s=<>;<>;<>;chomp($s);print length($s)."\n";' \ <input_file.fq >output.txt

    Unless I'm dumping the contents of a file to the console, I don't use cat that much .. head, tail and less are handier.

    Alex / talexb / Toronto

    "Groklaw is the open-source mentality applied to legal research" ~ Linus Torvalds

      Yeah, typically I do the same, however, in this case although both pieces of code do the same thing there is a massive run time difference.

      The file that I'm using is 107,259,832 lines long and the other 63 files I have are between 100 million lines and 200 million lines long. When running the original (utilising cat and piping this to perl) command and piping the output to a just 'wc -l' it took about a minute. With the change to your command structure it has taken about 15 mins so far and is still running.

      I'm not entirely sure why this is the case (probably to do with how perl handles files vs STDIN), but I thought it is something you should be aware of. Especially if anyone is working with files in the order of 100s of millions of lines long, and if this is typically the case.
        This is strange. I cannot replicate the behaivour with files of millions of lines. Are you sure there are no other factors involved?
        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
Re: command line perl reading from STDIN
by AnomalousMonk (Archbishop) on Jan 23, 2013 at 00:48 UTC

    Sometimes it's useful to let Perl tell you what it thinks about the code you give it to execute (e.g., "where does the fourth lineread come from?"):

    >perl -MO=Deparse -ne "$s=<>;<>;<>;chomp($s);print length($s).\"\n\"; " LINE: while (defined($_ = <ARGV>)) { $s = <ARGV>; <ARGV>; <ARGV>; chomp $s; print length($s) . "\n"; } -e syntax OK

    Use  -MO=Deparse,-p for even gorier details. See B::Deparse and O.

      Thanks. That's really great... Really helps with understanding whats actually going on.