http://www.perlmonks.org?node_id=1041252

rocketperl has asked for the wisdom of the Perl Monks concerning the following question:

Hi im trying to use regular expression to capture the digits on my data stored on a file that contains alphanumeric data and store the output to a different file. I am not getting any errors and the size of my output file has increased after execution but do not see any data.
#!/usr/bin/perl open (FILE, 'name.tsv'); open (OFILE, '>probe_dist.tsv'); while (<FILE>) { my $name = (FILE=~ m/:(\d+)/); print OFILE "$name\n"; } close (FILE); exit;
my input data contains rows like this below
chr1:4775792-4775851 chr1:4775842-4775901 chr1:4775852-4775911 chr1:4775902-4775961 chr1:4775952-4776011 chr1:4776002-4776061 chr1:4776052-4776111 chr1:4776102-4776161 chr1:4776212-4776271 chr1:4776252-4776311 chr1:4776302-4776361 chr1:4776352-4776411 chr1:4776402-4776461 chr1:4776452-4776511 chr1:4776502-4776561 chr1:4777032-4777091 chr1:4777082-4777141
My idea is to print the first occurrence of all the continuous digits in my file. pls help

Replies are listed 'Best First'.
Re: handling files using regular expression
by marto (Cardinal) on Jun 28, 2013 at 13:36 UTC

    Earlier today various people pointed out that you should use strict and warnings, check that you're opening files without error and you accepted this was the solution to the problem in question. Is there a reason why you've either ignored or fogotten this advice after only a few short hours?

      yes, sorry.. i did forget to include that. but inspite of including the use strict and warnings, i did not get any warning and neither did my output file get updated. kindly help me.
      #!/usr/bin/perl use strict; use warnings; my $input; my $start; open (FILE, 'name.tsv'); open (OFILE, '>probe_dist.tsv'); while (<FILE>){ ($input)= split ("\t"); $start= $input; } while (<FILE>) { $start =~ m/:(\d+)/; print OFILE "$start\n"; } close (FILE); exit;

        Just a few comments:

        • Your first while loop already runs through the whole file while not doing anything useful. Therefore the second while loop has nothing to do. Remove the first while loop altogether.
        • The statement while (<FILE>) of your second loop (now the only one...) will read a line at a time and assign it to $_. So you need to work with $_ within the loop block.
        • Your line $start =~ m/:(\d+)/; is applying the regex to the variable $start but you need to apply it to $_. Sou you might say $_ =~  m/:(\d+)/; which would work. It would be more Perlish to just say /:(\d+)/; as this would be applied to $_ by default.
        • The result of this match is that what was found in (...) is assigned to $1 so you need to write that to your file: print OFILE "$1\n";.
        This is probably not all but if you re-read the earlier thread you should find more best practices.

        Firstly, when posting asking for help please post the actual code you're running. This code differs significantly from the code you have initially provided. Also, the advice in the post I linked to wasn't restricted to adding use strict; use warnings;, we also told you to use open properly, and to check for errors.

        Here you have two while loops for some reason, I've no idea why. Also you're splitting on tab but to one variable, and the input data you provide doesn't have any tabs at all.

        I'd suggest (again) that you take the time to learn the basics, and if you're going to ask for advice which you claim resolves your poroblem please either try and remember it or keep notes so that you use apply the same knowledge in future works.

        while (<FILE>){ ... something ... } while (<FILE>) { ... something else ... }

        That second loop will never execute. Can you tell why?

Re: handling files using regular expression
by ww (Archbishop) on Jun 28, 2013 at 13:58 UTC

    What the heck does " the first occurrence of all the continuous digits" mean?

    That aside, your attempt to capture the digits won't work as is; you need something like:

    if ( $_ =~ m/:(\d+)/ ) { my $name = $1; print OFILE "$name\n"; }

    And, of couse, it would be a good idea to close the OFILE as well as your input... and to heed -- as mentioned before -- the advice your last SOPW elicited.

    Downvoted this node.

    If you didn't program your executable by toggling in binary, it wasn't really programming!

Re: handling files using regular expression
by sundialsvc4 (Abbot) on Jun 28, 2013 at 15:37 UTC

    The first thing I’d say about this program is ... “always use real variables,” not “implied” stuff like $_ which can very easily get away from you.   Then, “start with a test case.” One example string, that you can verify gets correctly parsed by the regular-expression you intend to use.   You can even use a “Perl one-liner” like this:

    perl -e 'my $str = "chr1:4777082-4777141"; \ my ($foo) = $str =~ /([0-9]+)/; print "foo is $foo\n";' foo is 1 ... oops, that's not right ... mike$ perl -e 'my $str = "chr1:4777082-4777141"; \ my ($foo) = $str =~ /[:]([0-9]+)/; print "foo is $foo\n";' foo is 4777082 ... correct.

    Now, write your program, something like:

    while (my $str = <INFILE>) { my ($foo) = $str =~ /[:]([0-9]+)/; # WE TESTED THIS die "Something's Wrong with $str!" unless ($foo); # NEVER ASSUME, NEVER ASSUME print OUTFILE "FILE=$foo\n"; # NOTICE DOUBLE-QUOTES }

    We used the one-liners to verify the actual regular-expression parsing, to quickly get it right (and to uncover a subtle bug in the first attempt), then wrote code that is above all else, clear to do the actual work.

    Within that program, we also added a die statement that will cause the program to test its assumption that every single record of the input will be handled correctly.   (We could make this test even-more aggressive if we knew that every record of the input file should contain seven digits, preceded by a ":" and followed by a "-", with: (see bold-faced parts)
    /[:]([0-9]{7})[-]/)
    Thus, the very fact that the program runs normally to completion, with a very stringent regular-expression that must be matched every time, is a strong indication that both the input data and the resulting output are correct.