Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^3: Regular expressions across multiple lines

by Marshall (Canon)
on Apr 24, 2016 at 17:11 UTC ( [id://1161368]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Regular expressions across multiple lines
in thread Regular expressions across multiple lines

Is this an ASCII file or are there other multi-byte character encodings? "Too slow" a PC is not likely, some other issue is afoot here, could be a Unicode issue? Can you hack this down into a simple: a)this works and b)this doesn't work example without huge files? The actual code can also be VERY useful.
  • Comment on Re^3: Regular expressions across multiple lines

Replies are listed 'Best First'.
Re^4: Regular expressions across multiple lines
by abcd (Novice) on Apr 24, 2016 at 17:27 UTC
    I dont know much about file formats but the input file I am using is a FASTA file which stores DNA sequences. I am a beginner and doing this as a grad school project so this is pretty much the actual code and there isnt much else to it. The regular expression is fine as it gives the desired results when I use it on a test file with a few lines but doesnt work on larger files.

    To give more context on the actual problem the 10 random characters are random barcodes flanked by a specific sequence (the abc and def in my example code). Once I get the 5 characters (i.e. dna bases) before and after this fragment I will use them to figure out which gene the random barcode inserted into. In this way I will have each gene associated with a unique barcode.
      I looked at the FASTA format and it is ASCII, however there could be some other issue here with the program that generated this file. Can you open the original file in the text editor, eg WordPad and see the characters displayed properly? chomp() should not affect this. This "I see bizarre characters in the texteditor" is sounding like a big clue to me that format is wrong and your small example works because it is ASCII?

      update: there are a bunch of modules to mess with this BIO FASTA format. Search CPAN for "FASTA". But this sounds easy enough to figure out without a module.

        Yes the original file displays fine in the text editor. Also I dont really see bizarre characters, just normal characters placed one on top of another which is why I thought it maybe an issue with my pc as the output file I create on removing the newlines has a very very long single line of text which my pc maybe having problems loading. But anyways thanks for the help. I will keep messing around and see if I can somehow get this to work because from the replies I have got the problem doesnt seem to be with the code itself but with something else.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1161368]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-04-23 16:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found