|Perl: the Markov chain saw|
Re: How to process with huge data'sby TomDLux (Vicar)
|on Oct 04, 2011 at 17:44 UTC||Need Help??|
700,000 lines is a small input file ...OK, a medium sized input file, especially when each line is 6 characters. It should take a few seconds to read and process.
Using the default variable $_ is great when you have a few lines of code, but when you have a real program, I find it better to use real variables all the time. For one thing, when you extend the code, you may introduce inner or outer loops which re-assign $_, so you are forced to use real variables anyway. And in your case, you explicity specify the variable you are chomping, so there's no saving. And since you assign $_ to $inputtext, why not use that variable right from the while loopo:
Regular expressions are great, when you need them, but they're not always the best solution. I would split the input when I receive it into it's separate components:
Encapsulating a search into a routine is better than having it inline, because the name or the routine provides some documentation. As well, splitting a complicated search into a pair of simpler searches is much easier to understand, and can be faster, too.
I think what you're trying to do is record pairs of numbers$fnum,$snum, and separate the pairs by vertical pipes. This is part of what is slowing your program down. For each line of input, all 700,000 of them, each regular expression searches linearly through the connectionText string... making your program complexity N^2. Instead of storing it as a string, you should use an array of pairs, or a hash of pairs, or a two layer hash.
But if you WERE going to stick with a string, you can simplify the last three lines by checking for blank components on the simple pair, rather than checking the whole string ... after all, you know after each pass that the connectionText has no half-pairs, so you don't need to check the whole string, only the new addition:
Given a list, ($fnum, $snum), grep for the defined components, discarding the rest. (This is a place where default variables ARE useful!) join() the resulting list with commas ... that is place commas BETWEEN components. If there is only one component, there's no comma, so no mess to clean up. But in any case, have you encountered this situation, or is it merely fear that makes you handle missing parts. because, you know, you assign $fnum and $snum from regular expressions that match numerals. If there is no numberal, there would be no match, so else() test would fail, and you'd never enter the else block at all.
I don't think your "trailing pipe or end of string" search (\||$) will ever find anything but a trailing pipe, since you add it with each addition, and preserve it with the changes. So why not simply check for and leave in place the trailing pipe?
But it would be a WHOLE lot simpler to use data structures instead of strings. If you use an array of pairs, you can grep through the list of pairs to search for those that match:
Searching the entire array for each line is still an N^2 algorithmn, but at least cleaner than searching the string character by character. Using hashes would definitely be faster.
It seems you're adding things to what I called 'pairs', to form growning lists. So maybe you want a hash of hashes, or a hash that uses $fnum as the key, and has a n array of $snum as the associated value. If you need to be able to find the sets that include a particular $snum, you could also have similar reverse data structure, keyed on $snum with an array of $fnum that use that $snum. Lots of different ways to generate data structures, depending on your needs, and all of them are better than a long string.
As Occam said: Entia non sunt multiplicanda praeter necessitatem.