The size of the data doesn't matter, it's the maximum size of the regex which counts.
Surely, in functional or algorithmic terms, the size of the maximum possible match is the important thing. But the size of the data does matter very much in terms of feasibility. Sometimes, it just can't be done because it would take just way too long.
About two years ago, I had a PL/SQL-like language program to extract data inconsistencies that would take about 60 days to complete (and that is after heavy optimization, the original one would have taken 180 days); 3 or 4 days would have been acceptable in the context, not 60 days. The idea was to correct data inconsistencies, you simply can't make DB corrections based on data whose extraction was done 60 days ago. You might be interested to know that I solved the problem thanks to Perl. I removed most of the very complicated business logic (at least the part of it that was taking ages to execute) from the PL/SQL-like program to extract raw files and reprocessed these files with Perl. The program is now running in about 12 hours (the main difference being that Perl has very efficient hashes enabling very fast look-up, whereas the PL-SQL-like language did not have that, forcing for linear search into relatively large arrays billions of times). BTW, this success contributed quite a bit to convincing my colleagues to use Perl; when I arrived in the department where I am, nobody was using Perl for anything more than a few Perl one-liners here and there in shell scripts; all of my colleagues now use Perl almost daily. Even our client has been convinced: I only need to propose them to rewrite this or that program in Perl to improve performance (of course, I do that only if I have good reasons to think that we will get really improved results), and they allocate the budget almost without any further ado.
OK, this was somewhat off-topic, but my point was: if the data is really big (and I am working daily with GB or dozens of GB of data), the size of the input can really make the difference between things that are feasible and things which are not.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||