Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Suggestion for regular expression speed improvement.

by bala.linux (Novice)
on Jun 15, 2009 at 11:56 UTC ( #771624=perlquestion: print w/ replies, xml ) Need Help??
bala.linux has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am using a regular expression which matches 25 columns with grouping and it is very slow. My perl version is 5.8.8.Can some one please help me to improve its speed ? Will updating perl or using any regexp modules help ?

Regular Expression : (.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)\t(.+)

Thanks and Regards,
Bala

Comment on Suggestion for regular expression speed improvement.
Re: Suggestion for regular expression speed improvement.
by Corion (Pope) on Jun 15, 2009 at 12:00 UTC

    It's likely much better to use Text::CSV_XS or a simple split on /\t/ in your case, as a regular expression is overkill in your situation.

      Thanks. Your suggestion can be well used for the properly separated log files like CSV. But, I want my code to work with regular expression so that I can parse any format of logs. Hope you understand my problem. So, unfortunately I can not use split or CSV modules :(
        …so that I can parse any format of logs.

        Can you elaborate how you hope to handle "any format" with regular expressions?

        I do not understand this response. Using a regex such as you described is less flexible than using split, not more flexible: The regex will only match on lines containing at least 25 tab-separated fields. If there are fewer fields, it will fail to match and return no data. If there are more, then some fields will not be separated from each other and returned as a single field1. split will work with any number of tab-separated fields right out of the box.

        Going beyond split to a proper CSV-handling module, you will be able to not only read arbitrary numbers of tab-separated columns, but it will also give you the ability to recognize quoting of the fields, so that they can contain embedded tabs without causing false field separations. Accomplishing this with regexes is messy, at best.

        1 ...unless you switch from (.+) to ([^\t]+), in which case it will only match lines containing exactly 25 fields.

Re: Suggestion for regular expression speed improvement.
by moritz (Cardinal) on Jun 15, 2009 at 12:11 UTC
    Corion is right with his suggestions. If you're still interested in how to speed up the regex, here it goes:

    The first .+ will first match all characters, then gives up characters until the \t finds the first tab, then the second .+ has no more character to match, then the first .+ has to give up characters again etc.

    To avoid all that backtracking, you should substitute each .+ by something that matches everything except tabulators, [^\t]+.

      This sounds good. I will adopt this change and compare the performance. Thanks.
        No, don't. Go with the tips Corion gave you above, it's much more sensible to use split or a module - my explanation was mostly to satisfy academic curiosity, and not meant as a suggestion on how to solve your problem.
Re: Suggestion for regular expression speed improvement.
by QM (Vicar) on Jun 15, 2009 at 17:37 UTC
    Hi bala.linux,

    To put a finer point on the previous responses...

    I see you've just joined Perlmonks, and this seems to be your first post. Given that, everyone will cut you some slack, to a point.

    Regarding Text::CSV_XS (emphasis mine):

    The module accepts either strings or files as input and can utilize any user-specified characters as delimiters, separators, and escapes so it is perhaps better called ASV (anything separated values) rather than just CSV.

    So I would respectfully suggest that you haven't considered the degree of the generic solution provided by T::C. I'd suggest rereading the module doc several times, and trying a few toy examples to give you the flavor of it's capabilities, and whether any drawbacks it holds are acceptable for you.

    For myself, I use it in a few scripts, and it is amazingly fast, and yet very configurable. Googling for reviews finds lots of similar opinions.

    Otherwise, I'd go with the coderef idea. Give your users a few examples based on log files you've encountered, just to be helpful. But there are a lot of pitfalls here, not even considering malevolent code.

    BTW, come back here often, and read up on anything that catches your fancy. This is the best place I know to get your Perl questions answered. You'll learn a lot by reading, and more by asking questions. Good luck.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Yes you are right. I might have registered today, but I always follow this site. Because google every time drops me here for any perl related query :) Regarding T::C module, frankly I have not evaluated so far. I would like to do that to see how best I can use that module to my project. Thanks for your suggestion.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://771624]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (10)
As of 2014-08-01 10:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (3 votes), past polls