Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Alternatives for index() ... substr() ?

by zarath (Beadle)
on Nov 23, 2017 at 15:02 UTC ( [id://1204144]=perlquestion: print w/replies, xml ) Need Help??

zarath has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks!

This question is all about keeping the written code as clean, clear and simple as possible.

Please consider the following piece of code (irrelevant parts are replaced by [...]):

[...] if ($text =~ /FATAL: Error while copying /) { my $dt = substr($text,0,20); my $me = index($text,' from '); my $ml = $me - 47; my $msg = substr($text,47,$ml); $me += 6; my $se = index($text,' to '); my $sl = $se - $me; my $source = substr($text,$me,$sl); my $sbs = $me + 2; my $sbl = $se - $sbs; my $sourcebis = q{\\\\ENCNRW0010}.substr($text,$sbs,$sbl); $se += 4; my $de = index($text,', error was:'); my $dl = $de - $se; my $destination = substr($text,$se,$dl); File::Copy::copy $source.$msg,$destination; push (@mail,$dt.$msg.' from '.$sourcebis.' to '.$destination." +\n"); $text =~ s/FATAL: Error while copying /FATAL (Fixed): Error wh +ile copying /g; print $text."\n"; } [...]

2 example lines of what $text could be:

2017-11-16 11:42:20 FATAL: Error while copying MX000017105279_2448299. +1523788.IN.EDI from D:\EnecoEDIELArchive\B2B_ELEK\ to \\ENCNRW0012\En +ecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying MX000017105328_3626588. +1523787.IN.EDI from D:\EnecoEDIELArchive\B2B_ELEK\ to \\ENCNRW0012\En +ecoData\EDIEL_IN\B2B_ELEK\, error was:

So, with my code I need to extract several strings from $text. In case of the first example: 2017-11-16 11:42:20, MX000017105279_2448299.1523788.IN.EDI, D:\EnecoEDIELArchive\B2B_ELEK\ and \\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\. Every other string found there is a constant by the way, which is why I can use those in my index().

Problem is these strings can be pretty much anything with any number of characters (apart from the datetime of course), the only way I know how to safely extract them is shown in the code snippet. I don't know about you but if I would see this in a code written by someone else, my head would start spinning.

The code works, so I would not classify this as a 'problem', but if for example next year something changes about the way the inputfiles are written, I will be having a hard time deciphering what I did here again. So I was just wondering if there is a clearer way of doing what I do here?

Maybe something like my $msg = "the string that comes between" 'copying ' "and" ' from ';? Is there a function that does what the double-quoted text says?

Thank you in advance!

Replies are listed 'Best First'.
Re: Alternatives for index() ... substr() ?
by choroba (Cardinal) on Nov 23, 2017 at 16:01 UTC
    You seem to already know the tool that can help you: regular expressions. Here's what I tried and got exactly the same output for the given input (i.e. same @mail, @text, and File::Copy::copy called with the same parameters as in your code):
    if (my ($dt, $msg, $source, $destination) = $text =~ /([0-9]{4}-[0-9]{2}-[0-9]{2}[ ][0-9]{2}:[0-9]{2}:[0 +-9]{2}) [ ]FATAL:[ ]Error[ ]while[ ]copying[ ] (.+) [ ]from[ ] (.+) [ ]to[ ] (.+), /x ) { ( my $source_without_drive = $source ) =~ s/^\w://; my $sourcebis = '\\\\ENCNRW0010' . $source_without_drive; File::Copy::copy $source.$msg, $destination; push @mail, "$dt $msg from $sourcebis to $destination\n"; $text =~ s/FATAL: Error while copying /FATAL (Fixed): Error wh +ile copying /g; push @text, $text."\n"; } }

    UPDATE: fixed for paths containing spaces.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Alternatives for index() ... substr() ?
by roboticus (Chancellor) on Nov 23, 2017 at 16:20 UTC

    zarath:

    Generally, I use a combination of techniques to parse text files. The methods I use depend on how the structure of the text varies in the file.

    When the structure is very regular, you can use substr or unpack to parse them. Unpack has the advantage where essentially lets me do multiple substring extractions in a single statement plus the ability to do a few simple type conversions. But if there are only a couple fields, I often fall back on split or substr.

    When the data has variable structure, then I'll follow up with regular expressions to parse out the tougher bits. Regular expressions with capture groups are a very powerful method to let you quickly chop text into pieces. Here's a quick example:

    use strict; use warnings; while (my $line = <DATA>) { next if $line =~ /^\s*$/; my ($date,$time,$severity,$msg) = split /\s+/, $line, 4; my ($r1_file, $r1_src, $r1_dst, $r2_file, $r2_src, $r2_dst); if ($msg =~ /^Error while copying (.*) from (.*) to (.*), error wa +s/) { ($r1_file, $r1_src, $r1_dst) = ($1, $2, $3); } if ($msg =~ /^Error while copying (.*?) from (.*?) to (.*?), error + was/) { ($r2_file, $r2_src, $r2_dst) = ($1, $2, $3); } print "FILE:\t<$r1_file>\n\t<$r2_file>\n"; print "SRC:\t<$r1_src>\n\t<$r2_src>\n"; print "DST:\t<$r1_dst>\n\t<$r2_dst>\n\n"; } __DATA__ 2017-11-16 11:42:20 FATAL: Error while copying MX000017105279_2448299. +1523788.IN.EDI from D:\EnecoEDIELArchive\B2B_ELEK\ to \\ENCNRW0012\En +ecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying MX000017105328_3626588. +1523787.IN.EDI from D:\EnecoEDIELArchive\B2B_ELEK\ to \\ENCNRW0012\En +ecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying Research Data from JOE +MX000017105328_3626588.1523787.IN.EDI from D:\EnecoEDIELArchive\B2B_E +LEK\ to \\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying MX000017105328_3626588. +1523787.IN.EDI from D:\EnecoEDIELArchive from JOE\B2B_ELEK\ to \\ENCN +RW0012\EnecoData\EDIEL_IN\B2B_ELEK\, error was: 2017-11-16 11:42:21 FATAL: Error while copying MX000017105328_3626588. +1523787.IN.EDI from D:\EnecoEDIELArchive to BOB\B2B_ELEK\ to \\ENCNRW +0012\EnecoData\EDIEL_IN\B2B_ELEK\, error was:

    See how easy the code with the regular expressions looks? It's pretty nice to be able to say:

    if ($msg =~ /^Error while copying (.*) from (.*) to (.*), error wa +s/) { my ($filename, $source_dir, $dest_dir) = ($1, $2, $3); ... do something ... }

    Perl can see that we're wanting to match an expression that starts with "Error while copying", followed by a chunk of text, followed by " from ", followed by more text, followed by " to ", yet more text, and ending up with ", error was". Since we used parenthesis to gather the three unspecified chunks of text, we have captured three strings. If the match was successful, we then assign the first captured chunk ($1) into $filename, the second captured chunk ($2) into $source_dir, etc.

    While regular expressions can greatly simplify your code, you must test them thoroughly, or you can be surprised by the results:

    $ perl pm_1204144.pl FILE: <MX000017105279_2448299.1523788.IN.EDI> <MX000017105279_2448299.1523788.IN.EDI> SRC: <D:\EnecoEDIELArchive\B2B_ELEK\> <D:\EnecoEDIELArchive\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <MX000017105328_3626588.1523787.IN.EDI> <MX000017105328_3626588.1523787.IN.EDI> SRC: <D:\EnecoEDIELArchive\B2B_ELEK\> <D:\EnecoEDIELArchive\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <Research Data from JOE MX000017105328_3626588.1523787.IN.EDI> <Research Data> SRC: <D:\EnecoEDIELArchive\B2B_ELEK\> <JOE MX000017105328_3626588.1523787.IN.EDI from D:\EnecoEDIELA +rchive\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <MX000017105328_3626588.1523787.IN.EDI from D:\EnecoEDIELArchi +ve> <MX000017105328_3626588.1523787.IN.EDI> SRC: <JOE\B2B_ELEK\> <D:\EnecoEDIELArchive from JOE\B2B_ELEK\> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> FILE: <MX000017105328_3626588.1523787.IN.EDI> <MX000017105328_3626588.1523787.IN.EDI> SRC: <D:\EnecoEDIELArchive to BOB\B2B_ELEK\> <D:\EnecoEDIELArchive> DST: <\\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\> <BOB\B2B_ELEK\ to \\ENCNRW0012\EnecoData\EDIEL_IN\B2B_ELEK\>

    As you can see, some oddball data can confuse your regular expressions, so you need to test them carefully. Trying to catch everything with a single regular expression can get complicated, so don't be afraid to use several cases in different if/then/else branches.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Alternatives for index() ... substr() ?
by Eily (Monsignor) on Nov 23, 2017 at 15:43 UTC

    IMHO the main issue with your code is not the use of index and substr, it's how unhelpful the variable names are. Not only do the name give no clue on their meaning, they look similar enough that you need extra concentration to tell them appart.

    Maybe something like my $msg = "the string that comes between" 'copying ' "and" ' from ';? Is there a function that does what the double-quoted text says?
    Yes, regexes do what you want. You can read the tutorials Pattern Matching, Regular Expressions, and Parsing for more info. Your specific case is quite straightforward:
    $text =~ /copying(.*)from/ or die "Couldn't find the pattern"; my $msg = $1;
    . means any character, * means several times and the parenthesis capture the result into $1 (another pair of parenthesis would put it in $2, and so on). So overall it's: find "copying", then find several times any character in a row, followed by "from", and put the found characters into $1.

Re: Alternatives for index() ... substr() ?
by Laurent_R (Canon) on Nov 23, 2017 at 16:01 UTC
    Hi zarath,

    The first thing I would recommend is to use variable names that clearly state what they contain.

    My first idea was perhaps to use unpack, but since you are saying that "these strings can be pretty much anything with any number of characters", that is probably not going to work.

    Although this might sound paradoxical, the clearest and cleanest way to retrieve your data might be to use just a long regex using the constant strings as delimiters between the pieces you want to capture. Something like this (to be completed and tested):

    my ($date, $file, $source_dir, $destination ...) = ($text =~ /(.{19}) +FATAL: Error while copying ([^ ]+?) from ([^ ]+?) to .../);

    Update: when I started writing this answer, there was no other response, but it took me so long to write this answer that I finally posted a quarter of an hour later than Eily, and just said almost the same thing.

    Update 2:: note that, the solution above is based on the data you've shown. I have assumed that the file and directory names you want to capture don't contain any space. If there can be spaces in your data to be captured, this would have to be slightly changed (thanks to Eily for pointing out the possibility that some of the data to be captured may contain spaces). Similarly, if the parts you want to capture may contain the strings "to" or "from", this might also be a problem. So you have to know very well your data in order to take your detailed decisions on how exactly to build your regex or regexes.

Re: Alternatives for index() ... substr() ?
by zarath (Beadle) on Nov 24, 2017 at 10:02 UTC

    Thank you for all the suggestions! This place never disappoints!

    Since my seemingly unclear naming of the variables has come up a few times, I'll explain this a bit: the variables that are only used for 'calculating' have a short name, because they are only 'tools', not a part of the end result. For example my $sl is short for 'source length' - the length of the string that will be called my $source later on. Other example: my $de is short for 'destination end' - the index of the last character of what will become my $destination.

    I feel a little bit silly because in a way, I already use the method that I was looking for (my $text = /blabla/), but I did not know I could actually fetch strings in a way that is understandable to me. I have done some searching before posting this, but all I found about this method is stuff that looks like /"(?>(?:(?>[^"\\]+)|\\.)*)"/. I am truly sorry about my noobieness, but when I see something like that, I can't help but think "What the hell am I staring at?"

    On the other hand...

    $text =~ /copying(.*)from/ or die "Couldn't find the pattern"; my $msg = $1;

    ... as suggested by Eily, this I understand and will be testing what I can do with it.

    And finally, is it... is it worth trying Parse::RecDescent? Have gone through it and it looks like it might take some time to get my head around it, I do have some time, but I don't like spending time on something just to see it does not help at all. Have read through the thread linked by LanX and I can't help feeling a bit suspicious about the suggestion.

      You understand the variable names right now, but if you do that often enough, you'll start using the same names to mean different things ($dl for download, destination length, desired language etc...) and will get more confusing. And you shouldn't rely on being able to remember what the names mean to understand the code.

      I'm glad that you understand the solution I proposed (rather than just copy and paste it), be aware that it is not perfect though. As Laurent_R said, if "copying" or "from" can be found somewhere else in the string (in the file names or the paths) the regex might use the wrong ones. The solutions proposed by choroba or Laurent_R might be better fitted in some cases.

      is it worth trying Parse::RecDescent?
      ...
      Not for your problem (although it's a great tool in other cases). Sundial seems to like demonstrating that he has approximate knowledge of many things, so he favours being able to say a lot about something over relevance or even technical accuracy. Just ignore his posts.

      I am truly sorry about my noobieness, but when I see something like that, I can't help but think "What the hell am I staring at?"

      It's that sort of syntax which, at least anecdotally, is blamed for putting a lot of people off Perl. However, syntax is just syntax and the more you use it the less arcane it seems. Regular expressions are an absolute cornerstone of the langauge (and so useful that the PCRE is used in many other tools/languages to their benefit). So it is very worthwhile putting the effort in to learn.

      There's perlretut for the basics and perlre for reference. Then you can dive into the various wonders of Pattern Matching, Regular Expressions, and Parsing. Take it slowly - get comfortable with the basics then add in one new extra feature at a time. You will soon find that they are as ubiquitous as everyone claims. Good luck and enjoy the ride.

Re: Alternatives for index() ... substr() ?
by Anonymous Monk on Nov 24, 2017 at 04:37 UTC

    It appears that the strings you want are separated by whitespace, have no whitespace embedded and $text is in a consistent format. If this holds true, I would split $text on whitespace, then extract the strings you want from the array.

    if ($text =~ /FATAL: Error ... /){ my @parts = split /\s+/, $text; my $dt = "$parts[0] $parts[1]"; my $msg = $parts[6]; my $source = $parts[8]; my $destination = $parts[10]; ... }
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1204144]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (3)
As of 2024-04-16 04:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found