Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Perl beginner here, needs a shove in the right direction.

by rfromp (Novice)
on Jun 16, 2015 at 18:28 UTC ( [id://1130678]=perlquestion: print w/replies, xml ) Need Help??

rfromp has asked for the wisdom of the Perl Monks concerning the following question:

Hello, new perl user here with a real-life problem I'm trying to solve.

I've been through the first Llama book by O'Reily so I am familiar with perl in that regard, however applying the concepts to real-life problems is where I get stuck. I never know where to start, so I try and then my code gets messy, complicated, and klugey. I think I need more experience and then it will come more naturally and logical. So that's where I'm at.

For the record, I'm not looking to have the work done for me or to be spoon-fed. Working the problem is the only way I'll become more proficient with perl but at the same time I legitimately need some mentorship. Thank you in advance.

For reasons beyond my control, I'm using Perl v5.8.8 on RHEL 5.9

In a nutshell, here's what I trying to do using perl. I have multiple folders that contain thousands of small text files (all have unique file names) and I need to validate that there is data in certain fields. It is not important what the data is, just that there is data in the fields. The text files vary in length, 20 to sometimes 100+ lines long. The lines that contain the fields I'm interested in all start with the same keyword and is unique to those lines. If there is data in the fields, I'm happy. However, if the field contains no data or contains a dash, then I need to know. The fields in the lines are deliminated by a forward slash. I cannot post the files due to being proprietary and I will be in violation of my non-disclosure agreement but if needed I can post a rough example of how the text files are generally formatted and containing bogus data.

My first solution was not with perl but on the command line using bash & awk and looked like this: cat *.TXT | awk '/uniqueKeywordAtBeginningOfLine/' | cut -f 6,7,8 -d '/' That worked, but as the number of files in the directories grew larger, in the thousands, I piped the above to more but paging one screen at a time and looking at the data is incredibly time consuming and just plain inefficient. I'm certain perl can do the job but I'm having trouble getting started. What I'm trying to avoid is a unwieldy kluge.

My first thought was to incorporate the awk into a perl script that would analyze the files but after some Google searching, it seems that using perl exclusively is the better option.

My trouble now is I'm not sure where to start. I'm thinking that it seems better to slurp each file into an index of an array and then run a foreach loop to iterate over the files, an elseif to check the fields of the lines I'm interested in, and if all are good then skip to the next index in the array (the next file). But if the fields are empty or contain a dash then report the filename for further manual examination.

If someone can give me their opinion on how I should start this out, I'd be much appreciative. Thanks





UPDATED to give example of the data files that are to be processed.

They are similar to below, with most having more than one INTERESTING line to check, but sometimes only having one.

Fields 6,7,8 of the INTERESTING line(s) are the fields I need to ensure are not blank or do not just contain a dash. The below example would be considered 'good' and not need further examination.

ZZZZZ ZZZZZZ 1111111-BBBB--CCCCC. DDD EEEEE F 222222G HHH 33 III JJ JJJJ LLLLL MMMMMMMM NNNNN//OO/PPPPP// QQ RRRRRR/SSSSSSSS TT U U U U U U//VVV WW XXX/XXX/XXX/XXX YYY ZZZZ/AA-44/55555556/BBB// CCCCCCC/D6// EEE/7777/8888888/999999G/H000I/-/-/-// JJJJJJ/11// INTERESTING/22/M/NNN3333333333P/-/444.5Q/6.77RR/8.99RR// EEE/7777/8888888F/999999G/H000/-/-/-// JJJJJJ/11// INTERESTING/33/T/UUU4444444444V/-/555.6R/7.88TT/9.11UU// LLLLL/22// MM

Replies are listed 'Best First'.
Re: Perl beginner here, needs a shove in the right direction.
by kcott (Archbishop) on Jun 16, 2015 at 19:47 UTC

    G'day rfromp,

    Welcome to the Monastery.

    Based on your command line solution, here's some pointers.

    Firstly, take a look at Re^3: Opening multiple log files. I wrote that a couple of days ago. Change .log to .TXT to get your *.TXT files.

    Then open each in turn and read line-by-line. See perlintro: Files and I/O; follow the links in that section for more details.

    Check each line for /uniqueKeywordAtBeginningOfLine/. You'll want to anchor your regex to the start of the line. See perlintro: Regular expressions; follow the links in that section for more details.

    When you get a match, use split and an array slice for the three wanted fields (i.e. to emulate cut -f 6,7,8 -d '/'). See perlintro: Perl variable types; follow the links in that section for more details.

    Now perform whatever checks you need on those fields. You may want to write the results to some log file for subsequent analysis.

    If you need further help, please address individual issues with a short piece of code that illustrates your problem. Do not post propriety data: make up dummy data. All of this is explained in How do I post a question effectively?.

    -- Ken

      Thank you Ken, your "Opening log files" sounds like it will be super-helpful. I was thinking about the regex as well, I will work on this and report back.
Re: Perl beginner here, needs a shove in the right direction.
by stevieb (Canon) on Jun 16, 2015 at 19:01 UTC

    Usually we don't write full-blown scripts if the Seeker of Perl Wisdom doesn't post any example code, but I put something together here that should get you started.

    I'm not sure if glob would be faster, as I did not test it, so you can play around.

    This code looks at all files located in the directory specified as the second argument to find() (recursively). If it's a file, it opens it, reads it line by line, and if it has the keyword (HELLO) at the start of line, it checks the line fields 1, 3 and 4 (after being split() on a forward slash) to ensure they are not empty. If the file contains a line starting with 'HELLO' but has at least one line without all the fields with data, it'll log to a file and continue on.

    You'll need to research how to print the directory path with the file in the log (if you need it), sort out your indexing, replace your keyword etc

    #!/usr/bin/perl use warnings; use strict; use File::Find; open my $log, '>', 'log.txt' or die "Can't open the log file!: $!"; find(\&check_data, "./test"); sub check_data { my $file = $_; #for clarity return if ! -f $file; open my $fh, '<', $file or die "Can't open file $file: $!"; for my $line (<$fh>){ if ($line =~ /^HELLO/){ my @parts = split(/\//, $line); for my $index (qw(1 3 4)){ if (! $parts[$index] or $parts[$index] =~ /[\s+-]/){ print $log "$file is missing data.\n"; return } } } } }

    Example files:

    $ cat test/hello.pl adsfasdfasdf HELLO/asdf/asdf/asdf/asdf/asdf $ cat test/testing.txt asdfasdf/dfasdf/12351234132r HELLO/asdf//////////

    Output:

    $ cat log.txt testing.txt is missing data.

    -stevieb

    EDIT: Added check for hyphen in line elems

      Thank you stevieb, I will study this and report back.

      Hi stevieb, I've worked on your example but I've run into a snag.

      For my testing, I have in my directory only 6 files. 5 of the files have all of the necessary data and the 6th file I have purposely left out data in one of the fields the perl script is supposed to check and another field I have put a dash. So my thinking is that once run, the log will record that only one of the files needs attention.

      Below is the perl and I've added line numbers for reference:

      1 #!/usr/bin/perl -w 2 3 use warnings; 4 use strict; 5 6 use File::Find; 7 8 open my $log, '>', 'log.txt' 9 or die "Can't open the log file!: $!"; 10 11 find(\&check_data, "./test"); 12 13 sub check_data { 14 15 my $file = $_; #for clarity 16 17 return if ! -f $file; 18 19 open my $fh, '<', $file 20 or die "Can't open file $file: $!"; 21 22 for my $line (<$fh>){ 23 if ($line =~ /^HELLO/){ 24 my @parts = split(/\//, $line); 25 for my $index (qw(1 3 4)){ 26 if (! $parts[$index] or $parts[$index] =~ /[\s+-]/){ 27 print $log "$file is missing data.\n"; 28 return 29 } 30 } 31 } 32 } 33 }

      I'm having problems with line 11. What I believe is happening there is that the subrouting check_data is being called with the parameter that is the name of a file called "test". In other words, the ./ is telling perl to look in the present directory for a file called "test" and then run the subrouting on that file.

      But what I need to do is have perl look in the directory and check all of the files. I've tried several ways to do this but getting either one of these two errors:

      I named the perl script 'stevieb' and I'm running it from the dir in which the files are located.

      Error #1

      No such file or directory at ./stevieb.pl line 11

      Error #2

      The second is not an error per se, but in the log.txt all of the files are listed as having missing data which isn't correct

      What I'm trying to do is get all of the files into the script instead of just only ready one file, as in your example, but I can't get the syntax of the script to do this. I've tried several ways to do this in line 11:

      find(\check_data, ".\*.TXT" ); # look in this dir for all files that end in .TXT Gives error #1

      find(\check_data, "./" ); # look in this dir for anything. # Gives error #2

      find(\check_data, "./*"); # look in this dir for everything. # Gives error #1

      find(\check_data, "."); # look here (as in pwd) Gives error #2

      find(\check_data, "/actualNameOfTheDirectory"); # Gives error #2

      find(\check_data, ".\/"); # tried escaping the slash thinking that was the problem. Gives error #2

      find(\check_data, ".\/*TXT" ); # tried escaping the slash thinking that was the problem. Gives error #1

      Long story short, I can't seem to get all the files in the directory to be read without them all producing an error in the log. I've verified that the "good" files are properly formatted.

Re: Perl beginner here, needs a shove in the right direction.
by aaron_baugher (Curate) on Jun 16, 2015 at 20:07 UTC

    I wouldn't recommend slurping the files, since you want to look at individual lines. Just go through the files one by one, checking each file line-by-line. Since you're looking for design guidance rather than code, the logic would go something like this:

    create a hash which has the keywords to watch for as its keys foreach file (probably using File::Find) do open file (error check) foreach line in file split the line into fields is the first field from this line in my hash of important keyw +ords? if yes, are there empty or dash fields in this line? if yes, report this file (and keyword and/or line number, optionally +) and move on to next file end of lines close file end of files

    Take a look at File::Find; all the indented code above would be part of the callback you'd pass to File::Find's find() method.

    Aaron B.
    Available for small or large Perl jobs and *nix system administration; see my home node.

      Thank you Aaron B, design is one of my weaknesses. You specifically say to "split the line into fields", however the lines are already delimited into fields by a forward slash, so is there something extra needed there?

      The line of the file I'm interested in looks something like this:

      DATA/-/data123/data456//data789/-/AZ

        I'm talking about using the split function to split the line into an array of fields, like this:

        my $line = 'DATA/-/data123/data456//data789/-/AZ'; my @fields = split '/', $line;

        that will put the fields in that array. Then you can check the first element of the array, $fields[0] , to see if it's in your hash of important keywords. If it is, you can grep the rest of the fields to see if any are the empty string or a dash. Here's an example with the sample line you gave:

        #!/usr/bin/env perl use 5.010; use strict; use warnings; my %keys = ('DATA' => 1); # setup a hash of keywords my $line = 'DATA/-/data123/data456//data789/-/AZ'; my @fields = split '/', $line; # split line into fields on a slash if( $keys{$fields[0]} ){ # is the first element in my hash of +keywords my $keyword = shift @fields; # remove the keyword from the fields +array if( grep { $_ eq '' or $_ eq '-' } @fields ){ # are any elements + empty or a dash? say "Line has problems, keyword $keyword"; } }

        Aaron B.
        Available for small or large Perl jobs and *nix system administration; see my home node.

Re: Perl beginner here, needs a shove in the right direction.
by Laurent_R (Canon) on Jun 16, 2015 at 20:54 UTC
    A couple of comments. First, don't use awk in Perl, Perl can do everything that awk can do, most of the times more efficiently and in a simpler manner.

    If you need to traverse recursively a directory tree, then File::Find is definitely the module you might need to use. But if you have a single directory or a bunch of directories not specifically related in a hierarchical relation, then I would rather use the glob function on each such directory, simpler to use for a beginner.

    Given what you intend to do, I would recommend against slurping the files, just read them one by one, each one line by line, and apply a split or a regex on each line.

    Then, you have to think about your output, on which you said very little. I would imagine, from what you said about the number of files, that you probably only want to output lines (with file names) that don't qualify your rules. Printing out to the screen might be sufficient, but you may want to consider printing out to a file (or several files).

    Yes, it would be useful that you provide a sample of the data format, even with bogus content.

    Finally, I would recommend that you start out writing some code and show it here, I am sure you will get guidance from experienced monks and learn a lot in the process. You obviously don't really know where to start. Start with something simpler than what you need. Break it down in smaller tasks easier to master.

    For example, you might start with a program reading all the files of a directory and just printing their contents to the screen (use a dummy directory with just 2 or 3 files for a start). Once this works, you can add more of the functionalities that you need. Such as several directories instead of just one, filtering the data that you need to display and writing to a file instead of displaying at the screen.

    You are on the verge of undertaking a great journey. You just have to dare to start, do it, go ahead, dare, nothing wrong can happen. I sincerely hope that you will enjoy it as much as I did when I started out writing programs about 35 years ago. And that you will soon share my passion for that.

      Thank you Laurent_R for the advice and encouragement. Reading the files line by line instead of slurping seems to be the consensus. I'm not at work anymore, but I will post a sample of the format, I don't want to go by memory now as memory is another one of my weaknesses. Tonight though, I'll do like you said and create some dummy content and see what works and what doesn't.

Re: Perl beginner here, needs a shove in the right direction.
by neilwatson (Priest) on Jun 16, 2015 at 18:35 UTC
      Thank you Neil, others have also commented on using "File::Find". I will research to learn more and report back.
Re: Perl beginner here, needs a shove in the right direction.
by RichardK (Parson) on Jun 16, 2015 at 22:56 UTC

    Split your problem into pieces, you have two separate things to do :-

    1) check a single file is in the correct format. Maybe write a function that takes file name and checks if it's ok.

    2) find all the file names that need checking. glob is a good place to start.

    Then you can work on each piece and get it working on it's own before getting them working together.

Re: Perl beginner here, needs a shove in the right direction.
by einhverfr (Friar) on Jun 17, 2015 at 06:22 UTC

    First for the shove in the right direction: It sounds like you might be doing EDI X12 stuff. If so, search CPAN for relevant libraries. Chances are very good you are reinventing a wheel.

    Now, these seem like a case where you likely can write some nice, elegant code. Not being familiar entirely with what you are doing, let's go over a quick example:

    sub invalid_line { my ($line) = @_; ... return 1 if $invalid; } my @files; # todo, get list of files for my $file (@files) { open FILE '<', $file; if (scalar grep { invalid_line($_) } grep { $_ =~ /^$keyword/ } <FILE> ){ # file is invalid } else { # file is valid } close FILE; }
Re: Perl beginner here, needs a shove in the right direction.
by QM (Parson) on Jun 17, 2015 at 10:25 UTC
    I would have chosen a different route. File::Find has rarely been needed with *nix find, xargs, and egrep. If the final decision is too complicated, you can feed that to Perl:
    find . -iname \*.txt | xargs egrep -H '^keyword\b' | perl -lane 'lengt +h($F[2]) and ($F[2] ne q/-/) and print' > some_output_file

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1130678]
Approved by Paladin
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (5)
As of 2024-04-19 22:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found