Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

CSV regex with hash/array program plan

by campus1plb (Initiate)
on Nov 23, 2014 at 15:01 UTC ( [id://1108164]=perlquestion: print w/replies, xml ) Need Help??

campus1plb has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, i'm new to the site (and returning to Perl after quite a long break, but i was never a master of any sort)

I'm trying to plan out a program and i'd like to sanity check my objectives just to make sure i'm not a: making more work than i need to, and b: planning something stupid.

Task:
read a CSV file containing columns for |Degree Subject|Entry requirement|

The entry requirement field contains strings such as:

"A minimum of 3 A Levels at ABB for First Year Entry or a minimum of AAB for Second Year Entry. Must include Mathematics and Physics at AB."

"For First Year Entry a minimum of 3 A Levels at BBB or 4 AS at AABB. For Second Year Entry a minimum of an A in the subject selected for Single Honours plus BB, or AB in the subjects selected for Joint Honours plus a further B."

"Three A Levels at ABB. AB required in Mathematics and Physics or a B in Design & Technology or a B in Engineering. If applicant presents with B in Physics, Design & Technology or Engineering, Mathematics must be A grade."

Program plan:
Read in the CSV file to an array/hash (more on this later)

Use regular expressions to determine which subjects are required for each degree subject, and create a column specific to EACH subject and mark whether it is present or not

Write this array/hash to a csv file for output.

Example output:
|Degree Subject|Entry requirement* |Grades|Maths|Physics|Engineering|etc etc

|Chemical Eng |A minimum of 3 A Levels at XXX* for First Year Entry..|ABB |A/B |A/B | |
*(use s/ in regex to indicate which parts have been "detected" for manual checking)

Problems/Puzzles:

1/ Would it be more straighforward to use Text::CSV having created the full matrix of columns manually and then assign values to the relevant fields or check for entries in the dataset and then "create" columns during runtime?

2/ My gut feeling is to use an array (instead of a hash) for this as by my nature (and as C was my first language many moons ago) it seems nice and orderly. Speed performance isn't a critical issue here.

3/ for the REGEX component, it's going to be quite complicated, and there are many variants of field here. I'm contemplating doing one of two strategies a) using a first pass to pull out all unique entries and attempt a regex on them, using this as a reference key to then screen the remainder b) doing the regex in one pass N.B there may be as many as 46,000 row entries, but in terms of unique entries it may be more like 5,000 (which is still loads but easier to check over perhaps)

Delighted for any guidance, even just pointers.

Best wishes, Phil

Replies are listed 'Best First'.
Re: CSV regex with hash/array program plan
by roboticus (Chancellor) on Nov 23, 2014 at 17:19 UTC

    campus1plb:

    1. I'd create the columns during runtime: that way, if someone adds a new subject, you don't need to change your program.
    2. Either data structure is fine. I'd personally go with a hash table, but only because that would feel more natural for me. If you're comfortable with an array, go with it.

    Your third question is the fun part. I don't find that writing the regexes is that difficult. The problem is more 'are you sure you're getting everything correctly?'.

    I generally attack it like this: write regexes for the first 5 or ten lines of data. Then make a program to match and delete all the requirements that it can. Then look at the next few lines of what's left, and add new regexes and/or altering existing ones. After a few iterations, you'll have regexes that can handle most of the data. You may have a few stragglers (misspellings, etc.) that may require a bit of playing with. You might, for example, first repair misspellings before matching requirements.

    Have fun with it!

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Thanks for the replies folks,

      The sort of thing i had in mind is

      use feature "switch"; for (to be determined) { when ((s/(\b[A-E]{3}\b)/XXX/g)) {@array[i,1] = $1} when (s/((Science)|(science))/XXX/g) {@array[i,2] = $1} when (s/((Math)|(math))/XXX/g) {@array[i,3] = $1} when - for all remaining subjects... when (s/((first|First)\W+(?:\w+\W+){1,10}?([A-E]{3})) #or some similar + REGEX returning only the first year entry grades default {} }

      but i'm concerned that i'm thinking very "C" in my iteration loop for the array

      To answer Anonymous monk,

      To start with, i'd be very happy with an output that looks like below:

      (subject)..|ABB|Maths|Physics|Design&Technology|Engineering

      If i can get the progam to do this above, then i'll try and refine it to pull out specific grades

      eg ([A-E]{1} in Design|design) returns the grade preceding that subject

      Roboticus, thanks for that i may start with a defined array (there are ~200 A level possibilities, but they change very infrequently)and then develop it into one which adds subjects when detected once i get the hang of it.

      You have also hit the nail on the head regarding the REGEXes, i was thinking about having a list produced of ones that the regex struggled with, or ones which didn't get any hits to see how i'm missing things too. I'd not thought about multiple passes however, nor repairing misspellings!

      really appreciate the input i just need to plough through some text books and remind myself (or learn new things) appropriate to the task in hand

      best wishes, Phil

        when ((s/(\b[A-E]{3}\b)/XXX/g)) {@array[i,1] = $1}

        Capture groups don't work the same way in  s/// substitution versus  m// matching:

        c:\@Work\Perl>perl -wMstrict -le "my $s = 'xxx aBc yyy dE zzz'; print qq{\$s: '$s'}; ;; print qq{\$1: '$1'} if $s =~ s{ [AaBbCcDdEe]+ }{XXX}xmsg; print qq{\$s: '$s'}; " $s: 'xxx aBc yyy dE zzz' Use of uninitialized value $1 in concatenation (.) or string at -e lin +e 1. $1: '' $s: 'xxx XXX yyy XXX zzz'
        Capture groups work in a potentially surprising way in  s/// substitution and  m// matching:
        c:\@Work\Perl>perl -wMstrict -le "my $s = 'xxx aBc yyy dE zzz'; print qq{\$s: '$s'}; ;; print qq{\$1: '$1'} if $s =~ s{ ([AaBbCcDdEe]+) }{XXX}xmsg; print qq{\$s: '$s'}; " $s: 'xxx aBc yyy dE zzz' $1: 'dE' $s: 'xxx XXX yyy XXX zzz'
        (note that only the last group matched is captured). Please see perlre, perlrequick, and perlretut.
        Also: I don't think  @array[i,1] = $1 is going to work the way you think it will whatever the value of  $1 may be (but I'm not sure just what you expect from this expression). Please see Slices in perldata. (Update: Something like this works for hashes: see  $; in perlvar. There's a more complete discussion of this old trick somewhere, but I can't locate it right now — anyone know where it is? (Update: Anonymonk informs me this is Multi-dimensional array emulation in perldata. This section was apparently added with Perl version 5.16.0 or 5.16.1. I only had 5.14 available locally and so missed it.))

Re: CSV regex with hash/array program plan
by GrandFather (Saint) on Nov 24, 2014 at 01:35 UTC

    Have you freedom to use a database instead of .csv files? SQLite (see DBD::SQLite) may fit the task better.

    Perl is the programming world's equivalent of English

      Hi Grandfather, yes i have the freedom to solve this problem in any way i see fit. Alas i've no experience with SQL, but i'd be willing to learn (and at some point soon i shall need to as the eventual destination for this data will be an SQL database

      My plan was to do the regex and matching first, then to populate a database with that (i actually need to create a mathmatical construct or coding system for the subject "hits" before that however

      Any suggestions on an alternative welcomed, i'm trying to spend lots of time on the planning, whilst i re and upskill myself in Perl, so open to any wisdom!

      best wishes Phil

        It's refreshing to have someone open to, nay keen to, embrace new technologies! To dip your toe into databases writ Perl and SQL you may find Databases made easy helpful.

        I'd be somewhat inclined to skip the CSV phase if you can because it will skew the way you think about structuring the database. Call back here when you have a structure in mind and ask for comment.

        Perl is the programming world's equivalent of English
Re: CSV regex with hash/array program plan
by Anonymous Monk on Nov 23, 2014 at 16:01 UTC
    "Three A Levels at ABB. AB required in Mathematics and Physics or a B in Design & Technology or a B in Engineering. If applicant presents with B in Physics, Design & Technology or Engineering, Mathematics must be A grade."
    How are you going to parse that, and what should the output look like for that entry?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1108164]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-04-23 23:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found