Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Parse file, split

by mewoq (Initiate)
on May 28, 2013 at 00:05 UTC ( #1035484=perlquestion: print w/ replies, xml ) Need Help??
mewoq has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that is a list of cars with the year/make/model as the description. ex- 2011 Chevy Camaro Some of the cars have extended model names like: 2011 Dodge Ram Crew Cab Short Bed What I want to do is have it parse and split to year/make/model and i have this

open (FILE, 'E:\ptest\cars.txt'); while (<FILE>) { chomp; ($year, $make, $model) = split(" "); print "$year "; print "$make "; print "$model \n"; } close (FILE); exit;
I'm not sure how to adjust for the model name being more than one word.

Comment on Parse file, split
Download Code
Re: Parse file, split
by LanX (Canon) on May 28, 2013 at 00:15 UTC
    add a field limit, see split

    DB<108> $line='2011 Dodge Ram Crew Cab Short Bed' => "2011 Dodge Ram Crew Cab Short Bed" DB<109> split ' ',$line,3 => (2011, "Dodge", "Ram Crew Cab Short Bed")

    If this is not what you want please provide us with better data enclosed in code-tags.

    You certainly have a bad delimiter character! =)

    Cheers Rolf

    ( addicted to the Perl Programming Language)

    ) I ignore American car manufacturers and models and for me "Ram Crew Cab Short Bed" rather sound like IMDB keywords for a porn movie...

    ...

    ... btw: what does Cab mean? ;-)

      ... what does Cab mean?

      At least in North America, the 'cab' of a Pickup truck is the driver/passenger compartment, usually enclosed. Similar to the cab of a locomotive: the operator's compartment. And yes, now you mention it, that description does sound kinda porny...

Re: Parse file, split
by JockoHelios (Scribe) on May 28, 2013 at 00:16 UTC

    If you split on a comma instead of a space, the model should come back as one field of text. Unless you don't have commas seperating the fields in your source file :)

    And unless you have commas in the model for some reason. Can you post a sample of the source file ?
    Dyslexics Untie !!!
      So here's the text file with the car names.
      2011 Chevy Camaro 2011 Dodge Ram Crew Cab Short Bed 2011 Ford F150 Platinum 2011 Ford Flex 2011 Ford Transit 2011 GMC Cargo Van Extended 2011 Hyundai Genesis Coupe 2011 Kia Sol 2011 Nissan Cube 2011 Toyota Prius
      This came from the name of the folder (each of these are a folder name containing images of that car) that I parsed into a text file. I plan (later with more changes) on importing this into a table in a db.

        Here's a demonstration of how using regular expression pattern matching instead of string split might be more correct, robust and extensible.

        #!perl use strict; use warnings; my $valid_vehicle_description_pattern = qr{ ((?:19|20)\d\d) # $1 is Year \s+ ( # $2 is Make British\s+Leyland | Chev(?:y|rolet) | Dodge | Ford | (?:General\s+Motors|GMC?) | Hyundai | Kia | Nissan | Toyota ) \s+ (\S.*) # $3 is Model }ix; while (my $vehicle = <DATA>) { chomp $vehicle; if ($vehicle =~ $valid_vehicle_description_pattern) { my ($year, $make, $model) = ($1, $2, $3); print "Year: $year\tMake: $make\tModel: $model\n"; } else { warn "Invalid vehicle description: $vehicle\n"; } } __DATA__ 1970 British Leyland Triumph Spitfire 2011 CHEVROLET CAMARO 2011 Chevy Camaro 2011 Dodge Ram Crew Cab Short Bed 2011 Ford F150 Platinum 2011 Ford Flex 2011 Ford Transit 2011 GMC Cargo Van Extended 2011 Hyundai Genesis Coupe 2011 Kia Sol 2011 Nissan Cube 2011 Toyota Prius 2015 Apple iCar

        If the dataset is assumed valid and you avoid multi-word 'make' fields, LanX's approach would certainly seem to do the trick with this dataset. Note that  $extended_model has to be 'fixed' if the field does not exist, otherwise it's undefined.

        >perl -wMstrict -le "my @records = ( '2011 Chevy Camaro', '2011 Dodge Ram Crew Cab Short Bed', '2011 Ford F150 Platinum', '2011 GMC Cargo Van Extended', ); ;; for my $record (@records) { my ($year, $make, $model, $extended_model) = split ' ', $record, 4; $extended_model //= ''; print qq{'$year' '$make' '$model' '$extended_model'}; } " '2011' 'Chevy' 'Camaro' '' '2011' 'Dodge' 'Ram' 'Crew Cab Short Bed' '2011' 'Ford' 'F150' 'Platinum' '2011' 'GMC' 'Cargo' 'Van Extended'
        > This came from the name of the folder (each of these are a folder name containing images of that car) that I parsed into a text file.

        Please! Just change the delimiter when writing to something impossible in filenames, like "\t" (hopefully) and your problems when reading are all gone!

        In comparison all other approaches are just insane hacks!

        Or just avoid any intermediate files.

        Cheers Rolf

        ( addicted to the Perl Programming Language)

        I'm assuming then that each line is the name of a folder, spaces included. A bit of RegEx would do, as in the code below.

        I'm sure the RegEx section could be redone to be cleaner and smaller, but I'm not at that level yet :) so I just did it in seperate steps.

        use strict; my $Car0 = "2011 Chevy Camaro"; my $Car1 = "2011 Dodge Ram Crew Cab Short Bed"; my $Car2 = "2011 Ford F150 Platinum"; my $Car3 = "2011 Ford Flex"; my $Car4 = "2011 Ford Transit"; my $Car5 = "2011 GMC Cargo Van Extended"; my $Car6 = "2011 Hyundai Genesis Coupe"; my $Car7 = "2011 Kia Sol"; my $Car8 = "2011 Nissan Cube"; my $Car9 = "2011 Toyota Prius"; my $OneCar = ""; my $Year = 0; my $Make = ""; my $Model = ""; push( my @CarInfo, ( $Car0, $Car1, $Car2, $Car3, $Car4, $Car5, $Car6, +$Car7, $Car8, $Car9 ) ); foreach $OneCar( @CarInfo ) { # initialize the variables $Year = $OneCar; $Make = $OneCar; $Model = $OneCar; # drop everything after the Year including the first space char $Year =~ s/\s.*//; # drop the year and the first space char $Make =~ s/\d*\s*//; # drop everything after the Make including the first space char $Make =~ s/\s.*//; # drop the year and the first space char, same as with $Make $Model =~ s/\d*\s//; # drop everything up to and including the first space char $Model =~ s/\w*\s//; print "$Year\t\t$Make\t\t$Model\n"; }
        Dyslexics Untie !!!
Re: Parse file, split
by Jim (Curate) on May 28, 2013 at 00:37 UTC

    For this text parsing task, I think you should use regular expression pattern matching instead of a simple string function like split. This way, you can be assured the year is a valid Gregorian year, the car maker is a legitimate one, etc. With pattern-based parsing, you can easily handle two-word vehicle manufacturers such as General Motors and International Harvester.

      Oops. Multiple-word manufacturers. My RegEx example does _not_ account for these, unless by happenstance.

      Meaning, it will work if they are all initials, like GMC was. Otherwise only the first word will go into Make.

      And it would a shame to miss a Manufacturer like Elfin Sports Cars, if there happen to be any of those in your list :)

      Dyslexics Untie !!!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1035484]
Approved by Jim
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (6)
As of 2014-12-27 07:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls