Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

RegEx Riddle

by ferddle (Initiate)
on May 28, 2008 at 05:39 UTC ( #688802=perlquestion: print w/replies, xml ) Need Help??
ferddle has asked for the wisdom of the Perl Monks concerning the following question:


I have a KML file that I need to parse. I would like to do this without the help of a module. Here is an example:
<Placemark> <name>This is the title</name> ... <coordinates>12,34,0</coordinates> </Placemark> <Placemark> <name>This is another title</name> </Placemark> <Placemark> <name>One more</name> <coordinates>56,78,0</coordinates> ... <coordinates>-99,-88,0</coordinates> </Placemark>
I would like to create hashes from all occurrences of <name> and its corresponding <coordinates>.
foreach (@kmlFile) { if (m/<name>(.*)<\/name>/) { %{$1} = (name => "<name>$1</name"); } }
Great, all I need is a way to find any <coordinates> tags that occur on lines after <name>, so that each coordinate is stored in a hash with its corresponding name. One of the problems with this is that there are some <name> tags that exist without <coordinates> (center three lines in the example above).

In other words, the code need only extract the next <coordinates> that occur after a <name> -- these coordinates would only be associated with the <name> that preceeds it, all other <coordinates> would be ignored until found again after the next <name>.

These occurrences are not necessarily on consecutive lines

From the above, I would need:
%This is the title = ( name => "<name>This is the title</name", coordinates => "<coordinates>12,34,0</coordinates>", ) %One more = ( name => "<name>One more</name>", coordinates => "<coordinates>56,78,0</coordinates>", )

If a mod is the only option, please let me know...

Thanks for any help,


Replies are listed 'Best First'.
Re: RegEx Riddle
by moritz (Cardinal) on May 28, 2008 at 06:44 UTC
    Why does everybody need to parse the same piece of XML? Is this some kind of homework? Or a very popular description format of some kind?

    That said, you shouldn't try to parse XML with regexes - use a decent module from CPAN instead.

      It seems to be used by Google in Google Earth and Maps, and is referred to as KML.
Re: RegEx Riddle
by prasadbabu (Prior) on May 28, 2008 at 05:47 UTC

    Hi ferddle,

    You can use negative look ahead regex to accomplish your job. If you have whole file in a string $kml, then

    use strict; use warnings; use Data::Dumper; my $kml = do { local $/, <DATA>}; my %hash; while ($kml =~ m/<name>((?:(?!<name>).)*)<\/name>\s*(<coordinates>(?:( +?:(?!<coordinates>).)*)<\/coordinates>)/gs){ $hash{$1} = $2; } print Dumper \%hash; output: ------- $VAR1 = { 'One more' => '<coordinates>56,78,0</coordinates>', 'This is the title' => '<coordinates>12,34,0</coordinates>' };

    You can learn more about positive look ahead at perlre.

    updated, added code. Thanks to ikegami for pointing out the mistake.


      I'd take a different tack, and loop through the placemark nodes.

      I've also assumed the OP wants all coords, rather than the last one as per your example, so have dumped them in a hash of arrays...

      my $xml = 'your xml...'; my %coords_by_name; while ($xml =~ m{<Placemark>(.*?)</Placemark>}gs ) { my $placemark_snippet = $1; if ( $placemark_snippet =~ m{(<name>.*?</name>)}gs ) { my $name = $1; while ( $placemark_snippet =~ m{(<coordinates>.*?</coordinates +>)}gs ) { push @{$coords_by_name{$name}}, $1; } } } use Data::Dumper; print Dumper(\%coords_by_name);
      I agree with the other comments though, you need a good reason to not use a proper xml parser, the above code won't be very robust, and only supports a fraction of the valid ways you could write that xml.
      my name's not Keith, and I'm not reasonable.
Re: RegEx Riddle
by punch_card_don (Curate) on May 28, 2008 at 18:28 UTC
    What about some good old plain logic?

    What I read is that you want to associate the very next occurrence of coordinates, and the next occurrence only, with the last occurring name. Then:

    $coords_found_flag = 0; open(KMLFILE, $kml_file) or die("cannot open file : $kml_file $! "); while ($line = <KMLFILE>) { if ($line =~ m/<name>(.*)<\/name>/) { $current_name = $1; $coords_found_flag = 0; } elsif ($line =~ m/<coordinates>(.*)<\/coordinates>/ && $coords_fou +nd_flag == 0) { $coords{$current_name} = $1; $coords_found_flag = 1; } } close(KMLFILE);

    Forget that fear of gravity,
    Get a little savagery in your life.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://688802]
Approved by prasadbabu
[davido]: Using the -i switch causes Perl to rename the input file, read from it, and write to a file using the original file's name. If there's an extension provided, as in -i.bak, it's easy to see where the input file is. Where is the input file temporarily....
[davido]: placed if there is no extension provided to the -i switch?
[davido]: Nevermind, found the answer.
[davido]: If no extension is supplied, and your system supports it, the original file is kept open without a name while the output is redirected to a new file with the original filename. When perl exits, cleanly or not, the original file is unlinked.
[haukex]: doc says "If no extension is supplied, and your system supports it, the original file is kept open without a name ..."

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (13)
As of 2017-09-22 14:50 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (264 votes). Check out past polls.