record separator causing problems

perlperlperl has asked for the wisdom of the Perl Monks concerning the following question:

The input file file.xml contains some legal xml with root html tag with content in it. I have set the record separator to undef, so that the entire file is slurped into $records as one big line. In the regex match below, I am trying to get the contents of the html element. This works when I use a literal as the input, but not when I use the variable $records as the input to match, I get 'Use of uninitialized value $text in string at... ' at run time. Why? Am I not capturing the result of the match into $text?

 
use strict;
use warnings;

open FILE_IN, '<file.xml';
open FILE_OUT, '>results.txt';

$/ = undef;
my $records = <FILE_IN>;

my $text = "";

($text) = $records =~ m/<html>(.*)<\/html>/;

print "$text";


close FILE_IN;
close FILE_OUT;
[download]

Puzzling.

Comment on record separator causing problems Download Code

Replies are listed 'Best First'.

Re: record separator causing problems
by hippo (Bishop) on Dec 19, 2013 at 09:41 UTC

You are not using the /s regex modifier, so your regex will not match unless there are no newlines between the opening and closing html tags, which is unlikely (but you haven't shown the data, so we won't know for sure).

[reply]
[d/l]

Re: record separator causing problems
by Athanasius (Archbishop) on Dec 19, 2013 at 10:32 UTC

In addition to the solution given by ++hippo, note that if there is more than one html element then to get the first one you need to make the quantifier non-greedy:

m{<html>(.*?)</html>}s
[download]

That said, you’d most likely be better-off (in the long run) using one of the dedicated XML modules: XML::Simple, XML::Twig, XML::Compile, XML::Tiny, XML::LibXML, ....

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]

Re^2: record separator causing problems

by jellisii2 (Hermit) on Dec 19, 2013 at 12:45 UTC

XML::Twig

[reply]

Re^3: record separator causing problems

by hippo (Bishop) on Dec 19, 2013 at 13:45 UTC

Quite. If only there were ... oh, I don't know ... some sort of poll or something to bring it to everyone's attention.

[reply]

Re: record separator causing problems
by sundialsvc4 (Abbot) on Dec 19, 2013 at 16:08 UTC

That really can’t be emphasized enough: “don’t try to do XML without a proper library, be it XML::Twig, XML::LibXML (my personal favorite), or something else.

In my experience, every XML data-feed that you’re ever going to receive was library-generated ... most commonly with libxml.so (or DLL), which is exactly what is used by the Perl package of the same name. Everything is there ... XSLT, XPath expressions, and so on. So you can arrange to be reading the file with the same software that was used to create it, driving the bus with Perl or Python or whatever language you please. You can focus on what you want to do with the file, and the code required to do it suddenly isn’t complicated at all. You’ve got much better things to do with your time than monkeying-around with regular expressions and record-separators . . .

[reply]

Back to Seekers of Perl Wisdom