Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

multi line regex

by metalfan (Novice)
on Jan 09, 2006 at 14:30 UTC ( [id://521941]=perlquestion: print w/replies, xml ) Need Help??

metalfan has asked for the wisdom of the Perl Monks concerning the following question:

hi,
why doesn' t the regex work?
#!/usr/bin/perl use warnings; use strict; use diagnostics; use WWW::LEO; use Data::Dumper; open(INFO, "< /home/metalfan/todo/dictionaries/Vokabeln\ -\ absolut +ion\ gap.html") || die("can't open datafile: $!"); my @file=<INFO>; { my $result; for my $entry (@file) { # file format: # <TR> # <TD WIDTH=10% HEIGHT=61 VALIGN=TOP> # <P ALIGN=LEFT>Wrought</P> # </TD> # <TD WIDTH=90%> # <P><BR> # </P> # </TD> # </TR> if ($entry =~ m/ <TR> <TD\sWIDTH=\d{1,2}%\sHEIGHT=\d{1,3}\sVALIG +N=TOP> <P\sALIGN=LEFT>(.+)<\/P> <\/TD> <TD\sWIDTH=\d{1,2}%> <P><BR> <\/P> <\/TD> <\/TR> /sx) { print "$1\n"; } } }
greets

Replies are listed 'Best First'.
Re: multi line regex
by Happy-the-monk (Canon) on Jan 09, 2006 at 14:39 UTC

    You read the file into an array of lines, then you compare single lines against something that suspiciously looks like a multi-line-structure. You probably have to slurp the whole file into the scalar and not use the for loop:

    my $entry = do { local $/; <INFO> }; # slurp the whole file

    To which Corion adds:

    with the /x modifier you will have to match all the whitespace explicitly, meaning to put \s+ here and there and everywhere you might expect whitespace.

    Cheers, Sören

Re: multi line regex
by matija (Priest) on Jan 09, 2006 at 14:54 UTC
    This is wrong in so many ways. First of all, you're parsing HTML with a regex. Don't do that. Use HTML::Parser instead.

    Otherwise, there are just too many ways in which you can be tripped - tags with extra white space, tags with newlines, quotes missing or present in unexpected places, escaping of this, that or the other thing, javascript code fooling you into thinking you're in another tag when you really aren't, etc.

    Second, you're trying to extract data from an HTML table using regex. Don't do that. Use HTML::TableExtract instead. It will save you a LOT of hairpulling.

      looks good, sorry for this question: but how can i use this to
      do geht the word in the first column?

      1.column | 2.column
      english word | german word
      ....

      thx for help
        Read the manual pages for HTML::TableExtract - once it parses the table, the first column will be the first element of the row array.
Re: multi line regex
by murugu (Curate) on Jan 09, 2006 at 15:00 UTC

    Hi,

    It wont match. Why because you are having the contents of the html file in an array, where each element contains a line each from the html file you read. But you are matching with something else.

    Its better to use HTML::TokeParser to parse the html file and get the attribute values.

    Regards,
    Murugesan Kandasamy
    use perl for(;;);

      Another alternative is HTML::TokeParser::Simple. Whichever HTML parsing module you eventually choose will obviously be up to yourself. The main point is that you should definitely, definitely use one of them, rather than trying to create custom regexps.
Re: multi line regex
by ptum (Priest) on Jan 09, 2006 at 14:41 UTC

    Hi, [id://metalfan]. It looks to me as though you're reading the contents of your file into an array (@file) and then you're processing each line of the file one line at a time. If you want the whole file to be read in by <INFO> into a single scalar variable, then I think you need to set $/ (the input record separator) to something other than \n (like ''), and replace @file with $file.

Re: multi line regex
by Perl Mouse (Chaplain) on Jan 09, 2006 at 14:40 UTC
    Because either the regex matches something different than you expect, or line you match against contains something different than you expect.

    What does the line you match against contain, and do you expect the regexp to match, or to fail?

    Perl --((8:>*

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://521941]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (7)
As of 2024-04-25 08:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found