Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

$nextline not working

by qingxia (Novice)
on Mar 22, 2013 at 00:47 UTC ( #1024849=perlquestion: print w/replies, xml ) Need Help??

qingxia has asked for the wisdom of the Perl Monks concerning the following question:

hi monkers:

I have following while using nextline. Bascially, i am parsing some string(N here) in the second line after matching the target. e.g.

(line1) target some text here (line2) some more text here N
And i apply this to find a couple of variables. The code i have at hand is:
#!/usr/bin/perl print "Borrower\tactiveDate\tAmount\n"; $Borrower = "null"; $activeDate = "null"; $Amount = "null"; $on = 0; open(FILE, $ARGV[0]); while ($line = <FILE>) { if ($on eq 0 and $line =~ /<tr/) { $on = 1; } if ($on eq 1) { if ($line =~ /<div(.*?)>Borrower:<\/td>/) { $nextline = <FILE>; if ($nextline =~ /<td(.*?)>(.*?)(\s)/) { $Borrower = $2;} if ($nextline =~ /<td(.*?)>(.*?)<\/td>/) { $Borrower = $2;} } if ($line =~ /(\s)active(\s)date\:<\/td>/) { $nextline = <FILE>; if ($nextline =~ /<td(.*?)>(.*?)<\/td>/) { $activeDate = $2;} } if ($line =~ /<td(.*?)>Amount:<\/td>/) { $nextline = <FILE>; if ($nextline =~ /<td(.*?)>(.*?)(\s)/) { $Amount = $2;} if ($nextline =~ /<td(.*?)>(.*?)<\/td>/) { $Amount = $2;} } } if ($line =~ /<\/tr>/) { $on = 0; if ($Borrower ne "null") { print "$Borrower\t$activeDate\t$Amount\n"; $Borrower = "null"; $activeDate = "null"; $Amount = "null"; } } } close(FILE);

the code goes well for the first variable namely 'borrower', however it starts going wrong from the rest. It is 'null' for the first 'borrower' in the variables 'activedate' and 'Amount', and one position lagged matched for the rest. i.e. the right 'activedate' and 'Amount' value of first borrower goes to the second borrower, the second goes to the third, so on and so forth.

Stuck for a while, hope someone help me out of here. thanks a lot in advance. regards,

Replies are listed 'Best First'.
Re: $nextline not working
by davido (Cardinal) on Mar 22, 2013 at 04:17 UTC

    "Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems."

    -- Jamie Zaqinski

    "You cannot parse [X]HTML with regex."

    -- bobince on StackOverflow

    I saw this coming with the previous question, and had a response typed, but then thought perhaps I should not come down so strong against the practice. But since my fears have begun to play out, I'll post the gist of what I was going to say before:

    Regular Expressions are not an ideal tool for parsing HTML. In the trivial cases, everything will seem to work out fine. But then you'll discover that your problem isn't as trivial as it first appeared, and consequently you will come back here to ask how to further refine (or expand) the capability of the regular expressions you're using. This cycle will repeat until eventually you've got a big nasty unmaintainable, fragile heap of regular expressions that stops working every time you look away. Your time will be wasted trying to fix the HTML parsing code, when it really ought to be spent on something more interesting that hasn't already been solved before, robustly, maintainably, and cleanly.

    Look at Mojo::DOM, HTML::Parser::Simple, or HTML::Parser. I prefer the former; it's just so darn easy to use. You will spend an hour or two getting accustomed to using whichever of these tools you settle upon, and will save many hours of headaches as a result.

    Abandon the notion that your problem is simple enough to just throw a few regexes at it. Simplicity is elusive, and more so when you introduce regular expressions to a non-regular task.


    Dave

Re: $nextline not working
by McA (Priest) on Mar 22, 2013 at 03:00 UTC

    Hi,

    in your thread parsing html you got the totally correct advice to use well known and proved packages for HTML parsing. And now you're asking a question while being stuck with parsing HTML the hard and buggy way. IMHO follow the advice. Do it right (at least more right ;-)). Be sure the investment into investigating one of the packages will be worth probably not with this problem, but with the very next HTML parsing problem. (Assume the following: All people like what you've done and this process flow will be established and sometime someone is changing line breaks in your html and your script will break)

    As stated in the other thread: Have a look at Mojo::DOM which has a very nice API and is simple to use.

    Best regards
    McA

    If you still insist on solving your problem this way, give us a snippet of your html to see the structure.

Re: $nextline not working
by McA (Priest) on Mar 22, 2013 at 03:54 UTC

    Hi,

    a solution based on many assumptions:

    #!/usr/bin/perl use warnings; use strict; use Data::Dumper; use lib './lib/lib/perl5'; use Mojo::DOM; my $html = q( <html> <head><title>Some list</title> </head> <body> <div> <table> <tr> <td>Borrower:</td> <td>Someone</td> <td>active date:</td> <td>2013-03-22</td> <td>Amount:</td> <td>100.00</td> </tr> <tr> <td>Borrower:</td> <td>SomeoneElse</td> <td>active date:</td> <td>2013-03-20</td> <td>Amount:</td> <td>10.50</td> </tr> </table> </div> </body> </html> ); my $dom = Mojo::DOM->new($html); my $table = $dom->at('table'); for my $record ($table->children('tr')->each) { my %record = map { $_->text } $record->children('td')->each; print Dumper(\%record), "\n"; }

    McA

Re: $nextline not working
by farang (Chaplain) on Mar 22, 2013 at 17:16 UTC

    Hi qingxia. With respect to those monks who have already replied and are correctly trying to convince you to consider a different approach to parsing even simple HTML, I think you need to start at the beginning before doing that. The very beginning consists of learning how to read documentation and placing these lines at the top of every program you work on, unless you know why you are leaving them out.

    use strict; use warnings;

    In addition to strict and warnings, you may wish to add diagnostics as well.

    use strict; use warnings; use diagnostics;

    If you do this on your current program, you'll find it won't compile until make some simple changes. Specifically, you'll need to declare your variables with my, which can be done as follows (but be aware that automatically declaring all variables at file scope like this is NOT in general good programming practice).

    my $Borrower = "null"; my $activeDate = "null"; my $Amount = "null"; my $on = 0; my $line; my $nextline;

    Once you've made these changes, the Perl interpreter will be able to help you by giving feedback when you edit your code in certain problematic ways it can identify. It is standard and highly recommended to enable 'strict' and 'warnings'.

    Next, it is the responsibility of a programmer to know what each line of code is doing, and to read the relevant documentation when necessary. You use the expression <FILE> both in a while loop and in nested if blocks, which is sure to be confusing even if you somehow manage to get the code to do what you want. On top of that you are assigning its value to two different variables, $line and $nextline, thereby creating more confusion. Do some reading about syntax, and specifically next for better ideas. Perl syntax is rich, and with some reading and experimentation you ought to be able to easily replace what you are trying to do with your clunky variable $on with much clearer code and you may be able to see just why you aren't getting the results you desire.

    Standard recommended programming practices recommend that functionally distinct ideas be separated into different blocks of code. So structuring your program to both process a file and provide output in the same block is likely going to hinder the ability to improve it. Instead, try to rewrite it with better structure.

    pseudocode: while ( <FILE> ) { do_stuff_to_process_file(); get_needed_info(); } do_output_related_stuff_outside_the_above_block();

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1024849]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2019-11-18 01:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Strict and warnings: which comes first?



    Results (87 votes). Check out past polls.

    Notices?