http://www.perlmonks.org?node_id=934485

emelianenko has asked for the wisdom of the Perl Monks concerning the following question:

Hello I dont want to imagine how annoying is to have a newbie asking question to gurus but the other forums, just did not bloody work i could not get logged in, i would get "akismetspam" sentence not found, and there was nothing for me to answer, well it was rubbish, so, sorry to bug you, I promise it will be just 1 question:

I have a paragraph like this:

<a href="page.aspx?a=c4bc46eswsdw32fcc">John.Martines </a></li><li><a +href="page.aspx?a=0a2b-a99d-3754eb2f5e35">Mary Jones</a></li><li><a h +ref="page.aspx?a=1ef7b100-8dc4-4b40-871c-68b1d0">Fernando Praderas</a +></li><li><a href="page.aspx?a=e8ec1d77-ee83-4797-b9c4-7676053a4926">

I just want to print the First and last name that you see scattered there but always between the same patterns. But if I am to tell the program to read line by line, they are differently divided, some times the content to print is split between two lines or some times is in the second line.

if ($line =~ /<a href="page.aspx?.*>.*</a>$/
it would have to remove all the junk I dont want and print just the remaining hmm so:
$line =~s/(<a href="page.aspx?*>$//.*s/</a>//

that is, replace that by nothing, so remove it, then I was thinking I am also removing the final leaving $line = to the name but...I dont think I am doing well. i only saw regex 10 years ago for two weeks and re-took it today... Any help would be strongly appreciated. regards

</code>

Replies are listed 'Best First'.
Re: A regex question
by roboticus (Chancellor) on Oct 28, 2011 at 20:22 UTC

    emilianenko:

    Here's a quick bit of code to get you started:

    use strict; use warnings; $/=undef; while (my $line = <DATA>) { for ($line =~ m/<a[^>]*>(.*?)<\/a>/gs) { print "Name '$_'\n"; } } __DATA__ <a href="foo">Jon.Martinez</a><li>gabba, gabba, hey!</li><a href=bar>Mary Jones</a><p>Gazebo!</p><a href="baz">Rob Oticus</a><a>Joe Blow</a>

    Note that we slurp all the file in at once ($/=undef) otherwise we can't find names spread over two lines (like Mary Jones). We also need to use the 's' switch on the regular expression to let '.' match newlines (again to pick up Mary Jones!.

    Running it gives you:

    $ perl foo.pl 1 Name 'Jon.Martinez' Name 'Mary Jones' Name 'Rob Oticus' Name 'Joe Blow'

    Now, having said all that: Remember to review perlre and perlop. Also, you may want to use a real HTML parser instead of hacking away with regular expressions. Otherwise you can find some difficulties with unexpected formatting.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

    Update: changed 'e' to 's' (thanks for catching that, hbm!)

      Thank you whole heartedly. I am going to study this that you wrote. Definitively I want to incorporate Perl into my bagage but I am finishing C now. Right after that I will because I am fanatic about managing information. thank you again best regards
Re: A regex question
by ww (Archbishop) on Oct 29, 2011 at 04:40 UTC

    roboticus has shown how to deal with the multi-line para problem, so I'm omitting that. But as the AM above suggests, using an appropriate module or modules can save you much grief.

    What follows is NOT code to adopt or emulate; rather it is intended to suggest the pitfalls you may encounter in dealing the html. Web pages may be fully compliant, well-structured (for readability and maintenance); and reasonably consistent within a <ol>, <ul>, or <table> (and those are just "fer'instances") but don't bet the farm on that!

    That which you're scraping may well be less than well-formed, compliant or consistent. And that makes parsing difficult. This bad example confines itself to some minor variance and only in the rendered portion of a link:

    #!/usr/bin/perl use Modern::Perl; # 934485 my ($name, @names, @scrapings, $scraping); while ( <DATA> ) { $scraping = $_; push @scrapings, $scraping; } for $scraping(@scrapings) { $scraping =~ /(\d\w+ # begin capture, num followed by 1 or more +wordchars [. ]* # charclass of dot or space (?: # non-capture grouping \w+ # wordchars [. ]+ # one or more of a literal dot or + space )* # group is OPTIONAL by quantifyin +g it as "0 or more" [\w-]+ # wordchars or hyphens \s* # zero or more spaces ) # end capture before <\/a> # close link_renderable and href /x; # extended format if ($1) { $name = $1; push @names, $name; } } for $name (@names) { say $name; } say "\n\t Done"; __DATA__ <li><a href="page.aspx?1ar3xlr29">1Mary Mary QuiteContrary </a></li> <li><a href="page.aspx?43xlr17">2Sam Samuels</a></li> <li><a href="page.aspx?3719rlr17qt">3Joe.Bones</a></li> <li><a href="page.aspx?a=c4bc46eswsdw32fcc">4John.Martines </a></li> <li><a href="page.aspx?a=0a2b-a99d-3754eb2f5e35">5Mary Jones</a></li> <li><a href="page.aspx?a=1ef7b100-8dc4-4b40-871c-68b1d0">6Fernando Pra +deras</a></li> <li><a href="page.aspx?a=e8ec1d77-ee83-4797-b9c4-7676053a4926"> 7blifs +tik</a></li> <li>foobar baz blivitz <a href="page.aspx?a=e8ec1d77" class="b">8Frede +rick B. Ohlmsted</a> more blah blah blah and so on ad nauseum....</l +i> <li><a href="page.aspx?a=c4bc46eswsdw32fcc">9Ernesto Maria Santiago-Co +rtez</a></li>

    Executing that script produces what I believe to be data consistent with your spec:

    1Mary Mary QuiteContrary 2Sam Samuels 3Joe.Bones 4John.Martines 5Mary Jones 6Fernando Praderas 7blifstik 8Frederick B. Ohlmsted 9Ernesto Maria Santiago-Cortez Done

    But look at the hoops we jumped thru to get here -- for what are fundamentally small-potatoes differences in the style of the names. Strong Suggestion: Don't do it this way.

Re: A regex question
by Anonymous Monk on Oct 28, 2011 at 20:40 UTC

    See Re: Help With Online Table Scraper, Re: Formating a HTML document to show certain text.

    $ lwp-download "http://perlmonks.com/?abspart=1;displaytype=displaycode;node_id=934485;part=1" junk.html
    281 bytes received

    $ perl htmltreexpather.pl junk.html _tag a
    HTML::Element=HASH(0xb31bcc) 0.1.0 John.Martines /html/body/a /html/body/a /html/body/a[@href='page.aspx?a=c4bc46eswsdw32fcc'] ------------------------------------------------------------------ HTML::Element=HASH(0xb31d2c) 0.1.1.0.0 Mary Jones /html/body/ul/li/a /html/body/ul/li/a /html/body/ul/li/a[@href='page.aspx?a=0a2b-a99d-3754eb2f5e35'] ------------------------------------------------------------------ HTML::Element=HASH(0xb31e0c) 0.1.1.1.0 Fernando Praderas /html/body/ul/li[2]/a /html/body/ul/li[2]/a /html/body/ul/li[2]/a[@href='page.aspx?a=1ef7b100-8dc4-4b40-871c-68b1d +0'] ------------------------------------------------------------------ ##################################################################

    HTML::Query

    use HTML::Query qw{ Query }; print "$_\n" for Query( file => q{junk.html} )->query( q{a[href~=page.aspx]} )->as_text ; __END__ John.Martines Mary Jones Fernando Praderas
      Thank you, very much appreciate, definitively the formatting that i received via wget was something to consider in addition to the rest of the work. And I will study this too. kind regards
Re: A regex question
by Marshall (Canon) on Oct 29, 2011 at 10:28 UTC
    I like the code from roboticus. A slight reformulation is below.

    You can capture all of the names from the "match global" in one expression.

    For most of my "web scraping" code, very fast performance is not that important. Neither is being super general purpose. If I can write a short regex in 5 minutes that gets me what I want, then I go with it and if the web page changes in a year, then I write another 5 minute regex.

    What makes sense in your application has to do with what you are "scraping", how often the page format changes, what the impact of that will be (maybe boss calling you at midnight - or just some thing that you have to get "done this week"). Mileage varies.

    There are some very fine HTML parsing modules and they can be used to make much more general solutions. However, I often write one regex to get to a hunk of html that has what I want and then write another regex like below to extract what I want from that hunk of stuff. Write as much code as you need, but don't write more than you have to. And no matter what you do, this HTML stuff is a very "fragile" interface - meaning that your code will break at the whim of the web developer.

    #!/usr/bin/perl -w use strict; $/=undef; my $data =<DATA>; $data =~ tr/\n/ /; #turn \n's into spaces my (@names) = $data =~ m/<a[^>]*>(.*?)<\/a>/g; foreach (@names) { print "$_\n"; } =prints Jon.Martinez Mary Jones Rob Oticus Joe Blow =cut __DATA__ <a href="foo">Jon.Martinez</a><li>gabba, gabba, hey!</li><a href=bar>Mary Jones</a><p>Gazebo!</p><a href="baz">Rob Oticus</a><a>Joe Blow</a>