Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Perl Monk, Perl Meditation
 
PerlMonks  

how to use regular expressions read some string from a htm file

by weihe (Initiate)
on Aug 02, 2006 at 01:04 UTC ( [id://565156]=perlquestion: print w/replies, xml ) Need Help??

This is an archived low-energy page for bots and other anonmyous visitors. Please sign up if you are a human and want to interact.

weihe has asked for the wisdom of the Perl Monks concerning the following question:

the source of text file like this

<html><head><title>my page></title></head> <body> <table><tr><td> <a href="http://mysite/bbsui.jsp?id=dxpwd">dxpwd</a> </td><td> <a href="http://mysite/bbsui.jsp?id=jimeth">jimeth</a> </td><td> <a href="http://mysite/bbsui.jsp?id=jone28">jone28</a> </td><td> <a href="http://mysite/bbsui.jsp?id=25528">25528</a> </td></tr> ..... </body></html>

and i need to read this source text file and got it URL like this

http://mysite/bbsui.jsp?id=dxpwd http://mysite/bbsui.jsp?id=jimeth http://mysite/bbsui.jsp?id=jone28 http://mysite/bbsui.jsp?id=25528

how do i write a regular expressions for this purpose.
thanks

Formatting cleaned up by GrandFather

Replies are listed 'Best First'.
Re: how to use regular expressions read some string from a htm file
by Zaxo (Archbishop) on Aug 02, 2006 at 01:10 UTC

    Don't use a regular expression, use HTML::LinkExtor;

    It will drive you mad to try this with regexen.

    After Compline,
    Zaxo

Re: how to use regular expressions read some string from a htm file
by GrandFather (Saint) on Aug 02, 2006 at 01:13 UTC

    You don't. You use HTML::TreeBuilder or some such similar module. Life is too short to bother reinventing that particular wheel. Markup is hard to write regexen to parse because there are many special cases for handling things like white space. Try something like:

    use strict; use warnings; use HTML::TreeBuilder; my $str = <<'STR'; <html><head><title>my page></title></head> <body> <table><tr><td> <a href="http://mysite/bbsui.jsp?id=dxpwd">dxpwd</a> </td><td> <a href="http://mysite/bbsui.jsp?id=jimeth">jimeth</a> </td><td> <a href="http://mysite/bbsui.jsp?id=jone28">jone28</a> </td><td> <a href="http://mysite/bbsui.jsp?id=25528">25528</a> </td></tr> </body></html> STR my $tree = HTML::TreeBuilder->new; $tree->parse ($str); print $_->attr ('href') . "\n" for $tree->find ('a');

    Prints:

    http://mysite/bbsui.jsp?id=dxpwd http://mysite/bbsui.jsp?id=jimeth http://mysite/bbsui.jsp?id=jone28 http://mysite/bbsui.jsp?id=25528

    DWIM is Perl's answer to Gödel
      thanks for you answer my question, but i have another question is : if my source text file are include some of URL that i don't like to got,what should i do, for example:
      <html><head><title>my page></title></head> <body> <table><tr><td> <a href="http://mysite/bbsui.jsp?id=dxpwd">dxpwd</a> </td><td> <a href="http://mysite/bbsui.jsp?id=jimeth">jimeth</a> </td><td> <a href="http://mysite/bbsui.jsp?id=jone28">jone28</a> </td><td> <a href="http://mysite/bbsui.jsp?id=25528">25528</a> </td></tr> </table> <a href="http://www.idontneed.com">i don't need this url</a> <a href="http://www.asdf.com">i don't need this url too</a> <a href="http://www.cnn.com"> this are also no need</a> </body></html> at result i just wann get the follow: http://mysite/bbsui.jsp?id=dxpwd http://mysite/bbsui.jsp?id=jimeth http://mysite/bbsui.jsp?id=jone28 http://mysite/bbsui.jsp?id=25528

        Please see the replies to i wann get some character from a string which has some general good advice and at least a partial solution for you.

        Also note that you should not post the same question multiple times. This follow up question is fine to elicit more information, but you should not repost it elsewhere. Remember that the questions that you ask and the answers they receive are a resource other people will use. If you ask the same question in different places the answers will be fragmented and the resource will be less useful to other people.

        Please note too that we are not trying to be obstructive when we say "go read the documentation". We are trying to point you to the information that you need so that you have the tools to solve similar problems in the future and so that you know where and how to look for related information.

        Generally you will find that we give direct answers to questions that ask "why doesn't this code work? It does this thing and I expected it to do that thing.". We give more general "here's where to look it up" answers to "how do I tackle this general problem" questions.


        DWIM is Perl's answer to Gödel
Re: how to use regular expressions read some string from a htm file
by rsriram (Hermit) on Aug 02, 2006 at 01:34 UTC

    Hi, It is a smarter way to use modules instead of regular expressions when working with HTML files. But, if you are so particular in using regex, try this.

    open (F1, "<$ARGV[0]") || die ("Can't open the file $ARGV[0]. $!\n");
    while(<F1>)
    {
       print "$1\n" if ($_ =~ /<a href="([^"]+)">/)
    }
    close F1;

    In the above script, I have the HTML file stored in the variable F1.

Re: how to use regular expressions read some string from a htm file
by reneeb (Chaplain) on Aug 02, 2006 at 01:48 UTC
    You can use HTML::Parser:
    #! /usr/bin/perl use strict; use warnings; use HTML::Parser; my @links; my $string = qq~<a href="url1">linktext1</a> Ein anderer Text <a href="url2">linktext2</a> text~; my $p = HTML::Parser->new(); $p->handler(start => \&start_handler,"tagname,attr,self"); $p->parse($string); foreach my $link(@links){ print "Linktext: ",$link->[1],"\tURL: ",$link->[0],"\n"; } sub start_handler{ return if(shift ne 'a'); my ($class) = shift->{href}; my $self = shift; my $text; $self->handler(text => sub{$text = shift;},"dtext"); $self->handler(end => sub{push(@links,[$class,$text]) if(shift eq 'a +')},"tagname"); }
Re: how do i get special string from source of text file
by gellyfish (Monsignor) on Aug 02, 2006 at 04:09 UTC

    I'd suggest using HTML::LinkExtor to extract the URLs from the <a /> elements and then throwing away the ones you don't want afterwards, however as you don't say how to distinguish between the ones you do want and the ones you don't I'm not going to guess and give you an example.

    /J\

Re: how to use regular expressions read some string from a htm file
by planetscape (Chancellor) on Aug 02, 2006 at 13:45 UTC

    As many others have pointed out, you don't.

    In addition to the other excellent examples above, you could also use mech-dump, which comes with WWW::Mechanize, e.g.:

    mech-dump --links http://www.perlmonks.org

    You would still, of course, need to do some post-processing to get just the links you want, but so far you have posted no criteria to determine which of those that is.

    HTH,

    planetscape

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://565156]
Approved by GrandFather
help
Sections?
Information?
Find Nodes?
Leftovers?
    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.