Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re^4: Perl Possibilities

by graff (Chancellor)
on Mar 18, 2016 at 22:05 UTC ( [id://1158281]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Perl Possibilities
in thread Perl Possibilities

I suggested the code above several hours before you provided a sample of data ("Filing Example" below), so I was unaware that you were dealing with HTML data. That changes things. For example, in HTML, some "whitespace" looks like this:
... Board recommends a vote FOR Proposal No. 2. ...
and given the variety of distinct sources (which presumably use distinct HTML/CSS formats and styles), I'd expect a variety of structural differences in the tags that appear in and around the patterns of interest.

BTW, on the matter of "html" vs. "txt", it doesn't matter what a given file name looks like - what matters is what the content looks like. If the content has HTML tags, it's HTML data, and needs to be treated as such, regardless of what the file name might be.

If it's typical for texts of this sort to always include a single table near the top of the document that lists the proposals with number, name, and result, it may be that your best bet is Corion's idea about HTML::TableExtractor. It's just a matter of knowing which table in the overall file is the one you want.

Aside from that, any other practical approach will involve parsing the HTML first to get its plain-text content before you do anything that involves string comparisons or regex matches.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1158281]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (10)
As of 2024-04-23 08:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found