Re^4: Perl Possibilities

I suggested the code above several hours before you provided a sample of data ("Filing Example" below), so I was unaware that you were dealing with HTML data. That changes things. For example, in HTML, some "whitespace" looks like this:

... Board recommends a vote FOR Proposal No.&nbsp;2. ...
[download]

and given the variety of distinct sources (which presumably use distinct HTML/CSS formats and styles), I'd expect a variety of structural differences in the tags that appear in and around the patterns of interest.

BTW, on the matter of "html" vs. "txt", it doesn't matter what a given file name looks like - what matters is what the content looks like. If the content has HTML tags, it's HTML data, and needs to be treated as such, regardless of what the file name might be.

If it's typical for texts of this sort to always include a single table near the top of the document that lists the proposals with number, name, and result, it may be that your best bet is Corion's idea about HTML::TableExtractor. It's just a matter of knowing which table in the overall file is the one you want.

Aside from that, any other practical approach will involve parsing the HTML first to get its plain-text content before you do anything that involves string comparisons or regex matches.

Comment on Re^4: Perl Possibilities Download Code


XP is just a number
	PerlMonks