"...they end up being in .txt file however still formatted in html (so you see all the html code in the .txt surrounding the actual text...."
This shouts "I haven't bothered to understand either html or the various meanings of 'text'." The last word in the quote above uses the word "text" the sense of "textual content." The references to ".txt" refer to a file format; in this case, a document (something.html) that is comprised to ASCII or UTF8 characters.
Since you say the html markup ("code") is still present (visible), you'll almost certainly have html formatted files by merely changing the file extension from .txt to .htm.
But you've asked quite enough questions1 that reflect an utter lack of person effort. This is going to be your project; your thesis; and your future; not ours. So build a good foundation by taking the trouble to understand at least the basics of the relevant technology (and, as has already been suggested, understand how, when and why to seek help here).
1 to wit, Re: Perl Possibilities, Re^3: Perl Possibilities, Re: Perl Possibilities where the link leaves the tedium of finding the material to which you refer (ANNUAL MEETING PROPOSALS ) to the Monk seeking to help. The observations here also apply to the node to which this is addressed (Re^3: Perl Possibilities) and my point is that those who seek the benefit of Monks effort should maximize their own beforehand.