http://www.perlmonks.org?node_id=984953


in reply to Parsing SGML-ish Data Files

It is pretty difficult to answer you with the litlle information you give us, but I'll try anyway ;--(

If the data is not SGML, XML or HTML I don't think you should try to use SGML/XML/HTML tools on it. SGML and XML tools will simply not accept the data, and HTML tools will try their best, but their guess may not be what you expect.

It really depends on the format of your data files, but I would probably try to first convert the data to XML, using regexps, and then use XML tools which are usually pretty fast. But that's because I am used to processing XML, and my output is usually either XML or HTML, so an XML transformation gives me the result I want.

Also, the problem with your finite state machine may be the tokenizer. If you scan the input one character at a time, C-style, this may not be optimal, tokenizing using regexps may be faster.