Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Parsing SGML-ish Data Files

by coolmichael (Deacon)
on Aug 01, 2012 at 17:10 UTC ( #984837=perlquestion: print w/ replies, xml ) Need Help??
coolmichael has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks, it's been a while (only seven years!).

I'm working on a project to parse, validate, and transform some large data files which look like SGML tags, but most definitely are not. For example, <FOO=4> is perfectly valid and well defined, and the tags do not have to nest properly as they do in SGML and XML. I've tried the SGML:: tools on CPAN, but they don't quite work. I've also tried HTML::Parser, but it chokes on attributes which use "smart quotes" (0x201D in Unicode).

I've written a pure perl finite state machine parser (and test suite) which creates a data structure I can validate, but it is very slow. Like 45 seconds on a 900Kb file. The bottleneck is the parsing phase, so I'd like to speed that up somehow.

I've squeezed as much performance out of it as I can with Devel::NYTProf, but I think if I want to get it down to 10 seconds a file I need to rewrite the parser some how. I could go the C/XS route for it, but that would be a massive learning curve.

I haven't tried Parse::RecDescent yet or Parse::Yapp. What are your thoughts on them?

If you were writing a parser for something SGMLish (but not SGML), where would you start?

Comment on Parsing SGML-ish Data Files
Download Code
Re: Parsing SGML-ish Data Files
by Anonymous Monk on Aug 01, 2012 at 22:09 UTC

    Like 45 seconds on a 900Kb file.

    I beat you, This can take 40 seconds on 44 char string :)

    The bottleneck is the parsing phase, so I'd like to speed that up somehow.

    I hear Marpa::XS is good for speed if you can write a BNF grammar for your language, but I've no practical experience with it :|

Re: Parsing SGML-ish Data Files
by mirod (Canon) on Aug 02, 2012 at 06:47 UTC

    It is pretty difficult to answer you with the litlle information you give us, but I'll try anyway ;--(

    If the data is not SGML, XML or HTML I don't think you should try to use SGML/XML/HTML tools on it. SGML and XML tools will simply not accept the data, and HTML tools will try their best, but their guess may not be what you expect.

    It really depends on the format of your data files, but I would probably try to first convert the data to XML, using regexps, and then use XML tools which are usually pretty fast. But that's because I am used to processing XML, and my output is usually either XML or HTML, so an XML transformation gives me the result I want.

    Also, the problem with your finite state machine may be the tokenizer. If you scan the input one character at a time, C-style, this may not be optimal, tokenizing using regexps may be faster.

      Well, the end goal is converting to XML. Regular expressions for the conversion aren't going to work very well, as the tags aren't properly nested as they are in XML/HTML/SGML. For example <a><b></a></b> is considered valid.

      I do think the speed problem is in the tokenizer. I am doing to the scan one character at a time (from a buffer in memory, at least). I'm not sure how I could do that with regular expressions, but it's a good idea to look into.

        If you can show us enough of the actual structure of the data and describe the constraints on tags, attributes etc, we should be able to at least sketch a regex based solution or offer other alternatives for you.

        True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://984837]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (15)
As of 2014-04-16 18:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (433 votes), past polls