html parser generator with GUI?

by dannoura (Pilgrim)
on Mar 28, 2008 at 12:32 UTC ( #676960=perlquestion: print w/replies, xml ) Need Help??
Recently I've been writing lots of HTML parsers. First with perl (ok), then with java (not so ok). Writing parsers is fairly boring so I thought I could maybe ask one of the non-technical people in my company to do this with a parser generator. I know there are plenty of them around, but they all use their own grammar, which is not very suitable for non-technical people. Does anyone know of a GUI application where you can mark elements in a web page and it will create some sort of extraction grammar or even code to extract them?


Re: html parser generator with GUI?
by Fletch (Chancellor) on Mar 28, 2008 at 12:46 UTC

    That's actually kind of an interesting idea. I don't know of anything offhand but you probably could repurpose and/or reuse one of the Firefox web dev packages like Firebug or XPather. You'd use those to find the xpath that describes the parts you're interested in and then plug that into something that pulls the relevant chunks out of the tree.

    At the least something like that might let you pawn off the tedious "dig through the page's source and figure out what you want" stage off on a non-technical person and you just tweak a standard skeleton as necessary.

Re: html parser generator with GUI?
by zentara (Archbishop) on Mar 28, 2008 at 17:16 UTC
Re: html parser generator with GUI?
by Anonymous Monk on Mar 28, 2008 at 12:46 UTC
Re: html parser generator with GUI?
by Anonymous Monk on Jul 18, 2008 at 13:49 UTC
    You could try Visual Web Task and Happy Harvester - both of which allow you to extract text from HTML using a simple GUI. Happy Harvester is better if you want to parse lots of sites quickly, but requires you to enter start and end tags for each class of data you want to extract (e.g. and article title or link) Visual Web Task is entirely code-free, but doesn't have so many options for parsing multiple sites. good luck, Harry

