Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

conversion from doc to html

by srocks (Initiate)
on Dec 05, 2011 at 13:38 UTC ( [id://941846]=perlquestion: print w/replies, xml ) Need Help??

srocks has asked for the wisdom of the Perl Monks concerning the following question:

i need to generate the html code...by reading each line of word file including heading ,contents etc.all should be in same order. As a beginner can u please share some sample programms for converting wrd file into html file....So that i will get a better idea on it

Replies are listed 'Best First'.
Re: conversion from doc to html
by moritz (Cardinal) on Dec 05, 2011 at 13:53 UTC
Re: conversion from doc to html
by wfsp (Abbot) on Dec 05, 2011 at 15:16 UTC
    Some googling turned up this by Util. ++ to him. This is a cut down version that could get you started.
    #!/usr/bin/perl use strict; use warnings; use Win32::OLE; use Win32::OLE::Enum; my $word = Win32::OLE->GetActiveObject('Word.Application'); my $document = $word->ActiveDocument; my $paragraphs = $document->Paragraphs(); my $enumerate = Win32::OLE::Enum->new($paragraphs); while( my $paragraph = $enumerate->Next()) { my $style = $paragraph->{Style}->{NameLocal}; my $text = $paragraph->{Range}->{Text}; $text =~ tr{\n\r}{}d; $text =~ tr{\x0b}{\n}; printf qq{%s -> ***%s***\n}, $style, $text; }
    It assumes a document is open in Word. My simple document parsed as
    Heading 1 -> ***Heading 1 text*** Heading 2 -> ***Heading 2 text*** Normal -> ***Normal***
    For producing HTML I would consider something like HTML::Element.
Re: conversion from doc to html
by Khen1950fx (Canon) on Dec 05, 2011 at 16:23 UTC

    This as simple as I could make it for you. Using:

    Text::FromAny
    HTML::FromText
    #!/usr/bin/perl -slw use strict; use warnings; use Text::FromAny; use HTML::FromText (); my $any = Text::FromAny->new(file => 'test-basic.doc'); my $text = $any->text; my $t2h = HTML::FromText->new({ paras => 1 }); my $html = $t2h->parse( $text ); print $html;

      ..It is not working on machine... Showing this type of mesage Can't locate Text/FromAny.pm in @INC (@INC contains: C:/strawberry/perl/site/lib C:/strawberry/perl/vendor/lib C:/strawberry/perl/lib .) at sample_html.pl line 5. BEGIN failed--compilation aborted at sample_html.pl line 5. please confirm it and let me know ..

        So why don't you install the missing module?
Re: conversion from doc to html
by vinian (Beadle) on Dec 05, 2011 at 15:06 UTC

    when i search cpan, i found Win32::OLE(and there are many more in https://metacpan.org/) can read doc file($wd = Win32::OLE->GetObject("D:\\Data\\Message.doc");), but my debian box is not supported by Win32::OLE. there is an example read doc file and put it content in a plain text file using-perl-to-read-microsoft-word-documents . maybe you can use this module read doc file and then put word in the porper position in the html file. when you write the html file, use template module can ease you work, such as HTML::Template

Re: conversion from doc to html
by TJPride (Pilgrim) on Dec 05, 2011 at 13:44 UTC
    Here's a stupid suggestion - why not use Microsoft Word to export as HTML? Pretty sure it can do that.
      1) MS Word's conversion to .html has been and still is badly borked (TTBOMK - I haven't checked the latest version), with enormous bloating to add non-standard MS tags. Don't use it, unless you don't care.

      2) Why not learn a little HTML -- an hour or so with a decent tut (w3schools comes to mind) -- and you can do the conversion yourself... by saving as text and adding the necessary tags. Generally, unless the Word .doc is extraordinarily complex, that's a quick and painless operation.

        Some old version of htmltidy had a handy word-2000 option:

        word-2000

        Type: Boolean
        Default: no
        Example: y/n, yes/no, t/f, true/false, 1/0

        This option specifies if Tidy should go to great pains to strip out all the surplus stuff Microsoft Word 2000 inserts when you save Word documents as "Web pages". Doesn’t handle embedded images or VML. You should consider using Word’s "Save As: Web Page, Filtered".

        I've used it some years ago, with sufficient (but not very pretty) results.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      it is not working .My word doc has table and image which will create grabage value during conversion ..

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://941846]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having a coffee break in the Monastery: (5)
As of 2024-04-23 20:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found