Re: conversion from doc to html
by moritz (Cardinal) on Dec 05, 2011 at 13:53 UTC
|
Another option is to try unoconv, which uses OpenOffice or LibreOffice to the actual work. Doing it all yourself would be very much work
| [reply] |
Re: conversion from doc to html
by wfsp (Abbot) on Dec 05, 2011 at 15:16 UTC
|
Some googling turned up this by Util. ++ to him. This is a cut down version that could get you started.
#!/usr/bin/perl
use strict;
use warnings;
use Win32::OLE;
use Win32::OLE::Enum;
my $word = Win32::OLE->GetActiveObject('Word.Application');
my $document = $word->ActiveDocument;
my $paragraphs = $document->Paragraphs();
my $enumerate = Win32::OLE::Enum->new($paragraphs);
while( my $paragraph = $enumerate->Next()) {
my $style = $paragraph->{Style}->{NameLocal};
my $text = $paragraph->{Range}->{Text};
$text =~ tr{\n\r}{}d;
$text =~ tr{\x0b}{\n};
printf qq{%s -> ***%s***\n}, $style, $text;
}
It assumes a document is open in Word. My simple document parsed as
Heading 1 -> ***Heading 1 text***
Heading 2 -> ***Heading 2 text***
Normal -> ***Normal***
For producing HTML I would consider something like HTML::Element. | [reply] [d/l] [select] |
Re: conversion from doc to html
by Khen1950fx (Canon) on Dec 05, 2011 at 16:23 UTC
|
#!/usr/bin/perl -slw
use strict;
use warnings;
use Text::FromAny;
use HTML::FromText ();
my $any = Text::FromAny->new(file => 'test-basic.doc');
my $text = $any->text;
my $t2h = HTML::FromText->new({ paras => 1 });
my $html = $t2h->parse( $text );
print $html;
| [reply] [d/l] |
|
| [reply] |
|
So why don't you install the missing module?
| [reply] |
Re: conversion from doc to html
by vinian (Beadle) on Dec 05, 2011 at 15:06 UTC
|
when i search cpan, i found Win32::OLE(and there are many more in https://metacpan.org/) can read doc file($wd = Win32::OLE->GetObject("D:\\Data\\Message.doc");), but my debian box is not supported by Win32::OLE. there is an example read doc file and put it content in a plain text file using-perl-to-read-microsoft-word-documents .
maybe you can use this module read doc file and then put word in the porper position in the html file. when you write the html file, use template module can ease you work, such as HTML::Template
| [reply] [d/l] |
Re: conversion from doc to html
by TJPride (Pilgrim) on Dec 05, 2011 at 13:44 UTC
|
Here's a stupid suggestion - why not use Microsoft Word to export as HTML? Pretty sure it can do that. | [reply] |
|
1) MS Word's conversion to .html has been and still is badly borked (TTBOMK - I haven't checked the latest version), with enormous bloating to add non-standard MS tags. Don't use it, unless you don't care.
2) Why not learn a little HTML -- an hour or so with a decent tut (w3schools comes to mind) -- and you can do the conversion yourself... by saving as text and adding the necessary tags. Generally, unless the Word .doc is extraordinarily complex, that's a quick and painless operation.
| [reply] |
|
Some old version of htmltidy had a handy word-2000 option:
word-2000
Type: Boolean
Default: no
Example: y/n, yes/no, t/f, true/false, 1/0
This option specifies if Tidy should go to great pains to
strip out all the surplus stuff Microsoft Word 2000 inserts
when you save Word documents as "Web pages".
Doesn’t handle embedded images or VML. You should
consider using Word’s "Save As: Web Page,
Filtered".
I've used it some years ago, with sufficient (but not very pretty) results.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] |
|
| [reply] |