Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: Can Perl generate a page break character that Microsoft Word will recognize?

by jcb (Priest)
on Dec 31, 2019 at 00:42 UTC ( #11110789=note: print w/replies, xml ) Need Help??


in reply to Can Perl generate a page break character that Microsoft Word will recognize?

In other words, is a Word "page break" an actual character or some other object from the Stygian Depths of Redmond?

Try an ASCII FF (form feed) character, Control-L or "\014". If that does not work, you will need to use COM Windows-isms to build up the text in Word bit by bit. Or try another trick: WordPad actually wrote RTF with a .doc extension and Word will silently accept RTF documents, so you might be able to output RTF and get Amazon to process it.

  • Comment on Re: Can Perl generate a page break character that Microsoft Word will recognize?
  • Download Code

Replies are listed 'Best First'.
Re^2: Can Perl generate a page break character that Microsoft Word will recognize?
by harangzsolt33 (Friar) on Jan 01, 2020 at 01:45 UTC
    A Word page break is a character \n in old DOC file format. Newer Word documents are DOCX files, which are essentially ZIP files containing several xml documents, one of which is called document.xml. This one contains the document text itself. I created a simple document with two lines "AAA" and "BBB" for example. This was the content in the document.xml file:

    <w:body> - <w:p w:rsidR="00D96BA8" w:rsidRDefault="00D96BA8"> - <w:r> <w:t>AAA</w:t> </w:r> </w:p> - <w:p w:rsidR="00D96BA8" w:rsidRDefault="00D96BA8"> - <w:r> <w:t>BBB</w:t> </w:r> </w:p> - <w:sectPr w:rsidR="00D96BA8" w:rsidSect="00354B3C"> <w:pgSz w:w="12240" w:h="15840" /> <w:pgMar w:top="1008" w:right="1008" w:bottom="1008" w:left="1008" w +:header="720" w:footer="720" w:gutter="0" /> <w:cols w:space="720" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:body>

    and this was the DOC file hex dump somewhere in the middle. I am not going to copy the entire file here. I know, some of you are like "whew!" lol

    Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F 000009B0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009C0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009D0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009E0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 000009F0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A00 41 41 41 0D 42 42 42 0D 00 00 00 00 00 00 00 00 AAA.BBB.... +..... 00000A10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A20 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A30 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A40 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +..... 00000A50 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ........... +.....

      Interesting. Word seems to use ASCII CR as paragraph break, so does it use ASCII LF or ASCII FF as page break? (There is also a forced end-of-line produced by Shift-Enter that does not start a new paragraph. Simply pressing Enter actually starts a new paragraph, which starts a new line as a side-effect.)

      If we want to consider producing DOCX, it would be fairly easy to input AAA [Control-Enter to insert a page break] BBB and see what turns up in document.xml. Word DOC format uses Microsoft's "OLE Container" format, which turns out to be a miniature FAT filesystem, complete with its own allocation tables, and (if I remember correctly) a second FAT filesystem with smaller blocks stored inside a "file" in the outer container file. At least they only did that to one level of recursion, instead of producing a "filesystems all the way down" crawling horror.

        Or just look up the XML to do what you want:

        <?xml version="1.0" encoding="UTF-8"?> <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingm +l/2006/main" xmlns:m="http://schemas.openxmlformats.org/officeDocumen +t/2006/math" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns: +r="http://schemas.openxmlformats.org/officeDocument/2006/relationship +s" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:ve="http://schemas.o +penxmlformats.org/markup-compatibility/2006" xmlns:w10="urn:schemas-m +icrosoft-com:office:word" xmlns:wne="http://schemas.microsoft.com/off +ice/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/dra +wingml/2006/wordprocessingDrawing"> <w:body> <w:p w:rsidR="00D479B1" w:rsidRDefault="00D479B1"> <w:r> <w:t>1234</w:t> </w:r> </w:p> <w:p w:rsidR="00D479B1" w:rsidRDefault="00D479B1"> <w:r> <w:t>5678</w:t> </w:r> </w:p> <w:sectPr w:rsidR="00D479B1" w:rsidSect="00D479B1"> <w:pgSz w:w="11906" w:h="16838" /> <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left=" +1800" w:header="708" w:footer="708" w:gutter="0" /> <w:cols w:space="708" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:body> </w:document>

        becomes:

        <?xml version="1.0" encoding="UTF-8"?> <w:document xmlns:w="http://schemas.openxmlformats.org/wordprocessingm +l/2006/main" xmlns:m="http://schemas.openxmlformats.org/officeDocumen +t/2006/math" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns: +r="http://schemas.openxmlformats.org/officeDocument/2006/relationship +s" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:ve="http://schemas.o +penxmlformats.org/markup-compatibility/2006" xmlns:w10="urn:schemas-m +icrosoft-com:office:word" xmlns:wne="http://schemas.microsoft.com/off +ice/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/dra +wingml/2006/wordprocessingDrawing"> <w:body> <w:p w:rsidR="00D479B1" w:rsidRDefault="00D479B1"> <w:r> <w:t>1234</w:t> </w:r> </w:p> <w:p> <w:r> <w:br w:type="page" /> </w:r> </w:p> <w:p w:rsidR="00D479B1" w:rsidRDefault="00D479B1"> <w:r> <w:t>5678</w:t> </w:r> </w:p> <w:sectPr w:rsidR="00D479B1" w:rsidSect="00D479B1"> <w:pgSz w:w="11906" w:h="16838" /> <w:pgMar w:top="1440" w:right="1800" w:bottom="1440" w:left=" +1800" w:header="708" w:footer="708" w:gutter="0" /> <w:cols w:space="708" /> <w:docGrid w:linePitch="360" /> </w:sectPr> </w:body> </w:document>

        See also the other links already provided in this thread, and their associated links. To be honest your work flow ('I'm using Perl to scrape text from a JavaScript that printed out one page at a time..') seems somewhat convoluted, but you don't go into much detail.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://11110789]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (5)
As of 2020-05-30 16:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    If programming languages were movie genres, Perl would be:















    Results (173 votes). Check out past polls.

    Notices?