portal entry | What iT iS dESign studios

Title:

Extracting text/html out Word (.docx) files

November 03, 2018 04:44:50 PM GMT

7 Comments

<p>Repositories https://github.com/jmohler1970/WordExtractor https://github.com/jmohler1970/WordExtractor_demo Introduction We are going to be extracting out HTML from a Word (.docx) file. .docx is an example of an Open Document Format for Office Applications (ODF) file. It is a ZIP of an XML document. By unzipping the file and locating the appropriate XML file, we can process the data an generate HTML Resources https://en.wikipedia.org/wiki/OpenDocument https://helpx.adobe.com/coldfusion/cfml-reference/coldfusion-tags/tags-u-z/cfzip.html https://helpx.adobe.com/coldfusion/cfml-reference/coldfusion-functions/functions-t-z/xmlparse.html</p>
<p>The post <a rel="nofollow" href="https://coldfusion.adobe.com/2018/11/extracting-text-html-out-word-docx-files/">Extracting text/html out Word (.docx) files</a> appeared first on <a rel="nofollow" href="https://coldfusion.adobe.com">ColdFusion</a>.</p>

Labels: Blog, Learning, blog, cfscript, cfzip, learning, programming

Comments:

Well done, James! I really enjoyed this walk through. Been a loooooOOOOooooong time since I've seen any great CF tutorials. And I think I can use this. I've been wanting to convert my word documents into markdown files. Thanks for the head start!

BTW: I had no idea docx files were really just zip files. (mind-blown)

Comment by chrisg57685480

1362 | November 05, 2018 06:38:01 PM GMT

Glad you liked it!

Comment by James Mohler

1365 | November 06, 2018 04:52:25 AM GMT

Excellent content James.  This was really well done.

Comment by David Byers

1369 | November 07, 2018 09:13:37 PM GMT

Glad you liked it!

Comment by James Mohler

1370 | November 07, 2018 09:58:03 PM GMT

Awsome James. This content is very useful.

Comment by ksaravn

2046 | May 10, 2019 08:41:47 AM GMT

Nice James.
Today I used you code to do some extractions. It is difficult to do a nice commit to your git hub.

See bellow some enhancement and issue solving.

component output="false" {
this.xmlPara = ""; // parsed into XML nodes
this.xmlString = ""; // raw text
this.proofErr = "spellEnd";
VARIABLES.listcounter = 1; // used to set order list numbers
VARIABLES.listcounterOutlinelevel = [0,0,0,0,0,0,0,0,0,0]; // used to set order list numbers with outlinelevel
VARIABLES.listCounterLen = ArrayLen(VARIABLES.listcounterOutlinelevel);
variables.CRLF = Chr(13) & Chr(10);
this.headingMax = 6;
private string function ReadNode (required xml Node) {
var result = "";
var wpPr = "";
var wrPr = ""; // Does bold, italic
var wnumPr = ""; // ordered or unordered in html
var wnumID = "";
var wVal = "";
var basedOn ="";
var startElementName = "";
var wrPrNodeElement = "";
var outlinelevel = 0;
var ilvl = 0;
var pHTMLtag ="p";
if (StructIsEmpty(arguments.Node)) {
return "";
}
for (var Element in arguments.node.xmlChildren) {
startwVal = "";
/* Start Tags*/
switch (Element.xmlName) {
case "w:p" :
wVal = ""; // default paragraph style
wnumid = ""; // This actually the type of list
if (ArrayLen(Element.XMLChildren) != 0) {
/* pPr ParagraphProperties*/
if (Element.XMLChildren[1].xmlName == "w:pPr") {
wpPr = Element.XMLChildren[1];
cfloop(array=wpPr.XMLChildren,index=PPropertyIndex,item=PProperty) {
if (PProperty.xmlName == "w:pStyle") {
wVal = PProperty.XMLAttributes["w:val"];
}
if (PProperty.xmlName == "w:outlineLvl") {
outlinelevel = PProperty.XMLAttributes["w:val"];
}
if (PProperty.xmlName == "w:numPr") {
cfloop(array=PProperty.XMLChildren,index=NumropertyIndex,item=NumProperty) {
if (NumProperty.xmlName == "w:numID") {
wnumid = NumProperty.XMLAttributes["w:val"];
}
if (NumProperty.xmlName == "w:ilvl") {
ilvl = NumProperty.XMLAttributes["w:val"];
VARIABLES.listcounterOutlinelevel[ilvl+1]=0;
resetCounterValues(ilvl);
}
}
}
}
}
}
switch (wVal) {
case "ListParagraph" :
if (wnumid == 2) {
result &= '<ol start="#listcounter#"><li>#ReadNode(Element)#</li></ol>#variables.crlf#';
}
else {
result &= '<li>#ReadNode(Element)#</li>#variables.crlf#';
}
variables.listcounter++;
break;
default :
variables.listcounter = 1;
/* normal paragraph*/
if(wVal neq ""){
/* find if the style has numbering and its outlinelevel*/
/*https://docs.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.numberingproperties?view=openxml-2.8.1*/
pstyle =XmlSearch(this.xmlStyles,"/w:styles/w:style[@w:styleId='#wVal#']");
outlinelevel = JavaCast("int",XmlSearch(pstyle[1],"number(w:pPr/w:outlineLvl/@w:val)"));
ilvl = JavaCast("int",XmlSearch(pstyle[1],"number(w:pPr/w:numPr/w:ilvl/@w:val)"));
wnumId = JavaCast("int",XmlSearch(pstyle[1],"number(w:pPr/w:numPr/w:numId/@w:val)"));
basedOn = XmlSearch(pstyle[1],"string(w:basedOn/@w:val)");
pHTMLtag ="p";
for(var headingNumber =1; headingNumber lte this.headingMax;headingNumber++){
if(basedOn eq "Heading"&headingNumber OR wVal eq "Heading"&headingNumber){
pHTMLtag ="h"&headingNumber;
resetCounterValues(headingNumber);
}
}
if(wnumId gt 0 AND isnumeric(ilvl) AND ilvl gt 0 ) {
VARIABLES.listcounterOutlinelevel[ilvl+1]=0;
VARIABLES.listcounterOutlinelevel[ilvl]=VARIABLES.listcounterOutlinelevel[ilvl]+1;
result &= '<#pHTMLtag# class="#wVal#">#VARIABLES.listcounterOutlinelevel[ilvl]#. #ReadNode(Element)#</#pHTMLtag#>#variables.crlf#'; //add style id as html class name
} else {
variables.listcounter = 1;
result &= '<#pHTMLtag# class="#wVal#">#ReadNode(Element)#</#pHTMLtag#>#variables.crlf#'; //add style id as html class name
}
} else {
variables.listcounter = 1;
result &= '<p>#ReadNode(Element)#</p>#variables.crlf#';
}
}
break; // end of w:p
case "w:r" : // This handles bolds and italics
wrPr = "";
wrPrNodeElement = ReadNode(Element);
/* multiple children*/
for(var elChild in Element.XMLChildren){
if ( isArray(elChild.XMLChildren) && !arrayIsEmpty(elChild.XMLChildren)) {
/* loop */
cfloop(array=elChild.XMLChildren,index=wrPrIndex,item=wrPritem) {
wrPr = elChild.XMLChildren[wrPrIndex].XMLName;
switch (wrPr) {
case "w:b" :
wrPrNodeElement = "<b>#wrPrNodeElement#</b>";
break;
case "w:i" :
wrPrNodeElement = "<i>#wrPrNodeElement#</i>";
break;
case "w:u" :
wrPrNodeElement = "<u>#wrPrNodeElement#</u>";
break;
default :
/* Other */
break;
}
}
}
}
result &= wrPrNodeElement;
break;
case "w:t" :
result &= Element.xmlText;
break;
case "w:ProofErr" :
/* Word divides this into separate areas*/
/*skip this.proofErr = Element.XMLAttributes["w:type"];*/
break;
case "w:pStyle" :
/* skip variables.currentTag = Element.XMLAttributes["w:val"];*/
break;
case "w:instrText" :
/* skip*/
break;
default :
result &= Element.xmlText;
}
/* Inner text*/
/* result &= readNode(Element);*/
} /* End for loop on Element*/
return result;
} /* End function*/
private function resetCounterValues(required numeric depth) {
for (var i = ARGUMENTS.depth+1; i lte VARIABLES.listCounterLen;i++){
VARIABLES.listcounterOutlinelevel[i] = 0;
}
}
string function extractDocx(required string pathToDocX) {
cfzip(action="read", file=arguments.pathToDocx, entrypath="word\document.xml", variable="this.xmlString",charset = "utf-8");
cfzip(action="read", file=arguments.pathToDocx, entrypath="word\styles.xml", variable="this.xmlStyles",charset = "utf-8");
this.xmlPara = xmlparse(this.xmlString).document.body;
return ReadNode(this.xmlPara);
}
}

Comment by dingdongiuytiyt_t

2473 | October 25, 2019 03:39:23 PM GMT

Hi,

great post, thx. Is there a way, also to put the Image out of the Word-Document, which is in "w drawing", to HTML? I've read many posts (Java and PHP > mostly found payable plugins), but no clue how to do it in CF.

There must be 2 ways, inline and floating.
I've found this post usefull:<a href="https://www.toptal.com/xml/an-informal-introduction-to-docx" rel="nofollow">https://www.toptal.com/xml/an-informal-introduction-to-docx</a>

But there is no explanation how to put that into code to show it then in HTML.

Thx for any Answer

Corrado

Comment by delsalsa

4770 | June 19, 2020 07:00:25 PM GMT