tracker issue : CF-4206330

select a category, or use search below
(searches all categories and all time range)
Title:

DWG files are not indexing correctly in Solr

| View in Tracker

Status/Resolution/Reason: Closed/Withdrawn/ThirdParty

Reporter/Name(from Bugbase): rojin t. / ()

Created: 12/12/2019

Components: Text Search, Solr

Versions: 2018

Failure Type: Data Loss

Found In Build/Fixed In Build: CF 2018 /

Priority/Frequency: Normal / All users will encounter

Locale/System: English / Win 2016

Vote Count: 0

Problem Description: Trying to index the dwg files in Solr, but I am getting only the file name (xyz.dwg) in the CONTEXT instead of the whole text inside the DWG file. The same way I am unable to index all the other kind of AutoCAD files (dxf, dfg, rvt, etc).

Steps to Reproduce:

<!--- Index startes here --->
<cfindex action="update" 
	extensions=".*" 
	type="file"
	collection="#collection#"
	recurse="yes" 
	key="C:\xyz\1344464051.dwg"
	urlpath="/CAD" 
	status = "insert"
/>
<Cfdump var="#insert#">
<!--- Index ends here --->


<!--- Search startes here --->
<cfsearch 
	type="standard"
	collection="#collection#" 
	status="docsearchstatus" 
	name="docsearch" 
	criteria='"1344464051.dwg"' 
	suggestions="Never" 
	contextpassages="1" 
	contextbytes="300"
>
<Cfdump var="#docsearch#">
<!--- Search ends here --->

Actual Result: 1344464051.dwg

Expected Result: All texts from the dwg files

Any Workarounds:

Attachments:

Comments:

Rojin, Solr uses tika library for extracting text of indexed file. Tika apparently can extract only metadata from autodesk files. Ref. https://tika.apache.org/1.23/formats.html#Full_list_of_Supported_Formats Is there a way you can update the metadata and verify if the search result returns the updated info in the "summary" field. With a sample file that I tried it with, the summary field returns "Autodesk" only.
Comment by Piyush K.
32025 | January 01, 2020 04:04:28 PM GMT
looks like a design limitation with TIKA lib. lib extracts just the flle metadata not the content. closing this.
Comment by Piyush K.
33177 | February 25, 2020 11:20:51 AM GMT