tracker issue | What iT iS dESign studios

Title:

cfindex should be able to strip html tags as well as decode html entity before indexing for type="custom"

| View in Tracker

Status/Resolution/Reason: Closed/Fixed/Fixed

Reporter/Name(from Bugbase): Henry Ho / Henry Ho (Henry Ho)

Created: 02/23/2015

Components: Text Search, Solr

Versions: 11.0

Failure Type: Enhancement Request

Found In Build/Fixed In Build: CF11_Final / 2018.0.0.310840

Priority/Frequency: Trivial / Unknown

Locale/System: English / Win CE

Vote Count: 0

Right now cfindex will index both tags and contents when type="custom", but type="file" or "path" seems to work fine with .html.

----------------------------- Additional Watson Details -----------------------------

Watson Bug ID:	3944307

External Customer Info:
External Company:  
External Customer Name: Henry
External Customer Email:  
External Test Config: My Hardware and Environment details:

CF10 u15

Attachments:

Comments:

Henry,

Thank you for logging this ER. 
Can you please discuss briefly, how you are using this feature, what kind of data/content you are indexing. 
Can you share a use case, which necessitates passing the html files/content as custom data type (type="custom") instead of as a file or directory containing those files (type=file/path).

Comment by Piyush K.

8292 | February 26, 2015 08:22:24 AM GMT

This feature will be very helpful for CMS who stores content in XML/HTML fragment on the DB as varchar.

Currently it's difficult to feed those column directly into cfindex because the tags and comments will also be indexed.

I tried HTMLStripCharFilterFactory in schema.xml and with some trial and errors I sort of get it to not index the markup, but the content is somehow still stored together with the entry.  http://stackoverflow.com/questions/28687200/does-htmlstripcharfilterfactory-solr-3-4-strip-out-html-for-returned-fields

I have moved onto using SQL Server full-text search because CF is still using ancient Solr 3.4 and Data Importer is only available in Enterprise edition.

Comment by External U.

8293 | February 26, 2015 12:43:28 PM GMT

Hi Piyush,

CMS is also the use case where I'd find this useful. I have HTML content stored in db. Currently, I run reReplaceNoCase("<[^>]*>", "", "all") and decodeForHTML() on that content before indexing.

It'd be handy if cfindex did that by default, or had an attribute like cleanHTML=true|false.

Thanks!,
-Aaron

Comment by Aaron N.

29046 | June 15, 2018 07:00:53 AM GMT

tracker issue : CF-3944307

cfindex should be able to strip html tags as well as decode html entity before indexing for type="custom"