tracker issue : CF-3944307

select a category, or use search below
(searches all categories and all time range)
Title:

cfindex should be able to strip html tags as well as decode html entity before indexing for type="custom"

| View in Tracker

Status/Resolution/Reason: Closed/Fixed/Fixed

Reporter/Name(from Bugbase): Henry Ho / Henry Ho (Henry Ho)

Created: 02/23/2015

Components: Text Search, Solr

Versions: 11.0

Failure Type: Enhancement Request

Found In Build/Fixed In Build: CF11_Final / 2018.0.0.310840

Priority/Frequency: Trivial / Unknown

Locale/System: English / Win CE

Vote Count: 0

Right now cfindex will index both tags and contents when type="custom", but type="file" or "path" seems to work fine with .html.

----------------------------- Additional Watson Details -----------------------------

Watson Bug ID:	3944307

External Customer Info:
External Company:  
External Customer Name: Henry
External Customer Email:  
External Test Config: My Hardware and Environment details:

CF10 u15

Attachments:

Comments:

Henry, Thank you for logging this ER. Can you please discuss briefly, how you are using this feature, what kind of data/content you are indexing. Can you share a use case, which necessitates passing the html files/content as custom data type (type="custom") instead of as a file or directory containing those files (type=file/path).
Comment by Piyush K.
8292 | February 26, 2015 08:22:24 AM GMT
This feature will be very helpful for CMS who stores content in XML/HTML fragment on the DB as varchar. Currently it's difficult to feed those column directly into cfindex because the tags and comments will also be indexed. I tried HTMLStripCharFilterFactory in schema.xml and with some trial and errors I sort of get it to not index the markup, but the content is somehow still stored together with the entry. http://stackoverflow.com/questions/28687200/does-htmlstripcharfilterfactory-solr-3-4-strip-out-html-for-returned-fields I have moved onto using SQL Server full-text search because CF is still using ancient Solr 3.4 and Data Importer is only available in Enterprise edition.
Comment by External U.
8293 | February 26, 2015 12:43:28 PM GMT
Hi Piyush, CMS is also the use case where I'd find this useful. I have HTML content stored in db. Currently, I run reReplaceNoCase("<[^>]*>", "", "all") and decodeForHTML() on that content before indexing. It'd be handy if cfindex did that by default, or had an attribute like cleanHTML=true|false. Thanks!, -Aaron
Comment by Aaron N.
29046 | June 15, 2018 07:00:53 AM GMT