tracker issue : CF-3183072

select a category, or use search below
(searches all categories and all time range)
Title:

xmlParse() does not correctly identify UTF-8 files

| View in Tracker

Status/Resolution/Reason: Closed/Fixed/

Reporter/Name(from Bugbase): Adam Cameron / Adam Cameron (Adam Cameron)

Created: 05/06/2012

Components: Language

Versions: 9.0.1

Failure Type: Data Corruption

Found In Build/Fixed In Build: 9.0.1 / 287710

Priority/Frequency: Critical / Some users will encounter

Locale/System: English / Windows 7

Vote Count: 1

I have a UTF-8-encoded file, which xmlParse() appears to read as ISO-8859-1 or whatever it is CF ass-u-me`s all files to be unless otherwise told.

The result being any non-ASCII "extended" unicode characters are munged.

The code I'm using is:
<cfset x = xmlParse(expandPath("./junk.xml"))>
<cfdump var="#x#">


The XML I am using is:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<DATAPACKET Version="2.0">
<ROWDATA>
<ROW Data="20000717" NotePub="La società debitrice è in liquidazione coatta amministrativa. Abbiamo provveduto a comunicare al Commissario liquidatore l'importo complessivo del credito. Ammessi al chirografo per L.2.473.460." AttIst="FALSE" Concordato="TRUE" Fallimento="FALSE" RagSocDeb="ORTO PIU' SOC. COOP. A R.L." CodDeb="RP49" NumDec="" Acconti="0.0000" Emissione="20001230" ValMat="0.0000" Notifica="20010103" Deposito="20001115" ValCap="1001.1899" IDCliente="3" IDPratica="196"/>
<ROW Data="20030120" NotePub="DIFFIDA INVIATA E NON RICEVUTA. 1 RICEVUTA Si è depositata la domanda di ammissione al passivo AMMESSO TUTTO AL CHIROGRAFO" AttIst="FALSE" Concordato="FALSE" Fallimento="TRUE" RagSocDeb="NIKO MARKET S.R.L." CodDeb="VQ23" Acconti="0.0000" ValMat="0.0000" ValCap="430.4600" IDCliente="3" IDPratica="1375" UdiVer="20030922"/>
</ROWDATA>
</DATAPACKET>

NB: this code, on the other hand, works fine:
<cfset s = fileRead(expandPath("./junk.xml"), "UTF-8")>
<cfset x = xmlParse(s)>
<cfdump var="#x#">

if xmlParse() is to read from the file system, it needs to understand that files can be encoded in different encoding schemes.  Or if it cannot do that, then it should take an attribute to help it understand (but it should just be able to get it right).

-- 
Adam

----------------------------- Additional Watson Details -----------------------------

Watson Bug ID:	3183072

Deployment Phase:	Release Candidate

External Customer Info:
External Company:  
External Customer Name: Adam Cameron.
External Customer Email:  
External Test Config: My Hardware and Environment details:

Attachments:

Comments:

Since ColdFusion's default page encoding became UTF-8, developers naturally expect functions and tags to read files in UTF-8, by default. Therefore, please ensure that functions and tags that read file content, like xmlParse(), fileRead(), cfxml, cffile(action=read), etc., have a default encoding of UTF-8. That will save developers time. As things currently stand, productive time is lost in discovering missing or deformed characters, in debugging the underlying encoding issues and in finding a workaround.
Vote by External U.
19537 | May 07, 2012 07:19:27 AM GMT
Added a fix to xmlParse function to identify the encoding from the given xml files.
Comment by S V.
19536 | January 13, 2014 01:21:02 AM GMT