tracker issue : CF-3039120

select a category, or use search below
(searches all categories and all time range)
Title:

Bug 78494:extracttext does not correctly extract unicode text

| View in Tracker

Status/Resolution/Reason: Closed/Fixed/

Reporter/Name(from Bugbase): Dave Ferguson / dave Ferguson (dave_jf)

Created: 07/02/2009

Components: Document Management, PDF manipulation

Versions: 9.0

Failure Type: Unspecified

Found In Build/Fixed In Build: 9,0,0,233019 / 241875

Priority/Frequency: Major / Unknown

Locale/System: English / Win All

Vote Count: 0

Problem:

extracttext does not correctly extract unicode text.  XML document is written as utf-8. All unicode text is garbled in the xml.  
Method:

<cfpdf action = "extracttext" source = "./docs/israel.pdf" destination ="./textout/textdoc.txt" type="xml" overwrite="yes" >


result:

<?xml version="1.0" encoding="UTF-8"?>
<DocText xmlns="http://ns.adobe.com/DDX/DocText/1.0/">
<TextPerPage>
<Page pageNumber="1">TEKL?F R??VET DOLU B?R A?K TOUCHSTONE PICTURES SUNAR B?R MANDEVILLE FILMS YAPIMI B?R ANNE FLETCHER F?LM? SANDRA BULLOCK RYAN REYNOLDS ? THE PROPOSAL ? MALIN AKERMAN CRAIG T . NELSON MARY STEENBURGEN M?Z?K VE BETTY WHITE BUCK DAMON M?Z?K AARON ZIGMAN S?PERV?Z?R? KOST?M TASARIMI CATHERINE MARIE THOMAS KURGU PRISCILLA NEDD FRIENDLY , A.C.E . YAPIM G?R?NT? TASARIMI NELSON COATES Y?NETMEN? OLIVER STAPLETON , BSC SRUMLU ALEX KURTZMAN VE ROBERTO ORCI MARY MCLAGLEN YAPIMCILAR YAPIMCILAR DAVID HOBERMAN TODD LIEBERMAN SENARYO PETER CHIARELLI SE?KiN SiNEMALARDA S E ? K i N S i N E M A L A R D A SE?KiN SiNEMALARDA Y?NETMEN ANNE FLETCHER DA?ITIM WALT DISNEY STUDIOS MOTION PICTURES ?</Page>
</TextPerPage>
</DocText>

Result:

----------------------------- Additional Watson Details -----------------------------

Watson Bug ID:	3039120

External Customer Info:
External Company:  
External Customer Name: dave Ferguson
External Customer Email: 333762A94460DE1A992015D5
External Test Config: 07/02/2009

Attachments:

Comments: