Monday, January 26, 2009

How to write XML data with Xerces which can be edit with unicode text editor

Serializing XML data with Xerces

The DOMBuilder class provides an API for parsing XML documents and building the corresponding DOM document tree; while the DOMWriter class provides an API for serializing (writing) a DOM document out in an XML document. To serialize XML data, first load the XML data to a DOM tree using a DOMBuilder and then use a DOMWriter to write out the DOM tree. For example:


Listing 1. Serializing XML data
// DOMImplementationLS contains factory methods for creating objects
// that implement the DOMBuilder and the DOMWriter interfaces
static const XMLCh gLS[] = { chLatin_L, chLatin_S, chNull };
DOMImplementation *impl =
DOMImplementationRegistry::getDOMImplementation(gLS);

// construct the DOMBuilder
DOMBuilder* myParser = ((DOMImplementationLS*)impl)->
createDOMBuilder(DOMImplementationLS::MODE_SYNCHRONOUS, 0);

// parse the XML data, assume it is saved in a local file
// called "theXMLFile.xml"
// the DOMBuilder will parse the data and return it as a DOM tree
DOMNode* aDOMNode = myParser->parseURI("theXMLFile.xml");

// construct the DOMWriter
DOMWriter* myWriter = ((DOMImplementationLS*)impl)->createDOMWriter();

// optionally, set some DOMWriter features
// set the format-pretty-print feature
if (myWriter->canSetFeature(XMLUni::fgDOMWRTFormatPrettyPrint, true))
myWriter->setFeature(XMLUni::fgDOMWRTFormatPrettyPrint, true);

// set the byte-order-mark feature
if (myWriter->canSetFeature(XMLUni::fgDOMWRTBOM, true))
myWriter->setFeature(XMLUni::fgDOMWRTBOM, true);

// serialize the DOMNode to a UTF-16 string
XMLCh* theXMLString_Unicode = myWriter->writeToString(*aDOMNode);

// release the memory
XMLString::release(&theXMLString_Unicode);
myWriter->release();
myParser->release();



Both DOMBuilder and DOMWriter are constructed using factory methods from DOMImplementationLS. When finished, they both need to be released explicitly to relinquish any associated resources. Also, the returned string from writeToString is owned by the caller, who is responsible for releasing the allocated memory.

You can also opt to set some features that control the behavior of the DOMWriter. Xerces-C++ has implemented a number of DOMWriter features that are specified in the W3C DOM Level 3 Load and Save Specification. A complete list can be found in the Xerces-C++ programming guide, DOMWriter Supported Features (see Resources). A couple of them are worth highlighting:

format-pretty-print -- This formats the output by adding a newline carriage return and indented whitespace to produce a pretty-printed, human-readable form. The exact form of the transformations is not specified in the W3C DOM Level 3 Load and Save Specification, and thus the parser has its own interpretation. In releases prior to Xerces-C++ 2.2 (or XML4C 5.1), the parser only pretty-prints the prologue and the epilogue. It doesn't touch the content within the root element. And from Xerces-C++ 2.2 (or XML4C 5.1) onwards, turning on this feature also causes the content within the root element to be formatted.
byte-order-mark -- This is a non-standard extension added in Xerces-C++ 2.2 (or XML4C 5.1) to enable the writing of the Byte-Order-Mark (BOM) in the resultant XML stream. The BOM is written at the beginning of the resultant XML stream, if and only if a DOMDocumentNode is rendered for serialization, and the output encoding is one of the following:

UTF-16
UTF-16LE
UTF-16BE
UCS-4
UCS-4LE
UCS-4BE

No comments: