DocFormats/core/src/xml/DFDOM.h - incubator-retired-corinthia - Git at Google

 // Licensed to the Apache Software Foundation (ASF) under one
 // or more contributor license agreements.  See the NOTICE file
 // distributed with this work for additional information
 // regarding copyright ownership.  The ASF licenses this file
 // to you under the Apache License, Version 2.0 (the
 // "License"); you may not use this file except in compliance
 // with the License.  You may obtain a copy of the License at
 //
 //   http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing,
 // software distributed under the License is distributed on an
 // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
 // KIND, either express or implied.  See the License for the
 // specific language governing permissions and limitations
 // under the License.

 #ifndef DocFormats_DFDOM_h
 #define DocFormats_DFDOM_h

 /** \file

  # DocFormats Document Object Model (DOM)

  This file defines data structures and functions for manipulating parsed XML data, represented in
  memory as a tree. It is inspired by the Document Object Model (DOM) API commonly used for
  processing XML data, but does not follow the API strictly.

  The two primary classes are DFDocument and DFNode. A DFDocument represents a parsed XML or HTML
  document, and acts as a container for all of the nodes within the document. A DFNode object
  represents either an element (represented textually as `<element-name> ... </element-name>`) or a
  text node (containing literal text from the XML document, residing inside an element).

  Every DFDocument object has a root note, and every DFNode object has a doubly-linked list of
  children. You can traverse the tree of nodes by using the first and next fields defined in the
  DFNode struct, and determine the element or node type via the \ref tag field, and the textual
  content (for text nodes) via the \value object.

  ## Tags

  Naming of XML elements and attributes is considerably complicated by the use of namespaces.
  Conceptually, every element is identified by a (namespcae URI, local name) pair, which determines
  its semantic meaning. However, these pairs are not directly specified in source files, as namespace
  URIs can be quite long. Instead, the XML Namespace specification [1] specifies a mechanism to map
  *prefixes* to namespace URIs, and elements in the textual representation are named based on a
  (prefix, local name) combination. With this mechanism, a program reading an XML document must first
  look up the prefix mapping to determine the namespace URI, before it can correctly identify an
  element. Worse, the mapping between prefixes and namespace URI can differ between different parts
  of the document.

  DocFormats uses an in-memory representation of XML elements in which the name is replaced by a
  numeric *tag*. Each possible tag corresponds to a (namespace URI, local name) pair, which
  simplifies checking the types of elements to a simple integer comparison, rather than a complicated
  symbol resolution algorithm. Furthermore, these tags can optionally be declared as pre-defined
  constants, as in \ref XMLNames.h, enabling them to be used in switch statements --- which improves
  performance of code that has to check for many different element types.

  Each document has a \ref DFNameMap object which stores the information necessary to map between
  numeric tags and (namespace URI, local name) pairs, and also stores the default prefix to be used
  for ecah namespace URI. During parsing, as carried out DFParseXMLFile() and DFParseXMLString(),
  symbol resolution based on the textual names given in the source file is performed, resulting in
  the tags in the constructed tree corresponding to either pre-defined tags in the default name map,
  or other tags whose numeric values have been dynamically allocated during parsing. During
  serialisation, namespace mappings based on the default prefix for each namespace URI are set on the
  root element of the output file, and tags are translated into (prefix:local name) combinations for
  the actual start and end tags of elements.

  ## Memory allocation

  DFDocument objects are reference-counted, but the memory occupied by all DFNode objects is managed
  by their containing documents. As you create new nodes, the document itself takes care of
  allocating memory for them, and also keeping track of all the nodes it has allocated. When a
  document is freed --- that is, when its reference count drops to zero --- all nodes allocated by
  that document are also freed.

  This approach is for performance reasons. In an earlier version of the library, each DFNode object
  was allocated separately, as a reference-counted Objective C object. This implied large overheads
  both for incrementing and decrementing the reference count, and for traversing through the node
  tree releasing all the references individually. The approach used now, implemented by DFAllocator,
  makes only a few calls to `malloc` to allocate large chunks of memory, and then allocates portions
  of those memory blocks itself for the DFNode objects. When a DFDocument object is freed, those
  blocks are simply released in one go, without the need to individually inspect each node to check
  its reference count. The same memory blocks are used to allocate strings for attribute values,
  which also get freed along with the document.

  THe previous paragraph describes an implementation detail which you don't need to worry about
  when using documents. Just call DFDocumentNew() or DFDocumentNewWithRoot() to create a document,
  and DFDocumentRelease() when you are finished with it. All memory that has been allocated for
  nodes, attribute values, and text node values will automatically be freed when the document's
  reference count drops to zero, as it would after a simple new/release sequence.

  */

 #include "DFXMLNamespaces.h"
 #include "DFXMLNames.h"
 #include <DocFormats/DFXMLForward.h>
 #include "DFBuffer.h"
 #include <stdarg.h>

 #define DOM_DOCUMENT                 1
 #define DOM_TEXT                     2
 #define DOM_COMMENT                  3
 #define DOM_CDATA                    4
 #define DOM_PROCESSING_INSTRUCTION   5

 #define MIN_ELEMENT_TAG              10

 typedef struct {
     Tag tag;
     char *value;
 } DFAttribute;

 ////////////////////////////////////////////////////////////////////////////////////////////////////
 //                                                                                                //
 //                                             DFNode                                             //
 //                                                                                                //
 ////////////////////////////////////////////////////////////////////////////////////////////////////

 /** Documentation for DFNode */
 struct DFNode {
     Tag tag;
     DFNode *parent;
     DFNode *first;
     DFNode *last;
     DFNode *next;
     DFNode *prev;
     unsigned int seqNo;
     struct DFDocument *doc;
     void *js;
     int changed;
     int childrenChanged;
     DFNode *seqNoHashNext;
     DFAttribute *attrs;
     unsigned int attrsCount;
     unsigned int attrsAlloc;
     char *target;
     char *value;
 };

 ////////////////////////////////////////////////////////////////////////////////////////////////////
 //                                                                                                //
 //                                           DFDocument                                           //
 //                                                                                                //
 ////////////////////////////////////////////////////////////////////////////////////////////////////

 #define SEQNO_HASH_SIZE 3571

 /**

  The DFDocument class represents an XML or HTML document in memory which has either been parsed from
  a file, or has been newly-created and is destined to be serialised to a file.

  */
 struct DFDocument {
     size_t retainCount;
     struct DFAllocator *allocator;
     DFNode *seqNoHashBins[SEQNO_HASH_SIZE];
     struct DFHashTable *nodesByIdAttr;
     DFNode **nodes;
     size_t nodesCount;
     size_t nodesAlloc;

     struct DFNameMap *map;
     DFNode *docNode;
     DFNode *root;
     unsigned int nextSeqNo;
 };

 DFDocument *DFDocumentNew(void);
 DFDocument *DFDocumentNewWithRoot(Tag rootTag);
 DFDocument *DFDocumentRetain(DFDocument *doc);
 void DFDocumentRelease(DFDocument *doc);

 void DFDocumentReassignSeqNos(DFDocument *doc);

 DFNode *DFNodeForSeqNo(DFDocument *doc, unsigned int seqNo);
 DFNode *DFElementForIdAttr(DFDocument *doc, const char *idAttr);

 ////////////////////////////////////////////////////////////////////////////////////////////////////
 //                                                                                                //
 //                                               DOM                                              //
 //                                                                                                //
 ////////////////////////////////////////////////////////////////////////////////////////////////////

 // Strings

 char *DFCopyString(DFDocument *doc, const char *str);

 // Document methods

 DFNode *DFCreateElement(DFDocument *doc, Tag tag);
 DFNode *DFCreateTextNode(DFDocument *doc, const char *data);
 DFNode *DFCreateComment(DFDocument *doc, const char *data);
 DFNode *DFCreateCDATASection(DFDocument *doc, char *data);
 DFNode *DFCreateProcessingInstruction(DFDocument *doc, const char *target, const char *content);

 // Node methods

 void DFInsertBefore(DFNode *parent, DFNode *newChild, DFNode *refChild);
 void DFAppendChild(DFNode *parent, DFNode *newChild);
 DFNode *DFCreateChildElement(DFNode *parent, Tag tag);
 DFNode *DFCreateChildTextNode(DFNode *parent, const char *data);
 void DFRemoveNode(DFNode *node);
 void DFRemoveNodeButKeepChildren(DFNode *node);
 void DFSetNodeValue(DFNode *node, const char *value);

 // Element methods

 const char *DFGetAttribute(DFNode *node, Tag tag);
 const char *DFGetChildAttribute(DFNode *parent, Tag childTag, Tag attrTag);
 void DFSetAttribute(DFNode *element, Tag tag, const char *value);
 void DFVFormatAttribute(DFNode *element, Tag tag, const char *format, va_list ap);
 void DFFormatAttribute(DFNode *element, Tag tag, const char *format, ...) ATTRIBUTE_FORMAT(printf,3,4);
 void DFRemoveAttribute(DFNode *element, Tag tag);
 void DFRemoveAllAttributes(DFNode *element);

 // Tree traversal

 DFNode *DFPrevNode(DFNode *node);
 DFNode *DFNextNodeAfter(DFNode *node);
 DFNode *DFNextNode(DFNode *node);

 // Names

 Tag DFLookupTag(DFDocument *doc, const char *URI, const char *name);
 const char *DFTagName(DFDocument *doc, Tag tag);
 const char *DFTagURI(DFDocument *doc, Tag tag);
 const char *DFNodeName(DFNode *node);
 const char *DFNodeURI(DFNode *node);

 // Misc

 void DFNodeTextToBuffer(DFNode *node, DFBuffer *buf);
 char *DFNodeTextToString(DFNode *node);
 void DFStripIds(DFNode *node);
 DFNode *DFChildWithTag(DFNode *parent, Tag tag);
 void DFRemoveWhitespaceNodes(DFNode *node);
 int DFIsWhitespaceNode(DFNode *node);
 int identicalAttributesExcept(DFNode *first, DFNode *second, Tag except);
 void DFStripWhitespace(DFNode *node);

 #endif
	// Licensed to the Apache Software Foundation (ASF) under one
	// or more contributor license agreements. See the NOTICE file
	// distributed with this work for additional information
	// regarding copyright ownership. The ASF licenses this file
	// to you under the Apache License, Version 2.0 (the
	// "License"); you may not use this file except in compliance
	// with the License. You may obtain a copy of the License at
	//
	// http://www.apache.org/licenses/LICENSE-2.0
	//
	// Unless required by applicable law or agreed to in writing,
	// software distributed under the License is distributed on an
	// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	// KIND, either express or implied. See the License for the
	// specific language governing permissions and limitations
	// under the License.

	#ifndef DocFormats_DFDOM_h
	#define DocFormats_DFDOM_h

	/** \file

	# DocFormats Document Object Model (DOM)

	This file defines data structures and functions for manipulating parsed XML data, represented in
	memory as a tree. It is inspired by the Document Object Model (DOM) API commonly used for
	processing XML data, but does not follow the API strictly.

	The two primary classes are DFDocument and DFNode. A DFDocument represents a parsed XML or HTML
	document, and acts as a container for all of the nodes within the document. A DFNode object
	represents either an element (represented textually as `<element-name> ... </element-name>`) or a
	text node (containing literal text from the XML document, residing inside an element).

	Every DFDocument object has a root note, and every DFNode object has a doubly-linked list of
	children. You can traverse the tree of nodes by using the first and next fields defined in the
	DFNode struct, and determine the element or node type via the \ref tag field, and the textual
	content (for text nodes) via the \value object.

	## Tags

	Naming of XML elements and attributes is considerably complicated by the use of namespaces.
	Conceptually, every element is identified by a (namespcae URI, local name) pair, which determines
	its semantic meaning. However, these pairs are not directly specified in source files, as namespace
	URIs can be quite long. Instead, the XML Namespace specification [1] specifies a mechanism to map
	prefixes to namespace URIs, and elements in the textual representation are named based on a
	(prefix, local name) combination. With this mechanism, a program reading an XML document must first
	look up the prefix mapping to determine the namespace URI, before it can correctly identify an
	element. Worse, the mapping between prefixes and namespace URI can differ between different parts
	of the document.

	DocFormats uses an in-memory representation of XML elements in which the name is replaced by a
	numeric tag. Each possible tag corresponds to a (namespace URI, local name) pair, which
	simplifies checking the types of elements to a simple integer comparison, rather than a complicated
	symbol resolution algorithm. Furthermore, these tags can optionally be declared as pre-defined
	constants, as in \ref XMLNames.h, enabling them to be used in switch statements --- which improves
	performance of code that has to check for many different element types.

	Each document has a \ref DFNameMap object which stores the information necessary to map between
	numeric tags and (namespace URI, local name) pairs, and also stores the default prefix to be used
	for ecah namespace URI. During parsing, as carried out DFParseXMLFile() and DFParseXMLString(),
	symbol resolution based on the textual names given in the source file is performed, resulting in
	the tags in the constructed tree corresponding to either pre-defined tags in the default name map,
	or other tags whose numeric values have been dynamically allocated during parsing. During
	serialisation, namespace mappings based on the default prefix for each namespace URI are set on the
	root element of the output file, and tags are translated into (prefix:local name) combinations for
	the actual start and end tags of elements.

	## Memory allocation

	DFDocument objects are reference-counted, but the memory occupied by all DFNode objects is managed
	by their containing documents. As you create new nodes, the document itself takes care of
	allocating memory for them, and also keeping track of all the nodes it has allocated. When a
	document is freed --- that is, when its reference count drops to zero --- all nodes allocated by
	that document are also freed.

	This approach is for performance reasons. In an earlier version of the library, each DFNode object
	was allocated separately, as a reference-counted Objective C object. This implied large overheads
	both for incrementing and decrementing the reference count, and for traversing through the node
	tree releasing all the references individually. The approach used now, implemented by DFAllocator,
	makes only a few calls to `malloc` to allocate large chunks of memory, and then allocates portions
	of those memory blocks itself for the DFNode objects. When a DFDocument object is freed, those
	blocks are simply released in one go, without the need to individually inspect each node to check
	its reference count. The same memory blocks are used to allocate strings for attribute values,
	which also get freed along with the document.

	THe previous paragraph describes an implementation detail which you don't need to worry about
	when using documents. Just call DFDocumentNew() or DFDocumentNewWithRoot() to create a document,
	and DFDocumentRelease() when you are finished with it. All memory that has been allocated for
	nodes, attribute values, and text node values will automatically be freed when the document's
	reference count drops to zero, as it would after a simple new/release sequence.

	*/

	#include "DFXMLNamespaces.h"
	#include "DFXMLNames.h"
	#include <DocFormats/DFXMLForward.h>
	#include "DFBuffer.h"
	#include <stdarg.h>

	#define DOM_DOCUMENT 1
	#define DOM_TEXT 2
	#define DOM_COMMENT 3
	#define DOM_CDATA 4
	#define DOM_PROCESSING_INSTRUCTION 5

	#define MIN_ELEMENT_TAG 10

	typedef struct {
	Tag tag;
	char *value;
	} DFAttribute;

	////////////////////////////////////////////////////////////////////////////////////////////////////
	// //
	// DFNode //
	// //
	////////////////////////////////////////////////////////////////////////////////////////////////////

	/** Documentation for DFNode */
	struct DFNode {
	Tag tag;
	DFNode *parent;
	DFNode *first;
	DFNode *last;
	DFNode *next;
	DFNode *prev;
	unsigned int seqNo;
	struct DFDocument *doc;
	void *js;
	int changed;
	int childrenChanged;
	DFNode *seqNoHashNext;
	DFAttribute *attrs;
	unsigned int attrsCount;
	unsigned int attrsAlloc;
	char *target;
	char *value;
	};

	////////////////////////////////////////////////////////////////////////////////////////////////////
	// //
	// DFDocument //
	// //
	////////////////////////////////////////////////////////////////////////////////////////////////////

	#define SEQNO_HASH_SIZE 3571

	/**

	The DFDocument class represents an XML or HTML document in memory which has either been parsed from
	a file, or has been newly-created and is destined to be serialised to a file.

	*/
	struct DFDocument {
	size_t retainCount;
	struct DFAllocator *allocator;
	DFNode *seqNoHashBins[SEQNO_HASH_SIZE];
	struct DFHashTable *nodesByIdAttr;
	DFNode **nodes;
	size_t nodesCount;
	size_t nodesAlloc;

	struct DFNameMap *map;
	DFNode *docNode;
	DFNode *root;
	unsigned int nextSeqNo;
	};

	DFDocument *DFDocumentNew(void);
	DFDocument *DFDocumentNewWithRoot(Tag rootTag);
	DFDocument DFDocumentRetain(DFDocument doc);
	void DFDocumentRelease(DFDocument *doc);

	void DFDocumentReassignSeqNos(DFDocument *doc);

	DFNode DFNodeForSeqNo(DFDocument doc, unsigned int seqNo);
	DFNode DFElementForIdAttr(DFDocument doc, const char *idAttr);

	////////////////////////////////////////////////////////////////////////////////////////////////////
	// //
	// DOM //
	// //
	////////////////////////////////////////////////////////////////////////////////////////////////////

	// Strings

	char DFCopyString(DFDocument doc, const char *str);

	// Document methods

	DFNode DFCreateElement(DFDocument doc, Tag tag);
	DFNode DFCreateTextNode(DFDocument doc, const char *data);
	DFNode DFCreateComment(DFDocument doc, const char *data);
	DFNode DFCreateCDATASection(DFDocument doc, char *data);
	DFNode DFCreateProcessingInstruction(DFDocument doc, const char target, const char content);

	// Node methods

	void DFInsertBefore(DFNode parent, DFNode newChild, DFNode *refChild);
	void DFAppendChild(DFNode parent, DFNode newChild);
	DFNode DFCreateChildElement(DFNode parent, Tag tag);
	DFNode DFCreateChildTextNode(DFNode parent, const char *data);
	void DFRemoveNode(DFNode *node);
	void DFRemoveNodeButKeepChildren(DFNode *node);
	void DFSetNodeValue(DFNode node, const char value);

	// Element methods

	const char DFGetAttribute(DFNode node, Tag tag);
	const char DFGetChildAttribute(DFNode parent, Tag childTag, Tag attrTag);
	void DFSetAttribute(DFNode element, Tag tag, const char value);
	void DFVFormatAttribute(DFNode element, Tag tag, const char format, va_list ap);
	void DFFormatAttribute(DFNode element, Tag tag, const char format, ...) ATTRIBUTE_FORMAT(printf,3,4);
	void DFRemoveAttribute(DFNode *element, Tag tag);
	void DFRemoveAllAttributes(DFNode *element);

	// Tree traversal

	DFNode DFPrevNode(DFNode node);
	DFNode DFNextNodeAfter(DFNode node);
	DFNode DFNextNode(DFNode node);

	// Names

	Tag DFLookupTag(DFDocument doc, const char URI, const char *name);
	const char DFTagName(DFDocument doc, Tag tag);
	const char DFTagURI(DFDocument doc, Tag tag);
	const char DFNodeName(DFNode node);
	const char DFNodeURI(DFNode node);

	// Misc

	void DFNodeTextToBuffer(DFNode node, DFBuffer buf);
	char DFNodeTextToString(DFNode node);
	void DFStripIds(DFNode *node);
	DFNode DFChildWithTag(DFNode parent, Tag tag);
	void DFRemoveWhitespaceNodes(DFNode *node);
	int DFIsWhitespaceNode(DFNode *node);
	int identicalAttributesExcept(DFNode first, DFNode second, Tag except);
	void DFStripWhitespace(DFNode *node);

	#endif