Bit by Bit
Teradata XML Publishing streams large XML documents to make processing more manageable.
Srinivas Pandrangi
Extensible markup language (XML) is one of the most popular stand-alone data formats. The means to produce and consume XML messages for database applications is provided through Teradata XML Services. The key components of Teradata XML Services are XML Publishing (publish database data as XML documents) and XML Shredding (shred values from XML documents into database tables). These components are implemented through a set of functions and stored procedures (SPs).
The need for streamed processing
XML documents come in different sizes and structures to match a wide variety of uses. When XML is used as a message format or a format for configuration files, the documents are likely to be smaller, up to a few hundred kilobytes. It is also likely that the document structure will vary widely in these cases. When XML is used for formatting application logs or moving data between databases, the documents could be much larger, hundreds of megabytes or even a few gigabytes. The document structures in these cases are often simple and repetitive. So what is the implication of this to the way Teradata XML Services is used?
Processing arbitrarily large XML documents can be resource intensive. For instance, creating a large document in its entirety is not very efficient or even feasible in some circumstances. The amount of memory that can be allocated in any function invocation has an upper limit, so it might be impossible to hold a large document in memory either in literal or object form for shredding or publishing. Teradata's Streamed XML Publishing overcomes this limitation by processing the large XML documents in small fragments so they can be worked on within resource constraints.
Teradata XML Publishing overview
The Teradata XML Publishing framework converts the results of user queries into XML documents. The publishing process includes a design phase and a run phase. In the design phase, the developer defines how the tabular structure of a query result set maps to the hierarchical structure of an XML document. The output of this phase is a mapping definition. Teradata XML Services stores this mapping definition in the form of an extensible stylesheet language transformation (XSLT) document. This mapping definition is registered with the database to be used subsequently in the run phase for publishing.
In the run phase, the user invokes a publishing SP with a query string and the mapping definition. The output of the SP is one or more XML documents representing the input query results. The typical publishing SP returns complete documents, not document fragments. When publishing large XML documents, the process can be hindered by resource limitations such as the maximum amount of memory available per function call. The streaming version of the publishing SP returns the documents in a stream of XML fragments that, when concatenated, result in the final XML document. By working with smaller fragments rather than documents serialized in their entirety, we can work around memory limitations.
Streaming XML Publishing
With Streaming XML Publishing, the entire document is not created at one time; rather, fragments of the document are constructed and returned in a result set. Before Teradata 12, SPs did not return result sets. Consequently, the sysxml.PUBLISH_STREAM SP returns a query string that, upon evaluation, returns a row set composed of a column named xmlelement. When the xmlelement column values are concatenated, the result is the intended, published XML document.
For Teradata 12 and later, the PUBLISH_XML and PUBLISH_XML_REF SPs directly return a result set of the xmlelement column contents. These can be concatenated to get the result document.
Streaming XML Publishing is best illustrated through an example that walks through the process of generating an XML stream. The example publishes the "Order" information for all of the "Customers" in the database as a single XML document. The tables involved are provided below.
The SQL query that joins these tables and returns the required information is shown here:
SELECT C_CUSTKEY, C_NAME, C_ADDRESS, C_PHONE, O_ORDERKEY, O_ORDERSTATUS, O_ORDERDATE,
O_TOTALPRICE, L_LINEITEMKEY, L_CODE, L_QUANTITY, L_PRICE
FROM CUSTOMER, ORDERS, LINEITEMS
WHERE C_CUSTKEY=O_CUSTKEY and O_ORDERKEY=L_ORDERKEY
The results of the query are mapped to the following XML structure during the design phase:
<DATA xmlns:td="http://www.teradata.com/xmlpub">
<Customer ID="1">
<Name>Customer#000001407</Name>
<Address>xxxyyyzzz</Address>
<Phone>650-555-1234</Phone>
<Order ID="1">
<Status>O</Status>
<Date>97/04/17</Date>
<TotalPrice>54924.38</TotalPrice>
<LineItem ID="1">
<Code>ESM-48</Code>
<Quantity>8</Quantity>
<Price>87.60</Price>
</LineItem>
<LineItem ID="2">
<Code>ESM-23</Code>
<Quantity>1</Quantity>
…
…
…
Details of the design phase, including how the mapping is done between the tabular and hierarchical structures, are provided in the Teradata XML Services Orange Book "Teradata XML Services Release 1 User's Guide." The mapping is stored as an XSLT document, which looks something like this:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:td="http://www.teradata.com/xmlpub" version="1.0">
<xsl:template match="QuerySchema">
<DATA>
<xsl:apply-templates select="ROW" />
</DATA>
</xsl:template>
<xsl:template match="ROW">
<Customer teradata_group="C_CUSTKEY">
<xsl:attribute name="ID">
<xsl:value-of select="C_CUSTKEY" />
</xsl:attribute>
<Name>
<xsl:value-of select="C_NAME" />
</Name>
<Address>
<xsl:value-of select="C_ADDRESS" />
</Address>
<Phone>
<xsl:value-of select="C_PHONE" />
</Phone>
<Order teradata_group="O_ORDERKEY">
<xsl:attribute name="ID">
<xsl:value-of select="O_ORDERKEY" />
</xsl:attribute>
<Status>
<xsl:value-of select="O_ORDERSTATUS" />
</Status>
…
…
…
Up to this point, from a user's perspective, no difference exists between the streaming and the non-streaming versions of Teradata XML Publishing. A user can perform non-streaming publishing by invoking an SP like sysxml.PUBLISH or sysxml.PUBLISH_LOB. To perform streaming publishing, the user will invoke a streaming publishing SP such as PUBLISH_STREAM, PUBLISH_STREAM_REF, PUBLISH_XML or PUBLISH_XML_REF. The following example shows a call to the SP:
call sysxml.publish_stream_ref(
'SELECT C_CUSTKEY, C_NAME, C_ADDRESS, C_PHONE, O_ORDERKEY, O_ORDERSTATUS, O_ORDERDATE,
O_TOTALPRICE, L_LINEITEMKEY, L_CODE, L_QUANTITY, L_PRICE
FROM CUSTOMER, ORDERS, LINEITEMS
WHERE C_CUSTKEY=O_CUSTKEY and O_ORDERKEY=L_ORDERKEY', 'publish_sample', NULL,
'pubsmp', 'pubsmp', qs, res );
One of the output parameters for the above SP is an intermediate query string. When evaluated, this string returns a row set with a single column named xmlelement, as shown below. The PUBLISH_XML and PUBLISH_XML_REF SPs (with Teradata 12) return the result set directly, thereby avoiding the extra step of evaluating another SQL query.
xmlelement
-------------------------------------------------------------------------
<?xml version="1.0" encoding="ISO-8859-1"?>
<DATA xmlns:td="http://www.teradata.com/xmlpub">
<Customer ID ="1"><Name>Customer#000001407</Name><Address>zZsTZ3nI1rG5X</Address>…
<Order ID ="1"><Status>O</Status><Date>97/04/17</Date>
<TotalPrice>54924.38</TotalPrice>
<LineItem ID ="1"><Code>ESM-48</Code><Quantity>8</Quantity>
<Price>87.60</Price></LineItem>
<LineItem ID ="2"><Code>ESM-23</Code><Quantity>1</Quantity>
<Price>55.40</Price></LineItem>
</Order>
<Order ID ="10"><Status>O</Status><Date>94/06/23</Date>
<TotalPrice>26501.70</TotalPrice>
<LineItem ID ="1"><Code>ESM-34</Code><Quantity>1</Quantity>
<Price>45.43</Price></LineItem>
<LineItem ID ="2"><Code>QSM-98</Code><Quantity>8</Quantity>
<Price>8.40</Price></LineItem>
</Order>
</Customer>
…
…
Concatenating the values of this column provides a well-formed XML document, which conforms to the mapping definition. Additionally, if the publishing of a large document is interrupted for any reason, the user can re-evaluate a modified version of the intermediate query to skip the rows already processed.
Conserve resources
Streaming XML processing is an important option for applications that must publish large XML documents while conserving resources on both the server and client sides. At the expense of an extra step (evaluating the intermediate query in database versions before Teradata 12), users can efficiently publish large, gigabyte-size documents.
Srinivas Pandrangi is the architect of Teradata XML Services. He has more than 10 years of experience implementing industry standards, including those related to XML.
ONLINE LINKS
For another article on Teradata XML Services, see "Speak the language" at TeradataMagazine.com. Watch for a third online article on XML Shredding that will appear with the next issue of Teradata Magazine.