com.ricebridge.xmlman
Class RecordSpec

java.lang.Object
  extended bycom.ricebridge.xmlman.RecordSpec

public class RecordSpec
extends Object

Specify the XPath expressions to pull record data out of your XML document.

This is the most important class in the XML Manager component after the XmlManager class. You use this Record Specification class to specify the location of your data records, and the data fields of those records. You do this using XPath expressions that point at the data you need.

Let's start with this XML document:

  <?xml version='1.0' encoding='UTF-8'?>
  <root>
    <record name="r1">
      <foo>f1</foo>
    </record>
    <record name="r2">
      <foo>f2</foo>
    </record>
    <record name="r3">
      <foo>f3</foo>
    </record>
  </root>
  

You can pull out each data record by using the XPath expression:

/root/record

This will return each record element in the order that it appears in the document. Now to get the record data, we use another set of XPath expressions, one for each data field. These XPath expressions are evaluated relative to the record element. For example:

@name

will extract the value of the name attribute from each record element. And

foo

will get the text content of the child foo element in each record element.

To specify all this we create a new RecordSpec object, like so:

new RecordSpec( "/root/record", new String[] {"@name","foo"} );

If we hand this record specification over to XML Manager, then we will obtain the following data:

NameFoo
r1f1
r2f2
r3f3

Here you can see that we have extracted three records from the XML document.

Record-based Parsing
XML Manager parses and loads XML documents using the standard SAX Java interface. This interface loads the XML document one element at a time, and does not store any of the elements in memory. This means that very large XML documents can be loaded quickly and easily. XML Manager then applies your XPath expressions to this continuous stream of elements, picking out the ones that match.

XML Manager is thus a record-oriented XML parser. That means that it is best suited to handling documents with a regular structure, that is, with repeated data records. Although XML Manager can in fact extract any data out of any XML document, some documents are easier to deal with than others. Luckily, most of the XML documents used for data exchange, web services and configuration are very easy to work with using XML Manager.

XPath Expressions
The use of XPath expressions is what makes XML Manager so effective. Instead of writing large amounts of node traversal code, as with the DOM interface, you can just write a few XPath expressions and get your data out directly.

However, because XML Manager is designed for high-speed and low-memory usage, and because it uses the SAX API, it cannot support the full functionality of XPath. XML Manager cannot see beyond the current element, and it cannot return to previous elements. That means that only the common XPath expression types are supported, such as normal/child/elements, descendant//elements and element[predicates]. In particular, you can't use the .. expression. This might seem like a bit of problem, but XML Manager provides an alternative. It is not necessary for record field paths to be relative. They can also be absolute. In this case, XML Manager provides you with the most recently seen value of the expression. In practical terms, this works the same way as the .. path element.

Here is an example to demonstrate this method:

  <?xml version='1.0' encoding='UTF-8'?>
  <root>
    <group code="g1">
      <record name="r1">
        <foo>f1</foo>
      </record>
      <record name="r2">
        <foo>f2</foo>
      </record>
    </group>
    <group code="g2">
      <record name="r3">
        <foo>f3</foo>
      </record>
    </group>
  </root>
  

In order to access the group element, we can use the following XPath expressions:

These expressions will produce the following data records:

NameGroup-Code
r1g1
r2g1
r3g2

XML Manager fully supports XPath predicates, so that you can say things like:

/root/record[@name='r1']

to extract the record where the name attribute has the value r1.

XML Manager also supports most of the XPath functions, so you can use expressions such as

concat(foo,bar)

to concatenate the text content of two elements.

Of course, some functions cannot be supported by XML Manager, due to the streaming nature if its input. In particular, the last function is not fully supported, as XML Manager cannot tell whether another element of the same kind follows the current one.

XML Manager also provides access to the text content of elements. The expression

foo
concatenates the child text nodes of the element foo, and all of it's child elements. Note that all XML elements are dropped and only the text content is returned. The expression
foo/text()
concatenates the child text nodes of the element foo, but does not include the text content of child elements. The expression
foo/text()[n]
returns the contents of the nth child text node. XML Manager can also return the actual XML text of any element, see the Special Functions section for more details.

Which Constructor Should I Use?
The RecordSpec object provides a number of convenience constructors. However, most of these simply allow you to specify additional secondary RecordSpecs without having to create a List object to put them in.

The essential parameters of every RecordSpec are the always the record XPath expression String, and the String[] array containing the field XPath expressions. These are always the first two parameters. Thus, the simplest way to create a RecordSpec is with the RecordSpec(String,String[]) constructor.

Sometimes you will need to provide names for the data fields. You can do this using a second String[] array. This should be the same length as the field expressions array, and contain the name of each data field. This extra information is required by some of the load methods in XmlManager, such as the XmlManager.loadBeans(File,RecordSpec,BeanSpec) method. This parameter, if present, is always the third parameter.

Finally, you may wish to use secondary RecordSpecs. These secondary RecordSpecs are used to extract extra information from the XML document at the same time as the primary data records are extracted. For more information about using them, see the Multiple RecordSpecs section below. To add secondary RecordSpecs to your primary RecordSpec, add them at the end of the constructor parameter list (RecordSpec(String,String[],RecordSpec)). If there are more than three secondary RecordSpecs, then you will have to put them in a List, see the RecordSpec(String,String[],List) method.

If you already have a set of RecordSpecs, and you want to create a primary RecordSpec from one of them, and add the rest as secondaries, use the RecordSpec(RecordSpec,RecordSpec) methods. These use the record specification details of the first argument to create a new primary RecordSpec.

Special Functions
XML Manager provides some additional functions to aid you in accessing the data in your XML documents. These functions are placed in the http://www.ricebridge.com/xmlman namespace, which has the default prefix rb (you can change this prefix). These functions are:

The rb:trim function provides exactly the same functionality as the Java String.trim method. It removes any whitespace characters from the start and end of it's string argument.

The rb:xml function allows you to get at the actual XML of your document. In cases where you need to process the XML afterwards, but you still want to use XML Manager to parse the XML file, this function is the one to use. It is especially useful in cases where XML content is the data that you want. For example this occurs in the content element of the Atom specification.

Using our first example XML document above, the RecordSpec:

new RecordSpec("/root/record", new String[]{"rb:xml(foo)"})

For the first data record, this will return the value:

<foo>f1</foo>

You can also define your own functions. See the XmlSpec.addFunction method.

Using Multiple RecordSpecs
Sometimes your XML document will contain more than one type of data record. Many documents also contain header information. To handle these cases, XML Manager allows you to use more than one RecordSpec at a time. In this case, every time any RecordSpec matches a data record, the data fields are delivered back to you. Let's look an example to show how this works:

  <root>
    <title name="t1" />
    <other bar="b1" />
    <other bar="b2" />
    <record name="r1">
      <foo>f1</foo>
    </record>
    <record name="r2">
      <foo>f2</foo>
    </record>
  </root>
  

From this data we want to get:

It is impossible to get all this data in one pass using just a single RecordSpec. Rather, we need three:

Notice that we have put a constant value in the first field. This serves as a marker to identify where the data came from. This is very useful if you are accessing the data using the convenience methods such as load(File,RecordSpec). Even if you using your own (@link RecordListener} this is probably a good idea as you cannot rely on the order of the elements in the source XML document.

In order to use these RecordSpecs together, we use the appropriate constructor of RecordSpec:

  ArrayList secondaries = new ArrayList();
  secondaries.add( new RecordSpec("root/title",new String[]{"'title'","@name"}) );
  secondaries.add( new RecordSpec("/root/other",new String[]{"'other'","@bar"}) );
  RecordSpec primary = new RecordSpec("root/record",new String[]{"'record'","@name","foo"}, secondaries );

This makes the record RecordSpec the primary RecordSpec, and the other two RecordSpecs are then secondary RecordSpecs.

What's the difference between a primary and secondary RecordSpec? The field names of the primary RecordSpec are the only field names that are used. Field names are used by the XmlManager.loadBeans(File,RecordSpec,BeanSpec) method, for example, to identify the property methods of the Java Beans to be loaded. Any field names defined in the secondary RecordSpecs are ignored. Normally you don't have to worry about this, especially if you are just using the load and loadAsLists methods. As a rule of thumb, always access the main data record using the primary RecordSpec.

Let's look at the data that is returned by the three RecordSpecs:

First ElementSecond ElementThird Element
titlet1 
otherb1 
otherb2 
recordr1f2
recordr1f2

You can see how the undefined data fields of the secondary RecordSpecs are simply returned as empty strings. When you process this list of String[] arrays, you can use the first element of each array to identify the type of data: title, other or record.

Limitations
XML Manager is designed for speed and stable memory use. This means you can use it on arbitrarily large XML documents and you will be able to process them without running into resource problems. In order to achieve this, XML Manager only makes one pass through the XML file, using a SAX parser. It uses the SAX events to construct an internal view of the XML document that can then be checked against the XPath expressions you are using to extract your data.

Because only one pass is made this means that XML Manager cannot "see into the future". This means that XPath expressions that refer to elements ahead of the current element cannot be used. Also, because we want to prevent memory errors, XML Manager does not try to keep a record of all the XML it has already seen. With these limitations in mind, let's take a look at the subset of XPath that XML Manager does support:

Note: the last function always returns true, as this is the most useful default. But this does mean that it will be true on the last element, and since XML Manager only uses the most recent value of an expression, this is most often what you want, especially in data field expressions.

The XPath specification relies heavily on the idea of a context. XML Manager can only provide a partial context, because it is not possible to determine the context size, and the context node is not a real node in a document model. This means that XML Manager has no concept of a node set, or rather, that all node sets contain just one node, the current one. As a result, the following parts of the XPath specification are not implemented: