How to Compare Large Files
Table of Contents
- Introduction
- White space
- XML file structure
- Number of differences
- PCDATA
- Changes only or Full delta
- Version of Java
- Java Heap Size
- SAX input sources
- Metrics
- Conclusion
1. Introduction
It is possible to compare very large datasets using DeltaXML; example test data has demonstrated that files over 400MB in size can be loaded from disk and compared in under 5 minutes.
NOTE: The Professional Named User licence for DeltaXML has a node limit and therefore will NOT process very large files. Please contact us if you need an evaluation to check processing of large files.
There are many different factors that affect performance with large files apart from the CPU type and speed, and the amount of physical memory, both of which must be adequate for the job. Some of the more important are discussed in the following sections. We also provide some typical metrics for DeltaXML on basic machines.
2. White space
White space nodes are generally significant in XML files. Each white space, e.g. newline or space, can be treated as a node in the XML file. This can increase the memory image size and slow the comparison process. It can also result in differences being identified that are not significant.
In many situations white space nodes are not important and can be ignored. If a file has a DTD, an XML parser can use this as the file is read in to identify whether white space is ignorable or not. If a white space node is ignorable, for example because it appears between two markup tags, DeltaXML will ignore it in the comparison process.
If there is no DTD white space nodes should be removed either using an editor or by processing using an XSL filter such as normalize-space.xsl, though using XSL can be time consuming for large files. The delta files generated by DeltaXML have no white space added to them: if you look at them in an editor you will see that new lines are added only inside tags. This may look strange at first but it is an effective way to have shorter lines without adding white space nodes to an XML file. White space inside tags will be ignored by any XML parser.
Remember also that indentation of PCDATA within a file has an effect: often white space in PCDATA and attributes should be normalized before comparison. Otherwise, again, there will be a lot of differences reported that are not important.
3. XML file structure
There is a performance difference in comparing 'flat' XML files, i.e. large number of records at one level, and more nested files, which tends to be require less processing because there are fewer nodes at each level. Comparison of orderless data is generally slower.
4. Number of differences
The number of differences affects performance: no differences is quickest! The more differences there are the slower the comparison process because the software is trying to find a best match between the files. The Wu algorithm used in DeltaXML for pattern matching ordered sequences has optimal performance for small numbers of differences and slows significantly for large numbers of differences.
5. PCDATA
DeltaXML shares text strings, so many different text strings will result in a larger memory image and may cause the program to hit memory size limitations sooner. On the other hand, files with many identical strings will be stored very efficiently.
6. Changes only or Full delta
The DeltaXML API has the ability to generate a delta file with 'changes-only' or a 'full delta' that includes unchanged data as well. It will be slower to output the full delta, of course. In general the full delta option should be avoided for large files.
7. Version of Java
Performance using JDK 1.4 and later versions is significantly better than that provided by earlier releases.
Some tests comparing J2SE 1.4.2_02 and 1.3.1_09 showed the newer version halved the comparison runtimes.
8. Java Heap Size
The size of the JVM heap is one of the main factors which determines the size of datasets which DeltaXML can process. The size of the heap, amount of available RAM and other JVM configuration options affects both capacity and performance (too small a heap will result in execess garbage collection, similarly not enough RAM will causes performance degredation). The following guidelines are suggested:
-
Using
java -Xmxcan be used to increase the fairly small default JVM heap size. For example invoking using the: (java -Xmx512m...) command line argument will allocate half a gigabyte of RAM to the JVM heap. -
Performance is poor if there isn't enough RAM available to support the requested JVM heap size. Using disk based swapping to support the heap exhibited significant slow downs. We suggest ensuring that the heap size specified with the
java -Xmxargument is available as free RAM. -
The J2SE server JVM can provide much better performance than the client JVM (in some cases twice as fast), but at the expense of increased memory consumption. If enough RAM is available, adding:
java -server... is recommended for best performance. -
32 bit Operating Systems and processors can limit the process virtual address space and thus the amount of memory that you can dedicate to JVM heap usage. Some Operating Systems divide the 32 bit process address space into space for system/kernel and space for user code. For example, Windows™, Linux™ (most distributions/kernels) and MacOSX™ do a 50/50 split, leaving on a 32 bit machine around 2GBytes of space available for the Java heap, even on machines which have larger amounts of RAM installed. 32 bit processes on Solaris Sparc™ (7, 8 & 9) avoid the 50/50 split and make most of the 4Gbytes available to the java heap, for example java -Xmx3900m is possible.
-
To exceed the 2 or 4GByte Java heap size limits, a 64 bit JVM is usually required. However, for this to work usefully and to see benefits, 64 bit processors, corresponding Operating System support (some Operating Systems available for 64 bit processors only support a 32 bit address space, for example MacOSX™ 10.3) and more than 4 GBytes of RAM will be needed.
-
The use of Multiple Page Size Suppport
java -XX:+UseMPSS... on Solaris provided a 5% runtime improvement in testing, with no measurable memory overhead. -
Using the incremental garbage collector (
java -Xincgc ...) showed no benefit when tested. -
It was hoped the use of the Parallel Garbage collector (
java -XX:+UseParallelGC ...) would provide improved run times on multiprocessors as garbage collection could occur concurrently on a separate CPU. It actually had the opposite effect, doubling the elapsed runtime and trebling the CPU time consumed.
9. SAX input sources
Reading from disk based files, for example, using the command.jar command-line interpreter, is typically slower than processing SAX events produced from an existing in memory data representation. As well as the reduced disk IO a more significant speedup arises from the lack of lexical analysis/tokenization that is otherwise performed by a SAX parser. We also recommend testing different SAX parsers and comparing their performance using your data if you need to read XML files from disk.
10. Metrics
It is difficult to give accurate performance metrics for the reasons outlined above. But some examples may help as an indication. Example files of test data were generated according to customer supplied metrics (the actual customer data was commercially sensitive) and tested with the following hardware and software:
-
Test hardware was a Sun V240, with:
-
2 * 1Ghz UltraSparc IIIi CPUs
-
4 GBytes of RAM
-
Internal 36GByte, 10k rpm SCSI disks
-
Solaris 9 04/2004, with recommended and security patch cluster
-
J2SE 1.4.2_04
-
-
Test software was a pre-release of DeltaXML Core API 2.8.2
One example data set has 27 million elements contained in 3 levels of hierarchy (generated using 3003 nested loops) inside the root element; each element contained a short piece of 4 character PCDATA. When written to disk file the size of this data was 430MBytes. This file was compared with itself to produce a 'null delta' (differences in input data will take longer to produce, but this type of testing is very useful for guaging capacity related issues). Some results were:
-
Reading the data from disk files took 4 minutes 45 seconds using the server JVM and consumed just under 2 Gbytes of memory for the java process. -Xmx2048m was used to limit the JVM heap size.
-
Smaller java process sizes were available from the client JVM, Java process size was reduced to 1250MB, but runtime increased to 8 minutes 29 seconds.
-
The fastest run time, 2 minutes 14 seconds, was obtained reading the data from SAX events rather than disk based XML files.
11. Conclusion
Be sure to remove white space from large input files. Performance depends on file structure and text content so needs to be evaluated on your own data.
However, it is clear form the above that DeltaXML can be used successfully with very large XML datasets.