Speed up java xml validation with DTD cacheing
To cache DTDs on the filesystem
- download them into a specific shared directory /mydtddir and create an Oasis standard catalog.xml file in the directory which refers to each of your saved DTDs.
- tell your SAXBuilder to use the Apache commons XML resolver.
- Specify where the resolver should look for the catalog.xml file using either props file on the classpath called CatalogManager.properties or by setting the system property xml.catalog.files.
Hey presto, all your java apps which use the commons resolver will then use cached DTDs for massive performance gains.
The next obvious question is how/where to obtain and store the DTDs on your system. In theory under Debian you can just install a package of the DTDs you need such as for xhtml
apt-get install w3c-dtd-xhtml
and then just point your resolver at /etc/xml/catalog.
Unfortunately (as usual) the OS package is overly complicated. It relies on a large set of files distributed all over the place which delegate to each other. Apart from making resolution slower due to multiple parses/file opens. at the time of writing, the files simply seem to be broken and contain unresolvable oasis extension DTDs.
e.g. the default catalog for xhtml is /usr/share/xml/xhtml/schema/dtd/1.0/catalog.xml.
This has a doctype which refers to an unresolvable GlobalTransCorp DTD. This fails to resolve:
$ xmlcatalog /etc/xml/catalog
"-//GlobalTransCorp//DTD XML Catalogs V1.0-Based Extension V1.0//EN"
No entry for PUBLIC
-//GlobalTransCorp//DTD XML Catalogs V1.0-Based Extension V1.0//EN
Therefore I usually choose to manage my own dtds under /etc/dtds.