Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
[lyo-dev] Fwd: Evolving RDF/XL support and ARP.

FYI.

I was going through my unread emaisl that accumulated over a long period of time and noticed this email on the jena dev list. The corresponding PR 1774 landed in Jena 4.8.0, which is what Lyo 5.1.1 ships with. 

The impact of this change is not fully understood yet. As far as I understand, it only affects apps that rely on RDF 1.0 features that were removed from RDF 1.1 and no OSLC spec ever relied on RDF 1.0. However, I imagine that really old Jazz apps may use some undocumented XML features. In any case, Lyo code relying on Jena 4.8.0 still seems to work with 3 installs of ELMv7.

Please report if you encounter any problems communicating with old Jazz installations using Lyo 5.1.1 or later versions.

Cheers,
Andrew

Begin forwarded message:

From: Andy Seaborne <andy@xxxxxxxxxx>
Subject: Evolving RDF/XL support and ARP.
Date: 24 February 2023 at 15:16:47 CET
Reply-To: <dev@xxxxxxxxxxxxxxx>

Jena's RDF/XML parser, ARP, was original a separate subsystem that could be configured for different possible directions of the RDF 1.0 working group and different treatment of IRIs that were possible at the time (this is before RFC3986/3987). It is the "xmlinput" package in jena-core.

It has a close coupling to jena-iri with features such as customization of errors, and an idiosyncratic approach to relative IRIs (if called directly). These are outside normal use of RDF/XML.  When used from model.read or a RIOT API, these features aren't accessible.

Both jena-iri and ARP are hard to maintain.

xmlinput is the last part of Jena that uses jena-iri directly.

Jena has a IRI abstraction - IRIx that allows switching IRI providers. The Jena releases use jena-iri as the provider through the IRIx abstraction - errors message are the same as before.

There is a test suite for compatibility - on a pass/warning/error basis, not error message text, that gives the expected behaviour of an IRIx implementation.


RFCs and W3C documents that define the URIs, IRIs, and the specific URI schemes evolve so maintenance is necessary.

RDF 1.1 removed the special "RDF URI reference" in favour of RFC 3987.
W3C has a REC about DIDs (a new "did:" URI scheme).
RFC 6874 changes the core URI grammar of RFC 3986, adding support for IPv6 zones.
RFC 8089 define "file:" as it is actually used.
RFC 8141 replaces the definition of URNs with a new RFC.


My long-term aspiration is to have an RDF/XML parser and IRI handling that is:

1/ Maintainable.
2/ For use as a parser in Jena and only for that.

That means making RDF/XML handling much simpler, with functionality for reading conformant RDF/XML and not variations that are not used by Jena users. The test suite has good coverage.

For IRIs, switch from jena-iri to a new IRI library that has up-to-date support for IRIs. jena-iri also has scheme-specific rules for a large number of legacy schemes (gopher:, telnet:, fax:, ...). This extensibility causes a very high cost to maintain. It has not been remade from the original configuration files for many years (that step is not in the build).

New IRI library:
https://github.com/afs/x4ld/tree/main/iri4ld

jena-iri is also slower than iri4ld and this is visible in parsing (the impact is 5-10% of parsing speed on N-triples.)

Error message do change, hopefully to ones that are easier to understand. jena-iri error messages are quite technical.

This all applies to xmloutput as well but that's already converted to IRIx.


I have a new PR in-progress that converts RDF/XML parsing to use IRIx.
It does change the behaviour for directly using RDFXMLReader when relative URIs are given as the base. A fully legacy setup exists that passes all the tests for normal parsing use but does not pass some detailed local behaviour tests in the RDF/XML writer.

Roadmap:

Eventually have multiple packages, until we decide that migration has happened and they are getting in the way.

Packages used by RIOT/modle.read are essential maintenance only.


* xmlinput0 - this is ARP xmlinput as it is in Jena 4.7.0.

* xmlinput1 - this is ARP switched to use IRIx.

* xmlinput2 - an RDF/XML parser (starting with ARP and cutting out the unused parts) that covers Jena needs and not trying to do everything ARP does. xmlinput2 does not yet exist.

The new PR gets the codebase to xmlinput1(as "xmlinput").

If all goes well, we can have 4.8.0 default to use xmlinput1, switchable back to xmlinput0.

When called from model.read or RIOT, it should not make a difference.

It would be great to have users test but any affected users are using legacy features and they are less likely to upgrade regularly. Reports about direct use of ARP have been very infrequent.

   Andy



Back to the top