Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [rdf4j-dev] IRI Validation

> On 1 Jun 2017, at 04:14, James Leigh <james.leigh@xxxxxxxxxxxx> wrote:
> 
> Hi all,
> 
> I want to add (optional) IRI validation to all the parsers. However,
> I've run into trouble and hope some of you can help.
> 
> My validation fails in the Turtle test suite on localName_with_assigned
> _nfc_PN_CHARS_BASE_character_boundaries[1]. You can see the IRI in an
> encoded form in the nt file and inline in the ttl file.
> 
> The last character of the IRI is U+E01EF, which, as far is I can tell,
> is not part of a valid IRI.

When I look at it, it says the last character is U+2FA1D, which is allowed. Could be that my editor is messing things up though.

> 
> RFC3987[2] (IRIs) says the following UCS characters are permitted:
>    ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
>                   / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
>                   / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
>                   / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
>                   / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
>                   / %xD0000-DFFFD / %xE1000-EFFFD
> 
>    iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD
>  
> 
> Also of note is this URL[1], which is also not a valid IRI because an
> IRI can only have at most one "#”.

It’s also not a legal URI, because RFC3986 also does not allow more than one #. However, in the obsolete RFC2396, it _is_ allowed, basically because it enforces no validation on the fragment (which is, strictly speaking, not actually part of the URI), and just says “any character goes”.  

So strictly speaking it’s malformed, but my gut feeling is that the most graceful way to handle this is to allow it, and simply consider the second # part of the fragment id. Perhaps a case for allowing different levels of severity in validation? 

Cheers,

Jeen 

Back to the top