Skip to main content

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]
Re: [rdf4j-dev] thoughts on RDFa ?

Hi Bart,

I have not used RDFa for anything. I do know that the metadata in images is rdf, and also that google is pushing for more jsonld on webpages. 

My experience with both SAX and jsoup are good. I usually use jsoup when I need to crawl webpages, and for this it is the best library I have used. Very robust and simple to use. 

SAX I use in my XmlToRdf converter for performance. I can convert 100 mb of XML to turtle with only 20 mb of ram in less than 2 seconds on my laptop. It even works all the way down to 3 mb of ram, but then the parsing time jumps to around 10 seconds because of GC. 

I would recommend SAX for pure XML, perfect syntax, usecases. JAXB for when you want java objects, and jsoup for everything else. 

Håvard

On 18 Apr 2019, at 18:46, Bart Hanssens (BOSA) <bart.hanssens@xxxxxxxxxxxx> wrote:

Hi,


for scraping purposes, I'm looking into RDFa/RDFa-Lite and I'm thinking about writing a RIO parser (see also issue #512).


IIRC James did some experimental work on RDFa as well, but I think it was based on SAX,

so probably assuming that the source would be perfectly formatted XHTML... which is rarely the case


So currently I'm looking at using either attoparser (smaller, event-driven) or jsoup (more frequently updated, DOM-interface),

and there is a wonderful test suite available at http://rdfa.info/test-suite/


So I was wondering

- are there other HTML parser I'd should look into (Jodd Lagarto ? NekoHTML ?)

- where should the testsuite go (if it gets CQ approval): I remember some emails about moving the rdf4j-testsuite back into the main repo, but I'm not sure what the conclusion was



Thanks


Bart


_______________________________________________
rdf4j-dev mailing list
rdf4j-dev@xxxxxxxxxxx
To change your delivery options, retrieve your password, or unsubscribe from this list, visit
https://www.eclipse.org/mailman/listinfo/rdf4j-dev

Back to the top