[rdf4j-dev] Question about parsing large files

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] [List Home]

[rdf4j-dev] Question about parsing large files

From: "Benjamin Herber (BLOOMBERG/ 919 3RD A)" <bherber1@xxxxxxxxxxxxx>
Date: Wed, 17 Jan 2024 19:35:27 -0000
Delivered-to: rdf4j-dev@xxxxxxxxxxx
List-archive: <https://www.eclipse.org/mailman/private/rdf4j-dev/>
List-help: <mailto:rdf4j-dev-request@eclipse.org?subject=help>
List-subscribe: <https://www.eclipse.org/mailman/listinfo/rdf4j-dev>, <mailto:rdf4j-dev-request@eclipse.org?subject=subscribe>
List-unsubscribe: <https://www.eclipse.org/mailman/options/rdf4j-dev>, <mailto:rdf4j-dev-request@eclipse.org?subject=unsubscribe>

Hi everyone!

I'm currently working on a project that involves bulk loading larger files (right now limited to n3, ttl and associated family). I was trying to parse about a 100 million triples (~13 GB) and it caused the parser to run out of memory with the JVM set to have 32 GB of heap.

Looked quickly into the parser implementation and it seems like the parser is not able to be set to parse per statement iteration. So what I am doing now is trying to chunk the larger triple files into a series of smaller ones and load each individually, but this is proving rather error prone and unmaintainable in the long term for different formats.
Does anyone have any insights into how to better approach this or work around a stream parsing? Also, newer to the codebase, so any pointers if I missed something would be appreciated!

Thank you!
- Benjamin Herber

Follow-Ups:
- Re: [rdf4j-dev] Question about parsing large files
  - From: Håvard Ottestad

Prev by Date: Re: [rdf4j-dev] JSON-LD 1.1
Next by Date: Re: [rdf4j-dev] Question about parsing large files
Previous by thread: [rdf4j-dev] JSON-LD 1.1
Next by thread: Re: [rdf4j-dev] Question about parsing large files
Index(es):
- Date
- Thread

Breadcrumbs