semweb – Semantic Web and Expert Systems – A comparison of six RDF stores

A comparison of six RDF stores

Fr De

February 2015 A comparison of six RDF stores by means of a simple use case

RDF stores are nowadays the physical carriers for knowledge graphs in the semantic web world. They deliver portions of graphs out of their SPARQL endpoint – usually through an apposite REST IP based interface. Although they all have in common the ability to process SPARQL, they might differ considerably with regards to capacity, speed, security, scalability and support. For the following article we refer basically to experiences made in the time range from May to December 2014. The status may have changed already at the time of reading.

Used stores and the use case
In our work with RDF stores we used as a simple use case a thesaurus matcher, which extracts portions of graphs to link thesaurus entities with words from PDF documents. You should find a demonstration of this at /Semweb1/. The RDF stores used are listed at /semweb2/ and are in a lexicographical order: Bigdata™, GraphDB™, MarkLogic™, PoolParty™, SPARQLverse™ and Virtuoso™.

Bigdata can be downloaded as Java source project and can be launched quickly by using “ant” to compile and run the application. It works smoothly, it offers besides a nice REST API (NanoSparqlServer) also a SAIL API for direct access. GraphDB is a Sesame based Java bundle of libraries which can be directly executed, it starts without problems, offers a SAIL and a REST interface. MarkLogic server is a ready-to-use application, which allows to store documents besides knowledge graphs, offers a REST and Java interface and data access through SPARQL and XQUERY. PoolParty was used from the cloud (so no installation was needed) – it offers a Sesame SPARQL Endpoint through a REST interface. SPARQLverse is a Linux based C++ program bundle which can be linked with Hadoop for high scalable infrastructures. Virtuoso is a C application, which installs directly on your operating system.

We did not exploit yet the scalability of each of the RDF stores. Bigdata, GraphDB, MarkLogic, SPARQLverse, Virtuoso claims to be highly scalable for processing billions of RDF records. For further reading on this topic, we recommend the following paper: An evaluation of the performance of triple stores on biological data in /ncbi2014/.

Importing data
Once you have installed an RDF store you need to (possibly quickly) import data into it. With exception of PoolParty – whose data can be imported or maintained via the well known PoolParty Suite, which is an excellent thesaurus and ontology management tool, the other RDF stores allow you to import RDF Data via their REST interfaces. GraphDB, MarkLogic and SPARQLverse offer REST interfaces, Virtuoso allows you to import data directly within its highly developed console, the “virtuoso conductor”. Bigdata comes with an excellent direct Java bulk loader. Every RDF store allows to partition graphs using different concepts (but at the end they are essentially the same): Bigdata has namespaces – a namespace is an indexable portion of a knowledge graph; GraphDB has “repositories”, PoolParty uses “projects”, Virtuoso has probably the most detailed way to structure knowledge in its “quad-store-upload” into several knowledge graphs. SPARQLverse allows to import RDF data per program into several graphs.

Security issues
Once you have imported your data into the RDF store – or maybe also before that – you start asking yourself, which security concepts are provided to protect your data from unwanted read/write access. Without any doubts, the most advanced security concepts are provided by MarkLogic, Virtuoso and PoolParty – followed by the others with practically no native data protection, remaining therefore dependend from standard web security mechanisms like protecting access in config files like web.xml. Bigdata has a flag to make the knowledge base “read-only”.

Compliance to SPARQL 1.1
Since Bigdata, GraphDB and PoolParty are Sesame based, they present a consistent response to SPARQL queries. However, we should underline the fact that every store uses its own implementation of the SPARQL 1.1 W3C recommendation. The most complete SPARQL implementation was found in GraphDB, PoolParty and Virtuoso, whereas momentarily the other stores have restrictions on negation or filtering, which regularly force(d) a reformulation of SPARQL queries in order to get the same data.

Loading speed and performance
Concerning loading speed we recognized SPARQLverse as the best candidate followed by Bigdata, GraphDB and Virtuoso, with MarkLogic in the last position. Since PoolParty maintains its data inside the PoolParty Management Suite – we did not consider to make tests on the loading speed with it. Operating speed refers here to selection and delivering of data through a REST interface. GraphDB and Bigdata deliver graph portions very quickly, followed by Virtuoso, PoolParty, SPARQLverse and MarkLogic.

Documentation & Support
A further precondition to make use of an RDF Store in commercial projects is the availability of documentation and the presence of a support team – whereas we have to state that for the tests we used for all stores their non-commercial evaluation version. Commercial versions should be considered separately. Bigdata and SPARQLverse offer essentially well-structured documentation and a blog to submit questions or problems, we had often valuable support from Bigdata. GraphDB offers an excellent documentation but at the moment very poor technical support. For MarkLogic we somehow haven’t asked yet for their support, PoolParty and SPARQLverse offered excellent support, and OpenLink (for Virtuoso) answered sometimes with some delay when technical issues occurred. PoolParty, MarkLogic and Virtuoso offer a very nice documentation.

(is there a) Conclusion (?)
Which of the available RDF stores you want to make use of depends on your use case and on the actual need you have for knowledge graphs. We cannot recommend any of the evaluated stores as being the “best” RDF store, but we have found some better skills on speed or on modeling issues for some of the tested systems. If your application is in the industrial sector, it seems advisable to look first at its security/scalability issues and a solid road map followed by a real presence of an efficient supporting team. If you want to run your application in the research field, you need perhaps to be fast and full-featured to achieve quickly your targets along your researching path. In the competition between RDF stores we will always find the claim that one store be the best from all the possible perspectives. We advise to take distance from that and to stay on the well-established golden mean.