Updates on the Sindice SPARQL Endpoint

SPARQL is really hard to beat as a tool to experiment with data integration across datasets.  For this reason we get many request on what data we have in our sparql.sindice.com, frequency of  updates, etc.

We admit there was a bit of disappointment as in the past months we were not able to keep up with the promise of having the whole content also available as SPARQL and real time updated. Sorry, it worked under certain conditions but extending those has so far been elusive.

While the technology is steadily improving, we are happy to report recent improvements that should bring back SPARQL.sindice.com as a useful tool in your data explorations.

These changes have to do with what data we load and how often.

First of all  there are 2 different kinds of datasets living in Sindice

  • Websites - these are harvested by the crawlers, and the best ones are those that are connected to a sitemap.xml file or acquired via a custom acquisition scripts (which are possible, just ask us). Websites can be added by anyone, just submit a sitemap and Sindice will start the processing.
  • Datasets - these are in general LOD cloud datasets or other notable RDF datasets which are not connected to the Web in a way that is easy for Sindice to crawl and process.  Datasets are added manually whenever we get requests.

SPARQL is undoubtedly an hardware intensive business. For the SPARQL endpoint we use 8 machines with 48GB of RAM divided in 2 * 4 machine cluster.  Under this configuration, keeping all the content of Sindice (something close to 80 B triples)  is not feasible at the moment as it would require at least 2 to 4 times as much hardware.

So we select the data that goes in the SPARQL endpoint as follows:

1) Websites (3 billion triples) are loaded based on the current ranking within Sindice + on a manual list (please ask if you wish a specific website to be added).  Current list can be by the way.

2) Datasets (1.5 billion triples) are all loaded in full, see the list below.

Datasets we currently hold in the SPARQL endpoint can be retrieved with the SPARQL query below (sparql.sindice.com) or from the list at the end of this email:

Note that in case of website datasets, we group the data by (second-level) domain. So all the triples coming from the same domain belong to the same website dataset.

 

prefix dataset: <http://vocab.sindice.net/dataset/1.0/>
SELECT ?dataset_uri ?dataset_name ?dataset_type ?triples ?snapshot
FROM <http://sindice.com/dataspace/default/dataset/index>
WHERE {
  ?dataset_uri dataset:type ?dataset_type ;
  dataset:name ?dataset_name .
  OPTIONAL {
    ?dataset_uri dataset:void ?void_graph ;
                 dataset:snapshot ?snapshot . 
 GRAPH ?G { 
 ?dataset_uri void:triples  ?triples . 
 } 
 } 
}

 

See the results here.

As a result of this new setup and of the new procedures, we believe the reliability of the SPARQL endpoint is now much greater. We look forward to your feedback on this.

What’s next?

Next thing that’s happening is that you will be able to search from the homepage and see directly in the UI what’s in sparql and what’s not. Also you’ll be able to request a dataset (or a website) to be added.  Adding datasets will go via moderation but it will be relatively fast.

Also, the SPARQL endpoint at the moment will be updated once a month or on request. In the future we’ll switch to twice a month at least.

Realtime updated results are of course always available via the Sindice regular API (courtesy of Siren)

List of DUMP datasets

dbpedia,  triples=  698.420.083
europeana,  triples=  127.002.152
cordis,  triples=       7.107.144
eures,  triples=       7.101.481
ookaboo,  triples=     51.075.884
geonames,  triples=   114.390.681
worldbank,  triples=   370.447.929
omim,  triples=           366.247
sider,  triples=     18.887.105
goa,  triples=     18.952.170
chembl,  triples=   133.026.204
ctd,  triples=   101.951.745
Post filed under Sindice.

No comments

No comments for this post.

About this blog

In this blog you'll find announcements related to Sindice project, as well as news about Semantic Web topics or technical issues strictly related to the search engine.

Categories