How we ingested 100M semantic documents in a day (And where do they come from)


How to get some unexpected big data satisfaction.
First: build an infrastructure to process millions of document documents. Instead of just doing it home-brew, however, do your big data homework, no shortcuts. Second: unclog some long standing clogged pipe :)


The feeling is that of “it all makes sense” and it happened to us the other day when we started the dataset indexing pipeline with a queue of a dozen of large datasets (see later for the explanation). After doing that, we just sit back and watched the Sindice infrastructure, that usually takes 1-2 million documents per day, reason and index 50-100 times as much in the same timeframe, no sweats.

The outcome showing on the homepage shortly after:

 

Sindice ingests 100m documents in 24h, screenshot

 

But what is exactly this “Dataset Pipeline” ?  
Well its simply a by dataset approach.. to ingest datasets.
In other words, We decided to do away with basic URL to URL crawling for the LOD cloud websites and instead are taking a dataset by dataset approach, choosing just datasets we consider value to index.


This unfortunately, is a bit of a manual work. Yes, it is sad that LOD (which is an initiative that is born to make data more accessible) hasn’t really succeded at adoption of a way to collect the “deep semantic web” in the same way that a simple sitemap.org initiative has for the normal web. But that is the case.


So what we do now in terms of data acquisition is:

  • For normal websites that expose structured data (RDF, RDFa, Microformats, Schema etc etc), we support Sitemaps. Just submit it and you’ll see it indexed in Sindice. For these web sites we can be in sync (24h delay or less) e,g, just expose a simple sitemap “lastmod” field and we’ll get your latest data,
  • For RDF datasets (Eg. LOD) you have to ask and send us a pointer to the dump:) . And we’ll add you to a manual watchlist and possibly to the list of those that also make it to SPARQL. Refresh can be scheduled at regular intervals
Which brings us to the next question: which data makes it to the homepage search and which to the Sparql endpoint? is it the same?


No its not the same.


The frontpage search of Sindice is based on Siren indexes all (including the full materialization of the reasoning). This index is also updated multiple times a day e.g. in response to pings or sitemap updates. As a downside, while our frontpage advanced search (plus the cache api ) can be certainly enough to implement some very cool apps (see http://sig.ma based on a combination of these APIs), they certainly won’t suffice to answer all the possible questions.


The sparql endpoint on the other hand has full capabilities but only containes a selected subset of the data in Sindice. The SPARQL sindice datasets consists of the first few hundeds websites (in terms of ranking) - see a list here - and, at the moment, the following RDF LOD datasets:


dbpedia , triples=  698.420.083
europeana ,triples=  117.118.947
cordis ,    7.107.144
eures , triples=    7.101.481
ookaboo, triples=  51.088.319
geonames, triples=  114.390.681
worldbank , triples=   87.351.631
omim,  triples=      358.222
sider  triples=   18.884.280
goa  triples=   11.008.490
chembl triples=  137.449.790  (and by the way also see hcls.sindicetech.com for bioscience data and assisted tools)
ctd  triples=  103.354.444
drugbank  triples=      456.112
dblp triples=   81.986.947


There is however some “impedence” mismatch between the Siren and SPARQL approach. E.g. Siren is a document index (e.g. in Sindice we arbitrarely limit the size of the documents to 5MB). So you might ask, how we manage to index in our frontpage search big datasets that come in “one chunk” like the above?


The answer is we first “slice them” per entity.


The above mentioned Dataset Pipeline contains a mapreduce job turns a single big dataset into millions of small RDF files containing ‘entity descriptions’ much like what you get resolving a URI tht belongs to a LOD datasets. This way we generated (out of the above datasets) the descriptions for approximately 100 million entities. Which became the 100M documents that went trough the pipeline in well less than one day (on our DERI/Sindice.com11 machine hadoop cluster).


A hint of what’s next: all the above in a comprehensible UI.


From the standard Sindice search you’ll be able to see what’s updated, what’s not, what’s in sparql, what’s not, what’s in each dataset (with links to each site indepth RDF analytics) and more. You’ll be also able to “request” that a normal website (e.g. your site) be included in the list of those that make it to the Sparql endpoint.


Until that happens (likely August) may we recommend you subscribe to the http://sindicetech.com twitter feed. Via SindiceTech, in the next weeks and months, we’ll make availabe selected parts of the Sindice.com infrastructure to developers and enterprises. (e.g. you might have seen our recent SparQLed release)
—-


Acknowledgments


Due go to the Kisti institute (Korea) which is now supporting some of the works on Sindice.com and to Openlink for kindly providing Virtuoso Cluster edition, powering our SPARQL endpoint. Thanks also go to the LOD2 and LATC EU projects, to SFI and others which we list here.
Post filed under Sindice.

4 comments

  1. Pingback from Unclogging the Data Pipeline - semanticweb.com  

    [...] Giovanni Tummarello has written a new article for Sindice discussing how the company “ingested 100M semantic documents in a day.” He writes, “First: build an infrastructure to process millions of documents. Instead of just doing it home-brew… do your big data homework, no shortcuts. Second: unclog some long standing clogged pipe. The [resulting] feeling is that of ‘it all makes sense’ and it happened to us the other day when we started the dataset indexing pipeline with a queue of a dozen large datasets… After doing that, we just sat back and watched the Sindice infrastructure, [which] usually takes 1-2 million documents per day, reason and index 50-100 times as much in the same timeframe, no sweat.” [...]

  2. Pingback from Link Roundup - July 29 | Enterprise Information Management in the 21st Century  

    [...] Data Unclogging the Data Pipeline (Indice) - How to ingest 100 million documents a day into a semantic application (!)… [...]

  3. Pingback from Web Semântica, por onde começar? « Renan Oliveira [Blog]  

    [...] How we ingested 100M semantic documents in a day (And where do they come from) (sindice.com) [...]

  4. Pingback from Web Semântica, por onde começar? | Planeta Globo.com  

    [...] How we ingested 100M semantic documents in a day (And where do they come from) (sindice.com) [...]

About this blog

In this blog you'll find announcements related to Sindice project, as well as news about Semantic Web topics or technical issues strictly related to the search engine.

Categories