Microformat Support

Contents

Overview

Microformats parsed out of HTML pages are converted into an RDF representation. In this way, one needs not to bother about the original data format but can simply query for the right RDF properties.

Note that this means that in general you can (and you should) make queries which are agnostic to the original semantic representation format, simply asking for the right RDF mapped Microformat class.
It is possible, however, to restrict a query to get only documents that were originally obtained from a microformat-enabled HTML page, (example) or a page containing RDFa (example)

Status

  • 2008-04-21: support rel-license, hListing, EXFN. Change archive format to accomodate new spec.
  • 2008-07-28: generally improved documentation
  • 2008-08-10: first public release

General HTML extraction

An HTML page can contain any number of microformats. If more than one is detected, the resulting RDF graph simply contains the RDF conversion of all the microformats, plus the RDFa found in it.

If at least one microformat is detected, then the HTML <title> is also extracted and added to the page URI with dc:title, and subsequent extraction pass are run over it. See [Extraction strategy].

Each format is tagged with the format metadata field. all the microformats have a specific name, which will be used here. All uppercase. All of them (except RDFa that is not a microformat) have also MICROFORMAT as a special catchall format.

XFN

XFN(XHTML Friends Network) allows markup of hyperlinks as typed links that represent a human relationship between the authors or owners of two web pages. A rel attribute is added to the HTML hyperlink, e.g. rel="contact", rel="sweetheart", rel="co-worker". A special case is rel="me" which connects different pages owned or created by the same person.This means that each link basically just transforms into a triple.

The guide available here In RDF:

An XFN me link from page.html to other.html in RDF
[
    a foaf:Person;
    foaf:isPrimaryTopicOf <page.html>;
    foaf:isPrimaryTopicOf <other.html>;
] .
An XFN friend link from page.html to other.html in RDF
[
    a foaf:Person;
    foaf:isPrimaryTopicOf <page.html>;
    xfn:friend [
        a foaf:Person;
        foaf:isPrimaryTopicOf <other.html>;
    ];
] .

The xfn: namespace expands to http://gmpg.org/xfn/11# (which, technically is namespace squatting).

We expand the XFN domain by adding direct triples, in the form

<uri.html> xfnrdf:friend <other.html>
<uri.html> xfnrdf:me <other.html>
<uri.html> xfnrdf:sweetheart <other.html>

Where xfnrdf is a vocabulary (namespace http://vocab.sindice.com/xfn) described in this guide. This is actually done via the same code, so the mapping is 1:1.

Example searches that may/should return microformats (term, advanced):

Joe format:XFN
 * <http://sindice.com/exfn/0.1/met-hyperlink> *
class:Person

hCard

We rely on the hcard profile, basically uses the vCard ontology.

Given something like:

<div class="vcard">
  <a class="fn org url" href="http://www.commerce.net/">CommerceNet</a>
  <div class="adr">
    <span class="type">Work</span>:
    <div class="street-address">169 University Avenue</div>
    <span class="locality">Palo Alto</span>,
    <abbr class="region" title="California">CA</abbr>
    <span class="postal-code">94301</span>
    <div class="country-name">USA</div>
  </div>
  <div class="tel">
   <span class="type">Work</span> +1-650-289-4040
  </div>
  <div class="tel">
    <span class="type">Fax</span> +1-650-289-4041
  </div>
 </div>

We get

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix v:       <http://www.w3.org/2006/vcard/ns#> .

[]    rdf:type v:VCard ;
      v:adr   [ rdf:type v:Address ;
                v:countryName "USA" ;
                v:locality "Palo Alto" ;
                v:postalCode "94301" ;
                v:region "CA" ;
                v:streetAddress "169 University Avenue"
              ] ;
      v:fn    "CommerceNet" ;
      v:org   [ rdf:type v:Organization ;
                v:organization-name "CommerceNet"
              ] ;
      v:tel   <tel:+1-650-289-4041> , <tel:+1-650-289-4040> ;
      v:url   <http://www.commerce.net/> .

As example queries you may try

format:HCARD
class:VCard class:Organization

Geo

Geo uses the same ontology as VCard but just defines some of the types in that ontology. So for example you can ask for

class:Location
format:GEO
* http://www.w3.org/2006/vcard/ns#latitude> "51.5217"

Adr

As above, Adr just comprises the Address class of the VCard ontology. So you can ask for

class:Address
format:ADR
* <http://www.w3.org/2006/vcard/ns#country-name> "Germany"

hCalendar

Based on the RDF Calendar spec

Given:

<div class="vevent">
 <a class="url" href="http://www.web2con.com/">http://www.web2con.com/</a>
  <span class="summary">Web 2.0 Conference</span>:
  <abbr class="dtstart" title="2007-10-05">October 5</abbr>-
  <abbr class="dtend" title="2007-10-20">19</abbr>,
 at the <span class="location">Argent Hotel, San Francisco, CA</span>
 </div>

We get:

@prefix r:       <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix c:       <http://www.w3.org/2002/12/cal/icaltzd#> .
<>    r:type  c:Vcalendar ;
      c:component
              [ r:type  c:Vevent ;
                c:dtend "2007-10-20"^^<http://www.w3.org/2001/XMLSchema#date> ;
                c:dtstart "2007-10-05"^^<http://www.w3.org/2001/XMLSchema#date> ;
                c:location "Argent Hotel, San Francisco, CA" ;
                c:summary "Web 2.0 Conference" ;
                c:url   <http://www.web2con.com/>
              ] .

Every page which contains at least some calendar data contains at least a Vcalendar entity. Where this is implied (you can have pages with just Vevent) it is automatically added.
Each calendar may have one or more components which can be Vevent, Vtodo, Vjournal, Vfreebusy etc..

So you can look for calendars with queries like:

class:Vcalendar
Eric Clapton class:Vevent
Live show format:HCALENDAR
opera ontology:icaltzd
* <http://www.w3.org/2002/12/cal/icaltzd#dtstart> "20080527T1900+0100"

hReview

hReview is still in draft, but seems interesting and some people use it. The reference spec is RDF Review

Anyway the output should be something on the lines of:

@prefix dc:      <http://purl.org/dc/elements/1.1/> .
@prefix review:  <http://www.purl.org/stuff/rev#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .


[]    rdf:type review:Review ;
      dc:title "A wonderful night on acid at MoS!" ;
      review:rating "5" .

Look for it like 

sigur ros class:Review
ontology:rev ipod
format:HREVIEW

Notice that the RDF Review ontology actually has some deployment in it's normal RDF form!

rel-license

Extracts the license links (simple rel="license") and transforms them into

<this.html> dcterms:license <someLicense.html>

Multiple licenses are extracted correctly as expected.

Simple query:

jazz format:LICENSE
* <http://purl.org/dc/terms/license> "http://creativecommons.org/licenses/publicdomain/"
 









hListing

hListing is unstable, and we extract it tentatively, but as of now we are still working on a proper ontology for it. 

The current parser accomodates this by being extremely liberal. As the format stabilizes we can restrict the rules.

The vocabulary used is namespaced as http://sindice.com/hlisting/0.1/, and is defined in HLISTING.java in the crawler sources. For a kelkoo listing it gets data like this:

@prefix hl:    <http://sindice.com/hlisting/0.1/> .

[]    a
              hl:Listing ;
      hl:action
              hl:offer;
      hl:description
              "..." ;
      hl:item
              [ a
                        hl:item ;
                hl:itemName
                        "Benq MP622 - DLP Projector - 2700 ANSI lumens - XGA..." ;
                hl:itemPhoto
                        "http://img.kelkoo.com/uk/medium/675/496/00117250662929509422269096808645163496675.jpg" ;
                hl:itemUrl
                        "http://bob.example.com/"
              ] ;
      hl:lister
              [ a
                        hl:Lister ;
                hl:listerLogo
                        "http://bob.example.com/data/merchantlogos/4621623/pcworld.gif" ;
                hl:listerName
                        "PC World Business" ;
                hl:listerOrg
                        "PC World Business" ;
                hl:listerUrl
                        "http://bob.example.com/m-4621623-pc-world-business.html"
              ] ;
      hl:price
              "£480.17".

Obviously this still needs a lot of work and fine tuning in the cource code, but it also need work WRT the ontology: most of stuff we can reuse from existing ontology so as to have more useful data (i.e. a lister is-a vcard, foaf:depiction,page,related etc can be reused, foaf:Person and vcard cn be connected etc).

hResume

In html an hResume is a special combination of hCards and hCalendars. Thus we extract both the calendar and the vCards.

Plus we also transform the resume using the DOAC vocabulary. As an example of a page on linkedin, notice the overlapping informations:

@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<http://www.linkedin.com/in/grenzi>
      <http://purl.org/dc/elements/1.1/title>
              "Gabriele Renzi - LinkedIn" .

[]    rdf:type <http://xmlns.com/foaf/0.1/Person> ;
      <http://ramonantonio.net/doac/0.1/#affiliation>
              [ rdf:type <http://www.w3.org/2006/vcard/ns#VCard> ;
                <http://www.w3.org/2006/vcard/ns#fn>
                        "stacktrace member" ;
                <http://www.w3.org/2006/vcard/ns#logo>
                        <http://media.linkedin.com/media/p/1/000/00a/29d/0844af4.png> ;
                <http://www.w3.org/2006/vcard/ns#org>
                        [ rdf:type <http://www.w3.org/2006/vcard/ns#Organization> ;
                          <http://www.w3.org/2006/vcard/ns#organization-name>
                                  "stacktrace member"
                        ] ;
                <http://xmlns.com/foaf/0.1/topic>
                        [ <http://xmlns.com/foaf/0.1/name>
                                  "stacktrace member"
                        ]
              ] ;
      <http://ramonantonio.net/doac/0.1/#affiliation>
              [ rdf:type <http://www.w3.org/2006/vcard/ns#VCard> ;
                <http://www.w3.org/2006/vcard/ns#fn>
                        "Professionisti @ Ruby member" ;
                <http://www.w3.org/2006/vcard/ns#logo>
                        <http://media.linkedin.com/media/p/3/000/00a/361/1a69bc0.png> ;
                <http://www.w3.org/2006/vcard/ns#org>
                        [ rdf:type <http://www.w3.org/2006/vcard/ns#Organization> ;
                          <http://www.w3.org/2006/vcard/ns#organization-name>
                                  "Professionisti @ Ruby member"
                        ] ;
                <http://xmlns.com/foaf/0.1/topic>
                        [ <http://xmlns.com/foaf/0.1/name>
                                  "Professionisti @ Ruby member"
                        ]
              ] ;
      <http://ramonantonio.net/doac/0.1/#affiliation>
              [ rdf:type <http://www.w3.org/2006/vcard/ns#VCard> ;
                <http://www.w3.org/2006/vcard/ns#fn>
                        "Forbes.com Personal Technology Forum member" ;
                <http://www.w3.org/2006/vcard/ns#logo>
                        <http://media.linkedin.com/media/p/1/000/000/005/0a596c4.gif> ;
                <http://www.w3.org/2006/vcard/ns#org>
                        [ rdf:type <http://www.w3.org/2006/vcard/ns#Organization> ;
                          <http://www.w3.org/2006/vcard/ns#organization-name>
                                  "Forbes.com Personal Technology Forum member"
                        ] ;
                <http://xmlns.com/foaf/0.1/topic>
                        [ <http://xmlns.com/foaf/0.1/name>
                                  "Forbes.com Personal Technology Forum member"
                        ]
              ] ;
      <http://ramonantonio.net/doac/0.1/#summary>
              "Coding Monkey, Geek, Professional Student" ;
      <http://xmlns.com/foaf/0.1/isPrimaryTopicOf>
              [ rdf:type <http://www.w3.org/2006/vcard/ns#VCard> ;
                <http://www.w3.org/2006/vcard/ns#adr>
                        [ rdf:type <http://www.w3.org/2006/vcard/ns#Address> ;
                          <http://www.w3.org/2006/vcard/ns#locality>
                                  "Rome Area, Italy"
                        ] ;
                <http://www.w3.org/2006/vcard/ns#fn>
                        "Gabriele Renzi" ;
                <http://www.w3.org/2006/vcard/ns#n>
                        [ rdf:type <http://www.w3.org/2006/vcard/ns#Name> ;
                          <http://www.w3.org/2006/vcard/ns#family-name>
                                  "Renzi" ;
                          <http://www.w3.org/2006/vcard/ns#given-name>
                                  "Gabriele"
                        ] ;
                <http://www.w3.org/2006/vcard/ns#title>
                        "Coding Monkey, Geek, Professional Student" ;
                <http://www.w3.org/2006/vcard/ns#url>
                        <http://www.riffraff.info> , <http://riffraff.blogsome.com> ;
                <http://xmlns.com/foaf/0.1/topic>
                        [ <http://xmlns.com/foaf/0.1/name>
                                  "Gabriele Renzi"
                        ]
              ] .

[]    rdf:type <http://xmlns.com/foaf/0.1/Person> ;
      <http://xmlns.com/foaf/0.1/isPrimaryTopicOf>
              <http://www.linkedin.com/in/grenzi> , <http://www.riffraff.info> ;
      <http://xmlns.com/foaf/0.1/weblog>
              <http://www.linkedin.com/in/grenzi> , <http://www.riffraff.info> .

So you can do searches like

* http://ramonantonio.net/doac/0.1/#organization 'something'
format:HRESUME

RDFa

RDFa is not really a microformat, cause it allows embedding of any schema, so we extract them verbatim, based on the work done here

Other microformats

These are not yet supported (or not yet documented here). See list of microformats for a pretty complete list of microformats. A lot of work can be done to improve the current extractors before going to others. A major problem could be nesting of a format into the same (i.e. vcard agents inside vcards), which was not implemented because there was no interesting data using it, but should be done for correctness.

WONTUSE formats

some of the microformats are not planned to be extracted, for now. This includes:

  • relNofollow - uninteresting
  • relTag - I remember we wanted to ignore this, but can't remember why -gabriele
  • XOXO - uninteresting by itself, already used when included in others