Sunday, April 05, 2015

RDF Faceting: Query Facets + Range facets = Object Ranges Queries Facets

Faceting on semi-structured data like RDF is definitely (at least for me) an interesting topic.

The issue #28 and the issue #47 track the progresses about that feature on SolRDF: RDF Faceting.
I just committed a stable version of one of those kind of faceting: facets objects ranges queries (issue #28). You can find here a draft documentation about how faceting works in SolRDF.   

In a preceding article I described how a plain and basic SPOC faceting works; here I introduce this new type of faceting: Object Ranges Queries Facets.

Range Faceting is an already built-in feature in Solr: you can get this facets on all fields that support range queries (e.g. dates and numerics). For instance, asking for something like this:

facet.range=year
facet.range.start=2000
facet.range.end=2015
facet.range.gap=1

you will get the following response:

<lst name="facet_ranges">
    <lst name="year">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
          ...  
       </lst>      
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    ...
Plain range faceting on RDF schema? mmm....

SolRDF indexes semi-structured data, so we don't have arbitrary fields like year, creation_date, price and so on...we always have these fields:
  • s(ubject)
  • p(redicate)
  • o(bject) 
  • and optionally a c(ontext)
So here comes the question: how can I get the right domain values for my range facets? I don't have an explicit "year" or "price" or whatever attribute.
See the following data, which is a simple RDF representation of two projects (#xyz and #kyj):

@prefix xsd: &lt;http://www.w3.org/2001/XMLSchema#&gt; .
@prefix abc: &lt;http://a.b.c#&gt; .
@prefix cde: &lt;http://c.d.e#&gt; . 

<#xyz> 
    abc:start_year "2001"^^xsd:integer ;
    abc:end_year "2003"^^xsd:integer ;
    cde:first_prototype_date 2001-06-15"^^xsd:date ;
    cde:last_prototype_date "2002-06-30"^^xsd:date ;
    cde:release_date  "2003-10-10"^^xsd:date .

<#kyj> 
    abc:start_year "2002"^^xsd:integer ;
    abc:end_year "2007"^^xsd:integer ;
    cde:first_prototype_date 2003-09-27"^^xsd:date ;
    cde:last_prototype_date "2005-08-24"^^xsd:date ;
    cde:release_date  "2007-03-10"^^xsd:date .

The following table illustrates how the same data is indexed within Solr:
S(ubject)P(redicate)O(bject)
#xyz http://a.b.c#start_year "2001"^^xsd:integer
#xyz http://a.b.c#end_year "2003"^^xsd:integer
#xyz http://c.d.e#first_prototype_date "2001-06-15"^^xsd:date
...

As you can see, the "logical" name of the attribute that each triple represents is in the P column, while the value of that attribute is in the O cell. This is the main reason the plain Solr range faceting here wouldn't work: a request like this:

facet.range=o

would mix apples and bananas. In addition, without knowing in advance the domain of the target value (e.g. integer, double, date, datetime) how could we express a valid facet.range.start, facet.range.end and facet.range.gap?

Requesting the same thing for s or p attributes doesn't make any sense at all because the datatype (string) doesn't support this kind of faceting. 

Object Ranges Queries Facets

In order to enable a range faceting that makes sense on SolRDF, I replaced the default FacetComponent with a custom subclass that does something I called Object Ranges Queries Facets, which is actually a mix between facet ranges and facet queries.
  • Object because the target field is the o(bject)
  • Facet because, of course, the final results are facets
  • Range because what we are going to compute are facet ranges
  • Queries because instead of indicating the target attribute in request (by means of facet.range parameter), this kind of faceting requires a facet.range.q which is a query (by default parsed by the Solr Query Parser) that selects the objects (i.e. the "o" attribute) of all matching triples (i.e. SolrDocument instances) and then calculates the ranges on top of them.
Returning to our example, we could issue a request like this:

facet.range.q=p:<http://a.b.c#start_year>
facet.range.start=2000
facet.range.end=2010
facet.range.gap=1

or like this

facet.range.q=p:<http://c.d.e#release_date>
facet.range.start=2000-01-10T17:00:00Z
facet.range.end=2010-01-10T17:00:00Z
facet.range.gap=+1MONTH

You can have more than one facet.range.q parameter. In this case the facet response will look like this:

<lst name="facet_ranges">
    <lst name="p:<http://a.b.c#start_year>">
      <lst name="counts">
          <int name="2000">3445</int>
          <int name="2001">2862</int>
          <int name="2002">2776</int>
          <int name="2003">2865</int>
          ...
       </lst>
       <int name="gap">1</int>
       <int name="start">2000</int>
       <int name="end">2010</int>
    </lst>
    <lst name="p:<http://c.d.e#release_date>">
      <lst name="counts">
          <int name="2000-03-29T17:06:02Z">2516</int>
          <int name="2001-04-03T21:30:00Z">1272</int>
          ...
       </lst>       <int name="gap">+1YEAR</int>
       <int name="start">2000-01-10T17:00:00Z</int>
       <int name="end">2010-01-10T17:00:00Z</int>
    </lst>
    ...

You can do more with request parameters, query aliasing and shared parameters. Please have a look at SolRDF Wiki.

As usual, feedbacks are warmly welcome ;)

No comments: