SampLD

Structural Properties as Proxy for Semantic Relevance

Laurens Rietveld

http://presentations.laurensrietveld.nl/now

Quantity over Quality

  • Datasets become too large to run on commodity hardware
  • We use only a small portion
  • Can't we extract the part we are interested in?


Dataset#triples#queriescoverage
DBpedia459M16400.003%
Linked Geo Data289M8911.917%
MetaLex204M49330.016%
Open-BioMed79M9310.011%
BIO2RDF (KEGG)50M12972.013%
Semantic Web Dog Food0.24M19362.4%

Relevance Based Sampling

  • Find the smallest possible RDF subgraph, that covers the maximum number of potential queries

  • How can we determine which triples are relevant, and which are not?
  • Can we implement a scalable sampling pipeline?
  • Can we evaluate the results in a scalable fashion?

How to determine relevance of triples

Informed Sampling

  • We know exactly which queries will be asked
  • Extract those triples needed to answer the queries
  • Problem: only a limited number of queries known

Uninformed Sampling

  • We do not know which queries will be asked
  • Use information contained in the graph to determine relevance
  • Rank triples by relevance, and select the k best triples (0 < k < size of graph)

Approach

  • Use the topology of the graph to determine relevance (network analysis)
  • Evaluate the relevance of our samples against the queries that we do know
  • Is network structure a good predictor for query answerability?

Network Analysis

  • Example: Explain real-world phenomenons
  • Find central parts of the graph
  • Betweenness Centrality
  • Google PageRank

  • We apply
    • In Degree
    • Out Degree
    • PageRank

Ranked List of Triples

  • Rewrite
  • Apply network analysis
  • Nodes Weights → Triples Weights
    W(triple) = max(W(sub), W(obj))
  • Weighted list of triples → Sample

Evaluation

  • Sample sizes: 1% - 99%
  • Baselines:
    • Random Sample (10x)
    • Resource Frequency

Naive evaluation does not scale



$t_e(t_d) = \sum\limits_{i=1}^{99} \frac{i}{100} \cdot methods_s \cdot methods_b \cdot t_d$


  • Over 15.000 datasets, and over 1.4 trillion triples
  • Requirements
    • Fast loading of samples
    • Powerful hardware
  • Not Scalable: load all triples, execute queries, and calculate recall

Scalable Approach

  • Retrieve which triples are used by a query
  • Use a hadoop cluster to find the weights of these triples
  • Analyze whether these triples would have been included in the sample

  • Scalable. Only execute each query once

Example

Query

?person?country
:Laurens:NL
:Stefan:Germany

Dataset

SubjectPredicateObjectWeight
:Laurens:bornIn:Amsterdam0.6
:Amsterdam:capitalOf:NL0.1
:Stefan:bornIn:Berlin0.5
:Berlin:capitalOf:Germany0.5
:Rinke:bornIn:Heerenveen0.1

Query results → Triples

SubjectPredicateObject   ?person?city?country
:Laurens:bornIn:Amsterdam :Laurens:Amsterdam:NL
:Amsterdam:capitalOf:NL    
:Stefan:bornIn:Berlin :Stefan:Berlin:Germany
:Berlin:capitalOf:Germany    

Triples→Recall

Which answers would we get with a sample of 60%?
Dataset
SubjectPredicateObjectWeight
:Laurens:bornIn:Amsterdam0.6
:Stefan:bornIn:Berlin0.5
:Berlin:capitalOf:Germany0.5
:Amsterdam:capitalOf:NL0.1
:Rinke:bornIn:Heerenveen0.1
Triples used in query resultsets
SubjectPredicateObject
:Laurens:bornIn:Amsterdam
:Amsterdam:capitalOf:NL
:Stefan:bornIn:Berlin
:Berlin:capitalOf:Germany

Evaluation

  • Better specificity than regular recall
  • Scalable: PIG instead of SPARQL
  • Special cases, e.g. GROUP BY, LIMIT, DISTINCT, OPTIONAL, UNION

Special case: UNIONS

SubjectPredicateObject
:Laurensrdfs:label"Laurens"
:Laurensfoaf:name"Laurens"

Evaluation Datasets

Dataset#triples#queriescoverage
DBpedia459M16400.003%
Linked Geo Data289M8911.917%
MetaLex204M49330.016%
Open-BioMed79M9310.011%
BIO2RDF (KEGG)50M12972.013%
Semantic Web Dog Food0.24M19362.4%

Results

click for more results

Observations

  • The influence of a single triple
  • DBpedia
    • Good: Path + PageRank
    • Bad: Path + Out Degree
    • Queries: 2/3 require literals
  • Other Observations
    • # properties vs 'Context Literals' rewrite method
    • # query triple patterns

Conclusion

  • Scalable pipeline: network analysis algorithms + rewrite methods
  • Able to eval over 15.000 datasets, and 1.4 trillion triples
  • Number of query sets too limited to learn significant correlations

  • Topology of the graphs can be used to determine good samples
  • Mimic semantic relevance through structural properties, without an a-priori notion of relevance



Special thanks to Semantic Web Science Association (SWSA)