SampLD

Structural Properties as Proxy for Semantic Relevance

Laurens Rietveld

http://presentations.laurensrietveld.nl/now

Quantity over Quality

Datasets become too large to run on commodity hardware
We use only a small portion
Can't we extract the part we are interested in?

Dataset	#triples	#queries	coverage
DBpedia	459M	1640	0.003%
Linked Geo Data	289M	891	1.917%
MetaLex	204M	4933	0.016%
Open-BioMed	79M	931	0.011%
BIO2RDF (KEGG)	50M	1297	2.013%
Semantic Web Dog Food	0.24M	193	62.4%

Relevance Based Sampling

Find the smallest possible RDF subgraph, that covers the maximum number of potential queries
How can we determine which triples are relevant, and which are not?
Can we implement a scalable sampling pipeline?
Can we evaluate the results in a scalable fashion?

How to determine relevance of triples

Informed Sampling

We know exactly which queries will be asked
Extract those triples needed to answer the queries
Problem: only a limited number of queries known

Uninformed Sampling

We do not know which queries will be asked
Use information contained in the graph to determine relevance
Rank triples by relevance, and select the k best triples (0 < k < size of graph)

Approach

Use the topology of the graph to determine relevance (network analysis)
Evaluate the relevance of our samples against the queries that we do know
Is network structure a good predictor for query answerability?

Network Analysis

Example: Explain real-world phenomenons
Find central parts of the graph
Betweenness Centrality
Google PageRank
We apply
- In Degree
- Out Degree
- PageRank

Ranked List of Triples

Rewrite
Apply network analysis
Nodes Weights → Triples Weights
```
W(triple) = max(W(sub), W(obj))
```
Weighted list of triples → Sample

Evaluation

Sample sizes: 1% - 99%
Baselines:
- Random Sample (10x)
- Resource Frequency

Naive evaluation does not scale

$t_e(t_d) = \sum\limits_{i=1}^{99} \frac{i}{100} \cdot methods_s \cdot methods_b \cdot t_d$

Over 15.000 datasets, and over 1.4 trillion triples
Requirements
- Fast loading of samples
- Powerful hardware
Not Scalable: load all triples, execute queries, and calculate recall

Scalable Approach

Retrieve which triples are used by a query
Use a hadoop cluster to find the weights of these triples
Analyze whether these triples would have been included in the sample
Scalable. Only execute each query once

Example

Query

?person	?country
:Laurens	:NL
:Stefan	:Germany

Dataset

Subject	Predicate	Object	Weight
:Laurens	:bornIn	:Amsterdam	0.6
:Amsterdam	:capitalOf	:NL	0.1
:Stefan	:bornIn	:Berlin	0.5
:Berlin	:capitalOf	:Germany	0.5
:Rinke	:bornIn	:Heerenveen	0.1

Query results → Triples

Subject	Predicate	Object		?person	?city	?country
:Laurens	:bornIn	:Amsterdam		:Laurens	:Amsterdam	:NL
:Amsterdam	:capitalOf	:NL
:Stefan	:bornIn	:Berlin		:Stefan	:Berlin	:Germany
:Berlin	:capitalOf	:Germany

Triples→Recall

Which answers would we get with a sample of 60%?

Subject	Predicate	Object	Weight
Dataset
:Laurens	:bornIn	:Amsterdam	0.6
:Stefan	:bornIn	:Berlin	0.5
:Berlin	:capitalOf	:Germany	0.5
:Amsterdam	:capitalOf	:NL	0.1
:Rinke	:bornIn	:Heerenveen	0.1

Subject	Predicate	Object
Triples used in query resultsets
:Laurens	:bornIn	:Amsterdam
:Amsterdam	:capitalOf	:NL
:Stefan	:bornIn	:Berlin
:Berlin	:capitalOf	:Germany

Evaluation

Better specificity than regular recall
Scalable: PIG instead of SPARQL
Special cases, e.g. GROUP BY, LIMIT, DISTINCT, OPTIONAL, UNION

Special case: UNIONS

Subject	Predicate	Object
:Laurens	rdfs:label	"Laurens"
:Laurens	foaf:name	"Laurens"

Evaluation Datasets

Dataset	#triples	#queries	coverage
DBpedia	459M	1640	0.003%
Linked Geo Data	289M	891	1.917%
MetaLex	204M	4933	0.016%
Open-BioMed	79M	931	0.011%
BIO2RDF (KEGG)	50M	1297	2.013%
Semantic Web Dog Food	0.24M	193	62.4%

Results

click for more results

Observations

The influence of a single triple
DBpedia
- Good: Path + PageRank
- Bad: Path + Out Degree
- Queries: 2/3 require literals
Other Observations
- # properties vs 'Context Literals' rewrite method
- # query triple patterns

Conclusion

Scalable pipeline: network analysis algorithms + rewrite methods
Able to eval over 15.000 datasets, and 1.4 trillion triples
Number of query sets too limited to learn significant correlations
Topology of the graphs can be used to determine good samples
Mimic semantic relevance through structural properties, without an a-priori notion of relevance

Special thanks to Semantic Web Science Association (SWSA)