SampLD
Structural Properties as Proxy for Semantic Relevance
Laurens Rietveld
Quantity over Quality
- Datasets become too large to run on commodity hardware
- We use only a small portion
- Can't we extract the part we are interested in?
Dataset | #triples | #queries | coverage |
DBpedia | 459M | 1640 | 0.003% |
Linked Geo Data | 289M | 891 | 1.917% |
MetaLex | 204M | 4933 | 0.016% |
Open-BioMed | 79M | 931 | 0.011% |
BIO2RDF (KEGG) | 50M | 1297 | 2.013% |
Semantic Web Dog Food | 0.24M | 193 | 62.4% |
Relevance Based Sampling
- Find the smallest possible RDF subgraph, that covers the maximum number of potential queries
- How can we determine which triples are relevant, and which are not?
- Can we implement a scalable sampling pipeline?
- Can we evaluate the results in a scalable fashion?
How to determine relevance of triples
Informed Sampling
- We know exactly which queries will be asked
- Extract those triples needed to answer the queries
- Problem: only a limited number of queries known
Uninformed Sampling
- We do not know which queries will be asked
- Use information contained in the graph to determine relevance
- Rank triples by relevance, and select the k best triples (0 < k < size of graph)
Approach
- Use the topology of the graph to determine relevance (network analysis)
- Evaluate the relevance of our samples against the queries that we do know
- Is network structure a good predictor for query answerability?
Network Analysis
- Example: Explain real-world phenomenons
- Find central parts of the graph
- Betweenness Centrality
- Google PageRank
- We apply
- In Degree
- Out Degree
- PageRank
Evaluation
- Sample sizes: 1% - 99%
- Baselines:
- Random Sample (10x)
- Resource Frequency
Naive evaluation does not scale
$t_e(t_d) = \sum\limits_{i=1}^{99} \frac{i}{100} \cdot methods_s \cdot methods_b \cdot t_d$
- Over 15.000 datasets, and over 1.4 trillion triples
- Requirements
- Fast loading of samples
- Powerful hardware
- Not Scalable: load all triples, execute queries, and calculate recall
Scalable Approach
- Retrieve which triples are used by a query
- Use a hadoop cluster to find the weights of these triples
- Analyze whether these triples would have been included in the sample
- Scalable. Only execute each query once
Example
Query
?person | ?country |
:Laurens | :NL |
:Stefan | :Germany |
Dataset
Subject | Predicate | Object | Weight |
:Laurens | :bornIn | :Amsterdam | 0.6 |
:Amsterdam | :capitalOf | :NL | 0.1 |
:Stefan | :bornIn | :Berlin | 0.5 |
:Berlin | :capitalOf | :Germany | 0.5 |
:Rinke | :bornIn | :Heerenveen | 0.1 |
Query results → Triples
Subject | Predicate | Object | | ?person | ?city | ?country |
:Laurens | :bornIn | :Amsterdam | | :Laurens | :Amsterdam | :NL |
:Amsterdam | :capitalOf | :NL | | | | |
:Stefan | :bornIn | :Berlin | | :Stefan | :Berlin | :Germany |
:Berlin | :capitalOf | :Germany | | | | |
Triples→Recall
Which answers would we get with a sample of 60%?
Dataset |
Subject | Predicate | Object | Weight |
:Laurens | :bornIn | :Amsterdam | 0.6 |
:Stefan | :bornIn | :Berlin | 0.5 |
:Berlin | :capitalOf | :Germany | 0.5 |
:Amsterdam | :capitalOf | :NL | 0.1 |
:Rinke | :bornIn | :Heerenveen | 0.1 |
Triples used in query resultsets |
Subject | Predicate | Object |
:Laurens | :bornIn | :Amsterdam |
:Amsterdam | :capitalOf | :NL |
:Stefan | :bornIn | :Berlin |
:Berlin | :capitalOf | :Germany |
Evaluation
- Better specificity than regular recall
- Scalable: PIG instead of SPARQL
- Special cases, e.g. GROUP BY, LIMIT, DISTINCT, OPTIONAL, UNION
Special case: UNIONS
Subject | Predicate | Object |
:Laurens | rdfs:label | "Laurens" |
:Laurens | foaf:name | "Laurens" |
Evaluation Datasets
Dataset | #triples | #queries | coverage |
DBpedia | 459M | 1640 | 0.003% |
Linked Geo Data | 289M | 891 | 1.917% |
MetaLex | 204M | 4933 | 0.016% |
Open-BioMed | 79M | 931 | 0.011% |
BIO2RDF (KEGG) | 50M | 1297 | 2.013% |
Semantic Web Dog Food | 0.24M | 193 | 62.4% |
Observations
- The influence of a single triple
- DBpedia
- Good: Path + PageRank
- Bad: Path + Out Degree
- Queries: 2/3 require literals
- Other Observations
- # properties vs 'Context Literals' rewrite method
- # query triple patterns
Conclusion
- Scalable pipeline: network analysis algorithms + rewrite methods
- Able to eval over 15.000 datasets, and 1.4 trillion triples
- Number of query sets too limited to learn significant correlations
- Topology of the graphs can be used to determine good samples
- Mimic semantic relevance through structural properties, without an a-priori notion of relevance
Special thanks to Semantic Web Science Association (SWSA)