SampLD
Structural Properties as Proxy for Semantic Relevance
Laurens Rietveld
Quantity over Quality
- Datasets become too large to run on commodity hardware
- We use only a small portion
- Can't we extract the part we are interested in?
| Dataset | #triples | #queries | coverage |
| DBpedia | 459M | 1640 | 0.003% |
| Linked Geo Data | 289M | 891 | 1.917% |
| MetaLex | 204M | 4933 | 0.016% |
| Open-BioMed | 79M | 931 | 0.011% |
| BIO2RDF (KEGG) | 50M | 1297 | 2.013% |
| Semantic Web Dog Food | 0.24M | 193 | 62.4% |
Relevance Based Sampling
- Find the smallest possible RDF subgraph, that covers the maximum number of potential queries
- How can we determine which triples are relevant, and which are not?
- Can we implement a scalable sampling pipeline?
- Can we evaluate the results in a scalable fashion?
How to determine relevance of triples
Informed Sampling
- We know exactly which queries will be asked
- Extract those triples needed to answer the queries
- Problem: only a limited number of queries known
Uninformed Sampling
- We do not know which queries will be asked
- Use information contained in the graph to determine relevance
- Rank triples by relevance, and select the k best triples (0 < k < size of graph)
Approach
- Use the topology of the graph to determine relevance (network analysis)
- Evaluate the relevance of our samples against the queries that we do know
- Is network structure a good predictor for query answerability?
Network Analysis
- Example: Explain real-world phenomenons
- Find central parts of the graph
- Betweenness Centrality
- Google PageRank
- We apply
- In Degree
- Out Degree
- PageRank
Evaluation
- Sample sizes: 1% - 99%
- Baselines:
- Random Sample (10x)
- Resource Frequency
Naive evaluation does not scale
$t_e(t_d) = \sum\limits_{i=1}^{99} \frac{i}{100} \cdot methods_s \cdot methods_b \cdot t_d$
- Over 15.000 datasets, and over 1.4 trillion triples
- Requirements
- Fast loading of samples
- Powerful hardware
- Not Scalable: load all triples, execute queries, and calculate recall
Scalable Approach
- Retrieve which triples are used by a query
- Use a hadoop cluster to find the weights of these triples
- Analyze whether these triples would have been included in the sample
- Scalable. Only execute each query once
Example
Query
| ?person | ?country |
| :Laurens | :NL |
| :Stefan | :Germany |
Dataset
| Subject | Predicate | Object | Weight |
| :Laurens | :bornIn | :Amsterdam | 0.6 |
| :Amsterdam | :capitalOf | :NL | 0.1 |
| :Stefan | :bornIn | :Berlin | 0.5 |
| :Berlin | :capitalOf | :Germany | 0.5 |
| :Rinke | :bornIn | :Heerenveen | 0.1 |
Query results → Triples
| Subject | Predicate | Object | | ?person | ?city | ?country |
| :Laurens | :bornIn | :Amsterdam | | :Laurens | :Amsterdam | :NL |
| :Amsterdam | :capitalOf | :NL | | | | |
| :Stefan | :bornIn | :Berlin | | :Stefan | :Berlin | :Germany |
| :Berlin | :capitalOf | :Germany | | | | |
Triples→Recall
Which answers would we get with a sample of 60%?
| Dataset |
| Subject | Predicate | Object | Weight |
| :Laurens | :bornIn | :Amsterdam | 0.6 |
| :Stefan | :bornIn | :Berlin | 0.5 |
| :Berlin | :capitalOf | :Germany | 0.5 |
| :Amsterdam | :capitalOf | :NL | 0.1 |
| :Rinke | :bornIn | :Heerenveen | 0.1 |
| Triples used in query resultsets |
| Subject | Predicate | Object |
| :Laurens | :bornIn | :Amsterdam |
| :Amsterdam | :capitalOf | :NL |
| :Stefan | :bornIn | :Berlin |
| :Berlin | :capitalOf | :Germany |
Evaluation
- Better specificity than regular recall
- Scalable: PIG instead of SPARQL
- Special cases, e.g. GROUP BY, LIMIT, DISTINCT, OPTIONAL, UNION
Special case: UNIONS
| Subject | Predicate | Object |
| :Laurens | rdfs:label | "Laurens" |
| :Laurens | foaf:name | "Laurens" |
Evaluation Datasets
| Dataset | #triples | #queries | coverage |
| DBpedia | 459M | 1640 | 0.003% |
| Linked Geo Data | 289M | 891 | 1.917% |
| MetaLex | 204M | 4933 | 0.016% |
| Open-BioMed | 79M | 931 | 0.011% |
| BIO2RDF (KEGG) | 50M | 1297 | 2.013% |
| Semantic Web Dog Food | 0.24M | 193 | 62.4% |
Observations
- The influence of a single triple
- DBpedia
- Good: Path + PageRank
- Bad: Path + Out Degree
- Queries: 2/3 require literals
- Other Observations
- # properties vs 'Context Literals' rewrite method
- # query triple patterns
Conclusion
- Scalable pipeline: network analysis algorithms + rewrite methods
- Able to eval over 15.000 datasets, and 1.4 trillion triples
- Number of query sets too limited to learn significant correlations
- Topology of the graphs can be used to determine good samples
- Mimic semantic relevance through structural properties, without an a-priori notion of relevance
Special thanks to Semantic Web Science Association (SWSA)