Apply on single table, or ranked lists of tuples ordered by individual attributes
Ranked inputs in the same or different servers (centralized or distributed data)
Standalone query or operator in a more complex query plan
Incremental retrieval of objects with highest scores (k is not predefined)
Top-k joins
代码语言:javascript
复制
SELECT h.id, s.id
FROM House h School s
WHERE h.location=s.location
ORDER BY h.price + 10 ∗ s.tuition
LIMIT 5
Probabilistic/approximate top-k retrieval
Random and/or sorted accesses at ranked inputs
Top-k Query Evaluation
Most solutions assume distributive, monotone aggregate functions (e.g. f=sum)
distributive: f(x,y,z,w)= f(f(x,y),f(z,w))
e.g., A+B+C+D = (A+B) + (C+D)
monotone: if x<y and z<w, then f(x,z)<f(y,w)
Solutions based on 1-D ordering and merging sorted lists (rank aggregation)
Solutions based on multidimensional indexing
Rank Aggregation
Solutions based on 1-D ordering and merging sorted lists (rank aggregation)
Assume that there is a total ranking of theobjects for each attributethat can be used in top-kqueries
These sorted inputs canbe accessed sequentiallyand/or by random accesses
Advantages and Drawbacks
Advantages:
can be applied on any subset of inputs (arbitrary subspace)
appropriate for distributed data
appropriate for top-k joins
easy to understand and implement
Drawbacks:
slower than index-based methods
require inputs to be sorted
TA: Threshold Algorithm
Introduction
Iteratively retrieves objects and their atomic scores from the ranked inputs in a round-robin fashion.
For each encountered object x, perform random accesses to the inputs where x has not been seen.
Maintain top-k objects seen so far.
T = f(l_1, . . . , l_m) is the score derived when applying the aggregation function to the last atomic scores seen at each input. If the score of the k-th object is no smaller than T, terminate.
Example of TA(k=1,f=sum)
STEP 1
top-1 is c, with score 2.0
T=sum(0.9,0.9,0.9)=2.7
T>top-1, we proceed to another round of accesses
STEP 2
top-1 is b, with score 2.2
T=sum(0.8,0.8,0.9)=2.5
T>top-1, we proceed to another round of accesses
STEP 3
top-1 is b, with score 2.2
T=sum(0.6,0.6,0.8)=2.0
T≤top-1, terminate and output (b,2.2)
Properties of TA
Used as a standard module for merging ranked lists in many applications
Usually finds the result quickly
Depends on random accesses, which can be expensive
random accesses are impossible in some cases
e.g., an API allows to access objects incrementally by ranking score, but does not provide the score of a given object
NRA: No Random Accesses
Introduction
Iteratively retrieves objects and their atomic scores from the ranked inputs in a round-robin fashion.
For each object x seen so far at any input maintain:
f_x_ub: upper bound for x’s aggregate score (f_x)
f_x_lb: lower bound for x’s aggregate score (f_x)
W_k = k objects with the largest f^lb.
If the smallest f^lb in W_k is at least the largest f_x_ub of any object x not in W_k, then terminate and report W_k as top-k result.
Example of NRA(k=1,f=sum)
STEP 1
STEP 2
STEP 3
STEP 4
NRA Properties
More generic than TA, since it does not depend on random accesses
Can be cheaper than TA, if random accesses are very expensive
NRA accesses objects sequentially from all inputs and updates the upper bounds for all objects seen so far unconditionally.
Cost: O(n) per access (the expected distinct number of objects accessed so far is O(n))
No input list is pruned until the algorithm terminates
LARA: LAttice-based Rank Aggregation
LARA: An efficient NRA implementation
Based on 3 observations about the top-k candidates
Operates differently in the two (growing, shrinking) phases
Takes its name from the lattice used in the shrinking phase