SQL Window Functions: Top-k Elements Example

SQL Window functions are similar to aggregate functions, such as count(), sum() or average(), but has different usage. Window functions are related to ordering like rank(), row_number(), while aggregate functions are related to summary of set of values like count(), sum(). SQL Window functions specification is ISO/ANSI standard since 2003. You can use SQL window functions on HiveQL, MySQL, Transact-SQL as well as Spark SQL.

In processing data, there are several cases to split data into groups or partitions and get representative values for each group. Someone may solve this problem by classifying data by making keys first and compute values for each key. You may use map() and reduce() in MapReduce framework, or use map() and reduceByKey() or aggregate() in Spark RDD API.

If the argument function of a reduce() is the add (_ + _), the query can be written as sum() in SQL. If it is the increment(++), it can be transformed to count() in SQL. But, if the function of the reduce() is a something like ‘top-10 largest values’, it would be vague to find correspondent SQL statements or functions. And some data engineers or scientists may think this query cannot be written in SQL statements, and decide to code in MapReduce or Spark.

In this post, I will show an example to solve top-k frequent elements problem using the rank() function which is one of SQL window functions.

Top-k Frequent Elements

Top-k frequent elements problem is finding the most frequently appearing elements from the given sequence or array of elements, not only the most frequent one, but also k-th frequent element.

Assume that you are given web server log, accessLog in a table format like as follows:

remoteAddr time request status bytes
172.19.0.1 2018-10-22T17:34:46Z GET / HTTP/1.1 200 820

If your work is find the most frequently visited remoteAddr and its hits, that is count of logs, how do you make a SQL query for the log?

Top-k most frequent remote addresses

If the work requires the most 3-frequent remoteAddr’s for the whole log, the following query may suffice:

1
2
3
4
5
SELECT remoteAddr, count(*) AS hits
FROM accessLog
GROUP BY remoteAddr
ORDER BY hits DESC
LIMIT 3

Top-k most frequent IPs for each days

But, if you need top-3 remoteAddr’s for each day, How do you query the log database? Here is the place where a window function make it.

1
2
3
4
5
6
7
8
9
10
11
12
SELECT date, remoteAddr, hits
FROM (
SELECT date, remoteAddr, hits,
rank() OVER (PARTITION BY date ORDER BY hits DESC) AS rank
FROM (
SELECT cast(time AS date) AS date, remoteAddr, count(*) AS hits
FROM accessLog
GROUP BY date, remoteAddr
)
)
WHERE rank <= 3
ORDER BY date, hits DESC

How this works? Let’s see step by step from the innermost SELECT statement.

In line 6 ~ 8, the innermost SELECT counts hits for each date and remoteAddr, and generates an intermediate table like as following:

date remoteAddr hits
2018-10-22 172.19.0.1 34867
2018-10-22 192.168.0.20 2334
2018-10-22 192.168.0.21 13223
2018-10-22 192.168.0.23 9385
2018-10-23 172.19.0.1 28762
2018-10-23 192.168.0.20 13534
2018-10-23 192.168.0.21 2331
2018-10-23 192.168.0.22 22937

In line 3 ~ 4, rank() generates ranks by ‘hits’ for each ‘date’ from the above table, and the rank value is added as an column named ‘rank’.

date remoteAddr hits rank
2018-10-22 172.19.0.1 34867 1
2018-10-22 192.168.0.20 2334 4
2018-10-22 192.168.0.21 13223 2
2018-10-22 192.168.0.23 9385 3
2018-10-23 172.19.0.1 28762 1
2018-10-23 192.168.0.20 13534 3
2018-10-23 192.168.0.21 2331 4
2018-10-23 192.168.0.22 22937 2

In line 11, the rows with rank over 3 are omitted.

date remoteAddr hits rank
2018-10-22 172.19.0.1 34867 1
2018-10-22 192.168.0.21 13223 2
2018-10-22 192.168.0.23 9385 3
2018-10-23 172.19.0.1 28762 1
2018-10-23 192.168.0.20 13534 3
2018-10-23 192.168.0.22 22937 2

In line 1, rank columns are removed.

date remoteAddr hits
2018-10-22 172.19.0.1 34867
2018-10-22 192.168.0.21 13223
2018-10-22 192.168.0.23 9385
2018-10-23 172.19.0.1 28762
2018-10-23 192.168.0.20 13534
2018-10-23 192.168.0.22 22937

In line 12, the rows are ordered by date and hits. Then you can get only most 3 frequent remoteAddr’s for each day.

date remoteAddr hits
2018-10-22 172.19.0.1 34867
2018-10-22 192.168.0.21 13223
2018-10-22 192.168.0.23 9385
2018-10-23 172.19.0.1 28762
2018-10-23 192.168.0.22 22937
2018-10-23 192.168.0.20 13534

The window function rank() generates rank of each days (date). And WHERE clause in outermost SELECT statement removes needless output.

References