SQL Window functions are similar to aggregate functions, such as count(), sum() or average(), but has different usage. Window functions are related to ordering like rank(), row_number(), while aggregate functions are related to summary of set of values like count(), sum(). SQL Window functions specification is ISO/ANSI standard since 2003. You can use SQL window functions on HiveQL, MySQL, Transact-SQL as well as Spark SQL.
In processing data, there are several cases to split data into groups or partitions and get representative values for each group. Someone may solve this problem by classifying data by making keys first and compute values for each key. You may use map() and reduce() in MapReduce framework, or use map() and reduceByKey() or aggregate() in Spark RDD API.
If the argument function of a reduce() is the add (_ + _
),
the query can be written as sum()
in SQL.
If it is the increment(++
), it can be transformed to count()
in SQL.
But, if the function of the reduce() is a something like ‘top-10 largest values’,
it would be vague to find correspondent SQL statements or functions.
And some data engineers or scientists may think this query cannot be written
in SQL statements, and decide to code in MapReduce or Spark.
In this post, I will show an example to solve top-k frequent elements problem using the rank() function which is one of SQL window functions.
Top-k Frequent Elements
Top-k frequent elements problem is finding the most frequently appearing elements from the given sequence or array of elements, not only the most frequent one, but also k-th frequent element.
Assume that you are given web server log, accessLog
in a table format like as follows:
remoteAddr | time | request | status | bytes |
---|---|---|---|---|
172.19.0.1 | 2018-10-22T17:34:46Z | GET / HTTP/1.1 | 200 | 820 |
… | … | … | … | … |
If your work is find the most frequently visited remoteAddr and its hits, that is count of logs, how do you make a SQL query for the log?
Top-k most frequent remote addresses
If the work requires the most 3-frequent remoteAddr’s for the whole log, the following query may suffice:
1 | SELECT remoteAddr, count(*) AS hits |
Top-k most frequent IPs for each days
But, if you need top-3 remoteAddr’s for each day, How do you query the log database? Here is the place where a window function make it.
1 | SELECT date, remoteAddr, hits |
How this works? Let’s see step by step from the innermost SELECT statement.
In line 6 ~ 8, the innermost SELECT counts hits for each date and remoteAddr, and generates an intermediate table like as following:
date | remoteAddr | hits |
---|---|---|
2018-10-22 | 172.19.0.1 | 34867 |
2018-10-22 | 192.168.0.20 | 2334 |
2018-10-22 | 192.168.0.21 | 13223 |
2018-10-22 | 192.168.0.23 | 9385 |
2018-10-23 | 172.19.0.1 | 28762 |
2018-10-23 | 192.168.0.20 | 13534 |
2018-10-23 | 192.168.0.21 | 2331 |
2018-10-23 | 192.168.0.22 | 22937 |
In line 3 ~ 4, rank() generates ranks by ‘hits’ for each ‘date’ from the above table, and the rank value is added as an column named ‘rank’.
date | remoteAddr | hits | rank |
---|---|---|---|
2018-10-22 | 172.19.0.1 | 34867 | 1 |
2018-10-22 | 192.168.0.20 | 2334 | 4 |
2018-10-22 | 192.168.0.21 | 13223 | 2 |
2018-10-22 | 192.168.0.23 | 9385 | 3 |
2018-10-23 | 172.19.0.1 | 28762 | 1 |
2018-10-23 | 192.168.0.20 | 13534 | 3 |
2018-10-23 | 192.168.0.21 | 2331 | 4 |
2018-10-23 | 192.168.0.22 | 22937 | 2 |
In line 11, the rows with rank over 3 are omitted.
date | remoteAddr | hits | rank |
---|---|---|---|
2018-10-22 | 172.19.0.1 | 34867 | 1 |
2018-10-22 | 192.168.0.21 | 13223 | 2 |
2018-10-22 | 192.168.0.23 | 9385 | 3 |
2018-10-23 | 172.19.0.1 | 28762 | 1 |
2018-10-23 | 192.168.0.20 | 13534 | 3 |
2018-10-23 | 192.168.0.22 | 22937 | 2 |
In line 1, rank columns are removed.
date | remoteAddr | hits |
---|---|---|
2018-10-22 | 172.19.0.1 | 34867 |
2018-10-22 | 192.168.0.21 | 13223 |
2018-10-22 | 192.168.0.23 | 9385 |
2018-10-23 | 172.19.0.1 | 28762 |
2018-10-23 | 192.168.0.20 | 13534 |
2018-10-23 | 192.168.0.22 | 22937 |
In line 12, the rows are ordered by date and hits. Then you can get only most 3 frequent remoteAddr’s for each day.
date | remoteAddr | hits |
---|---|---|
2018-10-22 | 172.19.0.1 | 34867 |
2018-10-22 | 192.168.0.21 | 13223 |
2018-10-22 | 192.168.0.23 | 9385 |
2018-10-23 | 172.19.0.1 | 28762 |
2018-10-23 | 192.168.0.22 | 22937 |
2018-10-23 | 192.168.0.20 | 13534 |
The window function rank() generates rank of each days (date). And WHERE clause in outermost SELECT statement removes needless output.
References
- SQL window function: https://en.wikipedia.org/wiki/SQL_window_function
- Yin Huai and Michael Armbrust, “Introducing Window Functions in Spark SQL“, Databricks Engneering Blog, July 15, 2015, https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
- Databricks Documentation > SQL Guide > Select > Window functions