Window Functions

1. Explain Window Functions and provide examples.

Window functions provide a way to perform calculations across rows within a result set, without creating separate subqueries.
Examples: AVG, SUM, MAX, MIN, RANK, DENSE_RANK, LEAD, LAG,

2. OVER

The OVER clause defines the window (set of rows) over which the window function is applied.

3. PARTITION BY

The PARTITION BY clause is used to divide a result set into partitions based on specified columns.

4. ORDER BY

The ORDER BY clause within the OVER clause is used to specify the order of rows within each partition.

5. OVER, PARTITION & ORDER BY (Syntax and Example).

Syntax

<WindowFunction>() OVER (
    PARTITION BY <expression>[{,<expression_2>...}]
    ORDER BY <expression> [ASC|DESC], [{,<expression_2>...}]
)

Example

SELECT 
    department_id, 
    employee_id, 
    salary, 
    SUM(salary) OVER (PARTITION BY department_id ORDER BY employee_id) AS cumulative_salary
FROM employees;

PARTITION BY department_id: Divides the result set into partitions based on department_id. The cumulative salary is calculated separately for each department.
ORDER BY employee_id: Orders the rows within each department by employee_id before calculating the cumulative sum.

6. ROW_NUMBER

The ROW_NUMBER assigns a unique row number to each row within a partition.

7. RANK

The RANK() function assigns a unique rank to each distinct row within a partition of a result set. It produces gaps in ranking values if there are ties.

Syntax:

RANK() OVER (PARTITION BY column_name ORDER BY column_name [ASC|DESC])

8. DENSE_RANK

The DENSE_RANK() function also assigns ranks to rows within a partition of a result set but does not produce gaps in rank values when there are ties.

Syntax:

DENSE_RANK() OVER (PARTITION BY column_name ORDER BY column_name [ASC|DESC])

9. LEAD

The LEAD() function allows you to access data from a row that follows the current row in the result set. It provides a way to look ahead within a partition or the entire dataset.

Syntax:

LEAD(column_name, [offset], [default_value]) OVER (PARTITION BY column_name ORDER BY column_name [ASC|DESC])

column_name: The column whose value you want to retrieve from the next row.
offset (optional): Number of rows to look ahead (default is 1 if omitted).
default_value (optional): Value to return if the lead row does not exist (e.g., if the current row is near the end of the partition).

Example

SELECT 
    sale_date, 
    sales_amount, 
    LEAD(sales_amount, 1, 0) OVER (ORDER BY sale_date) AS next_day_sales
FROM sales;

Explanation: This query retrieves the sales_amount for the next day. If the next day’s sales data is not available, it returns 0.

10. LAG

The LAG() function allows you to access data from a row that precedes the current row in the result set. It provides a way to look back within a partition or the entire dataset.

Syntax:

LAG(column_name, [offset], [default_value]) OVER (PARTITION BY column_name ORDER BY column_name [ASC|DESC])

column_name: The column whose value you want to retrieve from the previous row.
offset (optional): Number of rows to look back (default is 1 if omitted).
default_value (optional): Value to return if the lag row does not exist (e.g., if the current row is near the start of the partition).

Example

SELECT 
    sale_date, 
    sales_amount, 
    LAG(sales_amount, 1, 0) OVER (ORDER BY sale_date) AS previous_day_sales
FROM sales;

Explanation: This query retrieves the sales_amount from the previous day. If the previous day’s sales data is not available, it returns 0.

11. ROWS

The ROWS clause is used to define the window frame that specifies the range of rows that should be included in the calculation.

12. ROWS PRECEDING / FOLLOWING / UNBOUNDED / CURRENT ROW

Image Source: https://www.andreaperlato.com/theorypost/bias-variance-trade-off/

PRECEDING: Includes rows before the current row.
FOLLOWING: Includes rows after the current row.
UNBOUNDED: Specifies an unbounded number of rows.
CURRENT ROW: Represents the current row.

13. Explain the difference between RANK and DENSE_RANK.

Gaps in Ranking:
- RANK(): Produces gaps in ranking values if there are ties. For example, if two rows are tied for rank 1, the next rank will be 3.
- DENSE_RANK(): Does not produce gaps in ranking values. The next rank after a tie is incremented by 1.
Use-Cases:
- RANK(): Useful when the presence of gaps in ranking is acceptable or desired.
- DENSE_RANK(): Useful when you need a continuous sequence of ranks without gaps.

14. List some of the common aggregate functions to apply window function.

SUM(), AVG(), MIN(), MAX(), COUNT(): These aggregate functions can also be used as window functions.

15. Explain the difference between WINDOW Functions and GROUP BY.

GROUP BY is used for grouping rows and aggregating data, resulting in fewer rows in the output.
Window functions are used to calculate additional information across a set of rows while retaining all original rows.

Last updated on 22 Aug 2024