Clifford's Portfolio-SQL distinct on

The DISTINCT keyword used in a select statment is used to remove duplicates from a query result. It keeps one row from each group of duplicates.

This query shows its usage:

SELECT 
    DISTINCT column_1
FROM
    table1
;

It can be used on multiple columns. In that instance all the columns the distinct values are based on the combination of all the specified columns.

Like so:

SELECT 
    DISTINCT column_1
    , column_2
FROM
    table1
;

Sometimes however, you’d want to only keep the distinct values for just a single column. This is where the Postgres DISTINCT ON comes in.

SELECT 
    DISTINCT ON (column_1) alias
    , column_2
FROM
    table1
ORDER BY
    column_1
    , column_2
;

In this instance, the duplicates are only evaluated on the column specified by the DISTINCT ON. It is a good idea to add an order by clause to prevent unexpected results.

Note that, the column on which the distinct on is performed should be the first specified in the select statement and also in the order by clause.

Redshift however doesn’t support the DISTINCT ON feature although it is built on postgres. To bypass however you can use a combination of ROW_NUMBER and PARTITION BY.

This example assumes column_1 is the column on which the distinct on should be evaluated:

SELECT  
    *
FROM
    (SELECT column_1
            , column_2
            , column_3
            ROW_NUMBER() OVER (PARTITION BY column_1
                             ORDER BY column_2, column_3) AS column_1_ranked
    FROM table1
    ORDER BY column_1
             , column_2
             , column_3
    LIMIT 10) AS ranked
WHERE ranked.column_1_ranked = 1;

SQL distinct on

Sources: