Many organisations like to count things. Counting the number of distinct elements (i.e. the cardinality) of a set is a challenge when the cardinality of the set is large. I see many use cases where an approximate result for a given query would be sufficient, rather than a completely accurate result. For example, you may have a query: “How many distinct individuals visited our website last week?”. Do you need an exact number? I would like to bring your attention to a new SQL function introduced in 12c – APPROX_COUNT_DISTINCT()
The APPROX_COUNT_DISTINCT function was added, but not documented, in Oracle 11g to improve the speed of calculating the number of distinct values (NDV) when gathering statistics using the DBMS_STATS package. Oracle Database 12c (12.1.0.2) now includes the function in the documentation so we are free to use it in our applications as a supported SQL function.
Syntax
Purpose
APPROX_COUNT_DISTINCT
returns the approximate number of rows that contain distinct values of expr
.
This function provides an alternative to the COUNT
(DISTINCT
expr
)
function, which returns the exact number of rows that contain distinct values of expr
. APPROX_COUNT_DISTINCT
processes large amounts of data significantly faster than COUNT
, with negligible deviation from the exact result.
For expr
, you can specify a column of any scalar data type other than BFILE
, BLOB
, CLOB
, LONG
, LONG
RAW
, or NCLOB
.
APPROX_COUNT_DISTINCT
ignores rows that contain a null value for expr
. This function returns a NUMBER
.
Basic Usage
In previous database versions, if we wanted to perform a count of distinct values, we would probably have done something like the following.
SELECT COUNT(DISTINCT object_name) AS obj_count FROM all_objects; OBJ_COUNT ---------- 47171 1 row selected.
This query gives us the exact count of the distinct values based on Oracle’s read-consistency model. As such, we see all the committed data, along with any uncommitted changes made by the current session.
In contrast, the new APPROX_COUNT_DISTINCT
function does not provide exact results, but should give “negligible deviation from the exact result” according to the documentation.
SELECT APPROX_COUNT_DISTINCT(object_name) AS obj_count FROM all_objects; OBJ_COUNT ---------- 46788 1 row selected.
Performance
DROP TABLE t1 PURGE; CREATE TABLE t1 AS SELECT level AS data FROM dual CONNECT BY level <= 10000; INSERT /*+ APPEND */ INTO t1 SELECT a.data FROM t1 a CROSS JOIN t1 b; COMMIT; EXEC DBMS_STATS.gather_table_stats(USER,'T1');
Now we have a table with a little over 100 million rows and 10,000 distinct values. We can see the difference in performance of the two methods is more significant.
SET TIMING ON SELECT COUNT(DISTINCT data) AS data_count FROM t1; DATA_COUNT ---------- 10000 1 row selected. Elapsed: 00:00:19.66 SQL> SELECT APPROX_COUNT_DISTINCT(data) AS data_count FROM t1; DATA_COUNT ---------- 10050 1 row selected. Elapsed: 00:00:10.46
Conceptually, for some reason this technique reminds me of a similar problem in the web development arena. Sometimes, close enough is just enough.
The “Blur Up” Technique for Loading Background Images