APPROX_COUNT_DISTINCT– Quick Distinct Count (12c)

Many organisations like to count things. Counting the number of distinct elements (i.e. the cardinality) of a set is a challenge when the cardinality of the set is large. I see many use cases where an approximate result for a given query would be sufficient, rather than a completely accurate result. For example, you may have a query: “How many distinct individuals visited our website last week?”. Do you need an exact number? I would like to bring your attention to a new SQL function introduced in 12c – APPROX_COUNT_DISTINCT()

The APPROX_COUNT_DISTINCT function was added, but not documented, in Oracle 11g to improve the speed of calculating the number of distinct values (NDV) when gathering statistics using the DBMS_STATS package. Oracle Database 12c (12.1.0.2) now includes the function in the documentation so we are free to use it in our applications as a supported SQL function.

Syntax

Description of approx_count_distinct.gif follows

Purpose

APPROX_COUNT_DISTINCT returns the approximate number of rows that contain distinct values of expr.

This function provides an alternative to the COUNT (DISTINCT expr) function, which returns the exact number of rows that contain distinct values of expr. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result.

For expr, you can specify a column of any scalar data type other than BFILE, BLOB, CLOB, LONG, LONG RAW, or NCLOB.

APPROX_COUNT_DISTINCT ignores rows that contain a null value for expr. This function returns a NUMBER.

Basic Usage

In previous database versions, if we wanted to perform a count of distinct values, we would probably have done something like the following.

SELECT COUNT(DISTINCT object_name) AS obj_count
FROM   all_objects;

 OBJ_COUNT
----------
     47171

1 row selected.

This query gives us the exact count of the distinct values based on Oracle’s read-consistency model. As such, we see all the committed data, along with any uncommitted changes made by the current session.

In contrast, the new APPROX_COUNT_DISTINCT function does not provide exact results, but should give “negligible deviation from the exact result” according to the documentation.

SELECT APPROX_COUNT_DISTINCT(object_name) AS obj_count
FROM   all_objects;

 OBJ_COUNT
----------
     46788

1 row selected.

Performance

DROP TABLE t1 PURGE;

CREATE TABLE t1 AS
SELECT level AS  data
FROM   dual
CONNECT BY level <= 10000;

INSERT /*+ APPEND */ INTO t1
SELECT a.data FROM t1 a
CROSS JOIN t1 b;

COMMIT;

EXEC DBMS_STATS.gather_table_stats(USER,'T1');

Now we have a table with a little over 100 million rows and 10,000 distinct values. We can see the difference in performance of the two methods is more significant.

SET TIMING ON

SELECT COUNT(DISTINCT data) AS data_count
FROM   t1;

DATA_COUNT
----------
     10000

1 row selected.

Elapsed: 00:00:19.66
SQL>


SELECT APPROX_COUNT_DISTINCT(data) AS data_count
FROM   t1;

DATA_COUNT
----------
     10050

1 row selected.

Elapsed: 00:00:10.46

Conceptually, for some reason this technique reminds me of a similar problem in the web development arena. Sometimes, close enough is just enough.
The “Blur Up” Technique for Loading Background Images

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s