Approximate Query Processing in Oracle Database (12.2)

The growing market for Big Data has created a lot of interest around approximate query processing (AQP) as a means of achieving interactive response times. In my previous post I provided an explanation of one of the more well-known approximate functions – APPROX_COUNT_DISTINCT. Here was an example of how approximate query processing obtains approximate results with negligible deviation from the exact result while dramatically saving processing resources. In this post I will describe how to use some of the other new features related to approximate query processing.

About Approximate Query Processing

Approximate query processing uses SQL functions to provide real-time responses to
explorative queries where approximations are acceptable. A query containing SQL
functions that return approximate results is referred to as an approximate query. Business  intelligence (BI) applications extensively use aggregate functions, including
analytic functions, to provide answers to common business queries. For some types of
queries, when the data set is extremely large, providing exact answers can be resource
images (1)
For example, counting the number of unique customer sessions on a website
or establishing the median house price within each post code across a state. In certain
scenarios, these types of queries may not require exact answers because you are more
interested in approximate trends or patterns, which can then be used to drive further
analysis. Approximate query processing is primarily used in data discovery
applications to return quick answers to explorative queries. Users typically want to
locate interesting data points within large amounts of data and then drill down to
uncover further levels of detail. For explorative queries, quick responses are more
important than exact values.

Oracle provides a set of SQL functions that enable you to obtain approximate results with negligible deviation from the exact result. There are additional approximate functions that support materialized view based summary aggregation strategies. The functions that provide approximate results are as follows:

Approximate query processing can be used without any changes to your existing
code. When you set the appropriate initialization parameters, Oracle Database
replaces exact functions in queries with the corresponding SQL functions that return
approximate results.

Running Queries Containing Exact Functions Using SQL Functions that Return Approximate Values

The interesting part is that you can utilize the approximate functions without changing code. There are three initialization parameters introduced to control which functions should be treated as an approximate function during runtime:

The initialization parameters are:

  • approx_for_aggregation
  • approx_for_count_distinct
  • approx_for_percentile

To replace only the COUNT(DISTINCT) function in queries with the APPROX_COUNT_DISTINCT function, set the approx_for_count_distinct initialization parameter to TRUE for the current session or for the entire database.

To replace percentile functions with the corresponding functions that return approximate results, set approx_for_percentile to PERCENTILE_CONT, PERCENTILE_DISC, or ALL (replaces all percentile functions) for the current session or for the entire database.

To run all queries using the corresponding SQL functions that return approximate results instead of the specified SQL functions, set the approx_for_aggregation initialization parameter to TRUE for the current session or for the entire database.

When Should Approximate Query Processing (AQP) Be Used?

If you work in a billing department, stop reading right now 🙂

1. Use AQP when you care about interactive response times
Exploratory data analytics is typically an interactive and iterative process: you form an initial hypothesis (e.g., by visualizing and looking at the data), running some queries, modifying your queries based on the previous answers, and so on until you find a satisfactory explanation. You cannot be productive if you have to wait half an hour every time you run a new query to test a hypothesis.

2. Use AQP when you can make perfect decisions with imperfect answers
Exact results have no advantage over approximate ones if they both lead to the exact same conclusion/decision. This obviously depends on the quality of your approximation and the application logic.

3. Use AQP when your data is incomplete or noisy
Believe it not, most of the data that is collected and used in the real world is extremely noisy. So the idea that processing your entire data gives you 100% accurate answers is usually an illusion anyway. In other words, if my data is noisy to begin with, I might as well take advantage of approximation, which uses a controlled degree of noise in my computation in exchange for significant speed ups. This noise in your data can be because of many reasons – human error, missing values, white noise, data extraction errors, data conversion errors.

4. Use AQP when your goal is to predict something
In general, there are many cases where the output of database queries are consumed by predictive analytics or machine learning algorithms, and in those cases, returning smaller samples of the original data can be a great solution for gaining considerable performance benefits.

Additional Reading: Oracle Database New Features Guide 12c Release 2 (12.2)


APPROX_COUNT_DISTINCT– Quick Distinct Count (12c)

Many organisations like to count things. Counting the number of distinct elements (i.e. the cardinality) of a set is a challenge when the cardinality of the set is large. I see many use cases where an approximate result for a given query would be sufficient, rather than a completely accurate result. For example, you may have a query: “How many distinct individuals visited our website last week?”. Do you need an exact number? I would like to bring your attention to a new SQL function introduced in 12c – APPROX_COUNT_DISTINCT()

The APPROX_COUNT_DISTINCT function was added, but not documented, in Oracle 11g to improve the speed of calculating the number of distinct values (NDV) when gathering statistics using the DBMS_STATS package. Oracle Database 12c ( now includes the function in the documentation so we are free to use it in our applications as a supported SQL function.


Description of approx_count_distinct.gif follows


APPROX_COUNT_DISTINCT returns the approximate number of rows that contain distinct values of expr.

This function provides an alternative to the COUNT (DISTINCT expr) function, which returns the exact number of rows that contain distinct values of expr. APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than COUNT, with negligible deviation from the exact result.

For expr, you can specify a column of any scalar data type other than BFILE, BLOB, CLOB, LONG, LONG RAW, or NCLOB.

APPROX_COUNT_DISTINCT ignores rows that contain a null value for expr. This function returns a NUMBER.

Basic Usage

In previous database versions, if we wanted to perform a count of distinct values, we would probably have done something like the following.

SELECT COUNT(DISTINCT object_name) AS obj_count
FROM   all_objects;


1 row selected.

This query gives us the exact count of the distinct values based on Oracle’s read-consistency model. As such, we see all the committed data, along with any uncommitted changes made by the current session.

In contrast, the new APPROX_COUNT_DISTINCT function does not provide exact results, but should give “negligible deviation from the exact result” according to the documentation.

SELECT APPROX_COUNT_DISTINCT(object_name) AS obj_count
FROM   all_objects;


1 row selected.



SELECT level AS  data
FROM   dual
CONNECT BY level <= 10000;



EXEC DBMS_STATS.gather_table_stats(USER,'T1');

Now we have a table with a little over 100 million rows and 10,000 distinct values. We can see the difference in performance of the two methods is more significant.


FROM   t1;


1 row selected.

Elapsed: 00:00:19.66

FROM   t1;


1 row selected.

Elapsed: 00:00:10.46

Conceptually, for some reason this technique reminds me of a similar problem in the web development arena. Sometimes, close enough is just enough.
The “Blur Up” Technique for Loading Background Images


12 Days of Denormalization

On the first day of denormalization, my design gave to me
a really fast-running query.

On the second day of denormalization, my design gave to me
2 less tables.

On the third day of denormalization, my design gave to me
3 more indexes.

On the fourth day of denormalization, my design gave to me
4 larger disks.

On the fifth day of denormalization, my design gave to me
5 brand new reqs.

On the sixth day of denormalization, my design gave to me
6 times the locking.

On the seventh day of denormalization, my design gave to me
7 longer updates.

On the eighth day of denormalization, my design gave to me
8 more requirements.

On the ninth day of denormalization, my design gave to me
9 invalid rows.

On the tenth day of denormalization, my design gave to me
10 delays deleting.

On the eleventh day of denormalization, my design gave to me
11 questionable queries.

On the twelfth day of denormalization, my design gave to me
12 lessons learned.

Introduction to Bitmap Indexes

What is a bitmap index?

A bitmap index is a special type of index designed for efficient query processing on multiple keys. It is a binary valued two-dimensional array created with an indexed column for every record in the table. Bitmap indexes use bit arrays (commonly called bitmaps) and answer queries by performing bitwise logical operations on these bitmaps. As the number of distinct values increases, the size of the bitmap increases exponentially.  The bit in a row of bitmap is “1” if the record has the value v for the indexed attribute, or “0” otherwise.



To create a bitmap index in Oracle:

 (x, y);

Advantage of Bitmap Indexes

One belief concerning bitmap indexes is that they are only suitable for indexing low-cardinality data. This is not necessarily true, and bitmap indexes can be used successfully for indexing columns with many thousands of different values.

You use bitmaps predominantly when you want to merge together many of them at run time and there is no sensible way to B-Tree index them. For example: in an ad-hoc query environment where the users might pick any N columns out of M and query on them – no way to B-Tree index them – but you could create a series of single column bitmaps that we’ll merge together using AND/OR’s to create a new bitmap index on the fly to find your rows. Just like B-Tree indexes, bitmap indexes are best leveraged when the combination of them makes it very selective (returns only a small number of rows).

It is for this reason that bitmap indexes are largely used in data warehousing environments. OLAP environments typically have large amounts of data and ad-hoc queries but a low level of concurrent DML transactions. For such applications, bitmap indexing provides:

  • Reduced response time for large classes of ad-hoc queries
  • Reduced storage requirements compared to other indexing techniques
  • Dramatic performance gains even on hardware with a relatively small number of CPUs or a small amount of memory
  • Efficient maintenance during parallel DML and loads

Fully indexing a large table with a traditional B-Tree index can be prohibitively expensive in terms of space because the indexes can be several times larger than the data in the table. Bitmap indexes are typically only a fraction of the size of the indexed data in the table. In ad-hoc queries and similar situations, bitmap indexes can dramatically improve query performance. AND/OR conditions in the WHERE clause of a query can be resolved quickly by performing the corresponding Boolean operations directly on the bitmaps before converting the resulting bitmap to rowids. If the resulting number of rows is small, the query can be answered quickly without resorting to a full table scan.

Bitmap indexes are also useful in data warehousing applications for joining a large fact table to smaller dimension tables such as those arranged in a star schema.

Disadvantage of Bitmap Indexes

If your tables are not read-only during query time, do NOT use bitmap indexes!

Coming a close second to the “rebuild indexes regularly” myth is the “a column with a small number of distinct values is a good candidate for a bitmap index” legend. The reason for confining bitmap indexes to data warehouses is that the overhead on maintaining them is enormous. A modification to a bitmap index is significantly more expensive on behalf of the system than a modification to a B-Tree index. In addition, the concurrency for modifications on bitmap indexes is dreadful. Therefore they are not suitable for write-intensive environments.

The classic example of using a bitmap index on a gender column (male/female) is a horrible one in my opinion. If there are only two values, and there is an even distribution of data, 50% selectivity is too large and thus not a good candidate for a bitmap index. Would you use any index to access 50% of a table?

If a session modifies the indexed data, then all of the rows that index entry points are effectively locked in most cases. Oracle cannot lock an individual bit in a bitmap index entry; it locks the entire bitmap index entry. Any other modifications that need to update the same bitmap index entry will be locked out.