There are two approaches in relational database design.
* TOP DOWN: From Data Modeling (eg. ER Model) to Relational Logical Model for implementation.
* BOTTOM UP: Normalization of Relations
Normalisation is a formal process for deciding which attributes should be grouped together in a relation so that all anomalies are removed. Hence the aim is to successively reduce relations to produce smaller, well structured relations.
The simplest kind of dependency is called functional dependency (FD). The dependencies are best explained through examples.
For example, LecturerID -> LecturerName
is a valid FD because:
For each LecturerID there is at most one LecturerName, or
LecturerName is determined by LecturerID , or
LecturerName is uniquely determined by LecturerID , or
LecturerName depends on LecturerID .
Each of the above statements is equivalent.
The FD X -> Y is a full dependency if no attribute can be removed from X.
LabDate, SubjectCode -> Tutor is a full dependency, that is, Tutor is fully dependent on both LabDate AND SubjectCode.
The FD X -> Y is a partial dependency if an attribute can be removed from X.
LecturerID, SubjectCode -> LecturerName is a partial dependency, that is, LecturerName is partially dependent on LecturerID AND SubjectCode.
(to determine LecturerName, I only need to know LecturerID).
Dependencies can be transitive.
For example, if one lecturer can teach one subject and each subject only has one tutor, then we might have the dependencies:
LecturerID -> SubjectCode
SubjectCode -> Tutor
and, transitively LecturerID -> Tutor.
Functional dependencies can be used to decide whether a schema is well designed.
For example, in the following relation:
LecturerSubject (LecturerID, LecturerName, SubjectCode, SubjectName)
If there is a new subject which has not been allocated a lecturer, can you record the details of this subject in the above table? (Insert Anomaly)
If an existing subject changes the name, can you do the changes to one instance only? (Update Anomaly)
If a lecturer resigns and the details are to be deleted, would there be a chance that some subjects will be removed permanently and we won’t have any track record of those subjects anymore? (Delete Anomaly)
Design errors in relations, such as the potential for certain kinds of anomalies, can be categorised. These categories of error can be successively eliminated by decomposing relations into normal forms.
The major/main normal forms are first (1NF), second (2NF), third (3NF), and Boyce Codd (BCNF). Higher/advanced normal forms including fourth (4NF), and fifth (5NF). Because problems with 4NF and 5NF rarely occur, moreover database designers in industry normally do not need to use the highest possible NF for practical reasons. We will focus on satisfying 3NF level.
First Normal Form (1NF):
A relation is in 1NF if:
- There are no repeating groups
- A primary key has been defined, which uniquely identifies each row in the relation.
- All attributes are functionally dependent on all or part of the key.
- Attributes should be stored as atomic values -> Each field entry can only contain one piece of data. E.g. A name field containing “Fred Smith” has surname and first name, violating 1NF.
Second Normal Form (2NF):
A relation is in 2NF if:
- The relation is in 1 NF
- All non-key attributes are fully functionally dependent on the entire key (partial dependency has been removed).
Third Normal Form (3NF):
A relation is in 3NF if:
- The relation is in 2NF
- All transitive dependencies have been removed. (Transitive dependency: non-key attribute dependent on another non-key attribute.)
Normalize the ORDER form below:
From the ORDER FORM (user view) we can derive ORDER relation:
Currently in UNF (Un-normalized Form)
(Order #, Customer #, Customer Name, Customer Address, City, State, PostCode, Order Date, (Product #, Description, Quantity, Unit Price))
Note that the order form is not in 1NF because there is a repeating group
To convert the above relation into 1NF, the repeating group must be removed by creating a new relation based on the repeating group along with the primary key of the main relation.
(Order#, Customer#, Customer Name, Address, City, State, PostCode, OrderDate)
(Order#, Product#, Description, Quantity, Unit Price) —-> Note that Order# is also a foreign key as well as a PK
Insertion Anomalies: cannot insert a new product until there is an order for that product.
Deletion Anomalies: if an order is deleted the whole detail of the product will also be deleted.
Update Anomalies: if the detail of a particular product needs to be updated, each order that contains that product has to be updated.
2NF (Partial dependencies):
The ORDER_PRODUCT relation is not in 2NF because not all non-key attributes are fully dependent on the entire key (e.g. the PK is the combination of order# and product#. But description and unit price depend on product#, not order#)
To convert the ORDER_PRODUCT relation into 2NF, a new relation must be created which consists of part of the keys (becomes the primary key of the new relation) and all non key attributes that are dependent on the partial key.
(Order#, Product#, Quantity)
(Product#, Description, Unit_Price)
The ORDER relation is already in 2NF as there are no non key attributes that are dependent on partial key (ORDER only has a single key).
Insert Anomalies: a new customer cannot be inserted until he/she has an order.
Delete Anomalies: if an order is deleted, the whole information of the customer is also deleted.
Update Anomalies: if a customer detail is to be updated, all orders for that customer need to be updated.
3NF (Transitive Dependencies):
The ORDER relation is not in 3NF because there is a transitive dependency (non-key attribute dependent on another non-key attribute). e.g. customerName, city, Address, etc. all depend on customer#, which is currently a non-key attribute.
To convert the relation into 3NF, a new relation must be created for the non-key attributes that are dependent to another non-key attribute.
(Customer#, Customer Name, Customer Address, City, State, PostCode)
(Order#, Customer#, Order Date) —-> Remember to always maintain FK links
Both the order ORDER_PRODUCT and the PRODUCT relations are already in 3NF.
ORDER_PRODUCT(Order#, Product#, Quantity)
PRODUCT (Product#, Description, Unit Price)
Example Solution: Final Relations in 3NF and BCNF
(Order#, Customer#, OrderDate)
(Customer#, CustomerName, CustomerAddress, City, State, PostCode)
(Order#, Product#, Quantity)
(Product#, Description, UnitPrice)
Now an example for everyone at home to try:
Here is a suggested solution:
1NF: Identify the PK.
PATIENTHISTORY(PatientNo, name, address, suburb, date, time, drNo, drName, visitCode, description)
2NF: Remove partial dependencies. Notice that currently the PK is the combination of PatientNo, date, time. name, address, and suburb only depend on PART OF THE KEY (patientNo). patientNo, date, time -> drNo so this is not a partial dependency. Same as visitcode. We will deal with drName and description shortly.
PATIENT (PatientNo, name, address, suburb)
PATIENT_HISTORY(PatientNo, date, time, drNo, drName, visitCode, description) —> Note that PatientNo is now a foreign key as well as part of the key.
3NF: Remove transitive dependency. Notice how drName is dependency on the non key drNo. Same principle for description. description is determined by visitCode.
DOCTOR (drNo, drName)
ILLNESS (visitCode, description)
PATIENT (PatientNo, name, address, suburb)
PATIENT_HISTORY (PatientNo, date, time, DrNo, visitCode) —> DrNo and visitCode are foreign keys (pointing to the doctor and illness tables, respectively).
We were hiring developers for a large, complicated application. I happened to mention to one of the more promising candidates that the application was fairly write-heavy and we might experience some performance concerns later. This developer, with years of experience and skills across multiple technologies that might have been of use to us, replied “you’re going to either need to shard your database or switch to NoSQL.”
That was enough to guarantee they didn’t get the job.
Before relational databases, all databases were NoSQL. That’s why we have relational databases.
In the beginning was the disk array, and all was empty and raw, and Unix moved over the face of the platters. And the DBA said: Let there be Oracle. And there was Oracle. And the environmental variables were set and the disks were striped and mirrored and the OFA was established, and behold spindle was rent asunder from spindle. And the DBA saw that all was in spec.
And it was day and it was evening of the first day.
And the DBA said: Let there be scripts. And sql.bsq brought forth myriad crawling things upon the face of the array. And catalog.sql brought forth all manner of tables and views that swim unseen beneath the waters. And catproc.sql brought forth all the built-in programs and all the hosts of the air, that the users might be given wings and take fight over the data.
And it was day and it was evening of the second day.
And the DBA said: Let there be tablespaces. And there were tablespaces. And the network administrator looked upon the disk array and did see what the tablespaces had wrought upon the disk arrays, and he did gnash his teeth and seek a new work upon the Internet with an engine of search.
And it was day and it was evening of the third day.
And the DBA created users. Male and female he created them. And he said unto the users: Thou mayest create tables and views as thou wilt. Yea, though mayest create even indexes upon the data. Only meddle not with the system tablespace, for it is a holy place, and on the day wherein thou treadest upon it, on that day thy create session shall surely be revoked. And the serpent crept among the users and whispered to them, saying: Thine roles shall not be revoked. Taste ye all of the system tablespace, for ye shall know of b-trees and hints and ye shall be as DBAs. And the users heeded the serpent and filled the system tablespace with crap. And the instance did crash and the client did wax wroth at the DBA. And the DBA did gnash his teeth and partake of the fruit of the vine, for behold the users were permanent employees and the DBA was but a contractor and could not revoke their create session.
And it was day and it was evening of the fourth day.
And the DBA did set default tablespaces and temporary tablespaces and did lock down all that was upon the face of the array with roles and profiles and all manner of quotas, yea even from the rollback segments even unto the archived redo logs.
And it was day and it was evening of the fifth day.
And the DBA created synonyms and links and did tune the server and apply patches upon the face of the database.
And it was day and it was evening of the sixth day.
And on the seventh day the DBA did rest from all the labors of the creation. And his pager did ring and he ceased from resting and did spend his sabbath on the telephone with Oracle support. And by the time the DBA got through to someone who knew whereof they spake behold it was day and it was evening of the eighth day.
And the DBA waxed wroth.
About Approximate Query Processing
Oracle provides a set of SQL functions that enable you to obtain approximate results with negligible deviation from the exact result. There are additional approximate functions that support materialized view based summary aggregation strategies. The functions that provide approximate results are as follows:
Running Queries Containing Exact Functions Using SQL Functions that Return Approximate Values
The interesting part is that you can utilize the approximate functions without changing code. There are three initialization parameters introduced to control which functions should be treated as an approximate function during runtime:
The initialization parameters are:
To replace only the
COUNT(DISTINCT) function in queries with the
APPROX_COUNT_DISTINCT function, set the
approx_for_count_distinct initialization parameter to
TRUE for the current session or for the entire database.
To replace percentile functions with the corresponding functions that return approximate results, set
ALL (replaces all percentile functions) for the current session or for the entire database.
To run all queries using the corresponding SQL functions that return approximate results instead of the specified SQL functions, set the
approx_for_aggregation initialization parameter to
TRUE for the current session or for the entire database.
When Should Approximate Query Processing (AQP) Be Used?
If you work in a billing department, stop reading right now 🙂
1. Use AQP when you care about interactive response times
Exploratory data analytics is typically an interactive and iterative process: you form an initial hypothesis (e.g., by visualizing and looking at the data), running some queries, modifying your queries based on the previous answers, and so on until you find a satisfactory explanation. You cannot be productive if you have to wait half an hour every time you run a new query to test a hypothesis.
2. Use AQP when you can make perfect decisions with imperfect answers
Exact results have no advantage over approximate ones if they both lead to the exact same conclusion/decision. This obviously depends on the quality of your approximation and the application logic.
3. Use AQP when your data is incomplete or noisy
Believe it not, most of the data that is collected and used in the real world is extremely noisy. So the idea that processing your entire data gives you 100% accurate answers is usually an illusion anyway. In other words, if my data is noisy to begin with, I might as well take advantage of approximation, which uses a controlled degree of noise in my computation in exchange for significant speed ups. This noise in your data can be because of many reasons – human error, missing values, white noise, data extraction errors, data conversion errors.
4. Use AQP when your goal is to predict something
In general, there are many cases where the output of database queries are consumed by predictive analytics or machine learning algorithms, and in those cases, returning smaller samples of the original data can be a great solution for gaining considerable performance benefits.
Additional Reading: Oracle Database New Features Guide 12c Release 2 (12.2)
Many organisations like to count things. Counting the number of distinct elements (i.e. the cardinality) of a set is a challenge when the cardinality of the set is large. I see many use cases where an approximate result for a given query would be sufficient, rather than a completely accurate result. For example, you may have a query: “How many distinct individuals visited our website last week?”. Do you need an exact number? I would like to bring your attention to a new SQL function introduced in 12c – APPROX_COUNT_DISTINCT()
The APPROX_COUNT_DISTINCT function was added, but not documented, in Oracle 11g to improve the speed of calculating the number of distinct values (NDV) when gathering statistics using the DBMS_STATS package. Oracle Database 12c (126.96.36.199) now includes the function in the documentation so we are free to use it in our applications as a supported SQL function.
APPROX_COUNT_DISTINCT returns the approximate number of rows that contain distinct values of
This function provides an alternative to the
) function, which returns the exact number of rows that contain distinct values of
APPROX_COUNT_DISTINCT processes large amounts of data significantly faster than
COUNT, with negligible deviation from the exact result.
expr, you can specify a column of any scalar data type other than
APPROX_COUNT_DISTINCT ignores rows that contain a null value for
expr. This function returns a
In previous database versions, if we wanted to perform a count of distinct values, we would probably have done something like the following.
SELECT COUNT(DISTINCT object_name) AS obj_count FROM all_objects; OBJ_COUNT ---------- 47171 1 row selected.
This query gives us the exact count of the distinct values based on Oracle’s read-consistency model. As such, we see all the committed data, along with any uncommitted changes made by the current session.
In contrast, the new
APPROX_COUNT_DISTINCT function does not provide exact results, but should give “negligible deviation from the exact result” according to the documentation.
SELECT APPROX_COUNT_DISTINCT(object_name) AS obj_count FROM all_objects; OBJ_COUNT ---------- 46788 1 row selected.
DROP TABLE t1 PURGE; CREATE TABLE t1 AS SELECT level AS data FROM dual CONNECT BY level <= 10000; INSERT /*+ APPEND */ INTO t1 SELECT a.data FROM t1 a CROSS JOIN t1 b; COMMIT; EXEC DBMS_STATS.gather_table_stats(USER,'T1');
Now we have a table with a little over 100 million rows and 10,000 distinct values. We can see the difference in performance of the two methods is more significant.
SET TIMING ON SELECT COUNT(DISTINCT data) AS data_count FROM t1; DATA_COUNT ---------- 10000 1 row selected. Elapsed: 00:00:19.66 SQL> SELECT APPROX_COUNT_DISTINCT(data) AS data_count FROM t1; DATA_COUNT ---------- 10050 1 row selected. Elapsed: 00:00:10.46
Conceptually, for some reason this technique reminds me of a similar problem in the web development arena. Sometimes, close enough is just enough.
The “Blur Up” Technique for Loading Background Images
On the first day of denormalization, my design gave to me
a really fast-running query.
On the second day of denormalization, my design gave to me
2 less tables.
On the third day of denormalization, my design gave to me
3 more indexes.
On the fourth day of denormalization, my design gave to me
4 larger disks.
On the fifth day of denormalization, my design gave to me
5 brand new reqs.
On the sixth day of denormalization, my design gave to me
6 times the locking.
On the seventh day of denormalization, my design gave to me
7 longer updates.
On the eighth day of denormalization, my design gave to me
8 more requirements.
On the ninth day of denormalization, my design gave to me
9 invalid rows.
On the tenth day of denormalization, my design gave to me
10 delays deleting.
On the eleventh day of denormalization, my design gave to me
11 questionable queries.
On the twelfth day of denormalization, my design gave to me
12 lessons learned.