| This page is user contributed documentation. See the bottom of the page for information about the author. | |
author: Richard Huxton
revision: 1.0
date: 2003-03-14
This is a short guide to the use of nulls in SQL databases. It is written with Postgresql in mind but should be applicable to any SQL-based DBMS.
Thanks to the members of the psql-sql mailing list for their assistance in preparing this guide.
You can get further information in: Any good relational database book (try something written by Date or Pascal) Bruce's book My Postgresql Notes
A null is not an empty string. A null is not a value like others. A null is the absence of a value
The SQL standard defines it as a "special value" but its behaviour is so different to all the other values of that type that I prefer to view it as an absence of a value. If you don't see the difference, just ignore it.
Well, they should mean one of two things: 1. There is no applicable value 2. There is a value but it is unknown
Example 1: Imagine you have a customer table with name and sex
fields.
If you get a new customer "ACME Widgets Ltd", the sex field is
meaningless since your customer is a company (case 1).
A new customer "Jackie Smith" might be male or female but you
might not know (case 2).
Actually, since you are trying to store a company in the ACME
example that might indicate that you need to rethink your design.
Example 2: You have an address table with
(street,city,county,postalcode) fields.
You might insert an address ("10 Downing
Street","London",Null,"WC1 1AA") since you don't have a valid
county.
You might also insert an address ("1 Any
Street","Maidstone","Kent",Null) where there must be a
valid postalcode, but you don't know what it is.
It might be useful to be able to distinguish between these two cases - not applicable and unknown, but there is only one option "Null" available to us, so we can't.
There is one very important rule when dealing with nulls. A null is unknown and thus not equal to, less than or greater than any value it is compared to.
Example: with the customer table above you could run the following queries:
SELECT FROM customer WHERE sex='M';
SELECT FROM customer WHERE sex<>'M';
Now you might think this returns all customers, but it will miss
those where sex is null. You've asked for all rows where the
value of sex is M and all those with values not
equal to M but not rows with no value at
all
It might help to think of a database as a set of statements you know to be true. A null indicates that you cannot say anything at all about that field. You can't say what it is, you can't say what it isn't, you can only say there is some information missing.
So, to see all the customers with unknown or inapplicable sex you would need:
SELECT * FROM customer WHERE sex IS NULL;
Note that the following will not work, you need to use "IS NULL"
SELECT * FROM customer WHERE sex=NULL;
There are actually three possible results for a test in SQL - True (the test passed), False (the test failed) and Null (unknown or can't say). The table below indicates the result of using AND/OR operations on a,b for values of True,False and Null.
a | b | a AND b | a OR b
------+-------+---------+--------
TRUE | TRUE | TRUE | TRUE
TRUE | FALSE | FALSE | TRUE
TRUE | NULL | NULL | TRUE
FALSE | FALSE | FALSE | FALSE
FALSE | NULL | FALSE | NULL
NULL | NULL | NULL | NULL
In the example of a=True,b=Null, (a AND b) is Null (which gets treated as false for the purposes of WHERE clauses). However (a OR b) is True since if a is True, we don't care what b is.
If you try to perform an operation on nulls, again the result is always null. So the results of all of the following are null:
SELECT abc || null;
The first case can be especially confusing. Concatenating Null to a text value will return null, not the original value. This can catch you out if you are joining first_name to last_name and one of them contains nulls.
You can think of each null-able field/column having a separate "is_null" flag attached to it. So, if you have a column "a" of type integer, in addition to space required to store the number, there is another bit which says whether the item is null and the value should be ignored. Of course, there are optimisations that get made, but that is the general idea.
If you define a unique index on a column it prevents you inserting two values that are the same. It does not prevent you inserting as many nulls as you like. How could it? You don't have a value so it can't be the same as any other.
Example: We create a table "ta" with a unique constraint on column "b"
CREATE TABLE ta (
a int4,
b varchar(3),
PRIMARY KEY (a)
);
CREATE UNIQUE INDEX ta_b_idx ON ta (b);
INSERT INTO ta VALUES (1,aaa); -- succeeds
bbb); -- succeeds
bbb); -- fails
Given the definition of what a null is, you only have two choices: allow multiple nulls or allow no nulls. If you want no nulls, define the column as NOT NULL when creating the table.
No column that is part of a primary key can be null. When you define a PRIMARY KEY, none of the columns mentioned can take a null value. Postgresql makes sure of this by defining the columns as NOT NULL for you.
Example: With table "ta" we just created, \d ta will show column a as being not null. Otherwise, we could insert several rows with "a" set to null and have no way to tell them apart. If the primary key was defined as being over (a,b) then neither could be null.
You need to be careful using count() if a column can contain nulls. The count() function is defined as counting values and so skips nulls. The same applies to other aggregates like sum() or max() but these behave more intuitively.
Example: we have a table ta (a int4, b int4) with the following data.
a | b
---+----
1 | 10
2 | 20
3 |
SELECT count(*) as num_rows, count(a) as num_a, count(b) as num_b
FROM ta;
num_rows | num_a | num_b
----------+-------+-------
3 | 3 | 2
SELECT sum(b) FROM ta;
sum
-----
30
If you were trying to calculate the average of column b then sum(b)/count(b) gives a different result from sum(b)/count(*).
You need to think carefully about how the above rules impact sub-queries, especially something like NOT IN.
Example: Assume we have a companies table and a diary table. Diary entries are usually related to a particular company but not always.
SELECT co_id,co_name FROM companies;
co_id | co_name
-------+--------------------------
1 | Worldwide Partnership UK
2 | British Associates PLC
3 | Global Enterprises INC
SELECT dy_id,dy_company FROM diary;
dy_id | dy_company
-------+------------
101 | 1
102 | 2
103 |
SELECT co_name FROM companies WHERE co_id IN (SELECT dy_company
FROM diary);
co_name
--------------------------
Worldwide Partnership UK
British Associates PLC
SELECT co_name FROM companies WHERE co_id NOT IN (SELECT
dy_company FROM diary);
co_name
---------
(0 rows)
What happened in this last case? Where did co_id=4 go to - it's certainly not in the diary. Actually, that's not true. Since there is a null in dy_company, we can't say for sure whether 3 is there or not - that null throws everything into a state of uncertainty.
Looking at what's happening in the case of co_id=3:
WHERE co_id NOT IN (SELECT dy_company FROM diary)
WHERE 3 NOT IN (1,2,Null)
WHERE NOT (3=1 OR 3=2 OR 3=Null)
WHERE NOT (False OR False OR Null)
WHERE NOT (Null)
WHERE Null
You can see that the OR-ing a Null with false values gives Null and Not(Null)=Null. We can't prove that co_id=3 isn't in the diary so we can't return that row.
Remember that advice is worth exactly what you paid for it.
Think carefully about whether a particular field/column should be able to contain nulls. If not define it as NOT NULL when creating the table.
In cases where a column holds an enumerated type (like the sex column previously) you can define your own values for not applicable and unknown. So rather than values of (M,F) you could allow values of (M,F,N,U). This makes it clear whether the value is NOT APPLICABLE or UNKNOWN.
For timestamps there is the special value "infinity" which is later than any other value (and "-infinity" which is earlier). Note that this only applies to timestamps not date types (as of v7.3.2).
If you have a numeric or free-text field then you can't define your own replacements for null (except in the case when you can restrict permissable values). It's also not necessarily a good idea. Using negative or very large numbers to represent unknown is one of the reasons null was invented.
If you have a text field, you need to decide what you store when the user leaves it blank while using your application. Do you treat that as null or an empty string? The answer, I'm afraid will depend on context but if a field can be null you'll need a way to set/display it differently from an empty string.
Using nulls to mean not applicable can indicate you haven't normalised correctly.
Example 1: In the company/diary example earlier, we got tripped up with a null in the dy_company column. We could avoid this by splitting into three tables: companies, diary and co_diary with the last being the relevant id's from the other two tables. This makes it simple to identify what companies have/don't have diary entries.
Example 2: Assume we have a bird_description table to record sightings by bird-spotters. You could structure it as (sighting_id, weight, wingspan, belly_colour, wing_colour, wingtip_colour, beak_colour, feading_behaviour, mating_behaviour, ...) but for any individual sighting most of these values will be null. An alternative would be to define the table as (sighting_id,aspect,value) so you could store ("S0137","wingspan","15cm") - this means you are only storing the information supplied and also makes it easier to add new aspects of a sighting.