Normalising the data model

Sometimes I see on forums someone who is trying to get some SQL statement to wield data in a particular way but the data model is just thwarting their attempts, or if they do get something to work the SQL statement that does the job is horrendously complex. This tends to happen because the data is not normalised (or “normalized” if you are using the American spelling) to the third normal form. Denormalised data models tend to happen for two reasons. Firstly, because the modeller is inexperienced and does not realise the faux pas they have made in the model, and, secondly, because the modeller has found the a properly normalised data model just doesn’t have the performance needed to do the job required.

The Scenario

In the example scenario that I am going to present an private education company is looking to build a system that helps track their tutors and students. So as not to be overwhelming I am only going to concentrate on one aspect of the system – the tutor. A tutor may be multilingual and can teach in a variety of languages and they may also be able to teach a number of subjects. The Tutor table has joins to a table for languages and a table for the subjects. The model looks like this:

The denormalised data model
Partially denormalised data model

As you can see there are 3 joins from Tutors to Languages and 4 joins from Subjects to Tutors. This makes joins between these tables particularly complex. For example, to find out the languages that a tutor speaks then a query like this would have to be formed.

SELECT  l1.Name AS LanguageName1,
        l2.Name as LanguageName2,
        l3.Name as LanguageName3
FROM Tutors AS t
LEFT OUTER JOIN Languages AS l1 ON l1.LanguageID = t.Language1
LEFT OUTER JOIN Languages AS l2 ON l2.LanguageID = t.Language2
LEFT OUTER JOIN Languages AS l3 ON l3.LanguageID = t.Language3
WHERE t.TutorID = @TutorID

So, what happens if the tutor is fluent in more than three languages? Either the system cannot accept the fourth language it will have to be changed to accommodate it. If the latter option is chosen imagine the amount of work needed to make that change.

A similar situation occurs with the join to the Subjects table.

Solution

A better way to handle this sort of situation is with a many-to-many join. Many database systems cannot directly create a many-to-many join between two tables and must create an intermediate table. For those database systems that appear to be able to model a many-to-many join directly (GE-Smallworld comes to mind) what is actually happening is that an intermediate table is being created in the background that isn’t normally visible and the database takes care of this automatically.

The resulting data model will look like this

The normalised data model
Normalised data model

This allows a tutor to be able to register any number of languages or subjects. It also means that any joins on the data are easier as there are no duplicate joins for each Language or Subject. The above SELECT statement can be rewritten as:

SELECT  l.Name AS LanguageName
FROM Tutors AS t
INNER JOIN TutorLanguage as tl ON tl.TutorID = t.TutorID
INNER JOIN Languages as l ON tl.LanguageID = l.LanguageID
WHERE t.TutorID = @TutorID

This will result in one row being returned for each language rather than all the languages being returned into one row. It is possible to pivot the results back to one row, but currently in SQL Server 2000 that would add more complexity to the query than I am willing to discuss in this article. If you want to know how to pivot results in SQL Server 2000 then see the page on Cross-Tab Reports in the SQL Server books-online. SQL Server 2005 will allow PIVOTed results directly. For more information between the SQL Server 2000 and 2005 way of doing things see: Pivot (or Unpivot) Your Data – Windows IT Pro

Migrating existing data

Naturally, if you have existing data using the denormalised schema and you want to migrate it to the normalised schema then you will need to be careful about the order in which changes are made lest you lose your data.

  1. Create the intermediate table.
  2. Change any stored procedures using the denormalised schema to the normalised schema.
    • You may also need to change code outside the database. If you find yourself needing to do this then I strongly recommend that you read about the benefits of stored procedures.
  3. Perform an insert for each of the denormalised joins into the intermediate table
  4. Remove the old joins.

If possible the above should be scripted so that the database changes occur as quickly as possible as, depending on your situation, you may have to take your production system off-line while making the change. Testing the changes in a development environment first should ensure that the scripts are written well and don’t fall over when being run on the production database.

To move the denormalised Language joins to the normalised schema some SQL like this can be used.

INSERT INTO TutorLanguage
    SELECT TutorID, Language1 AS LanguageID
    FROM Tutors
    WHERE Language1 IS NOT NULL
UNION
    SELECT TutorID, Language2 AS LanguageID
    FROM Tutors
    WHERE Language2 IS NOT NULL
UNION
    SELECT TutorID, Language3 AS LanguageID
    FROM Tutors
    WHERE Language3 IS NOT NULL

It can, of course, be written as a series of individual INSERT INTO…SELECT statements rather that a large UNIONed SELECT

NOTE: This was rescued from the Google Cache. The original date was Sunday 3rd April 2005.

Tags:

Types of join

Occasionally there is a post on a forum asking what a certain type of join is all about, so I thought it would probably be good to have a stock explanation to refer people to so that I don’t re-write near enough the same response each time the question arises.

First lets consider these two tables.

A

Key         Data
----------- ----------
1           a
2           b

B

Key         Data
----------- ----------
1           c
3           d

We can see that the only match is where Key is 1.

INNER JOIN

In an INNER JOIN that will be the only thing returned. If we use the query

SELECT A.[Key] AS aKey, A.Data AS aData, B.[Key] AS bKey, b.Data AS bData
FROM A
INNER JOIN B ON a.[Key] = b.[Key]

the returned set will be

aKey        aData      bKey        bData
----------- ---------- ----------- ----------
1           a          1           c

In the case of the various outer joins non-matches will be returned also.

LEFT OUTER JOIN

In a LEFT OUTER JOIN everything on the left side will be returned. Any matches on the right side will be returned also, but if there is no match on the right side then nulls are returned instead.

The query

SELECT A.[Key] AS aKey, A.Data AS aData, B.[Key] AS bKey, b.Data AS bData
FROM A
LEFT OUTER JOIN B ON a.[Key] = b.[Key]

returns

aKey        aData      bKey        bData
----------- ---------- ----------- ----------
1           a          1           c
2           b          NULL        NULL

RIGHT OUTER JOIN

The RIGHT OUTER JOIN is very similar to the LEFT OUTER JOIN, except that, of course, the matching is reversed. Everything on the right side is returned, and only matches on the left side are returned. Any non-matches will be filled with nulls on the left side.

The query

SELECT A.[Key] AS aKey, A.Data AS aData, B.[Key] AS bKey, b.Data AS bData
FROM A
RIGHT OUTER JOIN B ON a.[Key] = b.[Key]

returns

aKey        aData      bKey        bData
----------- ---------- ----------- ----------
1           a          1           c
NULL        NULL       3           d

FULL OUTER JOIN

A FULL OUTER JOIN returns a set containing all rows from either side, matched if possible, but nulls put in place if not.

The query

SELECT A.[Key] AS aKey, A.Data AS aData, B.[Key] AS bKey, b.Data AS bData
FROM A
FULL OUTER JOIN B ON a.[Key] = b.[Key]

returns

aKey        aData      bKey        bData
----------- ---------- ----------- ----------
1           a          1           c
2           b          NULL        NULL
NULL        NULL       3           d

CROSS JOIN

The CROSS JOIN doesn’t obey the same set of rules as the other joins. This is because it doesn’t care about matching rows from either side, so there is no ON qualifier within the join clause. This is a simple join that joins all rows on the left side to all rows on the right side. Where the inner join and left/right outer join cannot return more rows than exist in the most populous of the source tables and the full outer join’s maximum result set if the sum of the source rows, the CROSS JOIN will return the product of rows from each side. If you have 5 rows in Table A, and 6 rows in Table B it will return a set containing 30 rows.

The query

SELECT A.[Key] AS aKey, A.Data AS aData, B.[Key] AS bKey, b.Data AS bData
FROM A
CROSS JOIN B

returns

aKey        aData      bKey        bData
----------- ---------- ----------- ----------
1           a          1           c
2           b          1           c
1           a          3           d
2           b          3           d

NOTE: This was rescued from the Google Cache: The original date was Monday, 27th February 2006.

Tags:

The Normal Forms of Database Normalization

I’m doing a clear out at the moment as I’m getting the house ready to sell – I’m planning to move to West Lothian – and I came across some old notes about database normalisation. It was a summary about the first 5 normal forms so I thought I’d share them.

First Normal Form: Every column contains just one value. That means that if you have a person’s name it should be split up to forename and surname, an address is split up so that the city and postcode are in separate columns, and so on.

Second Normal Form: The First Normal Form plus every table has a primary key. That means that everything can be uniquely referenced through the primary key.

Third Normal Form: The Second Normal Form plus every non-key column depends on the primary key. So for example, if you have an Orders table then columns such as DateOfOrder and DateOfDispatch are valid but listing the client’s details such as name, address and so on are not valid.

Fourth Normal Form: Remove repeating columns to a new table. Repeating columns are indicative of a missing relationship. So, if you have an Orders table it will not have item1, item2, and item3 columns. They will be removed to a separate table and form part of a join to the orders table.

Fifth Normal Form: Remove repetition within columns. Simply, this is enumerated values. For example, a company may accept Visa, Mastercard and American Express for payment. Rather than repeat those strings in the order payment table, they can be places in a small lookup table and the order payment table can refer to their smaller primary key.

However, I should point out that my notes are at odds with other references that I found. It would seem that the correct Fifth Normal Form is to do with ternary many-to-many relationships. In fact a quick search on the internet found that what ever I had written down as the Fifth Normal Form isn’t to be found. More research is in order to find out where my idea of the Fifth Normal Form comes from

NOTE: This was rescued from the Google Cache. The original date was Saturday, 22nd April 2006.

Delete the row, or set a deleted flag?

For auditing purposes some database system run without deleting rows from the database – they just flag the row as being deleted and ensure that queries that run on the table add a line to the WHERE clause to filter out the “deleted” rows.

However, as Daniel Turini points out on Code Project, this can hurt the database performance.

Specifically he says:

Bear in mind that having a flag leads to bad index performance. In my (rather old) SQL Server DO’s and DONT’s[^], I suggest you not to index on the “sex” or “gender” field:

First, let’s understand how indexes speed up table access. You can see indexes as a way of quickly partitioning a table based on a criteria. If you create an index with a column like “Sex”, you’ll have only two partitions: Male and Female. What optimization will you have on a table with 1,000,000 rows? Remember, mantaining an index is slow. Always design your indexes with the most sparse columns first and the least sparse columns last, e.g, Name + Province + Sex.

Creating a “deleted” flag will lead to bad performance, on several common cases. What I’d suggest is to move those rows to a “Deleted” table, and have a view using UNION ALL so you can easily access all the data for auditing purposes.

Not to mention that it’s easy to forget an “AND NOT IsDeleted” in some of your queries.

However, others recon the performance hit is negligable. Or that moving the data to a new table could damage the data. It is an interesting debate and you can read it here, from the start.

NOTE: This was rescued from the Google Cache. The original date was Thursday, 29th June 2006.

Tags: