The relational database model has become the de-facto standard for the design of databases both large and small. While the concepts involved are not terribly complex, it can be difficult at first to get a handle on the concept. I'll try and give you a little introduction to the following topics on this page:
The simplest model for a database is a flat file. You have only a single table which includes fields for each element you need to store. Nearly everyone has worked with flat file databases, at least in the form of spreadsheets. The problem with flat files is that they waste storage space and are problematic to maintain. Lets consider the classic example of a customer order entry system. Assume that you're managing the data for a company with a number of customers, each of which will be placing multiple orders. In addition, each order can have one or more items.
Before moving on, lets describe the data that we wish to record for each component of the application:
It doesn't take a database design expert to see what the problem is in using a flat file to represent this data. Each time an order is placed, you'll need to repeat the customer information, including the Customer Number, Company Name, etc. What's worse is that for each item, you not only need to repeat the order information such as the Order Number and Order Date, but you also need to continue repeating the customer information as well. Lets say there's one customer who has placed two orders, each with four line items. To maintain this tiny amount of information, you need to enter the Customer Number and Company Name eight times. If the company should send you a change of address, the number of records you need to update is equal to the sum of product of orders and order line items. Obviously this will quickly become unacceptable in terms of both the effort required to maintain the data and the likelihood that at some point there will be data entry errors and the customer address will be inconsistent between records.
The solution to this problem is to use a relational model for the data. Don't let the terminology get you down - the concept isn't that hard to understand. It simply means that in this example each order entered is related to a customer record, and each line item is related to an order record. A relational database management system (RDBMS) is then a piece of software that manages groups of records which are related to one another. Let's take our flat file and break it up into three tables: Customers, Orders, and OrderDetails. The fields are just as they are shown above, with a few additions. To the Orders table, we will add a Customer Number field, and to the OrderDetails table we will add an Order Number field. Here's the list again with the required additional fields and modified field names.
I'll digress here a little to describe how I came up with the field names. This is totally a personal preference, but I've found it works well for me. Every time I create a table, I decide on a field name prefix to be used. As you can see, I've used "Cust" for the Customers table, "Ord" for the Orders table, and "OD" for the OrderDetails table. I do this for a couple of reasons. First, by using a field name prefix, I can identify the table associated with a field by simply reading the field name. Second, I avoid name collisions in queries because every field in a database, no matter how large and complex the structure, will always be unique - I'll never end up with two fields in a query sharing a single name. I find this particularly helpful in distinguishing between primary and foreign keys - a subject I'll take up a bit later. One other thing you may have noticed is a table naming convention I use, which is to name tables using the plural form of the data they contain.
OK, back to the tables. What we've done besides the name change is to add fields to the Orders and OrderDetails tables. Each have key fields used to provide a link to the associated Customers and Orders records, respectively. These additional fields are called foreign keys.
Like most other programming disciplines, relational databases are ripe with jargon. A key is simply a field which can be used to identify a record. In some cases, key fields are a part of the data you are storing or derived from that data, but they are just as often an arbitrary value. For the Customers table, you could use the company name as a key, but if you ever had two companies with the same name, your system would be broken. You could also use some derivation of the company name in an effort to preserve enough of the name to make it easy for users to derive the name based on the key, but that often breaks down when the tables become large. I find it easiest to simply use an arbitrary whole number. You can completely hide the use of the numbers from the end users, or expose the data. Its your choice to make based on the needs and abilities of the users.
There are two types of key fields we are dealing with: primary keys and foreign keys. A primary key is a field that uniquely identifies a record in a table. No two records can have the same value for a primary key. Each value in a primary key will identify one and only one record. A foreign key represents the value of primary key for a related table. Foreign keys are the cornerstone of relational databases. In the Orders table, the OrdCustID field would hold the value of the CustID field for the customer who placed the order. By doing this, we can attach the information for the customer record to the order by storing only the one value. We'll discuss how to put the data back together again next.
So far we've broken down our order entry system into three tables and added foreign keys to the Orders and OrderDetails tables. Now, rather than repeating the Customers table data for each Orders table record, we simply record a customer number in the OrdCustID field. By doing this, we can change the information in the Customers table record and have that change be reflected in every order placed by the customer. This is accomplished by using queries to reassemble the data.
One of the inherent problems of any type of data management system is that ultimately the human users of the system will only be able to view data in two dimensions, which in the end become rows and columns in a table either on the screen or on paper. While people can conceptualize objects in three dimensions, its very difficult to represent detail data in anything other than a flat table. After all the effort we went through to break down the original flat file into three tables, we are now going to undo that effort and make a flat file again.
We're going to accomplish this amazing feat of backwards progress by using queries. A query is simply a view of data which represents the data from one or more tables. Lets say we want to see the orders placed by our customers. We can link the Customers and Orders tables using the CustID field from Customers and the OrdCustID field from Orders - remember, the value of the OrdCustID field represents a related record in the Customers table and is equal to the CustID value from that record. By joining together the two tables based on this relationship, we can add fields from both tables and see all orders along with any pertinent customer data.
Queries are built in a relational database using Structured Query Language, or SQL. (Just in case you're wondering, some spell it out, and others say the word sequel, I use the latter but have no idea which is correct.) SQL is the standard language for relational databases and includes the capability of manipulating both the structure of a database and its data. In its most common form, SQL is used to create a simple SELECT query. Without getting into all the details now, suffice it to say that if you will be doing any serious work with databases, you're going to need to learn SQL.
Let's take the earlier example and build a query to look at customer orders. Here's the SQL for it:
SELECT CustName, CustCity, CustState, OrdDate FROM Customers INNER JOIN Orders ON Customer.CustID = Orders.OrdCustID;
That wasn't too tough. Lets look in a little more detail. This query starts with the SELECT keyword. Most of the queries you'll be building will be SELECT queries. SELECT simply means that we wish to "select" records, or retrieve records from the tables. Following the SELECT keyword is the list of fields. Next comes the FROM keyword. This is used to indicate where the data is coming from. In this case, its coming from the Customers table and the Orders table. The key to this query is the INNER JOIN. There are two basic types of joins which can be done between tables: inner joins and outer joins. An inner join will return records for which only the matching fields in both tables are equal. An outer join will return all the records from one table, and only the matching records from the other table. Outer joins are further divided into left joins and right joins. The left or right specifies which side of the join returns all records. The balance of the example query specifies which fields are used to join the table. In this case we are matching the CustID field from Customers to the OrdCustID field (the foreign key) in Orders.
One thing that should be noted is that this is Jet SQL. Each RDBMS has its own particular dialect of SQL, just as Visual Basic is derived from some original BASIC language somewhere, Jet SQL is a variation of SQL particular to Microsoft's Jet database engine. For a complete description of the features of Jet SQL, search the VB help files for the topics "Jet SQL" and "reserved word, Jet".
Lets consider what happens when you start manipulating the records involved in the order entry system. You can edit the customer information at will without any ill effects, but what would happen if you needed to delete a customer? If the customer has orders, the orders will be orphaned. Clearly you can't have an order placed by a non-existent customer, so you must have a means in place to enforce that for each order, there is a corresponding customer. This is the basis of enforcing referential integrity. There are two ways that you can enforce the validity of the data in this situation. One is by cascading deletions through the related tables, the other is by preventing deletions when related records exist.
Database applications have several choices available for enforcing referential integrity, but if possible, you should let the database engine do its job and handle this for you. The latest advanced database engines allow you to use declarative referential integrity. You specify a relationship between tables at design time, indicating if updates and deletes will cascade through related tables. If cascading updates are enabled, changes to the primary key in a table are propagated through related tables. If cascading deletes are enabled, deletions from a table are propagated through related tables.
Looking again at our order entry system, if cascading updates are enabled, a change to the CustID for a Customers table record would change all of the related OrdCustID values in the Orders table. If cascading deletes are enabled, deleting a record from Customers would delete any related records in the Orders table. In contrast, if cascading updates or deletes are not enabled, you would be prevented from changing the primary key or deleting a record from Customers if any related records exist in the Orders table.
Keep in mind also that if you have enforced referential integrity in the relationship between Orders and OrderDetails, this relationship can also have an effect on your ability to manage records in Customers. Just as you can't delete a customer with orders, neither can you delete an order with detail items. The result is passed along as far as necessary. If you cascade deletes from Customers to Orders, but not from Orders to OrderDetails, you will be prevented from deleting a record in Customers if there are any Orders records which have related OrderDetails records.
Before you go ahead and enable cascading deletes on all your relationships keep in mind that this can be a dangerous practice in some situations. Lets say you have a table called States, which lists the U.S.P.S two letter state abbreviation for each of the states in the country, along with the full name of the state. You use this table as a lookup table and to enforce the validity of the state entered in the Customers table. If you define a relationship between the States table and the Customers table with cascading deletes enabled, then delete a record from States, you will delete all Customers table records where the customer is located in that state. In most cases, I've chosen to let my applications handle cascading deletes. This gives me a bit of a buffer against errors in the application and helps to prevent the loss of data. The database engine will prevent deletions if related records exist, forcing me to account for those records explicitly.
Normalization is a subject that is made overly confusing by most literature on the subject that I've read. In a nutshell, its simply the process of distilling the structure of the database to the point where you have removed repeating groups of data into separate tables. In our example, we have normalized customers and orders by creating a separate table for the orders. If you look around, you can probably find a list of books on normalization theory as long as your arm. Read on if you wish, but the bottom line is that you need to design your database to be efficient and reliable. At times you may need to sacrifice normalization to practicality.
If you look at the Customers table, you can see that it isn't really necessary to include the CustCity and CustState fields since a US ZIP Code uniquely defines a city and state in the US. I'm not aware of how postal codes in other countries work, but I'm going to guess that a postal code will in most cases define a city uniquely enough to get the mail delivered. However, when taken to extremes, you'll pay a performance penalty for excessive normalization. If you were to fully normalize the Customers table, you would need to remove the CustCity and CustState fields and create a table, perhaps called ZIPCodes, which included these fields, then include only the CustZIP field and join the Customers table to the ZIPCodes table in order to reconstruct the full address. The problem with this is that you add the overhead of an additional join in every query where you need to have the full address available.
There aren't any hard and fast rules for when to stop normalizing a database. You need to make your own choices based on the practicality of the data structures and the performance trade-offs involved. If possible, you should at least try to design the application so that you can restructure the data to accomodate normalizing or denormalizing the tables.
That's a lot of information to absorb, and its only the "tip of the iceberg" in database design. The key concepts that you must understand in order to design a database properly are primary and foreign keys, which are used to define relationships, referential integrity, which is used to maintain the validity of the data, and normalization, which is used to develop a data structure. Once you have these concepts down, the rest of the details will fall into place more easily.
For a more in-depth discussion of building a relational database, you can also visit my Database Design Fundamentals page.
| © 1997 Joe Garrick
|
I don't know if this email is valid anymore, the
webpages are no longer available on the Internet