Database Design 101

Queries work well if the databases are designed correctly. Here's what you need to know to build the foundations for your SQL Server databases.

Most IT professionals and developers understand the basic purpose of databases - to store data. But there's more to it than that. For example, you need to decide how the data should be stored. Good decisions will lead to an extensible database. Poor decisions may meet immediate requirements can lead to many headaches in the long run.

If you're strictly a database administrator (or a systems administrators who's responsible for the database servers), there's a good chance that the contents of the databases you administer are "black boxes." You understand the importance of database management issues - performing backups, testing restores, managing database file growth and monitoring and optimizing performance - but you're probably not familiar with the basics of how table structures are designed.

On the other hand, if you're a database developer, it's critically important that you understand how to design tables, views and other database objects. A seemingly simple decision about how to arrange Customers and Order information within a database can have severe impact on usability, manageability and performance. Regardless of your background with relational databases and how they work, a solid understanding of database design issues can form a great foundation for successful database design and management. As you might have guessed from the title, that's the topic of this article - the basics of database design.

Note
In the previous three articles in this series, I covered the basics of database objects and how SQL queries can be used to retrieve information. Although this month's topic is somewhat independent of the subject of those articles, it may help if you understand their content before reading this column.

In this column, I'll provide an overview of database design concepts. Instead of focusing on the specifics of such concepts as referential integrity and database normalization, I'll provide some basic info and point you to resources. The goal will be to understand the factors that make for good design decisions.

Determining Business Requirements
As with any technology-related project, you should clearly define the types of business problems you're trying to solve before you even think about solutions. The design and implementation of databases is no exception. The first step is to determine whether you need a database at all. All too often, it seems that when people want to collect information, this is the way to go. Most of the time, it probably is.

What reasons would compel you to house your important information in a relational database? Some reasons might include support for concurrency (that is, multiple people can access and change the data at the same time), data security (allowing people permissive access) and data manageability (modifying and reporting on large quantities of data). If you determine these features to be the basic requirements of your solution, chances are that a relational database is the best choice.

Before you begin the design of a database, ask these questions:

  • What is the purpose of the database? Often, the answer to this high-level question is obvious. For example, someone might request a database to track customer information such as names and addresses. You might need a method to keep track of inventory. In any case, you should understand whether the database will be used primarily for the storage of transactional information, or whether it will be used mainly for reporting purposes.
  • Who are the subject matter experts? In the ideal world, the developer would be the world's foremost expert on the business usage of the planned database. But it's the real world, so before you embark on a quest to design the perfect database, be sure you know which individuals within your company can answer questions that you might have. For example, if you're designing a financial database, you might need to know about the smallest and largest acceptable values for an "Amount" column. Knowing to whom the question should be asked is half of the solution.
  • What are the detailed requirements for the system? Be sure you can identify all of the entities (such as Customers, Orders and Invoices) that must be tracked within the database. You should also understand how these entities are related. It's important to be specific when you answer this question as a few small changes in the requirements can call for significant rework of a database design.
  • What types of users and applications will access this database? It's important to understand how the data you choose to track will be used. For example, will sales people frequently be viewing Customer information by region? Will hundreds of data entry clerks be entering thousands of Order records per day? The more information you have about the usage of your data, the more appropriate will be your database design.
  • What are the plans for the future of this database? Is it expected that users will be happy with the types of information we are planning to collect? Or, is it likely that they're want to add more to the database and/or integrate this information with other systems within the company.

If you have the above information, you're probably in good shape. So far, you know what you need. Now, let's move on the look at some considerations for designing a database.

The Basics of Database Design
When faced with "designing" a database, novice database implementers often try to think only about the storage of data. However, the method in which information will be retrieved from your database is just as important. Given a set of business constraints, I may be able to come up with multiple database designs. Some of these, however, will make it difficult to retrieve data in the format in which I need it. When you're designing a database, it's important to first think about the types of entities that you want to track. Let's continue to use the example of a Sales Order Entry system (so far, I've mentioned Customers and Orders). In such a system, you should be able to identify the characteristics of Customers, the characteristics of Orders, and how these entities interact.

A typical statement that describes entity relationships might be something like the following: "A Customer can have zero or more Orders, and all Orders must be linked to an existing Customer. An Order can contain one or more Line Items. Each line item includes information about exactly one Product that has been ordered. Each of these Products is identified by a unique number."

Figure 1 shows an example of a basic database schema that shows these relationships. The schema was created in SQL Server 2000 and is provided in the "pubs" sample database. However, many different database design tools allow you to easily create tables and define their relationships. For now, don't worry about the details of the lines between the various tables that I've created.

Table Relationships, on the next
Figure 1. Viewing table relationships in the pubs database schema in SQL Server 2000's Enterprise Manager. (Click image to view larger version.)

Now that we've covered a simple example of how a database can be designed, let's build upon this concept.

Dealing with Relationships
OK, so this title might sound like something out of a self-help book. Well, it sort of is - self-help for database developers. It's important to define and understand the relationships between objects in your database. If you've been following the examples thus far, you know that it's important to be able to enforce the rules that the database objects must follow. For example, you may want to make sure that a column named ZIPCode contains only valid United States ZIP code formats.

Note
This article primarily focuses on database design issues, in general. In some cases, SQL Server 2000 tools and terminology will be used to illustrate specific concepts. If your environment does not use SQL Server 2000, rest assured that the same information applies to most relational database systems that are available today.

In this section, we'll look at ways to make sure that only the right types of information are entered into your database. We'll focus on ways to ensure that your information is kept consistent and the relationships within and between tables are retained.

Defaults, Rules, and Constraints
When you define a column within a table, there's a good chance that you know the types of values that are acceptable. A PhoneNumber field might allow only integers, whereas an Address field might allow alphanumeric characters. The first step in determining acceptable types of data is in the table definition. When you create a table, you must choose from among the data types that are available in the table designer (see Figure 2). In SQL Server, for example, you might choose an int (an integer data type), a varchar (a variable-length character field), a text (long text fields), or a boolean (true/false) value for a column. Your choices here will determine what types of information can be stored in these columns.

SQL Server Table Designer
Figure 2. You create the table definition using the SQL Server 2000 table designer. (Click image to view larger version.)

Choosing a data type limits the types of data that can be stored in a column, but sometimes you want to take it even further. In addition to restrictions defined by data types, you can place other constraints on the types of information that are acceptable in a column of data. SQL Server provides several different types of constraints:

  • NOT NULL - A value must be specific for this column.
  • CHECK - The values supplied must meet the criteria specified in the constraint. A database developer can define a CHECK constraint to ensure that an entered integer is an even, positive value.
  • UNIQUE - No values in this column may be duplicates of another. This might be used, for example, in an employee information table to prevent duplicate employee numbers from being entered.
  • PRIMARY KEY - Defines which column or columns uniquely identify each row in the database. No two rows can have the same values for the primary key.
  • FOREIGN KEY - Enforces data integrity between tables by creating a reference for specific types of information.

We'll discuss the PRIMARY KEY and FOREIGN KEY constraints in the next section. CHECK constraints place limitations on the types of information that can be stored in a specific column of data. For example, I might want to restrict the value in a phone number column to a 10-digit format (with no dashes or other characters). A constraint can ensure that information is entered in numeric format. If the information does not meet the criteria, an error is returned to the user.

Constraints can be placed on one or more columns and can be quite complex, if that's what your business rules call for:

  • Defaults are settings placed on a table that specify which values should be used if none is specified. This type of constraint is commonly used in situations where the database assumes that certain values should be used unless they are otherwise specified by a user or an application. For example, I might want to create default responses for the columns that represent questions in a simple customer survey. If no value is provided for these columns, I may want the value to default to "N/A".
  • Rules function similarly to constraints but have the added benefit of existing as database objects. In contrast to constraints-which are defined as part of a column's definition-rules can be "bound" or "unbound" to columns. This allows the flexibility of disabling a rule without losing its definition. However, only one rule may apply to a column's definition. Rules are provided mainly for backward-compatibility with SQL Server applications. Microsoft recommends that, wherever possible, CHECK constraints be used instead of rules.

Now that we have a good idea of the types of column-level constraints that are available, let's look at the various types of referential integrity that are possible.

Domain, Entity, and Referential Integrity
It is possible to create a database in which all the information stored in tables is completely unstructured and unrelated. This would lead to many problems, however, as is often learned by those who do not take the time to adequately plan the structure of their databases. Generally, difference pieces of information stored in your database objects relate to each other in some way. Again, let's revisit the commonly used example of a sales database. Each sale might be tied to a customer, but the actual information about the customer (including shipping address and purchase history) might be stored in other tables. In this case, a clear relationship between the two tables must be kept intact. Additionally, business rules might require that each customer have a unique customer number that should never be reused.

Part of playing the database game is to ensure that the rules are followed; that is, you must be able to enforce the relationships between entities in your database. Consider the case in which you have a simple Customer - Orders relationship. Suppose that in your system, a Customer can have zero or more orders, and all orders must be attached to a customer. This might seem like an easy rule to define, and you can reasonably code for it within your application. But what happens if another developer forgets the rule and allows users to delete a customer who has several orders without first deleting all of the orders themselves? In this case, you'll have violated the integrity of your database by creating "orphan" records in the Orders table. The simple answer is that this shouldn't be allowed to happen, and the developer or end user should receive an error. Stopping short of this, you run the risk of entering invalid information into your database (something that can be a very serious problem and can be difficult to correct after it occurs). So what's a good database designer to do? Fortunately, there's a relatively easy solution.

Integrity constraints are created to ensure that these relationships are maintained in a consistent manner. There are three major types of integrity that database designers must keep in mind:

  • Domain integrity - Ensures that values stored within a column are consistent and in accordance with business rules. Domain integrity can be based on constraints such as UNIQUE and CHECK constraints that specify what values are acceptable for each column.
  • Entity integrity -- Refers to information stored in rows (remember that each row in a table stores information about one entity of the type that the table describes). This type of constraint makes sure that the information stored in rows within a table is consistent and follows the rules specified. For example, each row must contain the same number of columns (although some values may be left blank).
  • Referential integrity - Applies across tables and ensures that information between these objects is consistent. Referential integrity includes relationships between tables. The actual columns that match between the tables are known as foreign keys and primary keys, and they can be defined using PRIMARY KEY and FOREIGN KEY constraints (mentioned earlier). Referential integrity ensures that related information remains consistent. It solves, for example, the problem I mentioned earlier: ensuring that only valid customers are used for all orders placed in the database and avoids the problem of "orphan rows". Orphans might occur when a customer row is deleted from the database, but the customer still has orders. In this case, the orders are orphans since their parent row (the customer information) no longer exists. When it comes time to fulfill the orders, users will find that they do not have enough information.

Figure 3 illustrates the three main types of database integrity.

I think it's mirrors...
Figure 3. How domain, entity, and referential database integrity types compare.

With the basic ideas of database integrity out of the way, let's move on to another important topic: database normalization.

Understanding Database Normalization
Remember earlier in this article when I referred to the fact that there can be several good database designs that meet a given set of requirements? There's definitely a subjective aspect to database design. One important concept is that of normalization. Normalization refers to the level of separation of data into multiple tables. There are many rules of normalization, and they have been described in various ways. To remain at a high-level, I'll just provide the basics. One rule of normalization is that every column within a table should refer only to one entity. For example, if I have a Customer table, every piece of information within that table (like the customer's name and address) should correspond only to that customer. Additionally, each row within the Customers table should have a unique identifier of some sort (such as a Customer Number value that is unique for each customer).

A highly normalized database schema is characterized by having many "narrow" tables. By narrow, I mean the tables tend to have few columns. A denormalized database schema, on the other hand, tends to have fewer, wide tables. In general, a normalized database is easier to understand since tables refer to very specific entities.

So what should you consider when you decide how normalized your database should be? Well, performance is an important issue that is addressed by normalization. Highly normalized tables can improve performance in transaction processing systems as they limit the performance impacts of locking. If your database experiences a large number of data modification queries (INSERT, UPDATE and DELETE statements), then normalization is the way to go. On the other end of the spectrum, a denormalized database schema provides better performance for reporting functionality (SELECT statements). Since you'll have to join with fewer tables in your queries, reporting, especially with large sets of data, can be much quicker. Of course, denormalization comes at a cost: You may have redundant data, which requires additional storage space.

Remember that the concept of normalization is a continuum. That is, it's not like a database is either normalized or it's not. Rather, it can be "somewhat denormalized", or "highly normalized". Books have been written about database normalization techniques and recommendations. We don't have room in this article to dive into the details. Remember, this is an overview!

Summary
So, there you have it - the basics of database design. In this article, we looked at the basic design goals that you should keep in mind when you are developing a database solution. We covered the all-important concepts of referential integrity and database normalization. Both of these aspects form the basis for much longer discussions about how databases should be designed. The good news is that, if you understand these concepts, you're well on your way to designing efficient databases that meet your business requirements. The bad news is that we only scratched the surface within this article. Nevertheless, be sure you take the time determine your business requirements and design an appropriate database. Sometime soon, someone will be glad you did!

Additional Information
Microsoft wants to test your ability to design real-world database solutions and to take advantage of all of the many features available in SQL Server 2000. With exams that are related to database design and implementation, it can be difficult to judge when you're really ready. I recommend having at least several months of experience in working with SQL queries and designing databases, to give you a strong foundation.

With this information in mind, look at all of the new features in SQL Server 2000 that you have NOT used: don't forget about XML features, materialized views, partitioning and other features, just because you don't use them. The following resources should help you obtain more information about SQL Server 2000:

SQL Server 7.0 and SQL Server 2000 exams are available concurrently. If you're an MCPmag.com Premier member, you can check out the following articles:

I also review the 70-229 SQL Server 2000 Implementation exam in an upcoming issue.

Featured