Text of Chapter 11, Java Database Development, Martin Rinehart, Osborne/McGraw-Hill, 1998,

Chapter 11

Designing Object and Relational Databases

Do you know traditional database design theory? If you do, don't skip this chapter! It presents a nontraditional method. My method is simple to follow, easy to explain or teach and guaranteed to give you results superior to fourth-normal form. I'll bet you're skeptical right now. Read on.

If you don't know normalization theory, fifth-normal form is roughly equal to perfect design. The lower forms start at first-normal form, which simply says the data is stored in tables. Most database designers are content with third-normal form.

If you do know normalization theory, before you finish this chapter you'll probably be amazed at how simple this design business can be. You'll understand jargon like "full functional dependency," but you'll know you won't need it for designing databases.

We'll begin with the basics: objects and events. Designing a database begins with identifying objects and events. (Yes, I do know Chen's ER work. If you do, too, see if you don't agree that this blows it away.) When you identify objects and events, you list the characteristics of each that are important for your database's purposes.

You examine the characteristics to see what other tables you might need. Relational databases need separate tables for detail categories. Object databases can track this information in separate classes, or within the parent objects themselves.

I'll give you a simple, six-step summary of the whole procedure. When you design, you'll perform all these steps more or less simumtaneously. They'll serve as a checklist after you're done.

The whole procedure is so simple that there's a section explaining classic normalization and showing how our designs compare. You'll see that our designs compare very favorably. I'll show you that your designs will be in Boyce-Codd normal form. Those who haven't met normalization before will come away from this section with a brand-new vocabulary.

Finally, I'll hava a section on the intelligent use of redundant data. We'll leave theory behind and discuss how to make intelligent use of redundant data to help the users without taking unnecessary risks.

Let's begin with the basics, objects and events.

Beginning with Objects and Events

A database is a model of some useful portion of the real world. When you design a database, you have some application or applications for the data. You may want to account for past transactions or forecast future events. Whatever you have in mind, you have to design a good database if you want a good application.

Let's do the design job very badly, so you can see some of the problems you might have. We'll design the accounting system for a software distributor. For our starting design, we'll have just one table. With each sale we'll record the name and addresss of the customer, list the items sold (we'll allow enough columns for the highest number of items that might be included in a single sale), and we'll record the invoice number and any other data.

What's wrong with this database? For one, if we delete a single sale, we could lose all the information about a customer. If a customer makes a second purchase, the customer's data is entered in two places. What happens when we get a change-of-address notice from an active customer?

That customer's data is repeated in many records. Chances are excellent that the data is inconsistent. We'll get the new address recorded in one record, but not in others. There is no convenient way to ask questions such as, "How much software has X purchased?"

Our designs will be fundamentally sound because they'll be anchored in the world they model. We'll have none of these problems. You'll see that our design work doesn't even ask whether the database is object or relational until late in the process. That won't be very important for our designs.

Objects

Step 1 is to list the classes of objects we'll deal with. These are real-world objects, not software objects. The simple definition is, "If you can kick it, it's an object."

Classes of objects include customers, products, vendors, buildings, and departments. These are all things you can kick. (Just because you can kick something doesn't mean that you should kick it.)

Of course, nonkickable things are also objects. For examples, mathematical theories or musical compositions could be usefully cataloged in a database. A better definition is that an object is anything that is permanent, or that exists over a long time and has an independent existence — it doesn't require other objects.

Actually, I use the "kickable" definition when I design databases. It covers most practical work. Events will teach you something about what objects are not.

Events

In Java, Events are Objects, too. In our database design, they're definitely not objects. An event is something that happens at a point in time involving objects.

Making a sale is an event. Admitting a patient is an event. Shipping a product is an event.

In a sale, a customer is united with one or more products at a moment in time. This establishes a permanent relationship between the customer and the product(s): owner. (The customer might resell or otherwise dispose of the product later, but if that's not relevant to our system, we'll ignore it.)

Admitting a patient associates a person with a hospital at a point in time. (Presumably there will be a discharge event that later breaks the association.)

Shipping a product happens at a point in time. If our database is concerned with recognizing revenue, we might record the day the product is shipped. If our application is supposed to help us manage the loading dock, we'll record the exact minute each carton is loaded. Whatever the case, we'll have something that happened at a point in time. The event has to relate objects that are part of our system in a way that's meaningful.

The software distributor records customer information: name, address, phone, and so on. The burger joint does not know its customers. In the burger-joint system, there is no customer object. There's just a sale. The product disappears from our inventory — we don't track it beyond this point.

With no objects, we'll have no events. Any event that involves only objects that are outside our database is, by definition, not of interest. Our objects remain unchanged by that event. If you find an event that seems important but doesn't relate to your objects, ask to what objects it does relate. You've overlooked something.

Processes and Other Classes

Are there more things beyond objects and events? We'll meet more shortly, but before we go on, let's think about taxonomies in general. Here we're creating a taxonomy of the types of things about which you could record data.

The biologist divides animals into groups like mammals and reptiles. Mammals have hair, give birth to live young, and suckle their young. Reptiles have scales, lay eggs, and don't suckle their young.

Biology was stood on its ear with the discovery of the duck-billed platypus. This contrarian has hair, lays eggs, and then suckles its young. It clearly defies placement as mammal or reptile.

I'm not concerned about my taxonomy meeting its own version of the platypus. I see them all the time. Consider processes, for example.

I define a process as a long-duration event. No event is truly instantaneous, but for most events our databases record events at a single point in time. The product was sold on the 24th. The patient was admitted at 8:00 A.M. But suppose your business constructs homes. The process of building the home starts one day and ends many weeks later. During the process, your other objects (workers, building supplies) are brought together to create the product. A customer may be involved before the home is completed, or even before the home is started. You can kick the house, but you can't kick the process. It's a long-duration event.

If you can design well for objects and events, you can fit in processes or other platypus-like real-world phenomena. This taxonomy is a starting point. The fact that it's incomplete doesn't mean it's not useful.

Biologists, by the way, eventually accepted the platypus. They realized that their taxonomies were useful, but not necessarily complete. Mother Nature is more given to continuums than to neat classifications. Naming colors red, orange, yellow, and so on, doesn't bunch colors at those points in the spectrum. But it's still a useful taxonomy.

Characteristics

Once you've listed the classes of objects and events that your system needs to record, you list the important characteristics of each object and event. A characteristic is important if you need it for your application(s).

For a person class you'd want to know name, addresses, phone numbers, and other data. Height and weight are characteristics of people. If you're running an outpatient treatment facility, you'll record height and weight. If your gallery sells old masters, you won't record these measurements.

The first item in every list of characteristics will be an ID. When we first enter an individual object, we'll assign it a unique ID. We'll always refer to the object by that ID. We'll never reassign that ID. (Did I say "never"? Make that "NEVER.")

Our events will have an ID, a time stamp, ID(s) for the participating object(s), and any other data. This is a design for a store that sells products to customers:


PRODUCT:            CUSTOMER:           SALE:
    product ID          customerID          sale ID
    name                name                timestamp
    weight              address             customer ID
    color               phone number(s)     products sold
    number on hand      email address
    price
	
Each product will have one record in the product table. The important characteristics will be the columns in the table. The intersection of a record (row) and column is a single field in the record.

Note how our SALE records the customer ID, not the customer data. This is called a foreign key. It identifies a specific customer, but it doesn't have any data about the customer. You look up data about that customer in the CUSTOMER table.

If you need to distinguish the customer ID field in the CUSTOMER table from the customer ID field in the SALE table, you call the former a primary key. When a primary key is used outside its home table, it becomes a foreign key.

Notice that this design handles a change-of-address notics without a hitch. Each customer's name and address are recorded in exactly one place. You edit that one record and you're done.

Beware of Paperwork

Data was collected long before computers existed. Many paper-based forms were used to record transactions. Many are still in use today. For example, there's an invoice associated with most sales.

Those papers were our predecessor model. An invoice says that Fred bought five widgets on Tuesday. The invoice isn't Fred. It didn't give Fred the widgets. It didn't take Fred's money. It simply records facts about the transaction. It's a model.

When you design a database, model the real world. Don't model someone else's model. You should be able to generate your own invoices from your system. If you need to coexist with a paperwork system, you could enter the invoice number, for example, as a characteristic of the sale.

It's dangerous to have an INVOICE table. Invoices are kickable. They're objects. The underlying reality is an event. Model reality and attach any necessary paperwork to your model. Don't model the paperwork.

Relationships

Relational databases have nothing to do the the relationships among your data items. The word "relation" is the mathematical term for a table. A relational database is one in which the data is stored in tables. In fact, handling real-world relationships is one of the weak points of the relational database.

There are relationships among your objects that may be important. Again, an important relationship is one that your application needs to know about. The rest are unimportant as far as your database design is concerned.

Consider your enterprise's medical benefits. If they apply to the family of the employee, you'll need to know about the family relationships between your employees. If Fred and Sally are married, you don't want to pay for Fred's wife and Sally's husband after you pay for Fred and Sally.

For each class of relationship, you'll need a table. Unless the relationships are permanent (sibling relationships are permanent, marital relationships are less permanent), your characteristics list will include an ID, the start and end dates of the relationship, IDs of the related objects and any other information about the relationship.

Here's a sample for your health insurance:


	FAMILY MEMBERS:
		relation ID
		start_date
		end_date
		employee ID
		related employee ID
		relationship type
Events are, as we noted earlier, a type of relationship. Don't repeat them here. Events usually relate one class of object (for example, customer) with another class (for example, product). The relationships we're talking about here usually relate objects in a class with other objects in the same class. (In my work, they've always related objects in the same class. I don't see any reason, however, to preclude cross-class relationships.)

These relationships have an inherent and unsolvable problem. One database design principle is to eliminate redundant data. (Record everything in exactly one place, and you can never have conflicting data.) Transitive relationships will complicate your life.

For you designmeisters, I don't mean the transitive relationships that third-normal form talks about, I mean the relationships that are transitive in the real world: if Sally is Sarah's sister, it's also true that Sarah is Sally's sister.

For nonredundant data storage, you would record this relationship just once. Database queries will be complicated by the fact that you often have to search two columns. But this is a minor problem.

The major problem occurs when you hire Sandra, who is sister to Sally and Sarah. The minimal amount of data is achieved by entering Sandra as sister of Sally. Then you do some fancy programming so that your software recognizes that if Sandra is sister of Sally and Sally is sister of Sarah, then Sandra is sister of Sarah.

If you get that code debugged, you'll realize that you've made a mistake. If the three women are children of the same parents, you'll reach the correct result. But once you begin to allow for death, divorce, remarriage, and so on, you'll see that only some sibling relationships are transitive. Explicitly entering all the relationships is the only way to prevent mistakes.

When you explicitly enter all the relationships, you will have entered at least some redundant data. While you are puzzling over the best compromise for your database, make sure you ask yourself this question: what happens when one of these women quits? If your application doesn't collapse, you'll be OK,.

I should point out that most databases are designed to accompany aopplications, but databases generally end up serving multiple applications. I concentrate on the application at hand because I've learned that it's easy to specify database elements; it's difficult and expensive to collect data.

Detail Tables

Once you have your objects, events, and necessary relationships listed along with the characteristics of each, it's time to look at the characteristics to find necessary detail tables. In a relational database, you'll need a separate table for each repeating characteristic.

For example, if your sale event can transfer multiple products to the customer (this is almost always the case), it includes one or more products, quantities, and prices. This requires a separate table in a relational database.

If you're designing for an object database, it's not necessary to create a separate class for these details. They can be stored along with the parent object. Putting the details in a separate class may shrink the parent data by an order of magnitude, however. Always consider the relational approach as a possible alternative.

Characteristics change over time. People change their hair color quite routinely. With today's contact lenses, eye color is hardly permanent. Less often, but still commonly, people change their names (perhaps taking a spouse's name after marriage).

Our tables all incorporate an ID, so changing characteristics isn't a major problem. Sometimes, however, you'll also need to create detail tables (or, for objects, array data members) to record history details.

Repeating Characteristics

Let's revisit that simple store database. It looked like this:

PRODUCT:            CUSTOMER:            SALE:
    product ID          customerID           sale ID
    name                name                 timestamp
    weight              address              customer ID
    color               phone number(s)      products sold
    number on hand      email address
    price
	
There are two plural characteristics. For "phone number(s)" we can break down specific, individual fields this way:

CUSTOMER:
    customer ID
    name
    address
    office phone
    home phone
    mobile phone
    email address

For "products sold" in the sales event table, we can't do that. Novice designers may try to allow several columns for multiple instances of a single characteristic. Experienced designers reject that approach in almost every case. The right approach is to create a separate table for sale detail records. Its design might look like this:

SALE:            SALE_DTL:
    sale ID          sale detail ID
    timestamp        sale ID
    customer ID      product ID
                     number sold
                     price
					
Suppose Fred (remember Fred? We say him last in seat 17A, editing JDB.INI) walks into the store and buys a widget, two doohickeys (on sale), and three thingamabobs. This is what the relevat records in the database will contain:

CUSTOMER:
    23, Fred,        someplace,        (999) 999-9999, ...
    
PRODUCT:
    14, widget,          14.2,      red,   1024, 19.95
   144, doohickey         8.8,      blue,   127, 29.95
   145, thingamabob      18.0,      white,    4,  3.95
   
SALE:
  12345,    19980201, 23
  
SALE_DTL:
  123456,    12345,         14,    1,     19.95
  123457,    12345,        144,    2,     24.95
  123458,    12345,        145,    3,      3.95
  
All the tables start with their ID values, Fred is customer 23, for example. (Relational databases are completely independent of column ordering. IDs are placed first by tradition. Placing them elsewhere will be no problem for the computer, but we humans will get confused.)

The PRODUCT table will probably have numeric color codes, not words like "red." And commans aren't part of the records except for data import/export operations. Otherwise these are typical tables. Of course, the many thousands of other records in these tables aren't shown here. We're looking at the handful relevant to sale 12345.

The SALE record is very simple. This is not typical of actual SALE records. Many will incorporate sales tax rates, some will need shipping data, and so on. Real databases have to reflect all the messy details of the real world. This SALE record says that customer with ID 23 made a purchase on 2/1/98. What did he buy?

You look in the SALE_DTL table for records where the sale ID is 12345. Our example has completed this search and come up with three records. The first is for product ID 14, one unit at $19.95. This happens to be the same price that product 14 shows in the PRODUCT table. The next detail record shows two products 144 purchased for $24.95. The $29.95 price of product 144 in the PRODUCT table shows that blue doohickeys were on sale when Fred bought his.

Assembling this sale to show, for example, an invoice on the screen is very fast, thanks to the magic of indexed lookups. In an earlier book I went through the time calculations for the necessary lookups to assemble a sale such as this one. I showed that if there were 10,000 customers, 100,000 sales and 1,000,000 sale detail records, the whole lookup process still provided a subsecond response. Those calculations were for a 25MHz, 80386-based machine. The moral for today: Don't worry about it. Any lookup delay gets hidden in the screen refresh process.

Use Sequential (not Random) Keys

This is a good time to mention the ID values. They are often called keys. The characteristics of the key are that it is a unique identifier and that it is never reassigned. The latter characteristic is redundant if you take the former seriously enough.

The simple way to generate such an ID is to assign 0 to the first record entered, 1 as the key of the next record, and so on. Initially the IDs will be the same as the record numbers. Once you delete a record, this stops being true.

A small table records the highest key assigned. The important point is that deletions never be allowed to affect the next key assigned. If your system assigned the value 100 last and then someone deletes four records at the end of the table, the highest remaining key will be 96. The next record must be assigned key 101.

Some DBMSes get this all wrong. The have an automatic unique ID capability, but assign random keys. They generate a random number, check it for uniqueness and hand it out if it's not in the table. This is completely unacceptable.

Your customer, 12345, moved to another continent. A year or so later you've rolled the inactive customers (including 12345) onto backup media. 12345 is no longer in the database. The random, not-really-unique key mechanism gives out another 12345. Now what happens when you do an ever-to-date search of your data warehouse for customer 12345?

Right. Both customers 12345 seem to have made each other's purchases. If you start at the sales table and then look up customers based on their ID, there's no telling who you'll find. On average, you'll be right half the time. If you're trying to regenerate invoices for 12345, your accountants (or the accountants for both customers 12345) will have unkind things to say.

Your ID value must be unique and must permanently identify exactly one record in the table. It must do that job forever. ("Forever" means from the start to the end of our system's life.)

Uniqueness applies within, not across tables. Every table I build starts with key 0. That's not a problem.

One final note on repeating characteristics. Ask yourself if the characteristic permits one or more occurrences, or if it permits zero or more occurrences. The database design will be the same, but your application software needs to know this. Occasionally you'll find some other lower limit, and you'll also find upper limits in some instances. Document these things so they get coded correctly. The design shown here will handle zero or more occurrences (which means it will certainly handle one or more) up to whatever fills your disk drive.

Characteristic Histories

In addition to these detail records for repeating characteristics, you may need detail records for characteristic histories. We said in the sample earlier that the doohickeys Fred bought for $24.95 were on sale. The PRODUCT table showed the regular price was $29.95. This is the relevant data:

CUSTOMER:
    23, Fred,        someplace,        (999) 999-9999, ...
    
PRODUCT:
    14, widget,          14.2,       red,  1024, 19.95
   144, doohickey         8.8,       blue,  127, 29.95 <--
   145, thingamabob      18.0,       white,   4,  3.95    |
                                                          |
SALE:
  12345,    19980201, 23                            discount?
  
SALE_DTL:                                                 |
  123456,    12345,         14,    1,     19.95           |
  123457,    12345,        144,    2,     24.95 <---------
  123458,    12345,        145,    3,      3.95
  
Was this a sale price? Maybe the price is $29.95 today, but it was only $24.95 when Fred made his purchase. From the data given, we don't know. The $29.95 price is presented as if it were an immutable fact of life for blue doohickeys. That's probably not the truth.

When you need to maintain a historic record, you add another type of detail record. For PRODUCT prices, this design will work:


PRODUCT:            PROD_PRICE:
    product ID          prod_price ID
    name                product ID
    weight              start date
    color               price
    number on hand

    
These are the entries in the PROD_PRICE table for blue doohickeys:

PROD_PRICE:
       32,        14,        19940201,    49.95
     4279,        14,        19950201,    39.95
    14396,        14,        19960201,    34.95
    27819,        14,        19970201,    39.95
    
To find the current price of a blue doohickey, you find the last entry for product ID 14 in the PROD_PRICE table. (I assume this table is sorted by date within product ID.) Note that the start date alone is sufficient. The end date is emplied by the next record's start date. If there isn't a next record, you're looking at the start date of the latest price.

When you program this, you'll find it's even simpler if you do a descending sort on the start date. That way, the first record is the current price. The first record that has a start date earlier than or equal to a target date holds the price on the target date.

Method Summary

These methods are often called "methodologies," which is too sloppy and too polysyllabic for me. In fact, even "polysyllabic" is too polysyllabic for me. A method for doing something is a method. Like biology or theology, methodology is the study of methods.

This is my method:

Step 1. Objects

You create one table for each class of object. Loosely, an object is anything you can kick.

Objects can be abstract things such as mathematical formulae or laws of physices. A better definition of object is an entity with a separate existence and a relatively long or permanent life.

In practice, focusing on kickable things lets you identify most objects.

Step 2. Events

Each class of events also gets a separate table. An event is something that happens at a point in time. Nothing useful actually gets done at a point in time, of course. But if our database works when you record a single time stamp, that's close enough to a point for our design work.

Events create a relationship among our objects at their point in time. You sell products to a customer, for example.

Events have time stamps. Objects don't. Events happen to objects. Objects exist by themselves.

(If an important event is identified in step 2 that doesn't involve any objects from step 1, you'll have at least one new class of object that should have been listed in step 1. Real design is like that.)

Step 3 Characteristics

Each object listed in step 1 and each event from step 2 is a data table. In this step you list their important characteristics. These are the fields (columns in the table). A characteristic is importnat if it's needed in your intended application.

Although a database should be able to serve multiple applications, you should hesitate to put in data that is not immediately useful. Collecting reliable data is never cheap.

Every table's first characteristic is an ID. For permanent uniqueness, I use integer IDs that are never reassigned after they are handed out. Deleting a record effectively deletes the ID, too.

Step 4. Relationships

An additional table is created for each important relationship. Again, a relationship is important if it's necessary for your applications.

For example, family relationships are important to avoid duplicate billing of insurance benefits.

The characteristics of a relationship table are:

Step 5. Repeating Characteristics

Examine your characteristics for multiples or plurals. In some cases, you can break down a plural into multiple characteristics, such as phone numbers becoming home, office, and cellular. In most cases, however, a repeating characteristic requires a new detail table.

Characteristics of a repeating detail table are:

The example given was the common sales table. A sale includes one or more products sold. The detail record includes the ID of the product sold, the number of units and the price.

For object databases, these details may be a separate class, or they may be included within their parent objects.

Step 6. Characteristic Histories

When you need to track the changes in a characteristic that occur over time, you need an additional table. (Again, the object designer may include this data within the parent object.)

A history table's characteristics include

Real Design Rules

These steps are idealized. Most designers write out rough lists of characteristics as soon as they identify a table. Objects suggest events and events suggest objects, so steps 1, 2, and 3 are done together.

Steps 4 and 5 are done as each table's characteristics are enumerated. Experienced designers go right to step 4, rather than listing a repeating characteristic in step 3 and then removing it in step 4. The same applies to step 5.

Steps 4 and 5 have to be repeated for the characteristics of the new tables created during steps 4 and 5. You're done when there are no more repeating characteristics and all that need histories have histories.

These steps are not useful if you attempt to let them dictate the order of the design's development. They are, however, useful as a checklist to ensure that nothing is forgotten.

I have never worked on a real-world design where the number of tables has not been far in excess of early estimates. I have not always resisted the temptation to "simplify" the design by combining tables. I have always regretted the result when I failed to resist this temptation.

The complexity of the implementation is linearly proportional to the number of tables. The complexity is geometrically proportional to the number of "simplifications" you make after you've decided your design has too many tables.

Normalization

If you know normalization, I'm going to show here that the preceding simple procedure is provably superior to a classic approach to, say, third-normal form design. If you don't know normalization, I'll teach you enough of the jargon here so you'll be able to go toe-to-toe with any classically trained database administrator.

First-Normal Form

There are five numbered normal forms, first through fifth, and there are other named intermediate forms. Each of the numbered normal forms is better than and inclusive of all the lower forms. A design in fourth-normal form, for example, meets al the requirements of third-normal form and adds other desirable qualities.

If you want to become proficient with normalization, you'll need a good text and about a week. If you use my design method, your results will be consistently superior to all but the very best traditional normalization-based designs.

First-normal form specifies only that the data be stored in tables. In our first three steps we list tables, but we leave nontable data, such as repeating characteristics. Eliminating these in steps 4 and 5 puts our database design into first-normal form.

Second-Normal Form

"A database is in second-normal form if it is in first-normal form and every attribute is fully functionally dependent on the primary key."
Would you like that in English?

Let's see. An attribute is what we'd call the content of a field. If we had a table for books, the value in the Author field for this book would be "Rinehart." That's the author attribute for this book.

Now I'll try primary key. Our ID values are the primary keys in our designs. They uniquely identify a particular record. What else could we use for a unique identifier?

In the case of books, titles are almost unique. Publishers try to make them unique. For my last book, I remember seeing a book on the shelves at my local bookstore sporting my book's working title. I called my editor immediately. We changed our book's name. If I hadn't happened to be browsing that day, or if my bookstore hadn't happened to carry that title, we would have published the second book with that title, which goes to show that you can't uniquely identify a book by its title. You can, however, combine the title and publisher fields to form a unique identifier.

You might combine other attributes to try to create a key. You could, for instance, combine title and author. That combination might uniquely identify books. In normalization theory it's called a candidate key. The primary key is the one you select as a unique identifier. Let's choose title plus publisher, just for the sake of this discussion.

Bear with me now. I told you this was jargon rich. The "functional dependence" bit comes next.

An attribute is functionally dependent on a key when specification of the key specifies a single attribute value. Search your book table's title and publisher for this book, and you'll find "Rinehart" in the author field. Always. That attribute is functionally dependent on that key. Search on just the publisher, and you'll find lots of different values in the Author field. (They publish lots of books, of course.) Author is not functionally dependent on publisher. Search on the book's title, and we fervently hope that you'll always come up with "Rinehart" in the Author field, but we can't be sure some other publisher isn't working on the same title.

Now let's add "fully" to the functional dependence concept.

An attribute is fully functionally dependent on a key if it is functionally dependent on the entire key, but not on any combination of key attributes less than the entire key. In our book table, "Rinehart" is uniquely determined by the combination of title and publisher. It is not uniquely determined by the publisher. We'd like to think that the title will uniquely determine "Rinehart" as the author attribute, but we're not sure. In fact, if your book table is large enough, you're sure to find an instance where one title identifies multiple author attributes, so the author is not functionally dependent on the title. Since the author attribute is functionally dependent on the full primary key, the author attribute is fully functionally dependent on the primary key.

All of which means that our book table is in second-normal form. Now will you permit me to simplify things?

By assigning an abstract key (one not based on the data in the record) and by adopting a mechanism to ensure that these abstract keys are unique within the table, we achieve full functional dependence with no muss, no fuss, no bother. If this book is the book with ID 12345 in your table, for example, its author is "Rinehart." Look up ID 12345 and you'll always get the same title, publisher, and author. Each attribute is functionally dependent on the ID.

Our IDs are a single item, not a compound of other attributes. If your key isn't compound, functional dependence is also full functional dependence. The attributes can't depend on just part of the key, because it hasn't got parts. Our ID values make all this a lot simpler, don't they?

Third-Normal Form

"A database is in third-normal form if it is in second-normal form and contains no transitive attribute dependencies."
This one's a little easier to translate, but may be harder to understand.

If an attribute is functionally dependent on a key, you can say that the key implies the attribute. Our book table's ID for this book implies that the author is "Rinehart." A transitive dependency is one in which

    A implies B
    B implies C

    transitively, A implies C
Of course, if you have precisely one key, which you do if you assign an ID as I've suggested, you'll never have transitive dependencies. Our tables are also in third-normal form.

Almost. Using my method or using traditioal normalization, you can get trapped by the problem of unintended (or even unknown) dependencies. These are usually created by someone else and left around to trap unsuspecting database designers.

Suppose you've created a simple person table, which includes name, address, and phone numbers. Fred (remember Fred?) lives on the upper-east side of Manhattan. His home phone is (212) 123-4567. His ZIP code (the U. S. postal code) is 10028. His city and state are New York, NY. Have you counted the redundancies here?

The area code (the first three digits of the phone number) is 212. New York City's central borough, Manhattan, is so populous that it has its own area code: 212. The postal code 10028 identifies a small part of Manhattan. Both city and state are implied by both phone and ZIP code. The ZIP code implies the first three digits of the phone number.

This problem exists whenever your database includes portions of someone else's identification system, such as postal codes and phone numbers. You can ignore it in many applications, or you can incorporate (if they're available) the outside source's tables.

You'll find when you try to model the real world that your model probably can't avoid other people's models. The real world is a messy place.

Boyce-Codd Normal Form

Boyce-Codd normal form is a very interesting form. It has a relatively simple definition, and it is based on first-normal form, not a higher one. But it has been proven that a table (relation) in Boyce-Codd normal form is also in third-normal form.

Tables in third-normal form are not necessarily in Boyce-Codd normal form, so you can think of Boyce-Codd as 3.5-normal form. Boyce-Codd eliminates some database problems found in third-normal form designs. This is the definition:

"A relation, R, is in Boyce-Codd normal form if its key, K, implies all nonkey attributes — a, b, c, ... — and K is a superkey."
Right. That's just what you were thinking, isn't it?

No? Just in case you're not a student of this arcane area of computer science, I'll translate this into English.

Begin with this notation to describe a table (relation) R that contains fields (attributes) a, b, c, ...


	R(a, b, c, ... )
	
This notation describes a table (relation) with a key K that implies the values of the rest of the fields (attributes):

	R(K -> a, b, c, ... )
	
(The traditional relational literature uses all uppercase and underscores key fields. My version does the same job but avoids typesetting problems.)

The key K may itself be a concatenation of individual attributes, such as our title and publisher concatenation. For example:


	BOOK( TITLE + PUBLISHER -> author )
	
The key is called a superkey if it implies all other attributes and if no portion of K less than the whole implies all other attributes. In other words, none of the components of K can be removed. Our concatentated key is a superkey because neither title nor publisher by itself will identify just one author.

In relational math, the record or row in the table, is called a tuple (which rhymes with "couple"). If K is a single attribute and K implies all the fields (attributes) of the record (tuple), then K is a superkey.

If you use a unique abstract key, such as our ID values, then you can describe your relations this way:


	R( ID -> a, b, c, . . . )
	
ID is always a superkey if you define a unique ID for each record (tuple) in your table (relation). So our tables are always in Boyce-Codd normal form. Guaranteed.

I'll bet you're glad to know that. It means that you can forget about transitive dependencies and functional dependencies, including full functional dependency. You can, that is, unless you have to deal with a Traditional Database Administrator (TDA). If you have to deal with a TDA, you can launch zingers like this:

TDA: Have you removed transitive dependencies?

You: Transitive dependency? You must be stuck in third-normal form. Get with it! We're 100% in Boyce-Codd normal form.

If you're going to use an object database, you'll call our ID an object identifier, not a key. You'll see that the characteristics of an object identifier are that it must be unique within the class and it must be permanently unique. Sound familiar?

More Normal Forms

There are normal forms past Boyce-Codd, including fourth, fifth, domain-key, and projection-join normal forms. Fifth-normal form has been proven to be the end of the line. (Some observers suspect that there are additional, undiscovered avenues that will invalidate this proof and provide additional normal forms.)

Fourth-normal form designs are in Boyce-Codd normal form, but not all Boyce-Codd designs are fourth-normal form designs. Just as Boyce-Codd is better than third-normal form, fourth-normal form is better than Boyce-Codd.

Domain-key normal form is third-normal form with additional constraints. It's roughly parallel to fourth-normal form. Not all fourth-normal form designs are domain-key normal form designs. Not all domain-key normal form designs are in fourth-normal form.

Being in both domain-key normal form and fourth-normal form is better than being in just one or the other. Both imply Boyce-Codd.

Projection-join normal form is between fourth- and fifth-normal forms.

In an earlier book, I showed that my design yielded designs that were in both fourth-normal form and domain-key normal form. It took a whole chapter to demonstrate that, however. If you want to look it up, it's Client-Server dBASE Programming (Addison-Wesley, 1994). I'd suggest a trip to the library, since most of the content is about coding in dBASE IV for DOS.

[This is a link to the relevant chapter in that book.]

In the same book I cheerfully confess that the question of whether my method actually gets to fifth-normal form is beyond my capabilities, or maybe, beyond my patience. I believe my designs are fifth-normal form, since I've never seen any of my designs show the problems that are associated with non-fifth-normal form designs.

Maybe one of you who is real wizard with normalization could prove it, one way or another?

Redundant Data

I've just inflicted a rather heavy dose of condensed theory on you. If you didn't have a lot of fun with it, I apologize. In this section we're going to leave theory behind and get on to constructive design cheats. These are the little, practical violations of good design that can make a system really sing.

I've learned that cheating on the design by attempting to reduce the number of tables is a prescription for trouble. Don't do it.

But I've also learned that adding deliberate redundant data can be very constructive. I'll suggest two examples here. You'll find more.

Before you start adding redundant data, however, remember that you are working close to the cliff's edge. Remember these rules:

Make sure that everyone involved with the data knows about the redundant data and understands that there is a possibility of inconsistency whenever there is redundancy. Storing one fact in two or more places means that the fact could be correct here and wrong there.

Database updates should be restricted to the source data. The redundant data should be strictly read-only. Edits to the source data should automatically update the redundant data.

Batch procedures should be available that re-create the redundant data from the source data. Do not assume that the online procedures for updating redundant data are adequate. (Never ignore Murphy's Law.)

With those warnings, let's begin with totals.

Totals

If you record totals in your database, you'll be storing redundant data. Actually, this isn't just a problem with totals. Any fact that is derived by the application of some formula to the other data in your database is redundant.

On the other hand, storing totals can be very useful. You can total the individual detail items and store the total in the SALES table. That way you can see the total for a sale without accessing the detail table (which will be your largest table). Forming and answering a query such as, "What are the total sales for this customer?" is much easier.

Chances are, there are levels above the sale where additional totals can be very handy. You might total sales by product, total product sales by product categories, and so on.

If your system automatically pushes totals upward through a hierarchy, you'll have a lot of useful information available instantly. If you provide a sales total in the PRODUCT table, for example, you can look up the product and see its sales immediately.

Program each sale to post up to PRODUCT. Don't forget returns and adjustments. Make sure that the value in PRODUCT can't be edited. Keep a batch procedure that you can run every evening that recomputes the totals from scratch, just in case.

From PRODUCT, you can continue to post changes upward into product class, storewide totals, regional totals if you have multiple stores, and so on. You can get a lot of information this way. (Information is your target output. It's what happens when you aggregate a significant amount of accurate data.)

Current Values

Remember those characteristic history tables? If prices will vary, you need a product price history table, instead of just storing the price in the PRODUCT table. You get the current price by looking into the price history table for the latest value.

You can also duplicate the current price in the PRODUCT table, so you won't need a secondary lookup to find this frequently requested item. Again, the same rules apply.

Make sure the software that updates the price history also updates the PRODUCT table's value. Ensure that the value in the PRODUCT table can't be edited. And hava a batch procedure to re-create the latest prices in the PRODUCT table by reading all the history data.

Or take another approach.

Make the PRODUCT table price the source data. Let it be editable. Whenever that value is changed, post a new entry to the product price history table. The latter becomes the redundant data that is available on a read-only basis.

The batch procedure that updates the history tables will look at these tables and the PRODUCT table. It adds a record to the history table if the values are different.

It's not important that one table or the other be the source. The important point is that you decide clearly which item is source data and which item is redundant. The source data can be edited. The redundant data is read-only. Edits to the source get posted to the redundant data. Edits to the redundant data are as close to impossible as you can make them. Batch programs re-create redundant data from source data.

Summary

In this chapter I've shown you my method for designing databases. I use the same approach for relational and object databases.

Begin by identifying the classes of objects (kickable things) that are important to your system. Then identify the events (things that happen to your objects at a point in time) that are important. These are the first tables in your system.

If there are important relationships among your objects, these are described in additional tables. To avoid duplicate payments for health insurance, for example, you need to know which of your employees are related.

List the characteristics of each object, event and relationship. These become the fields (or object data members) in each table. For the first characteristic in every table, use a unique ID value. Use integer IDs that are incremented as each record is added, and ensure that they never get deleted. Some DBMSes use random IDs that aren't dependably unique. These are unacceptable.

Then examine the characteristics for repeating (one or more, or zero or more) fields. In a relational design these require separate tables. In an object design you can make them a separate class, or keep the data as part of the parent object.

Also examine the characteristics for ones that need a history of changes. These too become separate tables in a relational design.

Keep refining your design until none of the tables has repeating characteristics and all the characteristics that need histories have associated history tables. Once that's done, your design is ready.

Next I showed that the tables in our design are automatically in Boyce-Codd normal form. In fact, they're in fourth-normal form and domain-key normal forms, too. You who were new to normalization met a lot of jargon from the classic normalization approach to database design.

Finally, I suggested that you introduce some redundant data. Keeping totals, for example, simplifies many queries. Keeping the current item in the parent table eliminates a lot of looking up in history tables. This can work acceptably if you ensure that only the source data is editable. The redundant versions must be read-only, and a batch process must regularly update the redundant values from the source.

Now if only we had a tool to create data tables, we'd be able to put this knowledge to use. That, as you probably guessed, is the direction we'll be heading in Chapter 12.