Monday, January 09, 2006

Database 101: In Databases, normal is GOOD!

Let's say you've got your E-commerce web site online, selling Wodgets. Mr. John Smith logs on, punches in his name as a brand-new customer "John Smith," and orders a month's supply.

When the time comes for Mr. Smith to order more Wodgets, he's in a hurry, and puts in a name of "J S" on the second order. And on the third, he's off on a business trip, and asks his girlfriend to place the order for him. She gets to the website, punches in his name, "J Smith," and the third order is shipped.

Now, you offer a special deal: "Any customer who's ordered at least three times from us gets his next order for FREE." (Hey, I never said my examples would be sane or logical!)

Our friend Mr. Smith is elated to hear this; it's time for his fourth order anyway. He leaps to the computer, punches in his name "John Smith" (the whole thing, this time there's free stuff involved, it's got to be perfect), and waits for the response.

The website searches the database, finds one match--the very first one--and reports back "I'm sorry, you've only ordered from us once, so you aren't eligible for this month's special."

"Stupid site," Mr. Smith complains, "I know I've ordered three times." He narrowly avoids putting his fist through the monitor out of frustration, and decides to order his Wodgets from a different supplier from now on.

So what went wrong? Why didn't the website understand that "John Smith," "J S," and "J Smith" were all the same person? The answer: Normalization.

Normalization is one of those database geek words, like Relational, Left Outer Join, and Whizbang, and it's used to describe a well-designed database. Normalized databases are idiot-proof...well, about as idiot-proof as a database can be, anyway.

Take phone numbers, for example. If you put a text box on your website called "Phone," how many different ways can a user enter a phone number?

Without area code: 123-4567
Without area code or symbols: 1234567
With area code: 899-123-4567
With extra symbos: (899) 123-4567

Here's the problem. When the database tries to do anything with all of these phone numbers, it's going to choke. It won't know that 1234567 is the exact same number as 123-4567. As far as the computer is concerned, these are two entirely different chunks of text.

If we Normalize, though, we can easily solve the problem. One way to do this would be to use three text boxes instead of one: Area code, then prefix, then suffix. The database can store these separately, or, more likely, the "save record" program in the website will merge all three together into one phone number to store in the database. All of the phone numbers will be stored in an identical format, and there won't be any confusion trying to pull the information out of the database. This same strategy can be applied in a bunch of different ways, like the Customer Name example above, or US States (Arizona, Ariz, AZ).

Normalization means protecting the database from the user by only allowing them to enter good data. To correct Mr. Smith's issue, a returning customer might be required to choose his name from a list rather than entering it fresh on every visit. That's actually a good idea across the board--the more places the user is required to pick from a list, rather than type in data, the better.

The key point with Normalization, though, is that it has to be part of database design--it can't be tacked on as an afterthought. You can tell if the database is not normalized if searches look like "If the state equals Arizona, or if the state equals arizona, or if the state equals AZ, or if the state equals az, or if the state equals Ariz..."

There's nothing more frustrating to a database geek than to be forced to deal with non-normalized data. "Oh, we had six different data entry people, and we just entered it any old way; can you fix it up real quick and merge it into the master database in the next fifteen minutes?" That's the kind of sentence that can lead database programmers into investing in mercenary magazine subscriptions and large ammunition collections.

There's more to Normalization than this, but this is the general idea. Keep it in mind as we explore other database topics, because it really comes in handy when design problems show up.

-=ad=-

No comments: