Most common data modeling errors
I will describe mistakes that development teams may make in an application development project during development and implementation of a logical and physical data model. I have encountered these situations many times over the last 13 years (and even made some of them myself in the earlier years). The goal of the article is to allow you to learn from other people’s experiences and increase the chances of success on the first try. My experience has been concentrated on Oracle databases, so most of the recommendations deal with techniques you can use with Oracle.
Even though many of the recommendations will seem obvious, you may recognize your own projects in some of these mistakes. But perhaps it will be comforting to know that you are in good company, as these errors are far too common, even though Oracle and other relational database systems have been available for years.
Oracle is a sophisticated relational database management system that supports complex data structures. While there is enormous flexibility in building a data model, many of the options in the Data Definition Language (DDL) have default choices, which encourages the temptation to just “build it and go,” focusing exclusively on the logical functionality of the application without planning for the physical operation of the database. However, the complexities also require some planning in advance, to allow the application to perform adequately as it scales up to thousands of users, millions of records, or hundreds of gigabytes of storage. The following table summarizes common mistakes and the problems that they can cause.
|Mistake||Impact||Symptoms||Benefit of Correct Action|
|Failure to address long-term growth.||Severe||Chronic space shortages.
DBAs in constant reaction mode.
Expense of acquiring storage space.
|Predictable cost for storage space.
Space is available before it is needed.
Time for maintenance tasks (backup, recovery, upgrade) can be predicted and scheduled.
|No routine way to clean out old data.||Severe||Unbounded growth of old data.
Expense of acquiring storage space.
Degrading performance over time.
Difficulty dealing with ad hoc deletions on massive scale.
Excessive production of redo logs; recovery time increases.
|Reduced cost of storage space.
Efficient methods for discarding old data.
Predictable steady-state operation.
|Overuse of synthetic keys.||Minor to moderate||Server maintains unnecessary index.
Insert transactions serialize on unnecessary sequence generator.
|Improved performance of high-insert transactions.|
|Inadequate planning of indexes.||Moderate to major||Server wastes time and space maintaining unnecessary indexes.Suboptimal query plans.||Improved performance.|
Failure to Address Long-term Growth
Logically, all that is needed to construct a data model is CREATE TABLE, CREATE INDEX, and other related commands. Using these commands, your logical entities become your physical tables, and you’re done with the data model, right? But operations that work fine on small tables may perform inadequately when a table grows to production size. I have seen many databases for which someone just created tablespaces with a few gigabytes, and then created the tables and indexes; then these creators had to deal with endless crises as they found that the tablespaces were not big enough. Such crises are even more acute when solving that space shortage involves buying more disk drives, since the procurement and installation of disk is rarely fast enough to solve a crisis.
To avoid this, you must focus more attention on the physical storage of large tables during the design of the data model.
- Estimate both the size of the table and its long-term growth pattern. Extreme precision is not required, but the estimate should be based on the actual needs of the table. A table exists to store some real-world information, so use the underlying business needs of the table to produce a simple model (perhaps in a spreadsheet) estimating the number of rows. The business domain experts should provide input to this process. How many customers, clients, or patients will there be? How will the population grow over time? How many accounts, orders, bills, payments, and so on, will a typical customer have over time? How many detailed records with there be on average for a given parent key? How long will information be retained? Answering these questions may be difficult in the case of a new business or application, but frequently there is enough experience in the business to provide at least a crude estimate (of course, it is better to err on the high side than the low side).
- Estimate the physical size of the tables. Again, extreme precision is not required, but the result should be based on reality. Something as simple as the sum of average column lengths times the number of rows, plus some overhead for PCTFREE, will yield at least a reasonable order of magnitude. Obviously, if the database already exists, then the actual tables can be used as an accurate gauge of the storage needs, though one should still extrapolate growth into the future.
- Don’t forget to estimate the physical size of the indexes. This is more difficult than estimating table sizes, because the actual size can depend on how compactly Oracle stores keys in non-leaf blocks. But even a crude estimate is better than none at all. A good starting point would be the average length of indexed columns times the number of rows, plus some overhead for PCTFREE. For bitmap indexes, a good estimate would be the number of unique values of the indexed column times the number of rows divided by bits per block. A word of caution: in my experience, indexes may consume up to twice as much space as these simple models would predict. You have to try it on your database to know for sure.
- Use this information to estimate the need for disk space both now and several years in the future. It is not really helpful to know a table’s current size if it is going to triple in size in the next three years. Modeling the growth will allow the DBA to construct tablespaces of adequate size without having to react to ever more critical shortages of space. It also allows you to allocate adequate time for maintenance activities such as backups or upgrades, since the time required for these activities is proportional to the size of the database. It may also reveal that a table needs to be archived or cleaned out on a regular basis.
- During development, validate the predictions by comparing predicted bytes per row with the actual row sizes of the tables and indexes in the development or test environments, and adjust the prediction model accordingly. This is especially important for the indexes, for which actual space usage may differ from the predictions by as much as a factor of two.
No Routine Way to Clean Out Old Data
Some parts of your data will be dated, and will have a limited lifetime depending on the needs of the business. Perhaps the business needs to keep orders or payments online for three years, and user activity logs for the last three months. Frequently, programmers will focus on the routine operation of an application, and will pay great attention to the modules that insert the new records. But the application itself takes no specific action to “expire” the old data, so no one is concerned with doing this efficiently, if at all. Naturally, this leads to unbounded growth. On a large table, this can be a disaster. When there is no plan for cleaning out old data, database operations people will typically respond to emergency shortages of space either by scrambling to delete huge numbers of rows or by adding space on an ad hoc basis, neither of which are good long-term solutions.
In Oracle, the DELETE operation is intended for online transaction support, and as such it is most efficient only for small numbers of rows. For every row deleted, the table and index blocks containing the row must be read, recorded in a rollback segment, modified, written to the redo logs, and written back to the disk. All this disk I/O is essential for integrity of the database (which is of course why you are using Oracle anyway), but it is not efficient for large-scale clean-out from a big table. Furthermore, the amount of rollback and redo produced will increase the number log archives and thus increase the time required to recover from a backup.
Oracle’s solution to this problem is range partitioning, first introduced in version 8.0. A range-partitioned table is physically separated into many different partitions, and Oracle automatically adds rows to a particular partition based on the value of a specific column in the row (called the partition key). Each partition accepts values in a specific range (hence the name), and the ranges of the partitions are consecutive and non-overlapping. There are many advantages to partitioning large tables, including the following:
- All rows in a given range can be eliminated with a single small, efficient DDL transaction by simply dropping or truncating a partition.
- For queries that have a condition limiting the value of the partition key, such as partition key BETWEEN :lower_bound AND :upper_bound, the query optimizer can skip partitions that will have no matching rows, reducing I/O required to process the query, an operation known as partition pruning.
- Maintenance operations can proceed on individual partitions while the rest of the table is still available to support users.
More details about partitioning can be found in chapter 11 of Oracle 9i Database Concepts and chapter 17 of Oracle 9i Database Administrator's Guide, including specifics on how to set up and managed partitioned tables.
In order to reap the benefits of partitioning, several choices must be made in advance:
- Which column to use as the partition key.
- Which intervals to use as the partition ranges.
- Whether to partition the indexes, and if so, how to do so.
- When to add new partitions and drop old ones.
There are plenty of resources to advise on these decisions, so I’ll not cover them in detail here. The main point to remember is that partitioning is by far the most efficient way to deal with large tables that must be cleaned out on a regular basis.
Overuse of Synthetic Keys
As you know, the primary key of a table is a set of one or more attributes such that every row in the table has a unique value. When these attributes are things that the business uses routinely, the key is called a “natural key.” Examples could include Social Security numbers, vehicle identification numbers, customer or patient IDs, account numbers, and so on. However, sometimes a table has no obvious natural key, or the natural key has many columns. Database designers frequently create a key using a sequence generator. This is called a “synthetic key,” because it serves the purpose of a primary key but has no real meaning to the business. Some database designers and reference materials use the term, “surrogate key.”
I have seen data models in which every table in the whole schema was keyed by synthetic keys. While there are valid reasons to use a synthetic key, it is going overboard to extend the practice to every table in a data model, for these reasons:
- There are tables for which a synthetic key simply serves no purpose. The primary key index is a waste of space, and the server must waste time updating the index during inserts and deletes. One example would be a table of log entries; it probably does not need a key at all. Another example would be the detail table in a master/detail relationship. Detail records are typically fetched by the master key; there is little benefit in giving the detail rows their own synthetic key.
- Simultaneous transactions that insert new rows must serialize on the sequence generator as they each take turns getting new key values. If there are dozens or hundreds of such transactions running in parallel, as in an OLTP system, then such serializing will limit performance. (Sequence caching can mitigate this serializing effect in Oracle Parallel Server or Oracle Real Application Cluster, but it does not help at all on a single-instance database.)
So when should you use a synthetic key? I have encountered these situations for which a synthetic key was useful:
- The natural key is something that might change over time, such as a car's license number. A synthetic key can remain fixed while the other attributes change.
- The natural key has many columns, and there are many places in your application where a program must fetch a single row by key value. A synthetic key would allow the application to identify a row by only one value.
- The natural key has many columns, and the table is the target of foreign references. It is generally impractical to repeat all of the natural key columns in the child tables; a synthetic key allows you to create foreign references with just one column.
You may find other reasons for using a synthetic key. But rather than habitually defining a synthetic key for every table, it is a good idea to make sure that the benefit of a single-column key outweighs the cost of indexing and the cost of generating sequence numbers.
Inadequate Planning for Indexes
One of the most challenging tasks in creating a logical data model is to decide which columns to include in indexes. Unfortunately, there are no magic criteria that are guaranteed to tell you which indexes need to exist and which ones don’t, but there are some rules of thumb that can serve as a starting point. These generally require a decent understanding of the way the tables will be accessed by the application programs, report programs, and (if possible) ad hoc analysis. It is beyond the scope of this article to explore all of the situations for which an index is a good or a bad idea. The Oracle9i Application Developer’s Guide — Fundamentals has some very good advice in a section titled, “Indexing the Correct Tables and Columns,” in chapter 5.
So why did I bring this up in a discussion of common mistakes? Because in many cases developers are tempted to apply simple rules of thumb globally across the entire data model without any consideration of whether the resulting indexes are really useful or not.
Foreign Key Indexes
Some data modeling tools have options to create indexes on all foreign keys, and some developers may be tempted just to turn on that option. After all, the rule of thumb says to index columns that are used in joins, right? In reality, a foreign key index is useful only if at least one of the following is true:
- When a query plan in your application starts by fetching rows in the parent table, and then fetches rows in the child table by way of the foreign key, then an index on that foreign key will help. Note that the join order and the search method are both important; if the plan starts with the child table first, or uses a full scan of the child table, or uses a different index on the child table’s primary key, your foreign key index is not helping.
- When your application deletes rows in the parent table, or updates key columns in the parent table, Oracle can use the foreign key index on the child table to enforce the foreign key constraint.
If neither of these situations apply, then indexing the foreign key is a waste of space and time. Furthermore, if the foreign key is the first column of the child table, then a separate index on just the foreign key would be redundant, and hence totally unnecessary.
For low-cardinality columns, if you really need an index, you might consider using a bitmap index, since bitmap indexes have many benefits when dealing with low-cardinality columns. (See Oracle 9i Concepts for a thorough discussion of bitmap indexes.) However, because a single block in a bitmap index can contain thousands of rows, concurrency can be a problem during inserts or deletes. The only way to know for sure is to try a realistic benchmark with and without the index.
Order of Indexed Columns
The order of the columns in the index can also affect its usefulness, though this is less of an issue with the new skip-scanning feature of Oracle 9i. Generally speaking, columns that are high-cardinality (i.e., have many distinct values in the table) and are used in the WHERE clauses should generally be first in the index. The idea is to allow the query processor to find the rows you want by searching as few blocks as possible.
To understand this concept, consider a telephone book as an analogy. The entries are sorted by last name, first name, street name, and house number. When you are searching for a person by last name, the alphabetical organization allows you to proceed to the correct line very quickly. But it does not help at all if you are searching by street name, the third field in the sort order. Instead, you would need a book where street name is the primary sort field.
Reviewing Index Definitions
During development, it is important to figure out which indexes you will and will not need. Do not defer this activity until production deployment.
- Determine indexes you need by checking the query plans for important queries that your application uses. Make sure the indexes are adequate to support those queries.
- Determine indexes you don’t need by using Oracle 9i’s new index monitoring feature, which can find unused indexes for you. See the ALTER INDEX command in Oracle 9i SQL Reference for the specifics on how to enable this feature. For this to be useful, your application must be running under realistic conditions that exercise all of the features that will be employed in production.
The only purposes for indexes are to enforce constraints (primary key, unique, and foreign key) and to improve query performance. Any index that does not do one of these two things wastes space, and wastes the server’s time during DML operations.
You may have noticed that this discussion of indexes does not have many hard-and-fast rules that will tell you that a given index is or is not a good idea. But the main point of this piece is that you cannot apply such rules indiscriminately; you must test your database to verify that useful indexes exist and useless ones don’t.
We have seen a number of problems that a database project can encounter when developers focus exclusively on the logical functionality of a database without planning for performance or space management. Taking the time to plan for these problems during development can help you avoid crises in production, and avoid the time and expense of reorganizing your database later to repair the problems.