Notes on data modeling from Handbook of Relational Database Design.
I’ve been working through Handbook of Relational Database Design by Candace Fleming and Barbara von Halle, first released in 1989, which ambitiously attempts to define best practice for both modeling data in a database agnostic way and also address implementation concerns for implementing and tuning databases. I’ll write up note for both eventually, and these are my notes on the first half of the book on defining data models.
Quick summary: I haven’t taken a college-level databases course, and chatting with a few folks who have, this book covers much of the material. Aspects of the book are dated, but it’s a great review of the fundamentals, and it filled in a bunch of gaps that I missed from being a practioner first without the theory behind it. Solidly no regrets for spending time with it.
Three-schema approach
The three-schema approach asks data modelers to develop three kinds of schema: an external schema, a conceptual schema, and an internal schema. Modeling begins with the development of a number of external schema, each of which represents the view of a user cohort (in a rental application, perhaps the renter’s view or the landlord’s view). After collecting enough of these user views, the modeler collects them into the unified conceptual schema that reconciles the many user views into a unified schema owned by the database administrator but independent of any particular storage technology. The very final step is to translate that conceptual schema into an internal schema, which is the schema used by the underlying storage technology (e.g. a literal MySQL schema).
A few primary takeaways for me from this concept. FIrst, there is an actual standards document for this proposed in 1975 by the ANSI Committee. Second, it does a very deliberate job of decoupling technology constraints from data modeling. I’ve personally been quite guilty of modeling from a technology-first perspective, and on reflection I’m certain my models are worse for having taken that initial approach.
The rest of these notes focus mostly on the conceptual schema, which the book often also refers to as the logical data model. These notes use those terms somewhat independently, although techniques for developing the conceptual schema are equally appropriate for developing external schema.
Properties of effective data model
A good data model is (1) correct, (2) consistent, (3) sharable and (4) flexible. Such a model will be easy to use and remain useful as requirements evolve.
- Correct means that the model reflects the reality of the process it models. It doesn’t hide the complexity of the real-world behavior in creative ways, it represents it fully.
- Consistent is ACID’s Consistency: state should never reach invalid configurations. If you’re modeling rental homes, a renter’s home and the home’s renter should never reflect different realities.
- Sharable models are usable by many different applications. An accurate representation of the underlying process of renting homes should make it possible to support an application for renters applying for homes, for home owners to manage their rental properties, a billing application, an integration with a maintenance company for repairs, and so on.
- Flexible models are evolvable, preserving their correctness as the real systems they represent continue to change.
What I particularly like about these properties is that they focus on the value of modeling (accurately reflect real systems) and acknowledge that most of the work in maintaining a successful model is done after the initial modeling. Successful systems must adapt to a series of increasingly unexpected requirements in order to remain successful.
Developing a data model
The most important aspect of modeling data is to take a data-driven approach. Instead of starting from implementation constraints, ignore those constraints entirely in your first pass, and focus on what the business wants to accomplish.
Within that approach, the steps are:
Process to develop logical data model
- Build skeletal user views. Identify the kinds of users you are building for, and for each user document the major entities and relationships from their perspective. Use this to build a data dictionary of all data within the system (you’ll be maintaining this forever).
- Add keys to user views. First, determine the combination of keys which uniquely define an entity (primary keys), and also the combination of keys that will be used for retrieval (alternate keys). Second, determine foreign keys that will operate across models. Finally, identify the business rules that govern mutation (e.g. what are legal insert, delete and update behaviors).
- Add detail to user views. Add attributes which are not used in primary keys (“nonkey attributes”). This section makes a particular callout against using enumerated values (e.g. “1032” as a coded value for say “MEDICAL INDUSTRY”), as they make data hard to understand. That said, the later normalization technique would likely want you to make industry into a separate table anyway.
- Validate user views through normalization. The book introduces a number of “normal forms”, each of which is a transformation of a schema with redundant information into a schema with less redundant information. I started typing them all up, but wow, nope not doing that, just read an article on database normalization instead. If you’re less excited to memorize those rules, a rule of thumb is roughly: any modification you make to any entity should be achievable with a narrowly scoped update, e.g. the
where
clause on your query should only edit a single row. - Determine additional attribute business rules. To quote, “It is useless to understand the structure of information within the business without understanding rules pertaining to that information’s integrity.” This section goes through defining domains for each attribute (what are the legal values for each attribute based on the thing they’re representing, maximum values, minimum values, and so on), ensuring primary keys are unique, defining rules about foreign key relationships (which are optional, which are valid, etc), and defining triggers (what are database enforced checks on mutation events).
- Integrate user views. Now that you’ve refined each of your user views, merge them together into a unified model that incorporates each user view’s needs.
Something the book does quite nicely is walk through case studies of applying each of these particular techniques to a problem, demonstrating the techniques as they go.
Tips for modeling
Tips for developing an effective logical data model are:
- Work closely with users. Often users are brought into a modeling process late, particularly in cases where folks are taking an implementation-centric approach as opposed to a data-centric approach. This leads to less correct models. Instead start with users and work back to implementation second.
- Follow a structured methodology. Designing a data model is a format of product management, and benefits from the same rigorous, intentional methodology.
- Use structural and integrity constraints. Define explicit constraints on the domain of legal values for each attribute, and control the modification of the relations between objects to ensure the underlying data remains correct.
- Use conceptualization and normalization techniques. Use the numerous formal normalization methods to simplify the data model into its least redundant possible form.
- Use diagrams whenever possible. Prose models are harder for readers to intuit than diagram model, and diagrams have much higher information density.
- Build a data dictionary. An automated or manual collection of all information related to your data model, in document or textual format, to pair with your diagrams and serve as an enduring source of truth.
Vocabulary
- “Entity is a person, place, thing or concept about which you wish to record facts”
- “Relationship is association between two entiities”
- “Attribute is a nondecomposable piece of information describing an entity”
- Primary key is the “identify attribute or set of attributes”
- “Foreign key is an attribute or set of attributes that completes a relationship by identifying the associated entity. The term foreign conveys the idea that the attribute ‘belongs’ or refers to another… entity.”
- One-to-many relationship describes one entity (e.g. an author) relating to many others (e.g. books).
- One-to-one relationship describes an entity (e.g. a book) relating to exactly one other (e.g. its publisher)
- Business rules describe valid model states. They preserve integrity by limiting assumed values, for example required fields and nullable references
- Triggering operations are a kind of business rule which are applied during mutation (insert, update and delete)
- Integrity comes in three flavors: entity, referential and domain. Entity integrity describes individual attributes, in particular that all components of a primary key exist. Referential integrity governs foreign keys across tables, in particular that a foreign key in a reference does exist. Domain integrity applies to columns, for example the data type, default value, uniqueness, and so on.
Some commentary
Stepping away from the role of notetaker, I really enjoyed reading through this vision of structured data modeling. This book was written before Agile or Kanban or whatnot, in a world that was distinctly waterfall oriented, a world with more monoliths and services, and in a world where there were no internet-scale applications.
Much of the early NoSQL movement was about no longer exposing functionality that didn’t scale, forcing data modeling to directly confront the consequences of internet-scale, and that approach is the opposite of the approach described here that wants us to ignore the technology constraints until after developing the logical data model.
Similarly, most new products are created without product-market-fit, hoping to find fit through rapid iteration. This makes the idea of upfront modeling more challenging, but I also see how the absence of deliberate modeling cripples the productivity of most companies at some point in their growth curve until they’re able to remodel their data, often hidden within a larger rearchitecture project.
Another aspect of the three-schema concept that I find fascinating is the emphasis on user views, which I’ve effectively not seen used during my time in the industry. They are altogether a bit of an odd duck, perhaps no longer as relevant in a world where many production databases are modeled by the product engineers building on them rather than a database administrator.
Also, wow, triggers! One of the core tenets of modern development is keeping logic together where it can be modified atomically through deploys (that can be deployed, run in split mode, and reverted), and database triggers have fallen heavily out of use because they struggle to provide these properties. It’s fascinating to think back to trigger-oriented development while reading through this book, given how far they’ve fallen out of favor in industry circles I’ve spent time in. Are there still niches that rely heavily on triggers in new development?
As a final though, I think the concept of logical data models being sharable is particularly interesting when intersected with service boundaries. We want services to own their data management, but we still need a shared logical data model to reason across the service boundaries, which is one place I could imagine this three-schema approach being particularly interesting to apply in modern architectures.