September 13, 2019.
I’ve been working through Handbook of Relational Database Design by Candace Fleming and Barbara von Halle, first released in 1989, which ambitiously attempts to define best practice for both modeling data in a database agnostic way and also address implementation concerns for implementing and tuning databases. I’ll write up note for both eventually, and these are my notes on the first half of the book on defining data models.
Quick summary: I haven’t taken a college-level databases course, and chatting with a few folks who have, this book covers much of the material. Aspects of the book are dated, but it’s a great review of the fundamentals, and it filled in a bunch of gaps that I missed from being a practioner first without the theory behind it. Solidly no regrets for spending time with it.
The three-schema approach asks data modelers to develop three kinds of schema: an external schema, a conceptual schema, and an internal schema. Modeling begins with the development of a number of external schema, each of which represents the view of a user cohort (in a rental application, perhaps the renter’s view or the landlord’s view). After collecting enough of these user views, the modeler collects them into the unified conceptual schema that reconciles the many user views into a unified schema owned by the database administrator but independent of any particular storage technology. The very final step is to translate that conceptual schema into an internal schema, which is the schema used by the underlying storage technology (e.g. a literal MySQL schema).
A few primary takeaways for me from this concept. FIrst, there is an actual standards document for this proposed in 1975 by the ANSI Committee. Second, it does a very deliberate job of decoupling technology constraints from data modeling. I’ve personally been quite guilty of modeling from a technology-first perspective, and on reflection I’m certain my models are worse for having taken that initial approach.
The rest of these notes focus mostly on the conceptual schema, which the book often also refers to as the logical data model. These notes use those terms somewhat independently, although techniques for developing the conceptual schema are equally appropriate for developing external schema.
A good data model is (1) correct, (2) consistent, (3) sharable and (4) flexible. Such a model will be easy to use and remain useful as requirements evolve.
What I particularly like about these properties is that they focus on the value of modeling (accurately reflect real systems) and acknowledge that most of the work in maintaining a successful model is done after the initial modeling. Successful systems must adapt to a series of increasingly unexpected requirements in order to remain successful.
The most important aspect of modeling data is to take a data-driven approach. Instead of starting from implementation constraints, ignore those constraints entirely in your first pass, and focus on what the business wants to accomplish.
Within that approach, the steps are:
Process to develop logical data model
where
clause on your query should only edit a single row.Something the book does quite nicely is walk through case studies of applying each of these particular techniques to a problem, demonstrating the techniques as they go.
Tips for developing an effective logical data model are:
Stepping away from the role of notetaker, I really enjoyed reading through this vision of structured data modeling. This book was written before Agile or Kanban or whatnot, in a world that was distinctly waterfall oriented, a world with more monoliths and services, and in a world where there were no internet-scale applications.
Much of the early NoSQL movement was about no longer exposing functionality that didn’t scale, forcing data modeling to directly confront the consequences of internet-scale, and that approach is the opposite of the approach described here that wants us to ignore the technology constraints until after developing the logical data model.
Similarly, most new products are created without product-market-fit, hoping to find fit through rapid iteration. This makes the idea of upfront modeling more challenging, but I also see how the absence of deliberate modeling cripples the productivity of most companies at some point in their growth curve until they’re able to remodel their data, often hidden within a larger rearchitecture project.
Another aspect of the three-schema concept that I find fascinating is the emphasis on user views, which I’ve effectively not seen used during my time in the industry. They are altogether a bit of an odd duck, perhaps no longer as relevant in a world where many production databases are modeled by the product engineers building on them rather than a database administrator.
Also, wow, triggers! One of the core tenets of modern development is keeping logic together where it can be modified atomically through deploys (that can be deployed, run in split mode, and reverted), and database triggers have fallen heavily out of use because they struggle to provide these properties. It’s fascinating to think back to trigger-oriented development while reading through this book, given how far they’ve fallen out of favor in industry circles I’ve spent time in. Are there still niches that rely heavily on triggers in new development?
As a final though, I think the concept of logical data models being sharable is particularly interesting when intersected with service boundaries. We want services to own their data management, but we still need a shared logical data model to reason across the service boundaries, which is one place I could imagine this three-schema approach being particularly interesting to apply in modern architectures.