Scaling Capacities: Why we swapped Dgraph for PostgreSQL

When building a tool for thought like Capacities, the data model is the product. Our users literally create webs of connected notes, so choosing a graph database initially felt like the most authentic architectural decision to the founders. Dgraph was chosen because it promised native graph traversal and a schema that matched the user's mental model, it made it easy to move fast and express queries in a way that closely matched how we think about data.

However, as time went on, they realized that the "perfect" theoretical fit was becoming an operational nightmare and with the release of Full Offline Mode, Capacities was much less reliant on having a graph architecture on the backend. After joining the team, making a switch to a more efficient database became my first project as the new backend engineer.

The Problem: The Graph Tax

Our primary challenge wasn't functionality, it was CPU consumption. Dgraph’s resource usage was unpredictably high, even with modest datasets. We found ourselves at a crossroads: horizontally scale a complex cluster (without the traffic to justify it) or vertically scale our servers to much more expensive machines.

We weren't alone. Community reports confirmed our suspicions:

After doing some research and having many discussions, we finally decided it was time to move to something "boring" and reliable: PostgreSQL on a cloud provider.

Redesigning the Graph in a Relational PostgreSQL Database

Migrating from a loosely enforced graph schema to a strictly typed relational database is a massive normalization challenge. I worked closely with Steffen, our founder, to audit every assumption I had about the data. The migration wasn't just a data transfer; it was a complete structural redesign. My first task was to deconstruct our Dgraph schema. We needed to translate "nodes and edges" into a performant relational model.

The "Links" and "Notes" Schema

The main challenge was normalization. Dgraph’s data model is inherently denormalized, but to take full advantage of PostgreSQL we had to redesign our schema around normalization principles. This forced us to rethink how we modeled data and to carefully review every backend query, ensuring we weren’t merely trading CPU bottlenecks in a graph database for greater response times in a relational system due to slow joins.

We eventually landed on a clean, highly indexed structure where tables could be categorized as:

objects tables: Contains the core nodes and their properties.
links tables: A dedicated table representing the edges (from node A to node B).

To safely replace the engine while the car was moving, we followed a three-step technical strategy:

The DatabaseService Interface: We wrote an abstracted interface in our backend. This allowed us to rewrite Dgraph interactions in one place and provided a roadmap of every query we needed to port.
Choosing Kysely to interact with Postgres: It gave us the best of both worlds, strong TypeScript typing and the freedom to write raw-like SQL for complex traversals.
Recursive Power: Traversing connected notes in SQL is no small feat. We utilized WITH RECURSIVE Common Table Expressions (CTEs), heavily testing them with EXPLAIN ANALYZE to optimize our indexing strategy.

Tooling: Why Kysely?

We spent significant time evaluating ORMs and different alternatives. We ultimately chose Kysely for three reasons:

Type safety: It provides end-to-end type safety without a heavy runtime overhead.
Raw SQL control: Unlike traditional ORMs that hide the SQL, Kysely feels like writing raw SQL without losing strong typing.
Compatibility with complex queries: Our most critical queries, which traversed the graph, required the use of WITH RECURSIVE and good column indexing.

Optimizing with EXPLAIN ANALYZE

The recursive logic was the core of the new system. We spent days running EXPLAIN ANALYZE on our queries to determine the perfect indexing strategy. Kysely’s migration tool made it incredibly simple to iterate on these indexes as we discovered bottlenecks in our local Docker environments.

The Zero-Downtime Rollout

We couldn't just turn off the lights. We followed a rigorous rollout plan:

Double Writing: We modified our backend to write every change to both Dgraph and RDS.
Consistency testing: During local and staging testing, we compared read results from both databases to ensure consistency in our queries.
The Two-Week Migration: We ran a background job that slowly moved historical data from Dgraph to Postgres. This took two weeks to ensure we didn't overwhelm the production instance.
Feature Flags: Using a flag system, we toggled individual DatabaseService modules from Dgraph to Postgres one by one. This allowed us to monitor real-world performance for specific queries and "roll back" instantly if we saw a spike.

The "Dirty" Data Challenge

When we moved to staging on RDS, we hit a wall: Legacy Data. A few years ago, Capacities used string literals like "default", "root", or "basic" where we now expected UUIDs. We wanted to keep our Postgres columns as strictly typed UUID types for performance, but we couldn't just delete old user data.

The Solution: We built a custom QueryBuilder module on top of Kysely. It included a literal mapper that swapped these legacy strings for "fake" but consistent UUID values. To ensure this was safe, we ran a background job on production to extract all unique legacy values and verify the set was small enough to map manually.

After this, we thought we were in the clear until we started the background migration job. Suddenly, the logs were filled with errors: invalid byte sequence for encoding "UTF8".

Dgraph, written in Go, was surprisingly permissive with what it stored. It had allowed "broken" Unicode values, null bytes, and certain malformed emoji sequences to sit in the database for years. PostgreSQL, however, is a strict guardian of data integrity. It rejected these values outright.

We had to build a sanitization layer into our migration script to strip out these "ghost" characters and fix malformed UTF-8 sequences, including broken emoji encodings on the fly. It was a stark reminder that moving data isn't just about moving bits; it's about translating between two different philosophies of data validation.

The Results: 1/10th of the Cost

The impact was immediate and dramatic. As we toggled each module, we watched the server CPU metrics drop.

By moving to Postgres, we were able to reduce our database costs to 1/10th of their original price. This meant 70% annual savings on overall infrastructure costs.

Key Learnings

Boring is Beautiful: Postgres has decades of community support, documentation, and tooling. When things go wrong, the answer is usually one Google search away. Postgres can also be highly optimized for a wider range of use cases than other database technologies, so it made us feel safer because we could prototype different implementations and compare metrics without completely switching to another technology.
Interfaces are your friend: The DatabaseService abstraction was incredibly helpful for us to gather all calls of Dgraph throughout the codebase and understand exactly what functionality needed to be provided by a new PostgresService module. At Capacities we’re big fans of dependency injections, and this was just one more example where it proved tremendously useful.
Managed Services > Self-Hosting: For a small team, the "premium" of a managed database instance is a bargain compared to the engineering hours spent managing self-hosted Dgraph clusters. Capacities is not a database company, the focus is better spent innovating in knowledge work instead of in managing infrastructure.

Scaling Capacities: Why we swapped Dgraph for PostgreSQL