Skip to main content

Command Palette

Search for a command to run...

The Unglamorous Work That Makes Migrations Succeed

Three years maintaining a legacy system taught me more about what survives technological change than any greenfield project ever could.

Updated
10 min read
The Unglamorous Work That Makes Migrations Succeed
K
I’ve spent a decade keeping legacy systems alive while quietly replacing them with zero downtime, leaning on a team of genuinely talented people to make it happen. This is where I write about what I learned.

The question everyone is asking wrong

There's a conversation happening right now about which software work survives technological change. I hear it in every engineering channel, every conference talk, every coffee chat with someone anxious about where the industry is heading. The answers tend to cluster around the same ideas: creative work, strategic work, "high-level" work. Work that requires taste, judgment, vision.

I don't disagree with any of that. But I think we're missing something more specific, something that doesn't show up in job titles or performance reviews, and that most developers actively avoid when given the choice.

In October 2020, I joined a project as a sustainment developer. The "real" team was building something new. My job was to keep the old thing alive. It wasn't the glamorous assignment, and I knew it.

What I didn't know was how much that would matter later.


What sustainment actually means

Sustainment sounds like maintenance. It isn't.

Maintenance is passive: you wait for something to break and then fix it. Sustainment, at least in the context I'm describing, was active knowledge work on a system that nobody else wanted to touch. While the project team built the new stack in relative isolation, sustainment meant being in production every single day. It meant knowing which bugs were real and which were expected behavior that someone had quietly decided to live with. It meant understanding the patterns of thousands of users: when load spiked, where sessions dropped, what error codes never surfaced in the logs but showed up constantly in support tickets.

None of that knowledge existed in documentation. There was no onboarding guide that said "by the way, this module behaves unexpectedly on Tuesdays because of a scheduling dependency that was never removed." You learned it by being there. By being the person who couldn't hand it off, because there was nobody to hand it off to.

This is the part that's easy to undervalue, including if you're the one doing it. Sustainment work doesn't generate demos. It doesn't produce architecture diagrams. It doesn't show up in sprint velocity metrics in a way that feels meaningful. What it produces is context, and context is invisible until the moment it becomes the only thing that matters.

That moment came in Q2 2021.


"Let's start with the one fewer people use"

The logic was sound: start the migration with the module that had the smallest user surface. Less exposure, less risk, more room to learn before touching the modules where everyone would notice a problem. The client's architect had already defined the stack. The team had been building in Flutter. This was supposed to be the contained part, the proof of concept before the larger modules followed.

The interesting thing about "contained" briefs is what they don't account for.

Flutter Web in 2021 was not the routing-friendly framework it has since become. The approved dependency list for this client, a large enterprise with strict governance around third-party libraries, didn't include a solution that actually worked for the navigation patterns we needed. The only pragmatic option was a library called AutoRouter, which was not on that list.

What followed was not a technical conversation. It was a conversation about risk, trust, and tradeoffs, conducted with a client who had legitimate reasons to be conservative about dependencies. Making the case for AutoRouter meant the team had to articulate not just why it solved our problem, but what kind of problem we'd have if we didn't use it, what the maintenance surface looked like, and why the risk of including it was lower than the risk of building around its absence. That argument had to hold at every level, from the engineers who'd maintain it to the architect who had to sign off.

That argument was only possible because of the previous eight months in sustainment, though not in the way you might expect. I wasn't the Flutter expert in the room. I was coming from a JavaScript background, still learning the framework alongside the migration itself. When the developer who suggested AutoRouter brought it to me first, my initial instinct was uncertainty. But I knew something he needed to validate: routing failures in a sustainment context are serious. They surface as incidents, as user-reported bugs, as support tickets that don't trace cleanly to a root cause. That operational knowledge, not Flutter expertise but production experience with the system we were migrating, was what turned uncertainty into a position worth pushing.

What neither of us fully anticipated was how consequential that decision would turn out to be. The migration of this first module was, in effect, a proving ground. When the main mobile app later needed to compile for web, the routing architecture we had established made that path significantly faster. A dependency decision that looked like a tactical workaround ended up shaping the trajectory of a much larger migration.

This is what I mean when I say the decisions that matter aren't between good options and bad options. They're between different types of debt. Someone has to understand the debt well enough to choose deliberately, and to explain that choice to the people who have to live with it.


Zero downtime is a product decision

The migration strategy was straightforward in principle: connect the new Flutter Web frontend to the existing backend services. For users, the experience would be seamless. The new UI, the old data layer, nothing visibly changed from where they sat.

In practice, "seamless for users" meant the team absorbed a lot of invisible cost. Manual processes, workarounds, operational overhead that didn't show up in a sprint but absolutely showed up in your week. We knew this going in. The 1:1 migration meant we were carrying legacy architectural decisions forward into the new system deliberately, because the alternative, a harder cutover, carried more risk than we were willing to take at that stage. It was the right call. It was also a call that created work we'd have to come back for.

Feature toggles gave us control over the rollout: the ability to move incrementally, validate behavior module by module, and pull back if something unexpected surfaced. They also gave the client visibility into the process, which mattered as much as the technical safety net. A client who can see what's happening is a client who can calibrate their trust.

Zero downtime migrations are frequently described as technical achievements, and they are, in part. But the technical mechanisms are not the hard part. The hard part is understanding what you're trading away to get them, and making sure the people who need to know, know. If you don't know what it's costing, the bill arrives later, usually at the worst possible moment.

In our case, we knew exactly what the bill was. We planned for it. That planning was only possible because the tradeoff had been named out loud, not left as something everyone understood but nobody said.


The moment context becomes leverage

By August 2021, all the UI modules had been migrated. The system underneath was still the same: same services, same data sources, same architectural decisions from years before any of us had joined the project.

This is the phase where accumulated context stops being invisible and starts being operational.

Knowing which modules had implicit dependencies that weren't in any spec. Knowing which bugs were inherited from the original system and which were introduced by the new development team working in parallel. Knowing what the client had already accepted as a known limitation and what they would never accept, regardless of technical justification. None of that was written down anywhere. It lived in having been present, in the sustainment work, in the incident responses, in the hundred small conversations with stakeholders that don't make it into meeting notes.

This kind of knowledge isn't special. It doesn't require unusual intelligence or rare technical skill. It requires time, attention, and a willingness to stay close to a system that isn't glamorous. What makes it valuable is that it can't be transferred quickly. You can't summarize it in a handoff document. You can't reconstruct it from a codebase. It exists in the person who accumulated it, and when that person isn't in the room, decisions get made without it.

The same principle applies to any sufficiently complex system, and to any sufficiently powerful tool. Judgment formed in context, real context, under real pressure, with real consequences, is not something that can be generated on demand. Tools change. The need for someone who understands the system deeply doesn't.


How trust becomes technical capital

I want to be honest about something: a lot of what made the later stages of this migration work wasn't technical at all. It was relational. Three years of showing up, flagging problems early, delivering what we said we'd deliver, and occasionally pushing back on scope in ways that turned out to be right. That's not a glamorous story. It's also not something that happens by accident.

The last major migration, moving the remaining services, was the first time the client gave us room to improve rather than just replicate. That permission wasn't in a brief or a business case. It came from somewhere harder to measure: the specific confidence a client develops when a team has consistently understood the system, named the tradeoffs accurately, and not surprised them with problems that should have been seen coming.

Technical trust works like capital. It accumulates slowly, through exactly the kind of work that's hardest to make visible. It depletes quickly. One surprise, one commitment that wasn't kept, one problem that should have been flagged earlier. The last migration being better than the first wasn't luck. It was the compounded return on three years of staying close to both the system and the people responsible for it.


What survives the cycles

The job title "developer" has been renamed several times in my career, and it will be renamed again. Each cycle removes something from the definition: some task that turns out to be automatable, some skill that turns out to be a commodity. What survives isn't the stack or the title. It's the capacity to accumulate context in real systems, under real constraints, and use that context to make decisions that tools alone can't make.

Sustainment taught me this in the most concrete way possible. The work that looks peripheral is often the work that holds everything together. The developer who stays close to production while everyone else is building the next thing is accumulating something that doesn't show up on a performance review. It shows up later, when the migration starts, and someone needs to know why the system works the way it works.

That knowledge is not glamorous. It doesn't demo well. But it's what makes the difference between a migration that goes smoothly and one that surfaces surprises at the worst possible moment.

Start there, if you have the chance. Stay longer than feels comfortable. The context compounds.


There's a longer version of this story, one about what this kind of work costs personally, and what I'd do differently knowing what I know now. That one is harder to write. I'm working on it.

What Survives

Part 1 of 1

Retrospectives on context, trust, technical debt, and what actually survives when the stack changes underneath you.