In Limbo: Managing Transitional States

00:00:00.320 All right, welcome to "In Limbo: Managing Transitional States." I'm really excited to be here. As Fetalina said, I'm Jeremy Smith. I am a product-focused Rails developer. I run a tiny one-person Rails studio called Hybrid. I've been doing that since 2013. I co-host the Indie Rails podcast with my good friend Jess.

00:05:440 I also have stickers in the back on the sticker table, so feel free to grab one if you want. My latest side project is called Liinal; it’s a fresh take on old-school forums. If you miss forums and are tired of chats, check that out! I'd love to talk about that with you. You can find me most places online at JeremySmith.co.

00:00:20.240 The title of this talk is somewhat cryptic, so we're going to start here. What do I mean by "limbo"? Your first thought might be the contest that involves bending over backwards and passing under a horizontal pole, lowered slightly for each successive pass. But instead, I'm referring to the secondary definitions of limbo: an intermediate or transitional place or state, or a state of uncertainty.

00:00:44.160 When I started working on this talk, I remembered that "In Limbo" is the name of a Radiohead song found on their album, Kid A. In it, we find these lyrics: "Lundy Fastet, Irish Sea, I've got a message I can't read. I'm lost at sea; don't bother me, I've lost my way." It turns out Lundy Fastet and Irish Sea are three nearby areas on the BBC's shipping forecast. This song is about the transitional state of being at sea. Being in limbo is akin to embarking on a sea voyage—having left one shore of safety and not yet having reached the next. The sea is an ancient symbol of chaos and danger. While there are great benefits to be gained by going to sea, there are also significant risks.

00:01:44.960 In seafaring, a decision is made to undertake a lengthy and involved move from one position of relative safety to another to gain some perceived advantage. During the transition, risk and uncertainty are increased. It’s possible to get lost; the available resources are fewer, and the conditions may be more difficult and dangerous. But during the transition, all functions must continue and cannot be suspended.

00:02:22.599 I think the same situation arises when building software. Therefore, when I refer to being in limbo, I'm talking about the incremental process of making a complex change to a software system that is already in use. Let me break that down a bit. Why do I say "in use"? If you are making a complex change to a greenfield application with no users right now, there’s essentially zero risk to landing all those changes at once. Or if you're building a feature in an existing application that is not yet connected to anything else and is hidden away from users, again you can probably just make that change.

00:03:38.560 Why complex? Well, if the change is simple, can be easily reviewed, and can be rolled back with little or no impact to users, then the risk is low in making that change all at once. And why do I say "software system"? Because these changes aren't limited just to code; they can also involve schema changes, data infrastructure, and conceptual as well as operational changes for the people involved, including the end users of the system and those tasked with understanding and maintaining it.

00:04:09.000 It might be helpful to talk about some examples of different kinds of complex changes to systems currently in use. Let's say you work in a Rails app that uses the "paranoia" gem for soft deletion. Paranoia takes an implicit approach to soft deletion by overriding ActiveRecord behavior, which can lead to some confusion, as the "destroy" method soft deletes records while the "destroy!" method actually deletes them. You discover that you're losing data when a model with soft deletes has a "has many" association to another model without soft deletes because the association is set to "dependent: destroy." So, you decide to switch to the "discard" gem, which takes a more explicit approach and requires you to filter out all soft-deleted records in your queries.

00:05:05.360 Alternatively, let's say you work on an app where users have one of three possible roles managed by a role enum on the users' table. After some time, the team has determined that the app needs to allow users to have multiple roles tied to their team memberships, but all the privileges and authorization checks for existing users must be maintained during that transition.

00:05:36.559 Or let’s say you work on a system that analyzes documents with highly variable content and generates consistent structured outputs that surface to users. For simplicity, let’s say that one of the most important outputs is the unique word count of the document. However, you discover that the unique word count algorithm currently in use isn't always handling word boundaries correctly and is extremely slow when processing certain documents. You need to rewrite the algorithm and regenerate the unique word count for hundreds of thousands of documents in production, ensuring that the accuracy of the new word count is the same or better than the previous one.

00:06:11.840 Now, someone might push back at this point and ask, "Do these complex changes have to be incremental? Why can't we just make an all-at-once switch over?" We could assign one team member to build out all changes in a separate branch over several weeks and then pause all merges to our development branch, merging in the new changes, resolving all conflicts, blocking off time over a weekend to deploy the changes and cut the system over, remaining on call in case there are problems.

00:06:25.840 I'm sympathetic to this idea; it seems like the incremental approach will take twice the time as the all-in-one switch over. It may even seem there’s no reasonable way to break these changes into smaller steps, but this perspective is overly optimistic. First, it assumes that the work of maintaining and then merging a long-running branch can be accomplished with few defects and regressions within the limited time frame that other changes are being held back. It also assumes that the volume of problems faced after the switchover can be reasonably managed by the team and resolved before the weekend is over. It doesn't factor in the potential for failed cutover, which involves the work of reverting changes and trying again.

00:07:07.680 It puts an unnecessary cognitive load on team members, requiring them to hold a full understanding of both the before and after states of the system during reviews. It expects them to notice a needle in a haystack—the smallest bug in a gigantic diff. Yet, it puts undue stress on team members by shifting all risk and uncertainty around the change to a small window of time. If things go badly, it risks depleting team morale. I'm going to share two stories of different kinds of system transitions that I've made to Rails applications on client projects over the past couple of years. As a fractional Rails developer, I work on small applications with small teams.

00:08:43.200 Your circumstances may involve a more complex team size, larger lines of code, increased request volume, scaling challenges, connected systems, and so on. While my examples may be simpler than the systems you work with, I'd like to suggest that higher complexity makes it even more critical to identify these limbo states and establish good practices within them. A couple of years ago, I was working on a client project where we decided to migrate from Bootstrap 3 to Tailwind. I had some constraints: we couldn't stop all work on this app until the migration was over. Bug fixes were still needed, minor dependency upgrades were necessary, and some features had to be built along the way.

00:09:24.080 This migration needed to be possible for a week or a month without causing problems. From experience, I knew I didn’t want to maintain a long-running branch for this work, and I didn’t want to do an all-in-one switchover to Tailwind. The long-running branch would make integration difficult and build risk to a point of crisis when that final merge happened. Changes to views are often particularly heavy diffs with lots of line changes. I find there's a significant fatigue factor when reviewing changes to the view layer, making it very difficult to spot errors.

00:10:08.880 While it wouldn’t be ideal to live for a period of time straddling two CSS frameworks, I knew that an all-in-one switchover would be asking for surprises that could lead to production incidents, support issues, rollbacks, and potential rework.

00:10:39.640 There’s another thing I knew I didn't want during this project: no side quests that would increase the scope and put the main goal—switching from Bootstrap to Tailwind—at risk. Let me underscore this: side quests are a huge temptation. It's understandable to think that while you're already in all these views, making these changes to migrate to a new CSS framework, it would also be a good time to do design updates to the app, extract new helper methods, switch all partials to strict locals, or even migrate to view components.

00:11:16.360 In general, I think this is a mistake; doing so makes it much more difficult to maintain continuity in the app. The change sets become even larger and more difficult to review, and in the case of design updates, it may unnecessarily tie dependency changes to the design approval process, leading to slower reviews and more friction. With this in mind, here's how I approached the project: first, I reviewed how we were using Bootstrap in the app. I had a good idea of the use case since I had been working on this app for a year or two. However, it's important to perform a thorough review to understand how changes to dependencies will affect the system.

00:12:02.240 For example, what default components are being utilized from Bootstrap? What Bootstrap JavaScript functions are being used? Is the provided icon set being utilized, or is it another? Are there any Bootstrap extensions being used? How has the framework been extended or modified? How much custom CSS relies on Bootstrap classes? What Bootstrap classes are relied upon for custom JavaScript? Out of the discovery phase, I needed to be clear on two things: destination and inventory.

00:12:46.760 During discovery, it's vital to articulate what the end goal is and what parity will mean for the project. In other words, to be successful, what needs to remain the same before, during, and after the transition, and what is acceptable if it differs? When it comes to CSS framework changes, the closer you need to be to interface perfection, the more expensive it becomes to achieve. Fortunately, my client did not have a high requirement. My goal was to be reasonably close so that navigating between a Bootstrap page and a Tailwind page would not be jarring to users, but I didn’t want to go so far as to require all views to be visually identical.

00:13:37.840 The other piece that came out of the discovery was inventory. Having an inventory is important to understand the scope of individual changes, make estimates easily, and track progress across items. In this case, I needed an inventory of views, partials, and helpers relying on Bootstrap. I also needed an inventory of other Ruby and JavaScript libraries that made use of Bootstrap, as well as any custom extensions. Along the way, I identified tasks that could be performed even before starting the framework migration, which would simplify the process.

00:14:21.479 We had an admin layout that had very few differences from the default application layout, so I consolidated the two and added a conditional check where it went in the admin context. Navigation is often one of the most complex areas in CSS frameworks. I simplified multiple navigation partials and extracted navigation-related helper methods to reduce duplication and minimize markup and class changes when updating to Tailwind.

00:15:01.360 This was also a good place to decouple Bootstrap classes from custom JavaScript and tests. For example, you might have JavaScript event listeners on form control inputs or specs that check flash text in an alert div. With my inventory in hand and preliminary tasks completed, I needed to do all the setup to support using Tailwind without hindering Bootstrap use. In my application controller, I added a CSS framework helper method that defaults to Bootstrap but can be overridden with a query string parameter to Tailwind, allowing me to convert controllers one at a time with a way to manually test by switching views.

00:15:59.640 I added Tailwind dependencies, set up a new entry point, and conditionally loaded assets based on the CSS framework. Here’s what you may notice: we were also in limbo between Sprockets and Webpacker at the time. Next, I split out separate versions of Bootstrap and Tailwind partials for common and shared parts of the layout, such as navigation. I also split out separate versions of Bootstrap and Tailwind helper methods used for Navbar and links.

00:16:43.760 One trick here is to make the destination the default and the origin the special case. The original Bootstrap version of all the partials was moved into a Bootstrap subdirectory, leaving the Tailwind partials in their original position. I prefixed all the original Bootstrap helpers while leaving the Tailwind versions as defaults. This way, when Bootstrap was finally removed, it led to no line changes for all the Tailwind-related helpers and partials. I could have done better in at least two cases; first, you might have noticed already that I could have set the CSS framework helper method to default to Tailwind and overridden it to Bootstrap for all subclass controllers, removing it as I worked through each controller.

00:17:42.640 The second case was the form builder; I needed to add a new Tailwind-specific form builder. So, as I went along, I updated each form to specify that new builder, but it might have been better to first designate a Bootstrap form builder and update all forms to explicitly use it, then create a Tailwind form builder and set it as the new default, falling back to it as I worked through each form.

00:18:23.680 With the structure in place to convert to Tailwind, I started the process of updating views. I worked roughly from least to greatest risk, starting with low-traffic admin views for internal staff. I would typically convert all views for a single controller at once by overriding the CSS framework method for that subclass controller and updating its related views together. When doing this, it’s important to watch for views that may make use of partials in other directories that might primarily be used by controllers that haven’t been converted yet.

00:19:09.360 This is one reason why having that inventory and clarifying relationships between inventory items is critical. This work is tedious because it’s one controller or section at a time, making it easier to change, review, release, and move on without having to retain the context over the long term. As these items get crossed off, you start building momentum and can see what’s left to do.

00:19:43.840 At this stage, you will inevitably notice ways you could simplify views, abstractions you could extract, and other improvements to markup and styling, but I would suggest resisting the temptation to act on them immediately. Document those ideas for later. The greatest risk is allowing your well-intentioned desire to improve the code you’re working on to prevent you from achieving the immediate goal of completing this transition.

00:20:10.720 In one final pull request, I removed the Bootstrap Sass gem along with the Bootstrap CSS entry point, retaining only the stylesheets needed for Bootstrap, the CSS framework helper, all its conditional checks, and all the Bootstrap-related namespace partials and helpers. The fatigue of these incremental steps may tempt us to skip this final phase, but that last red pull request should be seen as a reward.

00:20:59.839 Even though there’s no functional difference to the system in this last change set, it’s symbolic. I think we underappreciate the psychology of developers. We want to do the right thing; we want things to be clean, tidy, and neat. The satisfaction of completion provides a needed refueling and reinvigoration for the toil and tedium of the conversion stage. This strengthens our resolve for the next migration project and period of limbo. This entire process took about three months of calendar time with a total of 30 pull requests.

00:21:50.080 Now, let’s talk about a different kind of transition—one that involves architecture and data modeling. I was working on a client project facing challenges because Stripe billing and subscriptions were coupled to the user model. It’s common wisdom now that if you’re building a SaaS application, you should start with an account or team model and connect your billing and subscriptions to that, rather than directly to the user model.

00:22:27.600 However, that wasn’t well understood a decade ago when payments were first introduced into this app. We knew that subscriptions and billing ultimately didn’t belong as responsibilities on the user, but the work to migrate to that new solution would be both extensive and critical. As with the previous example, I didn’t want to maintain a long-running branch for this project. I also didn’t want to create a crisis point with an all-in-one switchover. I couldn’t block other work being done in this app, but I needed to minimize transition time as much as possible by avoiding side quests.

00:23:16.160 So, with all that in mind, here's how I approached the project: first, I did a review of the system. This involved understanding what subscription and billing-related data was being stored within the system, categorizing and mapping all the billing-related processes in the app. As I didn’t write the billing code myself, I documented all the classes and methods related to billing and subscriptions, describing their purposes in plain English, listing all other parts of the system that called or relied on them.

00:23:55.360 I also documented questions I had about how the system was intended to work, particularly concerning parts that appeared not to be in use anymore, so I could discuss and get confirmation with the product owner. In the CSS framework migration, we could live with close but not precise visual parity. Here, we needed all aspects of subscriptions to work exactly the same as before. In the future, extracting billing from the user model would open up additional benefits, but immediately following the migration, there should be no functional difference to the system.

00:24:40.000 My review resulted in several inventories. One was a list of all the database-related columns related to billing and their respective ActiveRecord models, which would inevitably need to be extracted from a direct relationship to the user. The other was a list of all parts of the system down to the method level that related to subscriptions and billing, which would need to be changed. During my review process, I looked for tasks that could be accomplished ahead of the migration work to simplify processes. I found that a lockout feature would lock the user out of their account in case of a Stripe billing dispute.

00:25:34.720 Lockout modeling and functionality were tied to the user, which wouldn't make sense to continue after moving billing up to the account level, but most of this feature was separate from billing other than an event handler responding to a dispute event from Stripe. I moved this functionality to the account level ahead of the migration. With a good understanding of the system, my list of inventories, and preliminary tasks completed, I set up everything needed to start the migration.

00:26:28.320 My plan was to create a new model holding billing subscription data, work through the system to dual-write to both the old and new locations, and have a verification script check that the values matched between the origin and destination. After completing dual writing and verification steps, I would change the source of truth to the new model and go back to the system to stop writing to the old location. For setup, I created a new billing accounts table and the billing account ActiveRecord model with a way to sync individual records from their origin.

00:27:20.000 I created a script that forced the sync of all records from origin to destination, and a checklist for working through the entire system to make changes. I created the billing account model upon user signup and ran a backfill job to create and sync billing account models for all existing accounts so that all would have an associated billing account with a current snapshot of data. From there, I was ready to move into the incremental migration portion of this project. These migration steps are broken down into three main sections: dual write, change source of truth, and remove original.

00:28:58.640 For dual write, I would take a method or a background job and change its arguments from the user to the account or billing account. In cases where data was persisted, I would create a transaction around the original persistence call and add the corresponding call to change the billing account, leaving a to-do for myself at each incision point to know where I would need to return and remove the old call after verifying that dual write had maintained data parity. I was then ready for the change source of truth, which involved relying on that new billing account data within the system, whether it was the trial period, expiration status, etc.

00:29:54.720 At this point, I was still using the verification script to ensure we maintained data parity with the new model. Finally, for the remove original phase, I worked back through the system to eliminate all original persistence calls while transitioning with transactions wrapping them. At this point, I would no longer be able to rely on the verification script, but I had confidence that we maintained parity and were now fully relying on the new model as the truth of the system.

00:30:37.200 Each of these change sets were scoped down pull requests that could be reviewed and deployed easily. They could also be rolled back if necessary and monitored to ensure successful use in production. By giving each change time in production, I could gain confidence by observing the code's usage as well as verifying data parity with the verification script. Discrepancies didn’t happen often, but I caught one instance where I persisted a user’s email address upon signup. Device automatically downcased the email on the user record, but I hadn’t done the same in the billing account record.

00:31:03.520 So, my case-sensitive comparison script captured the difference. To fix this, I mimicked the device functionality to downcase and strip the email using ActiveRecord normalization, deployed, reran my sync from origin to destination, and gave it additional time to surface other issues. Finally, I was ready for the teardown. After several months of work on this project, it might have been easy to stop here and not complete the last step, but doing so denies the full satisfaction of witnessing the system simplified and denies others the opportunity to release their former knowledge about how the system previously worked.

00:31:52.800 I removed all remaining references to the old billing and subscription data from the user, including tests and factories, and eliminated the now unused columns on the users table. This entire process took around six months with total of 42 pull requests. As I worked on various kinds of migration projects like these over the past few years, I've felt that there's perhaps some overarching pattern here, regardless of whether the change was to dependencies, schema, or data processes.

00:32:46.960 You've seen it twice already, but let me lay it out plainly. I see three main stages representing before, during, and after the transition. Within those, we have the following phases: we start with the discovery of the system and what it will mean to make the transition. We may have an idea of the changes to be made, but we need a robust understanding of how the system currently works, as well as clarity about what the end state will look like.

00:33:33.640 Discovery should produce two things: destination and inventory. The destination may be obvious in some cases, but it may not always be. We might define a new data model to replace an existing structure or rewrite an algorithm, but no matter what, we need to define what parity will mean between the current system and that goal state. In the first example, parity was more subjective; we didn’t require a pixel-perfect match between CSS frameworks, but we wanted to avoid a jarring experience for users navigating between old and new layouts. In the second example, parity was more objective; we needed precisely the same functionality before and after the transition.

00:34:26.720 Inventories will differ depending on the type of migration. In the first case, it included a list of views, partials, helpers, Bootstrap extensions, and so on. In the second, it comprised all the billing attributes being stored, as well as a list of all the methods calling or changing those attributes. In both cases, this required understanding dependencies among those items. The inventory helps us plan steps, track progress, and can also be useful for estimating time and complexity.

00:35:13.760 Now that we have a good sense of our goal state as well as the project scope and complexity, it’s a good checkpoint to pause and reconsider whether the migration is still worthwhile for the team. If we want to continue, there’s likely more we can do to reduce both the complexity and calendar time of the project. During discovery, we likely identified preliminary tasks that can be completed before we start the transition, which improve the system regardless of whether we continue.

00:36:12.640 The more we can accomplish ahead of the transition, the better our chances for success as we navigate through that limbo period. When preliminary tasks are complete, we will again have more information about what it will take to make this transition, and we may once more evaluate if we want to continue.

00:36:58.720 Now, we enter the setup phase. At this point, we’ve begun the transition and, until we reach the teardown stage, we must hold a conceptual understanding of both the current state and the final state of the system. Because of this, we want to keep this timeframe as brief as possible. Setup work differs from preliminary tasks in that this is the specific staging needed to allow incremental changes leading to that desired final state.

00:37:39.440 This may mean adding new dependencies, new columns in tables, duplicating views, helper methods or classes, setting up structures to switch back and forth, adding the ability to transform and sync data, and building methods of verification. Now we enter a phase of incremental steps, usually iterating over those inventory items as smaller units of work. Not all units will follow the same process, but many will. The important part is that each step is short, easy to review, verifiable, and reversible.

00:38:31.760 In my experience, it’s not essential to plan out all these incremental steps in order beforehand. Instead, if you have the goal state in mind, the necessary staging, and inventories to guide you, you can often identify the next piece of work that can be done without disturbing other parts. When I was a kid, we played a game called pickup sticks, where colored sticks were dropped in a pile, and each player took turns trying to remove one stick from the top pile without disturbing others.

00:39:23.040 It would have been nearly impossible to plan the removal of each stick in order at the start of the game, but on your turn, you only had to choose the next easiest stick to remove, making the task achievable. One benefit of incremental steps is that they uncover misunderstanding and incorrect assumptions earlier. As much as you may feel you understand your app as it currently exists, and that end goal state, you will inevitably be surprised by something along the way. Small, incremental changes increase understanding and surface insights that help you make adjustments sooner, improving system and project confidence.

00:39:58.440 While the pattern for these incremental changes will differ based on the type of migration project, each step should involve some verification, possibly at both the unit and system levels. Automated test coverage is one part of this, but it may also involve manual testing and verification scripts. For the CSS framework migration, the unit-level verification was a manual page-by-page visual comparison. For the billing migration project, system-level verification checked data parity in production across accounts between the old and new data structures.

00:40:32.400 Some migration projects will involve a change in the source of truth, which may not happen at the end of the incremental steps. In my second example, there was an initial pass through the system to dual write to both the old and new data locations. After verifying parity, we made a second pass to change the source of truth and then removed writing to the old location.

00:41:07.280 At this point, we’ve completed the transition, but the work isn’t fully done. If you repaint your living room, the work isn’t complete as soon as the wall color changes; the painters’ tape still has to be removed, the paint cans, ladders, and drop cloths put away, and the room put back in order. The teardown work is the final phase, removing remnants of the original state and structures created just for the incremental steps.

00:41:41.760 Once the teardown is finished, we can finally release the knowledge we’ve held about how the system worked prior. This phase is often easy to neglect due to significant fatigue after the potential long haul of those incremental steps. However, completing this phase provides emotional relief, releases cognitive burdens for the team, and offers satisfaction that should not be underestimated.

00:42:31.680 So why discuss all this? What are the benefits of naming the scenario and defining a process for these transitions? Here are my reasons: first, the work involved in these transitions is often underappreciated, ignored, or simply not recognized. As Carl Jung noted, until you make the unconscious conscious, it will direct your life, and you will call it fate. Without recognition and appreciation for this work, we are subject to forces we assume are external and outside our control.

00:43:27.600 All the work we do deserves to be visible, conscious, recognized, and appreciated. It’s important for understanding the total cost of software ownership, for team morale, for planning and resource allocation, and for system knowledge and comprehension. Second, because this work is less visible and less recognized compared to something like feature development, we often start transitions that we don’t finish.

00:44:20.560 This happens because we underestimate the time and effort it will take to complete the transition, we lack a process that keeps us from getting lost or sidetracked in the middle, and we fail to persuade our team of the benefits gained by reaching the destination. Instead, we embark on ambitious work optimistically, become distracted by side quests, lose momentum as time expands beyond expectations, and the work eventually gets deprioritized as costs increase and the initial urgency fades.

00:45:12.000 Third, when this work is deprioritized, we are still left in limbo. The software system and our understanding of it remain adrift; we’ve left one shore of safety and haven’t reached the other. The additional risks and efforts we believed were temporary become permanent. Leaving these transitions unfinished teaches our teams that this is normal and sets precedents for future transitions, diminishing the perceived benefits to be had and reducing both the credibility of our plans and certainty in the minds of team members.

00:46:02.560 Risk and cognitive load compound the more projects linger in limbo, but we gradually learn to accept this overhead. Perhaps the only time this is questioned is when we onboard a new team member and must explain why half the codebase uses one pattern, library, or data source while the other half uses a different one.

00:46:56.600 Ultimately, we deny ourselves the resolution and satisfaction of arrival—the relief and release from cognitive burden when we can forget how the system worked before, along with the confidence to complete extensive cross-cutting work when we set out to do it. This doesn’t have to be our fate. By correctly identifying our circumstances, creating clearer processes, and recognizing hidden work, we can increase our chances for success and enjoy the rewards of reaching the destination when we find ourselves in limbo.

00:47:29.920 Thanks!

In Limbo: Managing Transitional States

Key Points Discussed:

Conclusion and Takeaways: