A Decade of Rails Bug Fixes
Jean Boussier • Tokyo, Japan & online • Talk

Date: October 28, 2023
Published: November 02, 2023
Announced: unknown

https://kaigionrails.org/2023/talks/byroot/

【発表概要】
In this talk we'll go over two Rails bugs I fixed 10 years appart. For both I will detail how the bug was found, how I debugged it and how I finally fixed it. I will as well reflect back on what I learned in the process. If you are a begginer you may learn some debugging techniques, if you are more confirmed you may enjoy the war stories and learn some gritty details about Ruby and Rails internals.

【発表者】
byroot
GitHub https://github.com/byroot

Kaigi on Railsは、初学者から上級者までが楽しめるWeb系の技術カンファレンスです。
https://kaigionrails.org/

Kaigi on Rails 2023

00:00:00.919 Hello everyone! I apologize for not being here in person, but I hope you will still enjoy my presentation.
00:00:06.839 In this talk, I will present two Rails bugs: one from about 10 years ago and another from just a few months back.
00:00:12.120 For each bug, I will detail how I found it, how I debugged it, and how I finally fixed it.
00:00:20.000 I will also reflect on what I learned in the process.
00:00:26.080 If you are a beginner, you may learn some debugging techniques; if you are more experienced, you may enjoy the stories and learn some gritty details about Ruby and Rails internals.
00:00:32.000 Let me introduce myself. I am Jean Boussier, better known as byroot.
00:00:39.800 I'm a member of the Rails Core Team as well as a Ruby committer.
00:00:46.199 I also maintain a number of popular gems such as RedBoot, Snap, and MessagePack.
00:00:52.520 Professionally, I work for Shopify on their Ruby and Rails infrastructure team. When I'm working for that team, you'll usually see me interact on GitHub as Caspar Isfine.
00:01:04.000 I do a bit of everything, but my focus is mainly on performance and stability.
00:01:08.840 The first bug I'd like to present, which also happens to be my very first Rails contribution, occurred nearly 10 years ago and involves counter caches.
00:01:15.880 But first, let me provide a bit of context.
00:01:22.000 At that time, I joined the newly created Shopify office in Montreal.
00:01:27.200 Shopify was growing rapidly, so we frequently had new team members in the office.
00:01:32.960 I often didn't have specific tasks assigned, so I started to fix bugs I found in open issues on GitHub.
00:01:39.000 Then, one day, my boss mentioned a slow database query issue.
00:01:49.000 The way we were finding slow queries back then, and probably still do today, was through the MySQL slow query logs.
00:01:54.759 Each time a query took longer than a set threshold, it would be logged for us to review.
00:02:00.640 In this instance, it was a count query that was counting products inside product collections.
00:02:09.440 It was taking 126 milliseconds, which is quite a long time, even if it was executed from a background job.
00:02:15.200 This was stressing our database more than we wanted.
00:02:21.200 To avoid slow count queries, the usual way is to use Active Record's counter caches.
00:02:26.599 This is a fairly straightforward denormalization technique: you simply create a column on the parent model, and Active Record takes care of incrementing and decrementing that column whenever records are created or destroyed.
00:02:39.000 Under the hood, it uses atomic queries to prevent race conditions.
00:02:47.200 In the same transaction where you create a record, it emits a query to automatically increment the counter; when you delete a record, it automatically decrements that counter.
00:02:55.080 So, I went to look for where the query was originating from, and I noticed it wasn’t using a counter cache.
00:03:02.800 I thought I would simply add one, but when I went to add it, I found that a counter cache already existed.
00:03:09.640 I used git blame to determine where the line of code was coming from.
00:03:16.519 It turned out that barely six months prior, someone had changed that line of code to stop using the counter caches.
00:03:22.200 Clearly, there was more to this problem than a simple fix.
00:03:31.000 By following the issues on GitHub and following threads, I managed to find an even older issue that mentioned desynchronized counter caches.
00:03:39.000 Some customers in our interface reported collections having negative amounts of products in them, which obviously is not possible.
00:03:46.560 There were attempts to resolve this: some involved moving those counter caches to Redis and other strategies, but none had solved the problem yet.
00:03:54.719 My colleagues resorted to using count queries again.
00:04:03.360 In my initial assessment, I didn't believe it could be a concurrency issue because that was what previous developers attempting to fix the bug had assumed and failed.
00:04:09.920 I was also very confident that Active Record's atomic incrementation was quite reliable—essentially foolproof.
00:04:15.080 Counter caches had been in Rails pretty much since the very first version.
00:04:20.639 Surely, if this bug existed, someone would have noticed it a long time ago.
00:04:27.360 The next step was to look at the implementation to see what else could be going on.
00:04:34.000 I started exploring other implementations to see if we were misusing them somehow.
00:04:40.120 The code was quite cryptic due to extensive meta-programming.
00:04:46.400 However, if we remove the meta-programming layer, the logic becomes quite straightforward.
00:04:54.400 Here's a simplified version of how they were implemented: it simply used two Active Record callbacks.
00:05:03.520 After a record is created, it invokes the incrementer; before a record is destroyed, it decrements.
00:05:10.960 My first thought was that since it used model callbacks, we might be doing updates without triggering the callbacks.
00:05:16.880 However, a careful audit of the application didn’t reveal anything like that, so I was back to square one.
00:05:22.560 I had no idea what was going on.
00:05:29.000 So, I went to analyze the data I could observe.
00:05:34.560 Looking at the entity records in production, the desynchronized caches were always lower than they should have been, never higher.
00:05:41.800 This indicated that the issue logically resided in the destroy code path, not on both paths.
00:05:49.000 So, perhaps it was a concurrency issue after all.
00:05:55.080 I went to try and reproduce the problem in isolation.
00:06:02.240 I created a minimal Rails application from scratch with a similar cache setup, even a bit simplified.
00:06:09.680 I set it up with Unicorn using eight processes to simulate a significant amount of concurrency even locally.
00:06:16.120 I followed the usual Rails scaffold process and created a few products.
00:06:28.000 Then I played around with it until I decided to click the same destroy button very rapidly.
00:06:34.000 I was finally able to reproduce the bug.
00:06:40.400 As you can see up there, the counter cache indicates a single product, but I had two listed under it.
00:06:47.960 This was exactly what we were witnessing in production, confirming it was indeed a concurrency issue.
00:06:54.240 The next step was to understand better what had happened.
00:07:02.000 Since I could reproduce it on my local machine, I could observe the issue as much as I wanted.
00:07:08.520 I could add logs to the application to help me track down the issue.
00:07:14.400 The first thing I looked at was the Rails log to understand the course of events.
00:07:21.680 Here you can see an explanation of the bug.
00:07:27.960 I highlighted two processes in red and blue.
00:07:33.000 The red process loads the record, and then the blue process loads the same record.
00:07:41.920 Both processes then issue delete queries and update the counter.
00:07:48.320 Two processes managed to destroy the same record without failing.
00:07:54.960 Having understood that, I was able to replicate the reproduction further.
00:08:01.680 I could do it with very simple Ruby code in the same process; no need for two processes.
00:08:08.640 You could simply load the same record twice and then destroy it twice.
00:08:16.480 Now that I had such a simple reproduction, I decided to analyze it directly in the MySQL command line.
00:08:22.959 That’s where it hit me that MySQL delete statements are idempotent.
00:08:31.920 It will succeed even if the action happens again; you can repeat it, and you'll end up with the same results—the record will remain deleted.
00:08:37.840 Therefore, I wanted to figure out a proper fix for this issue.
00:08:43.760 The best way to work on a fix is usually to have a good feedback loop: a fast way of reproducing and validating changes.
00:08:50.560 Ideally, I would write a test case.
00:08:56.760 This would serve as a good bug report if I couldn't figure out the fix myself.
00:09:03.160 Before working on a bug fix, I searched for existing pull requests and issues about it.
00:09:10.560 Such a significant bug surely had been encountered by someone else.
00:09:17.520 I found an open pull request that unfortunately wasn't matched.
00:09:24.680 The reason it wasn't match was that it tried using locks, which has a detrimental effect on performance.
00:09:31.480 Using locks created a lot of contention in the database.
00:09:39.000 If I wanted this bug to be fixed in Rails, I needed to find a more effective solution.
00:09:44.680 That's when it hit me: while I was messing around in the MySQL shell, the answer was right under my nose.
00:09:58.000 In the first query, MySQL tells us one row was deleted; in the next, it indicates that no row was deleted.
00:10:06.000 If we could access this information in Ruby, we could simply avoid issuing the decrement if no record was affected.
00:10:12.240 It turns out this information is already present in Ruby: if you call the delete method on a relation in Rails, you actually receive an integer back, indicating the number of affected rows.
00:10:19.320 So it was just a matter of wiring everything together to make it work.
00:10:27.960 But the problem was that since counter caches were implemented as callbacks, it was hard to access this information.
00:10:35.320 Callbacks are run in a more isolated context—they don’t receive parameters and cannot read anything else.
00:10:41.720 The first thing I did was refactor them to be first-class features—not inside callbacks anymore.
00:10:48.120 Instead, they should be a regular module in Active Record that simply executes before and after create and destroy.
00:10:53.760 Once that was done, it was just a matter of adding a simple condition: check if we affected more than one or no rows, and then we can decrement.
00:11:00.640 I tested this and was happy with the results.
00:11:06.879 But before submitting the fix upstream, I wanted to test it in production.
00:11:12.240 I needed to ensure it would work at scale, not just on my machine.
00:11:17.520 I also didn’t want to wait for the next Rails release to fix this issue in our application.
00:11:24.240 Perhaps this wasn’t the only bug, and there might be another underlying issue.
00:11:30.080 So, I implemented the fix as a monkey patch that I could apply to our application.
00:11:38.720 I let the monkey patch run in production for several weeks, regularly checking queries to identify any desynchronized counters.
00:11:45.440 To my surprise, after several weeks, I couldn’t find a single desynchronized counter.
00:11:52.760 I was able to re-enable the cache and make that slow query disappear.
00:11:59.320 Now that I had a fix, the next step was to contribute it back upstream.
00:12:05.760 Maintaining a monkey patch is quite annoying and painful, and there is no reason not to fix Rails.
00:12:12.720 Since people were experiencing this bug, we might as well share it with them.
00:12:19.120 It felt really good to submit a clean solution for such a longstanding bug that had been plaguing Shopify for over a year.
00:12:25.440 The Rails core team member reviewing my pull request ended up being Aaron Patterson.
00:12:32.640 Unfortunately, things didn't go quite as I imagined.
00:12:40.760 His response indicated he didn’t properly understand the bug or the fix I proposed.
00:12:47.720 I felt a bit destabilized! My English and general communication skills were not great at that time.
00:12:54.240 Now that I maintain many projects, including Rails, I have a better understanding of how things work.
00:13:01.760 Maintainers can be wrong, especially with large projects like Rails where there isn't a single maintainer for all the code.
00:13:09.200 Sometimes, they recover the code at the same time you submit the pull request.
00:13:16.560 I spent days and weeks thinking about this problem and Rails maintainers often have fresher context on a feature or issue than I did.
00:13:24.000 When you’ve identified and fixed a bug, your mind is full of context.
00:13:31.840 The challenge lies in identifying the relevant information and effectively documenting it in the pull request.
00:13:38.400 I didn’t do a great job of that; I assumed the problem and solution were obvious since it was fresh in my head.
00:13:45.360 Fortunately for me, Aaron is really nice, so we just talked about my blunder.
00:13:52.560 Many maintainers wouldn’t have taken it as well, but the pull request was ultimately merged, along with a couple of cleanup efforts.
00:13:58.960 That was my first major Rails contribution.
00:14:06.000 What I think we can learn from this is that the debugging technique I used resembles the scientific method.
00:14:13.600 You theorize about what it could be and conduct an experiment to confirm your theory.
00:14:20.480 If it works, great—you’ve found your bug. If not, you have learned more that allows you to formulate another theory.
00:14:27.680 This doesn’t need to be followed too strictly; sometimes prior knowledge or intuition can help you jump directly to the bug.
00:14:34.480 But when you don’t know where to start, this approach is extremely helpful.
00:14:41.720 Another important takeaway is that your dependencies are code that someone else has written, perhaps a colleague who left the company.
00:14:48.560 It doesn't matter whether you wrote it or someone else did; you should be able to debug and fix it yourself if needed.
00:14:55.200 To me, open source and dependencies are the same: they are written and maintained by other people.
00:15:02.080 But you should still be able to debug issues and submit a fix or at least a good bug report.
00:15:08.080 This means that it shouldn't require any authorization from a manager; it’s just like debugging your own code.
00:15:14.840 One significant aspect of debugging is to get into production.
00:15:20.640 Unless it’s a really trivial bug, you’re unlikely to get anywhere until you can reproduce the issue.
00:15:26.560 The faster you can reproduce it, the better: it enables more iterations of theorizing, observing, and so on.
00:15:34.640 And even if you cannot find a fix yourself, production information will be incredibly helpful.
00:15:41.840 Those with more context on the issue will likely be able to fix it.
00:15:48.640 Lastly, communication is crucial. It doesn’t matter if you’re right; if you come off as rude, you won’t get anywhere.
00:15:55.120 It’s essential to be nice and agreeable, in addition to being correct.
00:16:02.640 You want the maintainer to be inclined to help you.
00:16:09.920 There’s no point in being right if nobody is listening to you.
00:16:16.720 Now, I'd like to present the second bug, which is from earlier this year in March.
00:16:24.040 In the ten years since my first bug, I became a Ruby committer as well as a Rails committer, and later a Rails core member.
00:16:31.440 I am still working at Shopify in the Ruby on Rails infrastructure team, which didn’t exist back in 2013.
00:16:39.920 Now, bugs don’t scare me anymore.
00:16:48.000 So, when I saw a bug posted on Ruby Social, which is a kind of open-source alternative to Twitter, I figured I would take a look.
00:16:54.640 Someone was asking for help with a bug in the source code of Mastodon, which happened to be a Rails application.
00:17:01.920 I had a bit of time to help, so I decided to wrap this up quickly.
00:17:08.000 The application had been made compatible with Ruby 3.2, thanks to Aaron Pons.
00:17:15.360 He wanted to be able to test the latest widget developments on Mastodon.
00:17:23.120 After some users upgraded to Ruby 3.2, they started experiencing errors in production.
00:17:30.320 I just glanced at the backtrace: it seemed to be reading an attribute on a model and throwing a nil error.
00:17:37.680 So, most likely, I thought it was trying to access the ID of an Active Record instance.
00:17:44.000 Probably this instance didn't have an ID.
00:17:51.520 Since I didn’t have a lot of time, I assumed this was a simple situation.
00:17:58.720 I provided them a quick workaround and promised to submit a small pull request.
00:18:05.600 However, while trying to implement the fix on the Rails side, I quickly realized I was totally off.
00:18:12.960 There were many clues in the thread that I had entirely overlooked.
00:18:20.000 I had jumped to conclusions way too quickly.
00:18:26.600 I decided to rewind and look at the source code more thoroughly.
00:18:33.600 I noticed that the problem was not that the model didn’t have an ID column.
00:18:41.920 It was trying to access the ID of a model before its attributes were loaded.
00:18:48.000 That was a crucial point I had missed.
00:18:54.040 Another important clue shared by another user later in the thread was that this issue only manifested itself on Ruby 3.2.
00:19:01.120 Any explanation that didn’t account for a change in behavior between Ruby 3.1 and Ruby 3.2 wouldn’t suffice.
00:19:07.600 It had to be a change in how Ruby behaves.
00:19:14.840 Since the bug was triggered by loading a record from the cache with Marshal, I began looking for suspicious changes in the Marshal implementation.
00:19:22.000 As a Ruby committer, I was able to navigate the Ruby codebase relatively easily.
00:19:30.000 I checked for all changes in the Marshal implementation between Ruby 3.1 and Ruby 3.2.
00:19:36.640 I was hoping to find something obvious, but nothing stood out.
00:19:43.000 To be honest, from experience, looking at the git log is always a difficult move.
00:19:50.240 Unless you know that a change happened within a small timeframe, you're unlikely to find a bug by simply looking at the git log.
00:19:57.200 You first need to understand the bug before you can identify what introduced it.
00:20:03.440 At this stage, I had a better idea of what was happening, but it still wasn’t enough for me to reproduce the issue.
00:20:09.520 I couldn’t investigate further, so I waited for users of Mastodon to report more information.
00:20:16.160 The next day, another user posted an interesting observation.
00:20:23.680 They tried to load a broken Marshal payload from Ruby 3.1 and encountered the same issue.
00:20:31.040 Before that, we believed the bug was in Marshal, but we needed to know whether it was due to broken records being loaded or if it was generating invalid records.
00:20:37.680 Thanks to this observation from the user, we realized that the bug was in the Marshal dump, as Ruby 3.2 was producing invalid payloads.
00:20:44.480 Both versions were correctly attempting to delete records.
00:20:52.040 At this stage, the main thing preventing me from investigating deeper was obtaining one of those corrupted payloads.
00:20:58.840 I needed to determine what was wrong with them.
00:21:06.640 However, the cache contained a lot of personal data.
00:21:12.880 My stance on data privacy is pretty strict regarding not leaking sensitive information.
00:21:20.000 Those payloads couldn’t be shared without first anonymizing the data, complicating matters.
00:21:27.360 Soon after, I received another useful hint.
00:21:34.240 Someone created a decompiler for the Marshal binary representation.
00:21:41.680 They transformed it into something resembling Ruby source code.
00:21:47.600 This allowed them to strip away personal data before sharing the structure with me.
00:21:54.520 Though I couldn't directly use it to reproduce the error, I still looked for clues.
00:22:00.640 One clue was that there was a symbol link entry, marking a circular reference in the payload.
00:22:07.520 Circular references usually lead to various problems.
00:22:13.360 The presence of a circular reference led me to formulate a new theory.
00:22:19.840 My hypothesis suggested that an Active Record instance was being used as a hash key in a circular way.
00:22:27.760 Here's a quick snippet to illustrate my theory.
00:22:35.040 If you allocate an Active Record instance without initializing it, you can't access any instance variables.
00:22:42.640 So, if we try to insert it into a hash without it being initialized, Ruby will call the hash method on it, and it's going to fail.
00:22:50.320 This would explain the problem reported in production.
00:22:58.720 To explore this further, I needed to understand Marshal better.
00:23:04.400 The key thing about Marshal is that it doesn't load objects automatically.
00:23:11.440 From your perspective, you’re calling a single method and loading an entire object graph in one go.
00:23:18.240 However, under the hood in the Ruby virtual machine, it allocates the object and initializes attributes one by one.
00:23:25.760 There are various moments where the object exists but is not yet fully initialized.
00:23:32.320 Because of this, in cases of circular references, Marshal could insert a partially initialized object as a key.
00:23:39.200 It would automatically invoke the hash method on the object to know where to store it in the hash.
00:23:45.920 Such a scenario perfectly explains the bug.
00:23:53.680 Now, I was searching for a pattern where an Active Record instance could be used as a key.
00:24:02.080 Lucky for me, I knew Active Record well enough to suspect the association cache.
00:24:09.200 This is an instance variable that stores loaded records when following an association.
00:24:16.080 However, we needed a case where Marshal would load the association cache before the attributes of the record.
00:24:22.640 Marshal serializes the attributes of an object in the order they are defined.
00:24:29.200 This didn’t explain my bug because when instantiating an Active Record and inspecting its instance variables, the attributes property is defined before the association cache.
00:24:36.720 So, either this was not it, or in some cases, the order would be different.
00:24:43.040 To understand how the order could be variable, we looked at how order is defined in Ruby.
00:24:50.080 At the surface, it's relatively simple: two classes with two instance variables defined in reverse order.
00:24:56.960 The first class has variable A and B; the other class has variable B and then A.
00:25:03.040 It preserves the insertion order, like a hash.
00:25:10.080 But there are trickier situations: what if the same class defines instance variables in different orders?
00:25:17.920 This change occurred relatively recently.
00:25:24.960 Ruby committer Slack has a very helpful bot that runs snippets of code.
00:25:32.400 It executes the snippet across various Ruby versions and reports back on what changed over time.
00:25:39.040 In this example, we found a significant clue.
00:25:48.000 In Ruby 3.2, instance variable order changed, whereas in version 3.1, the order was uniform.
00:25:57.600 This was caused by the introduction of object shapes.
00:26:04.160 I had to do quite a bit of digging, but until I worked on fixing this bug, I never realized that shapes could cause this semantic difference in Ruby.
00:26:12.000 However, the ordering of instance variables is an implementation detail—it’s not specified anywhere.
00:26:19.360 It can change at any time.
00:26:28.080 Ruby and Jira changed it, and it didn’t break any tests because it wasn't specified.
00:26:35.920 I looked for patterns again, specifically focusing on places where the association cache would be assigned before attributes.
00:26:42.560 In my investigation, I found that when duplicating a record, the association cache was assigned before the rest.
00:26:51.680 This could create corrupted caches.
00:26:58.880 I had kept an IRB session open in Mastodon to try and reproduce the issue locally.
00:27:05.440 When I duplicated a record, it indeed created the situation!
00:27:13.840 I was ecstatic, as you can see by the number of exclamation points in my writing.
00:27:20.560 I had been searching for this bug for two or three days by then.
00:27:27.760 But I realized that my reproduction was merely by chance.
00:27:34.720 I had this console open for so long, but when I tried to reproduce it again, it wouldn’t work.
00:27:41.760 I realized that it wasn't just a shape issue; it was a shape too complex issue.
00:27:49.200 To clarify what shape too complex signifies: object shapes are an optimization that accelerates access to instance variables.
00:27:55.920 The problem is that they rely on insertion order.
00:28:02.320 Each unique insertion creates a new shape, and code that inserts instance variables in a non-deterministic order could lead to performance issues.
00:28:09.920 When a class accumulates more than eight leaf shapes, it triggers a property called shape too complex.
00:28:18.080 When this occurs, it reverts to using a hash internally to store attributes.
00:28:25.760 This explains why the bug was so challenging to reproduce; it would only occur after the application had been running long enough to create numerous shapes.
00:28:32.480 You had to let the application run for quite some time before it happened.
00:28:39.120 Now that we understood what the bug was, we had to address it.
00:28:46.000 The bug lays within Ruby itself, and it should be fixed there.
00:28:53.920 I could not accept that Rails was incompatible with the latest version.
00:29:00.240 My users required a fix.
00:29:06.480 Moreover, relying on such undefined behavior in Rails was simply a liability.
00:29:13.840 To me, it was a combination of two bugs that had to be fixed in two places.
00:29:19.440 Before that, however, I realized that Shopify had been running Ruby 3.2 for over three months and had not encountered this bug.
00:29:27.000 The reason was that we had replaced Marshal as a serializer for Rails caches.
00:29:36.000 We had our reasons—so not just this bug specifically, but many other potential errors with it.
00:29:43.600 Marshal makes it too easy to store data that you didn’t expect.
00:29:49.880 The next step was then to implement a simpler serializer that worked similarly to Marshal to avoid the problem.
00:29:58.480 I didn’t want to force dependencies on our projects, so I built something easier to monkey-patch.
00:30:06.440 Instead of passing the whole object tree to Marshal, I would first convert it into simple objects.
00:30:14.000 This approach allowed me to control the order and manually reconstitute the structure of the objects.
00:30:20.800 In doing so, I avoided potential circular reference issues.
00:30:28.000 In the meantime, the size of the cache entries dropped by an impressive 47%.
00:30:35.200 They were also approximately twice as fast to generate.
00:30:41.520 This was because Marshal performance is directly tied to the size of the cache entries.
00:30:48.000 Most of Marshal's serialized internal state consists of temporary cache data that is not needed.
00:30:56.320 What you really need is simply the attributes and whether the record was persisted in the database.
00:31:03.520 So, that was the monkey patch.
00:31:11.360 We were happy with it, but it was time to fix Rails so that other apps wouldn’t face the same issue.
00:31:18.560 I applied a similar technique but in a more advanced way.
00:31:25.160 A lesser-known fact is that Marshal objects can define special methods like marshal_dump and marshal_load.
00:31:31.760 These allow the object to control its own serialization in Marshal.
00:31:38.560 Rather than letting Marshal handle everything, they offer a different representation that gets serialized.
00:31:45.040 I utilized this concept within my own implementation, optimizing the method further.
00:31:52.880 As a result, the payload sizes became even smaller than what I had produced for Mastodon.
00:31:59.840 It also became even faster to produce.
00:32:07.760 That patch made it into Rails 7.1, which you’ll be able to upgrade to and benefit from.
00:32:15.040 We needed to fix the issue in Ruby as well.
00:32:23.840 Although Rails was at fault, Ruby's behavior wasn't ideal.
00:32:31.360 Not being able to rely on something being predictable is problematic.
00:32:38.680 I let Aaron and Jira know about the issue.
00:32:47.920 After some discussions, we agreed that the fallback implementation using a hash internally should use an ordered hash.
00:32:55.120 Previously, it had utilized an unordered hash.
00:33:03.520 I won't go into the details of the patch because it’s written in C, and the diff is quite extensive.
00:33:10.080 But what I want to highlight is that it switched from using a hash table with keys that were only symbols.
00:33:16.320 It had shifted to a hash table, which is backing the Ruby hash implementation.
00:33:23.200 Now it's an ordered hash.
00:33:30.080 This patch was backported on the Ruby 3.2 branch and should become available as soon as Ruby 3.2.3 is released.
00:33:36.960 What I've learned from this bug, in stark contrast to the previous one, is that I wasn’t directly experiencing it.
00:33:44.800 With the previous bug, I could observe it myself and iterate quickly; this time, debugging was more difficult.
00:33:52.960 I could only rely on user feedback, and responses took hours to return.
00:33:59.760 It's much like helping a friend debug their computer over the phone—it’s a tiring experience.
00:34:06.240 I couldn't directly access production data to investigate myself.
00:34:12.960 This is a significant advantage of open-source software—you can debug your own implementation.
00:34:20.000 Secondly, be careful using Marshal—I really advise caution.
00:34:27.680 There are valid use cases, but consider the possibility of serialization issues.
00:34:35.200 Avoid using it where serialized data depends on class definitions; it can lead to significant problems.
00:34:42.640 Instead, prefer more limited serialization formats such as MessagePack or JSON.
00:34:50.400 Finally, don't hesitate to explore lower layers.
00:34:57.920 You will learn so much from understanding the underlying systems, even if it feels intimidating.
00:35:05.000 Semi-deterministic behavior should be avoided at all costs.
00:35:12.800 It's one thing for something to be entirely random and another matter for something to be deterministic until it isn’t.
00:35:19.680 Being non-deterministic is truly unacceptable.
00:35:25.440 Be sure not to rely on such behavior.
00:35:32.320 Not every Ruby behavior is intentional and guaranteed to remain stable over time.
00:35:40.000 Some behaviors may be implementation-dependent and can change in future versions of MRI.
00:35:46.560 That's all I had! Thank you very much for your attention!
00:35:51.040 If there are any questions, I’d be happy to answer!
Explore all talks recorded at Kaigi on Rails 2023
+30