Summarized using AI

Benchmark and profile every single change

Daisuke Aritomo • April 17, 2025 • Matsuyama, Ehime, Japan • Talk

In this talk, Daisuke Aritomo discusses the importance of profiling and benchmarking in software development, particularly in the context of Ruby on Rails applications. The main focus is on a method called Benchmark Driven Development (BDD), which involves creating benchmarks before writing the actual code, allowing developers to streamline performance optimization effectively.

Key Points Discussed:
- Introduction to Profiler Updates: Aritomo shares updates on his Ruby profiler, which has reduced memory consumption by 90% and has new features for comparing profiles. He emphasizes that while profilers help in identifying performance issues, they do not automatically make applications faster.

  • Benchmark Driven Development (BDD): He elaborates on the BDD process, comparing it to Test Driven Development (TDD), wherein benchmarks are created first to define expected performance outcomes.

  • Real-world Application: Aritomo applies BDD in developing a new web framework called Zinatra, designed for efficiency and speed. By following this methodology, the framework reportedly achieves a performance improvement whereby routing logic is claimed to be up to 100 times faster than traditional implementations, though real-world applications showed that it typically results in around 2% improvements on the actual app performance.

  • Importance of Continuous Benchmarking: He advocates for continuous benchmarking throughout the coding process, suggesting that every change should be benchmarked to understand its impact on performance effectively.

  • Tools for Benchmarking: Aritomo presents various tools and frameworks developed to support benchmarking within the coding workflow, emphasizing that realistic workloads and meaningful metrics are crucial for accurate assessment. The use of differential flame graphs helps in visualizing performance changes pre- and post-optimizations.

  • Performance Optimization Techniques: He discusses multiple optimization techniques, such as smart algorithm choices for routing and efficient data handling. He notes that both high-level architectural changes and small, incremental improvements can significantly enhance overall performance.

  • Conclusion and Insights: The talk concludes with key insights on how even minor improvements in Ruby code can lead to substantial overall gains, especially as applications scale. Aritomo stresses the importance of keeping the benchmarking environment close to production conditions for accurate results and addresses common pitfalls in benchmarking practices.

The session showcases not just the methodologies for improving speed and efficiency in Ruby frameworks but also instills best practices around software performance monitoring and continuous integration of benchmarking into development workflows.

Benchmark and profile every single change
Daisuke Aritomo • Matsuyama, Ehime, Japan • Talk

Date: April 17, 2025
Published: May 27, 2025
Announced: unknown

RubyKaigi 2025

00:00:08.720 hello Matama
00:00:11.040 hello hello i'm DK Gate uh today we're
00:00:14.880 going to talk about profile and
00:00:16.640 benchmark every
00:00:18.840 change uh some of you might know me or
00:00:22.000 show you that's my online handle i work
00:00:26.000 at Smart Bank Inc
00:00:29.119 which is a credit card company we're
00:00:30.880 hosting the hack space on the second
00:00:32.719 floor so please take a visit if you want
00:00:34.880 to hack away or just want to relax i
00:00:37.760 usually wait work as a Rails developer
00:00:40.320 at
00:00:42.520 SmartBank and um before we go into the
00:00:46.480 main topic I'd like to some deliver some
00:00:49.039 updates from my own Ruby profiler PS2
00:00:51.360 which was uh received five honorable
00:00:54.480 mentions at the keynote um I'm sure I
00:00:58.559 don't have to explain what profiler is
00:01:01.039 so I just going to um deliver recent
00:01:05.479 updates i've changed some internals so
00:01:08.560 reduced memory consumption by 90% and I
00:01:11.520 have expanded features for comparing
00:01:14.080 profiles which I'm going to um explain
00:01:16.400 later and I have re rewritten the core
00:01:19.280 in
00:01:20.280 fem and one thing I have realized in
00:01:23.200 this year is that profilers don't make
00:01:25.840 program fast just like debuggers how
00:01:28.880 just like debuggers don't find bugs uh
00:01:31.439 profilers itself don't make uh the
00:01:34.640 program
00:01:36.040 fast so today I'm going to address that
00:01:39.759 today I'm going to speak about three
00:01:41.680 things one if introd introducing
00:01:44.960 benchmark dri driven de development
00:01:48.560 which I will refer as bench
00:01:51.240 dd now next I'll show you uh the tool
00:01:54.880 set for doing benchd and lastly I'll
00:01:58.320 show how I applied benchmark driven
00:02:00.880 development to build a 100 times fast
00:02:03.600 version of
00:02:04.920 Sinetra also for this year I have
00:02:07.680 attached some short Japanese translation
00:02:09.759 on the slides as an experiment this part
00:02:12.800 is uh pure translation and it doesn't
00:02:15.760 contain any flavor text so um if you
00:02:18.000 don't read Japanese you can safely
00:02:19.680 ignore
00:02:21.239 it i have created a web framework called
00:02:24.560 uh
00:02:26.040 Zinatra with it implements a well fair
00:02:29.200 part of the center API so it can be used
00:02:31.599 as a drop in replacement for uh you can
00:02:35.519 simply change your app's base class from
00:02:38.000 Crabase to the Netrabase uh I first
00:02:41.280 thought to name it Fetra since FE is 100
00:02:45.200 in Roman numerals but I soon realized
00:02:47.599 that it's too hard to uh distinguish um
00:02:51.440 Cetra the S and Cra so I just used um
00:02:55.360 Zetra as the name but um I realized
00:02:57.760 later that X is 10 in normal Roman
00:03:00.080 numerals so it might be 10 times only
00:03:02.760 faster but let's just ignore that uh is
00:03:06.480 it really 100 times fast well it depends
00:03:09.200 on the the definition of fast the
00:03:12.000 routing and handling logic is 100 times
00:03:14.680 fast which means that whole world whole
00:03:17.440 world apps are 100 times fast um of
00:03:20.840 course the framework handling isn't the
00:03:24.080 most expensive part in web frameworks
00:03:26.319 obviously the app itself has the most um
00:03:29.360 weight but even in real world benchmarks
00:03:32.799 uh making writing 100 times fast can
00:03:36.319 lead to a
00:03:37.879 2% gain so I'd say it's like a free
00:03:42.799 lunch you can just change the base class
00:03:45.200 and get a 2% performance
00:03:47.480 gain now how did I make it 100 times
00:03:50.319 faster that's the main part and I'm
00:03:52.799 going to introduce a technique called
00:03:54.400 benchmark benchmark driven development
00:03:56.400 and a tool set so let's continue into
00:03:59.840 the
00:04:01.959 style uh benchmarking itself is a common
00:04:04.799 technique used to measure the per
00:04:06.959 measure the performance of a program the
00:04:10.080 Ruby world has many benchmarking
00:04:12.239 libraries including benchmark which
00:04:14.480 comes as a standard library and I'm
00:04:16.880 showing benchmark IPS which is a gem in
00:04:20.160 in slide uh in this case a program which
00:04:24.479 has up to 1 to 100 is benchmarked re
00:04:27.600 revealing that it takes 2.5 micros to
00:04:30.800 execute the code
00:04:33.080 block given that let's talk development
00:04:36.560 driven by benchmarks the concept of
00:04:39.280 benchmark driven development development
00:04:41.280 is simple uh before writing starting co
00:04:43.840 before starting writing code design a
00:04:45.919 benchmark then start writing code now
00:04:48.400 make it fast and the cycle continues
00:04:50.600 on this can be compared to TDD which
00:04:53.520 stands for testdriven development in TDD
00:04:57.360 we start writing a failing test then
00:04:59.759 make it pass then refactor do you see
00:05:03.040 the similarity
00:05:05.840 the reason why I brought bench writing a
00:05:08.320 benchmark first can be explain explained
00:05:10.720 using this metrics there are four
00:05:13.120 sections in this matrix broken and slow
00:05:15.919 code fast but broken code working but
00:05:19.039 slow code and then finally working and
00:05:21.680 fast code uh we all know that making
00:05:25.520 slow but working code to fast and
00:05:29.280 working code is very different uh
00:05:31.919 difficult it's easier to write fast code
00:05:35.680 from the
00:05:37.400 beginning uh now when we start talking
00:05:39.840 about bottlenecks the first thing many
00:05:41.840 people mention is that you should focus
00:05:43.919 on the bottlenecks if there's some
00:05:46.560 significant bottleneck you should work
00:05:48.240 on that part or you should focus on the
00:05:51.120 algorithm and
00:05:52.919 make magnitude faster or you could just
00:05:56.639 work on the entire
00:05:58.199 architecture of course that is perfectly
00:06:01.120 valid advice and you should stick to
00:06:03.440 these principles but that is not
00:06:06.600 everything in reality there's not always
00:06:09.680 something significant nope there's not
00:06:12.800 always something significant no
00:06:15.280 particular bottleneck could exist for
00:06:17.919 example in this plane graph every a lot
00:06:21.199 of blocks have the same length so
00:06:23.759 there's no particular significant
00:06:27.240 bottleneck even if your code is working
00:06:29.520 on a log log arithmic algorithm and it
00:06:32.960 could still not meet performance
00:06:36.840 needs so um lots of slight slowdowns
00:06:40.319 will impact performance as a whole
00:06:43.199 however those are hard to find since
00:06:46.319 those are slight even though they may be
00:06:48.720 easy to fix uh we we should remember
00:06:51.680 that not slow is not fast
00:06:55.120 uh in this example we want to take any
00:06:57.680 even number from the array numbers the
00:07:00.880 slow version always iterates through the
00:07:02.880 entire array while the fast version
00:07:04.880 bails out early if it hits the first
00:07:06.720 even number uh this might not sound
00:07:11.280 realistic but we see a lot of this kind
00:07:14.080 of code in real apps they have the they
00:07:16.880 have the same time complexity but the
00:07:19.039 latter version is 10 times faster
00:07:22.960 to address this problem while while
00:07:26.000 develop developing Zetra I benchmarked
00:07:29.199 really every single change so in each
00:07:31.759 pull request I attached a benchmark and
00:07:35.599 I compared the the before and after per
00:07:41.479 performance now this sounds well cool
00:07:46.880 you could run benchmarks as much as
00:07:48.800 possible to catch flow code maybe on
00:07:51.440 every pull request just like I showed
00:07:54.120 you or on every comment
00:07:57.560 maybe even more every time you type a
00:08:02.240 row on your editor you could run a
00:08:07.000 benchmark i I have created a tools and
00:08:10.160 frameworks to keep benchmarking in the
00:08:12.319 loop
00:08:13.919 combined together they will form a
00:08:17.919 framework called benddd so that's
00:08:21.960 bencd uh now let me walk through how to
00:08:25.599 do
00:08:28.199 it uh benchmarking zetra started from uh
00:08:33.440 setting a measurable performance goal so
00:08:36.240 this means that defining what needs to
00:08:38.240 be 100 times fast so I wrote a benchmark
00:08:41.760 first for that
00:08:44.560 uh my first benchmark looks very simple
00:08:47.680 it just takes a app and call uh this is
00:08:51.279 a rack app so I called call and passed a
00:08:56.560 almost empty hash now let's see the
00:09:02.680 results uhra took 35,000 nconds which is
00:09:09.200 no 35,000 nconds per request
00:09:13.120 and another I found um another
00:09:15.920 interesting benchmark target is was roa
00:09:18.640 which is known as a a fast very fast web
00:09:22.760 framework uh now if I wanted to make a
00:09:26.080 100 times
00:09:27.320 faster this means that it has to be it
00:09:30.480 has to complete one request in 350
00:09:33.640 nconds and now the empty rack app I had
00:09:38.160 just built takes 250 nconds per
00:09:43.240 request um this graph looks a little
00:09:47.839 squashed so let's make the access log
00:09:51.440 log all log and make it easy to read
00:09:56.839 um comparing this we see that empty we
00:10:01.760 have empty rect app and if you want to
00:10:04.240 make a synetra that completes request in
00:10:06.720 350 nconds we only have headroom 135
00:10:10.560 ncond headroom for request to do the
00:10:13.519 heavy lifting and everything we need to
00:10:16.200 features in 105 nconds that's what we
00:10:19.440 found through
00:10:21.720 benchmarks so in code it looks like If
00:10:25.839 uh the question they said how much time
00:10:27.920 can we spend
00:10:31.320 here now setting the um target was well
00:10:36.560 it looks like 100 times is a fun target
00:10:39.360 but maybe a different target could be
00:10:41.440 fed f set f set f set f set f set f set
00:10:42.079 f set f set f set f set
00:10:44.120 fet i initially wanted to overtake hono
00:10:47.360 which is known as uh which is a
00:10:49.360 javascript web framework it is known to
00:10:51.600 be very fast now the sad part is that
00:10:55.360 fullfeatured hono was faster than an
00:10:57.600 empty rack app so hono could complete
00:11:00.800 request in 51
00:11:02.360 nconds and um I wanted to overtake but
00:11:07.360 it turns out it was kind of impossible
00:11:10.079 to do that in rack so I just dropped
00:11:12.160 that goal um sometimes benchmark is
00:11:15.680 benchmarking is useful to set a goal
00:11:18.560 goal
00:11:21.560 itself anyways we now know our code goal
00:11:25.120 we now know our goal so now it's time to
00:11:27.760 real write real code so the process goes
00:11:32.720 like start Vim write code run a
00:11:37.480 benchmark start Vim again and then
00:11:39.839 benchmark
00:11:41.360 so um I started writing code like this i
00:11:44.560 started from writing the routing code uh
00:11:47.760 do routing takes an M and it checks i'll
00:11:52.480 talk about this later but the um problem
00:11:55.279 I found that is not fun at
00:11:58.120 all because
00:12:00.959 uh look at the time
00:12:02.440 stamps i was running benchmarks uh 10
00:12:07.000 benchmarks in like five minutes time
00:12:11.040 editing code and this is obviously well
00:12:14.399 it was kind of fun but not really the
00:12:17.040 ideal
00:12:19.160 process so to um remedy
00:12:23.320 this I created a benchmarking framework
00:12:27.279 now this is
00:12:28.760 a kind of a way to describe benchmarks
00:12:32.600 um there's a this is a um DSL like RSpec
00:12:38.240 that has a setup part a data data set
00:12:40.880 part and a scenario part in this case
00:12:44.160 the setup part is not measured and the
00:12:46.480 scenario part is the part that is
00:12:49.800 benchmarked
00:12:51.480 um so yeah
00:12:54.320 um I have created a benchmark
00:12:56.279 benchmarking framework sorry I flipped
00:12:58.800 the slides
00:13:01.760 um we see that the designing work the
00:13:05.760 designing the workload is very important
00:13:08.720 workloads should be realistic and
00:13:11.320 representative and compact at the same
00:13:14.440 time uh for Zenetra I prepared multiple
00:13:17.600 tiers of workloads
00:13:20.320 the small one is a generated randomly
00:13:23.040 generated set of requests which consists
00:13:25.519 of um 10k so uh 10,000 10,000 requests
00:13:31.440 the large one is a log cor log collected
00:13:35.120 from a real center app which consists
00:13:37.200 from 100,000
00:13:39.800 requests we use the small one when
00:13:42.880 running benchmarks in a tiny uh small
00:13:45.440 loop so when you're writing code and
00:13:47.920 trying to experiment we can use the
00:13:49.760 small data set to get results fast and
00:13:53.200 to get more accurate results we use the
00:13:55.600 large
00:13:59.160 one another problem is that benchmarking
00:14:02.000 does not provide pro provide us enough
00:14:07.800 information what benchmarking tells us
00:14:10.639 is the time per iteration of the current
00:14:13.560 code but however what we really needed
00:14:17.360 is how did the performance change before
00:14:20.639 to after and why did the performance
00:14:25.720 change so in this case the only
00:14:29.040 information we know that is the time per
00:14:32.680 iteration this is where profiling comes
00:14:37.480 in explaining perform explaining
00:14:40.720 performance is exactly what profilers do
00:14:44.320 so as I maintain my own profiler I've
00:14:47.920 added a new view to show performance
00:14:50.000 diff between two
00:14:53.079 revisions uh this is called a
00:14:55.760 differential flame
00:14:57.240 graph unlike a normal unlike a normal
00:15:00.480 flame graph the colors some some frames
00:15:03.600 are colored in red and red and blue
00:15:08.560 the red ones indicate that
00:15:12.160 uh so this flame graph is generated from
00:15:14.160 two benchmark results and two
00:15:16.519 profiles the blue frame blue flames are
00:15:20.160 the ones that get they have get improved
00:15:22.720 performance from the previous benchmark
00:15:25.680 and the red ones are the ones that got
00:15:28.560 degraded from last run
00:15:31.000 so seeing this we can see that the blue
00:15:34.240 ones got better and the red ones got
00:15:36.760 worse this gives this gives us hints to
00:15:40.480 um
00:15:41.560 optimize uh actually this screenshot is
00:15:45.600 not from my profiler but this isn't
00:15:48.399 actually a screenshot from Brendan Greg
00:15:51.279 Sensei site and there is a reason
00:15:55.519 because his books is in the bookstore um
00:15:59.040 this Brendan Gre is a great person you
00:16:02.480 really should buy his book so in the
00:16:04.399 bookstore you should go now and um I
00:16:07.759 have implemented another
00:16:10.040 thing uh this is editor integration so I
00:16:13.759 felt that reading flame guard I felt
00:16:16.639 that reading flame graph is kind of
00:16:18.399 tedious and not a thing that want to do
00:16:20.720 every single time so I have made a uh
00:16:24.560 editor expense editor integration that
00:16:27.000 shows the time spent per line in ghost
00:16:30.959 text um do you notice the gray the code
00:16:34.480 um text in gray for example the
00:16:38.000 continuation check on line 22 is
00:16:40.800 spending 82,000 nconds per iteration
00:16:43.920 indicating that this is a um possible
00:16:46.320 hot spot
00:16:50.160 this is implemented in cooperation with
00:16:52.079 the benchmark framework framework and
00:16:54.440 PS2 uh when a scenario runs when a
00:16:57.920 scenario runs it automatically engages
00:17:00.639 PF2
00:17:03.000 profiling um the results are recorded in
00:17:06.480 some temporary directory and the diff
00:17:09.199 engine in PF2 generates the differential
00:17:12.240 flame graph and data for the editor
00:17:15.039 integration
00:17:18.880 so now we have walked through a single
00:17:20.400 cycle of NCD it is important to it is
00:17:23.280 important to write a benchmark for each
00:17:25.120 feature not each only in the case of
00:17:28.160 Zetra I wrote benchmarks for routing
00:17:30.160 handling actions and more now we'll see
00:17:33.039 how I improved performance for each
00:17:34.880 feature in
00:17:36.440 detail uh so making Zetra 100 times
00:17:41.799 faster so this is a reminder we need to
00:17:44.880 fit we need to fit all features in 155
00:17:47.760 nconds
00:17:49.720 time uh we have to start from optim
00:17:52.480 optimizing this the significant routing
00:17:55.520 is the largest part in Sinetra as well
00:17:58.320 web frameworks in general which is the
00:18:01.120 which is the equivalent of what rails rb
00:18:03.840 does in
00:18:04.840 Rails um some algorithms come to mind
00:18:08.559 like try based rooting or linear rooting
00:18:12.720 there are multiple there are m multiple
00:18:14.480 options of algorithms for routing one is
00:18:17.679 linear so it tries to match every single
00:18:20.480 registered route for request and another
00:18:23.520 one is a try based on a hash well data
00:18:26.240 structure called a try now the one is um
00:18:30.480 and another one is
00:18:32.280 login so there's a um
00:18:37.240 trade-off it it's important to know
00:18:39.679 where's the line the line the line where
00:18:42.400 a simple um linear benchmark a linear
00:18:46.000 algorithm lose to a complex log
00:18:48.559 algorithm
00:18:50.240 uh for 10 to 20 routes linear routing
00:18:52.320 was faster which was found by
00:18:54.760 benchmarking so for zetra it
00:18:57.200 automatically switches the um algorithm
00:19:00.320 based on the number of routes
00:19:03.960 registered now that we have implemented
00:19:06.400 routing um we have 84 it it turns out
00:19:10.320 that routing consumed about 50 nconds so
00:19:13.679 we have 84 nconds headroom headroom to
00:19:16.360 go uh we have many more features that we
00:19:19.600 have to implement to make Sinatra well
00:19:22.160 zinatra useful web framework
00:19:25.400 uh
00:19:26.919 it only not not only routing makes a
00:19:30.080 framework we need to implement param
00:19:32.160 access before and after actions and
00:19:34.120 cookies and those are not those may not
00:19:37.280 be bottlenecks but there are still fit
00:19:40.640 challenging to fit in 84 nconds for a
00:19:46.039 request has a params API if you call
00:19:49.440 params in a handler you can get the
00:19:53.120 params passed to a um request the
00:19:57.600 problem is that params is a method call
00:20:00.720 and method calls are actually
00:20:04.760 expensive if we could make parents an
00:20:07.360 instance variable that will make uh that
00:20:10.640 will gain they will make a
00:20:12.440 gain however that is API
00:20:16.520 change
00:20:18.760 obviously app params and inest variable
00:20:21.280 is faster but it's obviously mutable and
00:20:23.280 can do less work zetra support supports
00:20:26.640 both so users can gradually switch to
00:20:29.440 the instance var variable version and
00:20:31.440 gain
00:20:32.679 performance so there's a lesson lesson
00:20:35.120 learned here performance can influence
00:20:37.840 API
00:20:38.919 design if you start from params and make
00:20:42.240 that a public API you can't change it
00:20:46.000 easily without introducing a breaking
00:20:48.280 change you have to um carefully check
00:20:51.840 every option
00:20:54.799 another example is the request
00:20:57.080 object request is uh another
00:21:01.640 known it returns a request object it's
00:21:04.799 kind of useful now can we change this to
00:21:07.520 gain some
00:21:09.240 performance we have some option to
00:21:12.440 implement request but actually there is
00:21:15.840 another usage so unlike param you can
00:21:18.240 call request.mp or request.par to get
00:21:21.280 another object
00:21:24.000 there's a multiple options to implement
00:21:26.039 this the one is option one is making it
00:21:29.679 a class option two is making a data
00:21:32.400 option three is making it a
00:21:37.000 strruct the data is the fastest of the
00:21:41.159 three it oh sorry it's all parents um
00:21:46.640 sorry about that um so the obvious here
00:21:50.799 is data one interesting thing is that
00:21:53.120 these results get reversed when widget
00:21:55.200 is off widget is off when watch is off
00:21:59.200 class is actually the fastest so it's a
00:22:02.480 lesson learned here is that it is
00:22:04.000 important to exper experiment in an
00:22:05.919 envir envir environment near
00:22:10.360 production another feature is before and
00:22:13.200 after action before and after actions
00:22:15.760 can be registered and those are saved on
00:22:18.880 app startup those get called on every
00:22:21.799 requests so they can be used for
00:22:24.240 authentication and other
00:22:26.520 texts there are multiple handful in Ruby
00:22:29.919 there are handful of ways to call a
00:22:33.720 proc one is block call so you can just
00:22:37.919 call on the
00:22:39.880 proc and another one is instance effect
00:22:43.520 and another one is instance evil these
00:22:46.720 have very different different
00:22:48.240 performances
00:22:50.000 block call proc call is the fastest but
00:22:53.360 in this case it was unusual since the
00:22:56.640 execution execution context
00:22:59.159 changes so the options left are infinite
00:23:02.559 and instance evil but benchmarking found
00:23:05.280 out that instance evil was somewhat
00:23:08.280 faster so I used instance evil
00:23:13.640 here
00:23:15.480 however 422 nconds is already 95 times
00:23:19.919 faster than ATRA so we're breaking the
00:23:22.000 100 times
00:23:24.039 line there so we have to come up with
00:23:26.559 another another
00:23:28.280 strategy one way is to just make the
00:23:31.679 before action a method actual method we
00:23:34.559 just called define method instead of
00:23:36.480 saving a block I just define defined the
00:23:40.400 actual method and called it later using
00:23:43.120 self
00:23:44.520 send calling methods is faster than
00:23:46.960 calling blocks and allows logic so this
00:23:50.559 was this gained us
00:23:52.840 significant
00:23:54.600 nonds however send it is kind of still
00:23:57.520 slow
00:23:59.440 what if we could just call a static
00:24:03.400 method we could do that but multiple
00:24:06.159 before cannot be defined in this version
00:24:08.559 so it's kind of breaking however this
00:24:11.840 led us to um jumping 145 times fast 144
00:24:17.600 times
00:24:20.360 fastra now uh one last feature I'd like
00:24:23.600 to introduce is ra session um this is a
00:24:26.960 this is technically not a part of
00:24:28.240 Senetra but many Senetra apps say use
00:24:30.559 this so um session handling wasn't in
00:24:32.640 the original benchmark so I'm not
00:24:34.320 showing numbers here but um profiling
00:24:37.679 shows that rack session usually consumed
00:24:39.760 quite a lot of CPU
00:24:41.760 um I tried to implement a equivalent in
00:24:44.080 rest and it was quite fast so just for
00:24:48.799 information and more and more um I
00:24:52.080 reduced hash access reduced object
00:24:54.120 calculation in this case um
00:24:59.000 hash this the flow pattern you see a lot
00:25:01.840 of this full pattern in Ruby code but it
00:25:03.840 actually access hash key twice so that's
00:25:07.760 some damage caused to performance and
00:25:09.679 reducing a lot of
00:25:12.520 things uh so wrapping up building was
00:25:15.600 removing a ton of small ton of small
00:25:17.880 debris doing high performance was not
00:25:20.480 making it fast but not making it slow
00:25:23.600 because 10 10% slower code is
00:25:27.799 150% slower
00:25:30.360 code now I'd like to introduce some tips
00:25:33.679 for better
00:25:35.720 benchmarking uh does this presentation
00:25:38.720 matter with me yes because in Ruby and
00:25:41.840 Rails CPU time is very precious um many
00:25:45.840 people say that the database is the
00:25:47.679 bottleneck and Ruby code won't matter
00:25:50.000 but actually Rails isn't that IO bound
00:25:53.919 it only does bound IO for like 50 to 60%
00:25:58.080 of time
00:25:59.440 and reducing one millisecond in Ruby
00:26:01.919 code could go very far especially when
00:26:04.400 you
00:26:06.200 scale uh another important thing is to
00:26:08.880 not do gacha uh gacha machines is by the
00:26:12.000 way the photo on the left right um you
00:26:14.760 can pay 100 or 200 yen to get some
00:26:18.640 random mikang mikang goods and well um
00:26:24.559 doing benchmark like running a bench
00:26:26.960 repeating benchmark commands until you
00:26:29.200 get a good good results is well tempting
00:26:31.840 and fun but that's kind of wasting time
00:26:34.880 um if you are unsure you should do a
00:26:37.679 stat hypo stat statistical hypothesis
00:26:41.640 test and um why is it enabling watching
00:26:46.720 during benchmark is important just just
00:26:48.799 as I said keeping the benchmarking
00:26:50.720 important benchmarking environment close
00:26:53.200 to production is
00:26:55.000 important uh why is it increasing for
00:26:57.840 methods called three times so a short
00:27:00.559 warm up should suffice
00:27:04.720 uh when profiling a set benchmarking uh
00:27:07.039 that is a yes you will see lower scope
00:27:10.320 scores with the profiler enabled but
00:27:13.360 that's okay as long if the overhead is
00:27:17.320 consistent and one thing benchmarking CI
00:27:21.240 so the continuous benchmarking idea is
00:27:25.600 kind of sounds it kind of sounds like a
00:27:28.240 thing to do in
00:27:29.559 CI why do it in local when we have uh
00:27:33.840 like GitHub actions or something well
00:27:36.320 that is because CI environments are very
00:27:38.480 unstable you don't get the same results
00:27:40.799 from the same code they have an unstable
00:27:43.600 based CPU you might have some noisy
00:27:46.320 neighbors the libraries could give
00:27:48.640 automatically updated and hyperthreading
00:27:51.200 could be enabled or
00:27:53.720 disabled um so wrapping up do you now
00:27:57.520 feel benchmarking some optimizations I
00:28:00.559 cover today won't be easy to do after
00:28:02.720 writing code but when you're when the
00:28:06.640 instance you're writing code that is
00:28:08.240 possible so always want always run
00:28:11.279 benchmarks when writing code and find
00:28:13.440 them before you get
00:28:14.919 comment so thank you
Explore all talks recorded at RubyKaigi 2025
+66