LLM Evaluations & Reinforcement Learning for Shopify Sidekick on Rails

Large Language Models (LLM)

LLM Evaluations & Reinforcement Learning for Shopify Sidekick on Rails

Play on YouTube

Andrew Mcnamara

@andrew-mcnamara

Rails World 2025

#large-language-models-llm

#machine-learning

#software-architecture

LLM Evaluations & Reinforcement Learning for Shopify Sidekick on Rails

Andrew Mcnamara and Charlie Lee • September 05, 2025 • Amsterdam, Netherlands • Talk

Introduction

This talk, delivered by Andrew Mcnamara and Charlie Lee at Rails World 2025, explores the architecture and evaluation strategies behind "Shopify Sidekick," an LLM-powered assistant integrated into the Shopify admin. The speakers discuss production-oriented approaches for orchestrating LLM systems, rigorous evaluation frameworks, and reinforcement learning pipelines that move beyond ad hoc methodologies.

Key Points

Conclusions & Takeaways

Transitioning from vibe testing to statistically grounded LLM-based evaluation is critical for reliable LLM deployment.
Quality and modularity in both tooling and evaluation infrastructure are essential for scaling production LLM systems and reinforcement learning pipelines.
Continuous monitoring, expert labeling, and iterative judge refinement are necessary to prevent reward hacking and maintain long-term robustness.

LLM Evaluations & Reinforcement Learning for Shopify Sidekick on Rails
Andrew Mcnamara and Charlie Lee • Amsterdam, Netherlands • Talk

Date: September 05, 2025
Published: Mon, 15 Sep 2025 00:00:00 +0000
Announced: Tue, 20 May 2025 00:00:00 +0000

This talk explores building production LLM systems through Shopify Sidekick's Rails architecture, covering orchestration patterns and tool integration strategies. We'll establish statistically rigorous LLM-based evaluation frameworks that move beyond subjective 'vibe testing.' Finally, we'll demonstrate how robust evaluation systems become critical infrastructure for reinforcement learning pipelines, while exploring how RL can learn to hack evaluations and strategies to mitigate this.

Rails World 2025

00:00:06.960 I'm primarily here today to give you a

00:00:09.200 quick intro to what to tell you what

00:00:11.360 sidekick is and to give you a few of the

00:00:14.320 insights that we've learned while

00:00:15.759 building Sidekick. So, uh to start

00:00:20.160 uh this is Sidekick. Sidekick is an

00:00:22.160 assistant that lives within the Shopify

00:00:24.000 admin and merchants use it to uh get

00:00:27.279 general help with their store and manage

00:00:29.199 their business. In this specific

00:00:30.720 example, it's actually running on

00:00:31.760 Andrew's store who's going to join us

00:00:33.520 later to talk about evals. But Andrew

00:00:35.920 asked Sidekick, "Hey, could you fetch

00:00:38.640 and analyze the sales from the last 30

00:00:40.960 days and provide recommendations based

00:00:42.879 off what you've learned?" and psychic

00:00:44.640 can decompose that uh into individual

00:00:47.360 tasks, collect all the different pieces

00:00:49.039 of context and actually generate a

00:00:50.800 fairly reasonable response. So how does

00:00:53.440 this work internally? Uh at its core,

00:00:56.480 psychic is just an agent, a very simple

00:00:59.039 agent. Uh it has an LM that is equipped

00:01:02.879 with many tools. Those tools then

00:01:05.519 interact with the environment. The

00:01:07.200 environment in this case is the shop by

00:01:10.159 store and other shop by APIs. the agent

00:01:12.880 can sort of reason about the the tool

00:01:14.880 responses and then generate a response

00:01:17.280 back to the the merchant or the human in

00:01:19.680 this case. Now, Sidekick today doesn't

00:01:23.119 actually have that many defined

00:01:24.560 workflows and I know that many agents

00:01:26.479 are built with very defined workflows in

00:01:28.320 mind to keep consistency. But we

00:01:30.000 actually found that uh by strapping an

00:01:32.240 LM with very well- definfined tools that

00:01:35.200 it provides the best balance of

00:01:36.880 generating the best quality responses

00:01:39.360 and also giving the LM enough space to

00:01:41.360 recover from some of those error and

00:01:42.720 edge cases. So it gives that flexibility

00:01:44.960 back to the LM.

00:01:47.840 Now before I continue, I do want to go

00:01:50.159 back in time to when we were first

00:01:51.759 developing Sidekick and I want to go

00:01:53.840 back to the handful of skills or the

00:01:56.320 tools that we developed. Um, and the

00:01:59.119 main criteria of picking these tools was

00:02:01.439 one, does it provide value to the

00:02:03.920 merchant and two, does it actually save

00:02:05.680 them time? There's actually no point in

00:02:07.360 building a tool if the merchant can

00:02:09.119 accomplish that task faster in other

00:02:11.680 ways. So, as an example, here are the

00:02:14.319 first two tools, customer segmentation

00:02:15.920 and analytics. Customer segmentation, if

00:02:18.560 you don't know, is a core feature at

00:02:20.480 Shopify that allows you to group

00:02:22.239 customers based off a set of criteria.

00:02:24.800 Uh these groups can be used for

00:02:27.120 marketing campaigns, tag them with

00:02:29.360 discounts um or generally get buyer

00:02:32.879 insights for your customers. Uh but this

00:02:36.480 requires that merchant to learn a sort

00:02:38.720 of bespoke query language to even start

00:02:40.720 digging in to that data. Same with

00:02:42.959 analytics. Analytics at Shopify is a

00:02:45.920 wealth of of information. It can give

00:02:48.160 you order details. It can give you sales

00:02:50.239 details. It can give you trends. But

00:02:52.239 again, you as a merchant have to learn

00:02:55.040 this query language. And you can imagine

00:02:56.959 some of the non-technical merchants may

00:02:59.040 have a bit of trouble learning this and

00:03:01.200 getting even to the starting line of

00:03:03.680 digging into these insights.

00:03:06.159 But in the advent of LMS, this type of

00:03:08.800 problem and this type of feature is so

00:03:10.560 much more accessible. Uh you can task an

00:03:13.280 LM to generate these queries and you can

00:03:16.080 have the agent run these queries and

00:03:18.080 provide the insights directly. So for

00:03:20.159 these two specific skills, we actually

00:03:21.760 had to fine-tune a model that was able

00:03:24.319 to translate a user request into the

00:03:26.560 queries you see above. Scick would then

00:03:29.040 in the tool validate the response from

00:03:31.360 the fine-tune, run the query, and

00:03:33.280 generate that response. In addition, we

00:03:36.799 were considering like it's amazing can

00:03:39.840 do these complex tasks, but it would be

00:03:41.760 a little funny if it couldn't do the

00:03:43.760 basic tasks. So in addition to the

00:03:46.159 complex SAS, we also introduce three of

00:03:48.319 the more basic skills. Uh the first one

00:03:50.640 being navigation, which is sort of

00:03:52.400 self-explanatory, but it helps

00:03:53.920 merchants, especially the new ones, sort

00:03:55.840 of find their way around Shopify admin,

00:03:58.080 which is can be a very big place. Uh the

00:04:01.439 second one is the help tool, which is

00:04:03.200 your classic rag uh based tool, and it's

00:04:06.000 hooked up to the Shopify help center,

00:04:07.840 which contains all the documents that

00:04:09.920 you probably need to run your business.

00:04:12.480 And lastly, it's form filling. Uh form

00:04:15.519 filling uh gives Sidekick the ability to

00:04:18.400 generate a preview of uh a crate or edit

00:04:21.519 action for any of the resources on the

00:04:23.680 merchants store. And I want to be really

00:04:25.600 clear here. Psychic itself doesn't

00:04:27.440 mutate the state of the shop. You can

00:04:29.759 sort of imagine this free form agent

00:04:32.400 mutating many many things would be a bad

00:04:34.800 experience for users especially if the

00:04:36.639 users don't have any input. So all it

00:04:38.960 does is it provides the UI this preview

00:04:41.280 that the merchant is given and the

00:04:43.520 merchant has to actually sign off and

00:04:45.199 commit the changes on their own.

00:04:48.080 Now uh for the most part you know we

00:04:51.440 launched this and we had a few of those

00:04:54.960 skills/tools

00:04:56.639 and we you know found pretty good

00:04:58.639 success. A lot of merchants really

00:05:00.320 enjoyed the additional help that psychic

00:05:02.320 provided. And we would be able to sort

00:05:04.560 of uh look at the gaps between what

00:05:07.280 psychic could answer and what it

00:05:09.520 couldn't answer. And then for the ones

00:05:10.960 that it couldn't answer, we'd say, let's

00:05:12.240 just build a tool for that. It'll just

00:05:13.919 can add it to the system. You can just

00:05:15.440 keep growing the list of

00:05:16.560 responsibilities for sidekick. And for

00:05:18.720 the most part, that worked up until when

00:05:22.240 it didn't. Then we started to notice

00:05:24.160 when we had way too many tools,

00:05:26.080 Sidekick, the LM would start to confuse

00:05:28.560 the responsibilities of the different

00:05:30.639 tools, it would start to misuse the

00:05:32.960 instructions that we had in the system

00:05:34.400 prompt uh across the tools and in

00:05:37.120 general it lowered the quality of the

00:05:39.039 responses. And you can imagine this

00:05:40.880 problem being further exasperated the

00:05:42.800 more tools we sort of prototyped within

00:05:44.880 Sidekick. So a co-orker of mine

00:05:47.680 describes this as a death by a thousand

00:05:50.160 instructions. you have conflicting

00:05:52.479 instructions. It slows down the entire

00:05:55.120 processing of the LM because you can

00:05:56.639 imagine that giant system prompt being

00:05:58.639 dynamically swapped out and rebuilt. It

00:06:00.880 becomes incredibly difficult to debug

00:06:03.039 especially when you have other external

00:06:04.880 contributors trying to add to that one

00:06:06.880 system prompt. And ultimately it's very

00:06:09.360 hard to evaluate. So you don't even know

00:06:10.639 if you're moving in the right direction.

00:06:12.880 So what did we do? The first major

00:06:14.800 refactor we had was introducing this

00:06:16.720 concept called just in time

00:06:17.919 instructions. just in time instructions

00:06:20.160 removes the complexity of having all of

00:06:23.199 the conditionals within your main

00:06:24.720 system, your agent's main system prompt

00:06:26.720 and moving it to uh the tool response

00:06:29.840 directly. And this provides two things.

00:06:31.600 One, it keeps the core of your agents

00:06:33.759 behavior pretty static. It's just the

00:06:36.800 behavior of the agent that you're trying

00:06:38.000 to achieve. And it still surfaces the

00:06:41.360 tool instructions when you actually need

00:06:42.960 them, when the tool is actually actually

00:06:45.199 called. Uh we also found a side benefit

00:06:47.840 that this is more cache friendly because

00:06:49.840 if you imagine a static system prompt

00:06:51.840 and we're sort of doing an appendon

00:06:53.759 version of the tool results you can

00:06:56.160 maintain a longer cache uh especially

00:06:58.720 with those LM providers that provide

00:07:00.240 that feature and ultimately because now

00:07:02.319 that the instructions were in the tool

00:07:03.919 results uh we found teams were

00:07:05.840 experimenting with different

00:07:07.199 instructions being passed back to the uh

00:07:10.000 the agent using things like beta flags

00:07:12.560 depending on the model that was used or

00:07:14.160 the different pages that the merchant

00:07:15.840 might be on. So ultimately like when we

00:07:18.560 have other teams contributing the blast

00:07:20.880 radius of any one tool didn't affect the

00:07:24.160 core of the agents behavior as a whole.

00:07:27.599 So you know what this looks like in

00:07:29.360 practice. It's really just moving those

00:07:30.639 instructions from the the system prompt

00:07:32.560 into the uh the structured tool results.

00:07:35.039 And this is sort of a simplified version

00:07:36.720 of our help tool. But you can see that

00:07:38.800 this help tool returns a very specific

00:07:41.520 citations format. And when we used to

00:07:43.759 have that format and those instructions

00:07:45.360 the system prompt, scite would actually

00:07:47.440 misuse that citations format for other

00:07:49.440 tools that weren't help related at all.

00:07:51.680 So this one simple change sort of helped

00:07:54.479 us modulize this process and scale out

00:07:56.800 to way more tools. And from the

00:07:58.560 merchants view of course they just see

00:08:00.960 they don't see the structured response

00:08:02.160 but they see the agent adhering to the

00:08:03.759 instructions that are provided by those

00:08:05.120 tool results. Um,

00:08:07.919 so you know this worked even further and

00:08:11.199 we were able to scale out to way more

00:08:12.879 tools and for most for the most part

00:08:15.680 like teams would introduce one tool that

00:08:18.000 sort of represented their domain and

00:08:20.000 then we started onboarding more recently

00:08:21.759 some more complex features at Shopify

00:08:24.479 and those teams would require sort of uh

00:08:27.759 a crap ton of domain specific tools that

00:08:31.039 we felt like we're going to go back into

00:08:33.360 the original problem we have where the

00:08:35.200 main agent has to now keep track of all

00:08:37.279 the different tools across multiple

00:08:39.440 domains across many complex domains. So

00:08:42.800 this is something that we're sort of

00:08:44.000 exploring today and that's just sub

00:08:46.080 agents. Sub agents are specialized

00:08:48.160 agents to handle those specific domains.

00:08:50.560 Uh but the key point here is that the

00:08:52.959 only agent the main agent that the

00:08:55.279 merchant is talking to is still

00:08:56.480 sidekick. Sidekick is the only point of

00:08:59.120 contact and the merchant can't really

00:09:00.880 speak to the sub agents directly. Uh

00:09:03.279 this ensures that the tone and voice of

00:09:05.519 Sidekick is consistent and you're still

00:09:07.839 delegating to a sub agent to handle some

00:09:10.959 uh some of those domain specific tasks.

00:09:14.160 So what does this look like in practice?

00:09:16.720 It should look very familiar. It's the

00:09:18.800 same interface. You're calling a tool

00:09:21.440 from Sidekick that hands off to a sub

00:09:24.000 agent with a set of instructions, an

00:09:25.839 optional conversation ID, which we'll

00:09:27.440 get back to, and any other specific

00:09:29.360 pieces of context that that sub agent

00:09:31.279 might need. And that sub agent will take

00:09:34.080 those instructions, run its own internal

00:09:36.640 uh agentic loop with its own system

00:09:38.480 prompt, with its own set of domain

00:09:40.480 specific tools and then spit out just

00:09:42.720 like the just in time instructions the

00:09:44.880 instructions back to sidekick so

00:09:47.120 sidekick can actually form the response

00:09:48.720 back to the merchant and uh taking a

00:09:50.959 pause and taking a look at that

00:09:52.320 conversation ID. Why would you want to

00:09:54.160 have a conversation ID? You'll find that

00:09:56.480 a lot of users aren't really good at uh

00:09:59.360 providing the full spec of the thing

00:10:01.519 that they're trying to accomplish in a

00:10:03.120 single turn, right? They're not

00:10:04.800 providing a giant brief of all the

00:10:06.560 different conditions and they want to

00:10:07.760 iterate across that with multiple turns,

00:10:09.839 which means your sub agent needs to be

00:10:11.680 aware of the different turns uh that

00:10:14.320 have happened before to make sure that

00:10:16.240 it knows the context uh to move on to

00:10:18.800 the next step. So what we do is we pass

00:10:21.040 back the conversation ID from the sub

00:10:22.880 agent so that the main agent can say hey

00:10:25.279 this person is actually continuing that

00:10:27.120 thread from the sub agent. Please add

00:10:29.200 this set of instructions to your

00:10:30.720 conversation history so you can continue

00:10:32.480 on to the next step. But with all this

00:10:35.440 in mind, this is very much early

00:10:37.279 exploration for us and we're still

00:10:38.640 evaluating these responses. Uh so if I

00:10:41.920 can leave you with a few takeaways, it's

00:10:44.079 to stay simple as long as you can

00:10:45.760 because the simpler your system is, the

00:10:47.920 m it's way easier to reason about your

00:10:50.079 system and scale it out and you'll have

00:10:51.760 much higher response quality scores.

00:10:54.560 Don't jump to the multi- aent

00:10:57.040 architecture, especially not right away

00:10:59.440 because one, you're going to be adding

00:11:00.720 unnecessary complexity and two, it's

00:11:02.959 going to add a lot more latency. So only

00:11:05.440 really start exploring that once you

00:11:07.680 have clear evidence that you you might

00:11:10.160 actually need it. Uh the second point is

00:11:12.480 the quality of your tools matters way

00:11:14.640 more than the quantity of your tools. Uh

00:11:16.720 I think the majority of the time that we

00:11:18.640 spend on psychic is really on tool

00:11:20.560 design. We are iterating and

00:11:21.839 re-evaluating our tools constantly

00:11:23.920 because the core of the system or the

00:11:26.000 agent is fairly static. And the last bit

00:11:28.959 is stay modular. There are multiple

00:11:31.040 parts to your agent. By isolating those

00:11:33.120 individual parts, you can reduce the

00:11:34.560 blast radius of any one change um and

00:11:37.440 keep your agent running without uh too

00:11:39.600 much of an issue. So keep all this in

00:11:42.079 mind while I bring up Andrew who's going

00:11:43.600 to talk about evaling this system.

00:11:46.079 Thanks.

00:11:54.240 Hey everyone and thanks Charlie for the

00:11:55.920 sidekick intro. I will give uh a quick

00:11:58.640 midt talk intro myself. I'm Andrew

00:12:00.480 McMurra. I've been building uh

00:12:02.320 assistants now for 15 years. In 2011, we

00:12:05.519 started as a startup and we ended up

00:12:07.200 powering LG and Samsung's assistants

00:12:09.839 both on their phones and TVs. We were

00:12:12.399 acquired by Microsoft in 2017 where we

00:12:15.279 built the first LM uh assistant in

00:12:18.480 Microsoft. Originally called Sydney,

00:12:21.040 then Bing Chat. Um when it launched in

00:12:23.360 North America, it was rebranded as

00:12:24.959 Copilot. And now I'm here at Shopify

00:12:28.160 with Charlie and the team uh building

00:12:30.000 out Sidekick. So I'm going to share a

00:12:32.079 little bit about what I've learned um

00:12:34.399 over the years on how to evaluate chat

00:12:36.800 systems or agents as um they're kind of

00:12:41.440 difficult and more more difficult than

00:12:43.519 normal uh ML models. So what I see a

00:12:46.720 lot, let's say this is a theoretical

00:12:48.480 member of the Sidekick team who may or

00:12:50.480 may not be called Ben. Uh there was a

00:12:52.399 lot of vibe testing um of Sidekick. I

00:12:55.760 would say we tried it out, it looked

00:12:57.279 good, so we shipped it or it vibe tested

00:12:59.200 well is a phrase that I've heard quite a

00:13:00.800 bit. But then what would happen? It

00:13:02.560 would launch and there'd be errors and

00:13:05.040 then theoretical uh who may or may not

00:13:08.079 exist, Ben, uh he was very sad. Uh so

00:13:11.920 what kind of framework did we build to

00:13:13.680 move away from this? uh we use LM's LM

00:13:17.040 as a judges and simulators in order to

00:13:19.760 really have a lot of high trust in our

00:13:21.920 systems and it really brought sidekick I

00:13:24.399 think to the next level. So what we do

00:13:26.320 is we have a user simulator which is an

00:13:27.920 LLM based. It's actually uh for us it's

00:13:30.959 merchant facing. So we call it our

00:13:32.240 merchant simulator. The idea of it is to

00:13:34.639 replay the spirit of a conversation that

00:13:37.519 happened in production to our new

00:13:39.200 candidate system. And our candidate

00:13:41.040 system is just one one delta or one

00:13:44.079 change in the system because we really

00:13:45.680 just want to be testing things in

00:13:46.880 isolation. And then we have an LLM

00:13:49.200 judge. So after the conversation is

00:13:50.800 replayed with the CA with the uh

00:13:52.560 candidate system, the LM judge is going

00:13:54.399 to evaluate it across different criteria

00:13:56.959 and then we're going to have full trust

00:13:58.399 in this and we can ship with confidence.

00:14:00.480 Uh but I'm going to talk about how to

00:14:01.920 build this LM judge in simulator because

00:14:04.240 you um you can't just vibe create these

00:14:06.880 things either. You need a lot of

00:14:07.920 statistical rigor um to actually pull

00:14:10.160 this off. So we create what we call a

00:14:12.880 ground truth set. So all of us are

00:14:14.399 developers here. I don't know how many

00:14:15.680 of us have actually read specs. Uh I

00:14:18.320 don't think I've ever read Spaxs. I

00:14:19.680 imagine a lot of you haven't despite all

00:14:21.440 the hard work that PMs put into them. Uh

00:14:24.079 so what we do is we get the PMs to

00:14:25.839 create what's called a ground true set.

00:14:27.440 Uh this is different than a golden set.

00:14:29.199 I have a quick definition of a golden

00:14:30.720 set here. Um basically given an input X,

00:14:34.079 we expect the correct output Y. And we

00:14:36.959 have like a fixed number of these, maybe

00:14:38.240 a thousand or 5,000 or something. And

00:14:40.480 when we test a new model, call it f and

00:14:42.720 we run this input on our new model, we

00:14:46.000 we get yhat or y prime as I have here.

00:14:49.120 So this golden set will check does y

00:14:51.920 prime match our expected y in the golden

00:14:55.040 set. And this is how machine learning

00:14:57.199 did uh testing for a very long time

00:14:59.040 prior to LLM.

00:15:00.959 Um so what we actually want to do to

00:15:03.440 create this ground truth set is sample

00:15:05.199 real conversations from prod or against

00:15:07.120 the prod distribution and then we create

00:15:09.360 criteria and label these as humans and

00:15:12.480 then we continuously grow this ground

00:15:14.160 truth set. So what a ground truth set

00:15:15.680 looks like is a conversation a criteria

00:15:20.000 like for side for sidekick it's safety

00:15:23.440 uh goal fulfillment grounding merchant

00:15:26.160 sentiment uh etc. I think there's like

00:15:28.240 five five criteria and the PMs or

00:15:30.720 product experts label these.

00:15:33.519 It's important that we grab both good

00:15:35.440 and bad conversations in this because

00:15:38.320 ultimately this ground truth set should

00:15:40.480 be our specs. It should have all the

00:15:42.959 corner cases. It should have the bad

00:15:44.560 things marked as bad. It should have the

00:15:46.480 good things marked as good and why

00:15:47.760 they're good, why they're bad as part of

00:15:49.120 their criteria. So we don't just get

00:15:50.880 anyone to label these. We're not getting

00:15:52.320 like Amazon Turk to do this. we have

00:15:54.160 like let's say three product experts or

00:15:56.560 five product experts to do this. So like

00:15:58.320 our PM team what we do is if let's say

00:16:01.519 we start with like 200 ground true set

00:16:04.160 we have five PM or product expert label

00:16:06.320 these and then we calculate the

00:16:08.000 correlation using something like Cohen's

00:16:09.600 Kappa or Kendall Tao these are

00:16:11.040 statistical measures of of correlation

00:16:13.600 and we try to figure out with these five

00:16:16.079 product experts how much do they agree

00:16:18.399 on what's good and what's bad and then

00:16:20.480 we try to get coverage of like I said

00:16:22.160 both good and bad in this ground truth

00:16:24.240 set and we calculate this number and

00:16:26.639 this we consider the theoretical max of

00:16:29.440 what our LM judge can approve because LM

00:16:31.759 judge is not going to get 100. Humans

00:16:33.279 are not going to get 100.

00:16:38.480 So the next step here, let's say we

00:16:40.399 calculated Cohen's cap and it was like

00:16:42.079 69 uh which is actually like about what

00:16:44.800 it is on sidekick with some of the

00:16:46.880 agreement between some of our PMs. And

00:16:49.120 now we're going to make a prompt to try

00:16:52.240 to match what the GTX is doing. So the

00:16:55.279 GTX is again a conversation criteria

00:16:58.160 like goal fulfillment uh overall etc etc

00:17:01.920 safety. So we've labeled these criteria

00:17:04.400 on these 250 conversations. 250 isn't

00:17:07.280 enough but it's a good place to start

00:17:09.839 and we prompt an LLM. You can also train

00:17:13.199 an LM or like do whatever but you

00:17:14.799 probably want to start with prompting

00:17:15.919 and you try to get it to match the

00:17:17.600 humans as close as possible. Um so a lot

00:17:20.400 of people when they build LM judges if

00:17:22.079 you go back think back to the ground

00:17:23.439 truth set they have maybe a thousand of

00:17:24.880 those and we have Yhat or Y prime and Y

00:17:27.600 and they their LM judges does Y prime is

00:17:32.400 it semantically similar to Y or is it

00:17:34.799 close enough to Y that we consider it

00:17:36.320 correct but there's a problem with that

00:17:38.320 which is that you are then limited to

00:17:40.000 the size of your ground truth set if or

00:17:42.480 sorry with your um golden test set. If

00:17:44.720 you go this way with a ground truth set,

00:17:46.400 what you're doing is giving yourself the

00:17:47.919 ability to run the the judge on an

00:17:50.799 infinite amount of conversations that

00:17:52.320 actually happen in production. So as you

00:17:54.960 do your prompting, you're going to

00:17:56.400 probably start with a low correlation

00:17:58.720 between your prompt and the actual

00:18:00.400 ground truth set of 250.

00:18:03.360 And then as you keep doing iterations

00:18:04.799 and iterations, you're going to get

00:18:05.919 closer and closer and eventually maybe

00:18:07.360 you get to like 61. And again, these are

00:18:09.840 actual numbers. This is what our

00:18:11.360 sidekick judge is at uh right now. So

00:18:13.840 it's very difficult, you know, once you

00:18:15.039 get to a certain point, you're going to

00:18:16.000 be 0.599,601,

00:18:18.720 etc. So you're in a pretty good state if

00:18:21.039 you're um if you're at this high and

00:18:22.640 there's going to be high trust in it.

00:18:24.320 The ultimate goal here is that I hate

00:18:27.120 kind of saying like turning test here,

00:18:29.039 but it is a it is a good way to explain

00:18:31.280 it. So you want to take your five judges

00:18:33.360 and then your LLM judge, your five human

00:18:35.120 judges and for every conversation

00:18:38.160 or for every uh test in your ground

00:18:39.919 truth set, you want it to be

00:18:41.360 indistinguishable whether the judge was

00:18:43.840 actually randomly be been selected to be

00:18:46.480 in the human set or not. Uh when you've

00:18:49.280 done this, that's when you know that

00:18:50.799 you've that you've you know quote

00:18:53.039 unquote pass the touring test. So that's

00:18:55.679 pretty statistically rigorous way to

00:18:57.039 make your judge, but there's more

00:18:58.240 needed. Um, degreation testing is also

00:19:00.880 very important. So, we have different

00:19:02.720 versions of sidekick called bad kick,

00:19:04.240 annoyed kick, sadkick, uh, I'm going to

00:19:06.960 do everything but fulfill your goal kick

00:19:08.960 maybe. Uh, and you can run this as the

00:19:11.679 candidate system. So, for bad kick,

00:19:14.000 maybe it's always swearing at you or

00:19:15.280 being very rude. And then if you look at

00:19:17.440 your criteria on this run with the

00:19:19.600 merchant simulator and then the judge,

00:19:22.000 you want to target the safety score. So

00:19:24.880 you're going to see your safety score

00:19:26.000 drop and probably the overall score drop

00:19:27.520 and you might see your other scores drop

00:19:28.880 too but you are targeting a specific

00:19:31.120 criteria. So the ground truth set itself

00:19:33.600 gives us

00:19:35.600 you know the the cases that we want uh

00:19:38.400 like the positive like we are matching

00:19:39.919 it and now we're like targeted trying to

00:19:41.919 like get bad on certain scores which is

00:19:43.919 actually what we're expecting. So

00:19:45.760 degradation testing is also very

00:19:47.440 important as part of the offline

00:19:48.799 process. Once you have pretty high trust

00:19:50.960 in it at this stage, you can go to

00:19:54.160 online metric verification. So if your

00:19:56.400 system is mature enough and you have

00:19:57.760 online AB tests, you can actually look

00:20:00.320 at past tests and if you have your

00:20:02.240 controller treatment, control is like

00:20:04.160 what was in prod when you ran the test

00:20:05.520 and treatment was, you know, your your

00:20:08.240 system that you wanted to launch. If you

00:20:10.799 see a delta on those past experiments

00:20:12.559 with your online metric, the ultimate

00:20:14.160 goal is that your LM judge aligns with

00:20:16.880 your online metric.

00:20:19.039 So if you've had, you know, a a press

00:20:21.919 test that was a new model and this is in

00:20:24.799 your online system and the delta between

00:20:26.480 control and treatment was 12%, it went

00:20:28.320 up and then you introduce a new tool, it

00:20:30.240 went up by 4%. You want your judge to be

00:20:32.640 directionally aligned with this kind of

00:20:35.600 magnitudally, if that's a word, align,

00:20:38.080 but it's not going to match exactly, but

00:20:40.480 if you if it's looking something like

00:20:42.000 this, then you know it's pretty good.

00:20:44.400 This is a good test to do because a lot

00:20:46.159 of times or a couple times I've seen an

00:20:49.600 online flight be very green and then the

00:20:51.440 LM judge when we run it after is very

00:20:53.360 red and we're like, "Oh crap, you know,

00:20:55.520 our LM judge is not working as expected

00:20:57.760 despite all this testing we did." And

00:20:59.440 when we actually investigated that

00:21:00.880 deeper, it turned out that the LM judge

00:21:03.520 was correct and our online metric was

00:21:05.120 wrong. And something was happening

00:21:07.280 online. Users were like clicking. maybe

00:21:09.200 we released some some bad change but

00:21:11.760 users were clicked on it but it wasn't

00:21:13.120 aligning with what we wanted in the

00:21:14.480 product and the judge was able to catch

00:21:16.799 that. Um that's the positive testing.

00:21:20.400 You can also do degregation testing with

00:21:22.640 online flight. So if you wanted and

00:21:24.880 we've done this, you could launch maybe

00:21:26.480 not with bad kick but with some of the

00:21:29.120 other dggregation testing you can

00:21:30.559 actually launch a worse model or flight

00:21:32.880 a worse model at a low percent like 1%

00:21:35.679 and see

00:21:37.679 you want to see both your online metric

00:21:39.360 drop and your LM judge drop. You got to

00:21:41.679 got to be careful about that one. That

00:21:43.440 was more a Microsoft thing than a

00:21:44.640 Shopify thing I would say.

00:21:47.600 Um, now at this point you should have

00:21:52.960 high confidence in your judge, but this

00:21:54.720 isn't enough yet to have super high

00:21:58.480 trust in this and to test system like

00:22:00.240 candidate systems that haven't launched

00:22:01.600 yet. So that's where the user simulator

00:22:03.200 comes in. Uh, so I kind of introduced it

00:22:05.760 a little bit, but really here with chat

00:22:08.240 conversations or uh with chat

00:22:10.960 conversations or agents

00:22:13.679 after you have the first turn, things

00:22:15.200 can diverge with your candidate system.

00:22:16.880 So you can't just replay exact

00:22:18.559 conversations as they were. So you need

00:22:20.960 to create a user simulator or a merchant

00:22:23.039 simulator to look at a conversation that

00:22:25.600 happened, get the essence of it, kind of

00:22:27.200 the goals that happened in it, what the

00:22:28.720 merchant was trying to do or the user

00:22:30.000 was trying to do. And then you want to

00:22:33.120 use that and replay against the

00:22:34.960 candidate system. So that LM is going to

00:22:36.960 act like a user and have a conversation

00:22:39.039 with the candidate system. And then

00:22:40.880 you're going to end up with a

00:22:42.640 conversation that you can then use your

00:22:44.799 LLM judge on, which you've gone through

00:22:47.039 all this rigorous testing and you trust.

00:22:50.720 How can you trust your LLM?

00:22:53.280 How can you trust your merchant

00:22:54.400 simulator or your user simulator? You

00:22:55.919 can't just, you know, prompt an LLM and

00:22:57.600 vibe test and be like, "Yeah, this is

00:22:58.960 kind of doing the thing." You want to

00:23:02.159 again use statistical rigor to build

00:23:04.159 trust in this thing. So, what you can

00:23:05.679 do, I mean, there's many things you can

00:23:07.360 do. This is just a simple thing. Run

00:23:08.799 many AA tests. So, take your your seed

00:23:11.440 conversation, run your judge on it. If

00:23:14.080 it gets a score of 3.1, let's say,

00:23:16.960 simulate like a 100 conversations and

00:23:19.919 then check that those simulated

00:23:21.679 conversations when you run your judge on

00:23:23.200 it that they are all very similar and

00:23:25.280 then you know that your merchant

00:23:27.679 simulator is is doing what you want it

00:23:30.080 to do.

00:23:32.559 So this is I would say like a pretty

00:23:35.919 trustable way to build LM judge and get

00:23:38.080 away from vibe testing and you know

00:23:40.320 you're not vibe creating an LM judge

00:23:42.480 that says is this the same as that.

00:23:44.159 You're actually building something that

00:23:45.440 you can run on infinite data. There's

00:23:47.919 many many positives to this. One

00:23:49.679 positive is certainly you can use it for

00:23:51.200 reinforcement learning algorithms or you

00:23:53.280 know DSP pie I think is like very

00:23:55.280 popular right now. Um and a lot of

00:23:57.679 people are doing that on their golden

00:23:58.880 set with a judge. But the problem is

00:24:01.360 then they're building they're pro

00:24:03.360 they're tuning their prompt with DSPI

00:24:05.120 for example using their golden set and

00:24:07.679 an LM judge. But if their golden set is

00:24:09.520 only 500 then they're basically just

00:24:11.679 overfitting on that 500. So when you

00:24:13.520 have a judge that's against a ground

00:24:15.440 true set you can run it on inf infinite

00:24:18.159 and your RL systems are going to work a

00:24:19.840 lot better with this. So what else can

00:24:21.760 you do? Create skill judges RHF

00:24:23.520 pipelines and RL pipelines.

00:24:26.640 At the end of the talk here, I'm going

00:24:28.080 to talk a little bit about some of the

00:24:29.440 RL we did and something to be wary of.

00:24:32.000 So, I'm not going to get super deep

00:24:33.279 into, you know, GRPO, which is what were

00:24:35.679 you using or or reinforcement learning

00:24:37.520 in general. But I do want to share some

00:24:40.720 of this very interesting things that

00:24:42.080 happen with the judges and just

00:24:43.279 something you need to be aware of. Even

00:24:44.400 if you build a high trust judge, RL is

00:24:48.000 going to learn how to exploit you. And

00:24:50.240 uh so it's kind of interesting to do as

00:24:51.600 an experiment because then you can, you

00:24:53.279 know, keep iterating on your judge. For

00:24:55.679 those of you who are unfamiliar with RL

00:24:57.840 or reinforcement learning, basically at

00:24:59.760 its core, you have an environment and

00:25:01.600 you have an agent interacting in the

00:25:03.200 environment and that the environment has

00:25:05.120 a state and when the agent interactes

00:25:07.840 interacts with the environment, the

00:25:09.760 state changes and then given the state

00:25:11.360 change, you have a reward function and

00:25:13.840 that both of those go back to the agent

00:25:15.600 and the agent is just keep taking

00:25:16.960 actions and it's keep getting getting a

00:25:19.120 reward um and then you stay in the

00:25:21.120 environment. So when you think about

00:25:23.360 this for training an ML model uh for a

00:25:25.760 generation model let's say you have the

00:25:27.520 generation and then you have your reward

00:25:29.360 model which is let's say our LM judge

00:25:32.080 that is giving a score basically and

00:25:34.320 then that goes into our loss function

00:25:36.000 which you know back propagates into the

00:25:37.919 model and then the next generation is an

00:25:39.840 improved version of the model based on

00:25:42.000 whether the judge or reward function

00:25:43.520 said it was good or not and the longer

00:25:45.520 you run it the more steps that it runs

00:25:47.600 the higher their performance is going to

00:25:48.880 be. So when we ran this on our system,

00:25:53.440 like one of our fine-tuned models that

00:25:54.960 that did one of the skills that Charlie

00:25:56.640 talked about, we saw the accuracy go

00:25:58.960 from, you know, basically 79% to like

00:26:01.760 99%.

00:26:03.279 And the team was like super happy and

00:26:04.799 I'm like this is way too good to be

00:26:06.559 true. Uh so then they started doing

00:26:08.320 manual analysis and they found some some

00:26:11.279 pretty interesting things. So

00:26:13.840 uh one of the tasks was uh like gener

00:26:17.200 let's let's just say generating SEO for

00:26:20.720 for a product description. So the answer

00:26:22.400 we're expecting or something good might

00:26:24.000 be like you know some good SEO title. Uh

00:26:27.039 but what the model was actually doing

00:26:28.960 was creating uh a response that says

00:26:31.279 unfortunately I can't do that for you as

00:26:33.679 that's not something I support. uh it

00:26:36.000 can support it, but it found out that it

00:26:38.559 got the highest rewards because our

00:26:40.159 judge was not good enough at evaluating

00:26:44.320 this task to know that it was, you know,

00:26:46.640 it was marking this as correct because

00:26:48.240 when you read it, it it considered it

00:26:51.039 correct, which was a fault in our judge.

00:26:53.200 So despite all that rigor, you can still

00:26:55.039 have faults in your judge and RL will

00:26:58.240 find a way to exploit them. Uh so

00:27:00.400 something to be something to be careful

00:27:02.000 about.

00:27:04.320 Another one uh so this was like you know

00:27:07.360 customer segmentation the model can do.

00:27:09.279 So there's a model that specifically

00:27:10.559 turns natural language into a

00:27:12.400 segmentation query and the correct

00:27:14.960 answer is here you know customer account

00:27:16.880 status equals enabled. This is like an

00:27:18.799 actual field uh in the segmentation but

00:27:22.000 you can also have free form tags in

00:27:24.400 segmentation. So what the model was

00:27:26.640 doing here uh with G after being tuned

00:27:29.200 on GRPO with our judge, it was using the

00:27:31.360 free form tags to say just look for a

00:27:33.360 customer tag that says enabled or a

00:27:35.440 customer tag that says active. And then

00:27:37.600 when our judge looked at this uh which

00:27:40.880 our judge looks like the validity

00:27:42.480 validity of you know the the language

00:27:44.559 generated as well as if it's

00:27:46.240 semantically correct in this case and um

00:27:49.360 you know this is a valid query and looks

00:27:51.520 correct but it's technically wrong. Um,

00:27:54.480 so we had to like, you know, keep

00:27:56.000 iterating on our judges and and improve

00:27:59.600 this. So this is definitely something to

00:28:02.320 watch out for when you do create your

00:28:04.080 your LLM judges. So some takeaways from

00:28:07.440 this conversation uh and chat here.

00:28:09.840 Create an LLM judge, but don't just, you

00:28:12.720 know, don't do vibe testing, but don't

00:28:14.320 vibe create a judge either. Uh, that's

00:28:17.200 probably the thing. So people like yes I

00:28:18.640 have an LM judge but they vibe coded

00:28:21.039 their LM judge or it's just you know is

00:28:23.840 my Y I you know they have a fixed golden

00:28:25.919 set and it's just testing is Y the same

00:28:28.000 as you know Y hat and that's that's not

00:28:30.799 enough to to really you know do the test

00:28:33.679 and get into RL and really have high

00:28:35.279 trust in it. Uh you want to align your

00:28:37.520 judges to product experts. Uh so again

00:28:41.600 not just you probably don't even want

00:28:43.360 developers on the team to do it. You

00:28:44.720 want the product leaders to be doing it.

00:28:47.440 um and continuously grow your ground

00:28:49.279 truth set. One important thing is when

00:28:51.600 you are getting your ground truth set,

00:28:54.159 you you don't want to get only good

00:28:56.320 conversations. You want to get good and

00:28:57.760 bad and you want to mark them as

00:29:01.919 as the merchant, not as somebody who

00:29:04.960 knows what your product supports and

00:29:06.559 doesn't support. Uh so if something if

00:29:09.760 psychic says correctly, I can't do that

00:29:11.760 for you. merchant sentiment overall

00:29:14.080 score should be low so that you can find

00:29:16.559 these conversations know what product

00:29:18.240 features to create next and when you do

00:29:20.559 create these products uh these features

00:29:22.480 then you will see you know your score on

00:29:24.080 the judge go up so always think from the

00:29:26.640 user how do I grade this conversation

00:29:28.559 without knowledge of what it's supported

00:29:30.159 or not and when you do create these

00:29:32.240 things um expect it to do reward hacking

00:29:36.320 or find loopholes especially if you're

00:29:38.480 using it um in an RL system

00:29:41.840 so that thank you. Um, we are hiring and

00:29:44.720 if you want to come chat with us at the

00:29:46.000 booth, we are there. Thanks,

Andrew Mcnamara

@andrew-mcnamara

explore all talks recorded at Rails World 2025

Explore all talks recorded at Rails World 2025

Rails World 2025