00:00:06.960
I'm primarily here today to give you a
00:00:09.200
quick intro to what to tell you what
00:00:11.360
sidekick is and to give you a few of the
00:00:14.320
insights that we've learned while
00:00:15.759
building Sidekick. So, uh to start
00:00:20.160
uh this is Sidekick. Sidekick is an
00:00:22.160
assistant that lives within the Shopify
00:00:24.000
admin and merchants use it to uh get
00:00:27.279
general help with their store and manage
00:00:29.199
their business. In this specific
00:00:30.720
example, it's actually running on
00:00:31.760
Andrew's store who's going to join us
00:00:33.520
later to talk about evals. But Andrew
00:00:35.920
asked Sidekick, "Hey, could you fetch
00:00:38.640
and analyze the sales from the last 30
00:00:40.960
days and provide recommendations based
00:00:42.879
off what you've learned?" and psychic
00:00:44.640
can decompose that uh into individual
00:00:47.360
tasks, collect all the different pieces
00:00:49.039
of context and actually generate a
00:00:50.800
fairly reasonable response. So how does
00:00:53.440
this work internally? Uh at its core,
00:00:56.480
psychic is just an agent, a very simple
00:00:59.039
agent. Uh it has an LM that is equipped
00:01:02.879
with many tools. Those tools then
00:01:05.519
interact with the environment. The
00:01:07.200
environment in this case is the shop by
00:01:10.159
store and other shop by APIs. the agent
00:01:12.880
can sort of reason about the the tool
00:01:14.880
responses and then generate a response
00:01:17.280
back to the the merchant or the human in
00:01:19.680
this case. Now, Sidekick today doesn't
00:01:23.119
actually have that many defined
00:01:24.560
workflows and I know that many agents
00:01:26.479
are built with very defined workflows in
00:01:28.320
mind to keep consistency. But we
00:01:30.000
actually found that uh by strapping an
00:01:32.240
LM with very well- definfined tools that
00:01:35.200
it provides the best balance of
00:01:36.880
generating the best quality responses
00:01:39.360
and also giving the LM enough space to
00:01:41.360
recover from some of those error and
00:01:42.720
edge cases. So it gives that flexibility
00:01:44.960
back to the LM.
00:01:47.840
Now before I continue, I do want to go
00:01:50.159
back in time to when we were first
00:01:51.759
developing Sidekick and I want to go
00:01:53.840
back to the handful of skills or the
00:01:56.320
tools that we developed. Um, and the
00:01:59.119
main criteria of picking these tools was
00:02:01.439
one, does it provide value to the
00:02:03.920
merchant and two, does it actually save
00:02:05.680
them time? There's actually no point in
00:02:07.360
building a tool if the merchant can
00:02:09.119
accomplish that task faster in other
00:02:11.680
ways. So, as an example, here are the
00:02:14.319
first two tools, customer segmentation
00:02:15.920
and analytics. Customer segmentation, if
00:02:18.560
you don't know, is a core feature at
00:02:20.480
Shopify that allows you to group
00:02:22.239
customers based off a set of criteria.
00:02:24.800
Uh these groups can be used for
00:02:27.120
marketing campaigns, tag them with
00:02:29.360
discounts um or generally get buyer
00:02:32.879
insights for your customers. Uh but this
00:02:36.480
requires that merchant to learn a sort
00:02:38.720
of bespoke query language to even start
00:02:40.720
digging in to that data. Same with
00:02:42.959
analytics. Analytics at Shopify is a
00:02:45.920
wealth of of information. It can give
00:02:48.160
you order details. It can give you sales
00:02:50.239
details. It can give you trends. But
00:02:52.239
again, you as a merchant have to learn
00:02:55.040
this query language. And you can imagine
00:02:56.959
some of the non-technical merchants may
00:02:59.040
have a bit of trouble learning this and
00:03:01.200
getting even to the starting line of
00:03:03.680
digging into these insights.
00:03:06.159
But in the advent of LMS, this type of
00:03:08.800
problem and this type of feature is so
00:03:10.560
much more accessible. Uh you can task an
00:03:13.280
LM to generate these queries and you can
00:03:16.080
have the agent run these queries and
00:03:18.080
provide the insights directly. So for
00:03:20.159
these two specific skills, we actually
00:03:21.760
had to fine-tune a model that was able
00:03:24.319
to translate a user request into the
00:03:26.560
queries you see above. Scick would then
00:03:29.040
in the tool validate the response from
00:03:31.360
the fine-tune, run the query, and
00:03:33.280
generate that response. In addition, we
00:03:36.799
were considering like it's amazing can
00:03:39.840
do these complex tasks, but it would be
00:03:41.760
a little funny if it couldn't do the
00:03:43.760
basic tasks. So in addition to the
00:03:46.159
complex SAS, we also introduce three of
00:03:48.319
the more basic skills. Uh the first one
00:03:50.640
being navigation, which is sort of
00:03:52.400
self-explanatory, but it helps
00:03:53.920
merchants, especially the new ones, sort
00:03:55.840
of find their way around Shopify admin,
00:03:58.080
which is can be a very big place. Uh the
00:04:01.439
second one is the help tool, which is
00:04:03.200
your classic rag uh based tool, and it's
00:04:06.000
hooked up to the Shopify help center,
00:04:07.840
which contains all the documents that
00:04:09.920
you probably need to run your business.
00:04:12.480
And lastly, it's form filling. Uh form
00:04:15.519
filling uh gives Sidekick the ability to
00:04:18.400
generate a preview of uh a crate or edit
00:04:21.519
action for any of the resources on the
00:04:23.680
merchants store. And I want to be really
00:04:25.600
clear here. Psychic itself doesn't
00:04:27.440
mutate the state of the shop. You can
00:04:29.759
sort of imagine this free form agent
00:04:32.400
mutating many many things would be a bad
00:04:34.800
experience for users especially if the
00:04:36.639
users don't have any input. So all it
00:04:38.960
does is it provides the UI this preview
00:04:41.280
that the merchant is given and the
00:04:43.520
merchant has to actually sign off and
00:04:45.199
commit the changes on their own.
00:04:48.080
Now uh for the most part you know we
00:04:51.440
launched this and we had a few of those
00:04:54.960
skills/tools
00:04:56.639
and we you know found pretty good
00:04:58.639
success. A lot of merchants really
00:05:00.320
enjoyed the additional help that psychic
00:05:02.320
provided. And we would be able to sort
00:05:04.560
of uh look at the gaps between what
00:05:07.280
psychic could answer and what it
00:05:09.520
couldn't answer. And then for the ones
00:05:10.960
that it couldn't answer, we'd say, let's
00:05:12.240
just build a tool for that. It'll just
00:05:13.919
can add it to the system. You can just
00:05:15.440
keep growing the list of
00:05:16.560
responsibilities for sidekick. And for
00:05:18.720
the most part, that worked up until when
00:05:22.240
it didn't. Then we started to notice
00:05:24.160
when we had way too many tools,
00:05:26.080
Sidekick, the LM would start to confuse
00:05:28.560
the responsibilities of the different
00:05:30.639
tools, it would start to misuse the
00:05:32.960
instructions that we had in the system
00:05:34.400
prompt uh across the tools and in
00:05:37.120
general it lowered the quality of the
00:05:39.039
responses. And you can imagine this
00:05:40.880
problem being further exasperated the
00:05:42.800
more tools we sort of prototyped within
00:05:44.880
Sidekick. So a co-orker of mine
00:05:47.680
describes this as a death by a thousand
00:05:50.160
instructions. you have conflicting
00:05:52.479
instructions. It slows down the entire
00:05:55.120
processing of the LM because you can
00:05:56.639
imagine that giant system prompt being
00:05:58.639
dynamically swapped out and rebuilt. It
00:06:00.880
becomes incredibly difficult to debug
00:06:03.039
especially when you have other external
00:06:04.880
contributors trying to add to that one
00:06:06.880
system prompt. And ultimately it's very
00:06:09.360
hard to evaluate. So you don't even know
00:06:10.639
if you're moving in the right direction.
00:06:12.880
So what did we do? The first major
00:06:14.800
refactor we had was introducing this
00:06:16.720
concept called just in time
00:06:17.919
instructions. just in time instructions
00:06:20.160
removes the complexity of having all of
00:06:23.199
the conditionals within your main
00:06:24.720
system, your agent's main system prompt
00:06:26.720
and moving it to uh the tool response
00:06:29.840
directly. And this provides two things.
00:06:31.600
One, it keeps the core of your agents
00:06:33.759
behavior pretty static. It's just the
00:06:36.800
behavior of the agent that you're trying
00:06:38.000
to achieve. And it still surfaces the
00:06:41.360
tool instructions when you actually need
00:06:42.960
them, when the tool is actually actually
00:06:45.199
called. Uh we also found a side benefit
00:06:47.840
that this is more cache friendly because
00:06:49.840
if you imagine a static system prompt
00:06:51.840
and we're sort of doing an appendon
00:06:53.759
version of the tool results you can
00:06:56.160
maintain a longer cache uh especially
00:06:58.720
with those LM providers that provide
00:07:00.240
that feature and ultimately because now
00:07:02.319
that the instructions were in the tool
00:07:03.919
results uh we found teams were
00:07:05.840
experimenting with different
00:07:07.199
instructions being passed back to the uh
00:07:10.000
the agent using things like beta flags
00:07:12.560
depending on the model that was used or
00:07:14.160
the different pages that the merchant
00:07:15.840
might be on. So ultimately like when we
00:07:18.560
have other teams contributing the blast
00:07:20.880
radius of any one tool didn't affect the
00:07:24.160
core of the agents behavior as a whole.
00:07:27.599
So you know what this looks like in
00:07:29.360
practice. It's really just moving those
00:07:30.639
instructions from the the system prompt
00:07:32.560
into the uh the structured tool results.
00:07:35.039
And this is sort of a simplified version
00:07:36.720
of our help tool. But you can see that
00:07:38.800
this help tool returns a very specific
00:07:41.520
citations format. And when we used to
00:07:43.759
have that format and those instructions
00:07:45.360
the system prompt, scite would actually
00:07:47.440
misuse that citations format for other
00:07:49.440
tools that weren't help related at all.
00:07:51.680
So this one simple change sort of helped
00:07:54.479
us modulize this process and scale out
00:07:56.800
to way more tools. And from the
00:07:58.560
merchants view of course they just see
00:08:00.960
they don't see the structured response
00:08:02.160
but they see the agent adhering to the
00:08:03.759
instructions that are provided by those
00:08:05.120
tool results. Um,
00:08:07.919
so you know this worked even further and
00:08:11.199
we were able to scale out to way more
00:08:12.879
tools and for most for the most part
00:08:15.680
like teams would introduce one tool that
00:08:18.000
sort of represented their domain and
00:08:20.000
then we started onboarding more recently
00:08:21.759
some more complex features at Shopify
00:08:24.479
and those teams would require sort of uh
00:08:27.759
a crap ton of domain specific tools that
00:08:31.039
we felt like we're going to go back into
00:08:33.360
the original problem we have where the
00:08:35.200
main agent has to now keep track of all
00:08:37.279
the different tools across multiple
00:08:39.440
domains across many complex domains. So
00:08:42.800
this is something that we're sort of
00:08:44.000
exploring today and that's just sub
00:08:46.080
agents. Sub agents are specialized
00:08:48.160
agents to handle those specific domains.
00:08:50.560
Uh but the key point here is that the
00:08:52.959
only agent the main agent that the
00:08:55.279
merchant is talking to is still
00:08:56.480
sidekick. Sidekick is the only point of
00:08:59.120
contact and the merchant can't really
00:09:00.880
speak to the sub agents directly. Uh
00:09:03.279
this ensures that the tone and voice of
00:09:05.519
Sidekick is consistent and you're still
00:09:07.839
delegating to a sub agent to handle some
00:09:10.959
uh some of those domain specific tasks.
00:09:14.160
So what does this look like in practice?
00:09:16.720
It should look very familiar. It's the
00:09:18.800
same interface. You're calling a tool
00:09:21.440
from Sidekick that hands off to a sub
00:09:24.000
agent with a set of instructions, an
00:09:25.839
optional conversation ID, which we'll
00:09:27.440
get back to, and any other specific
00:09:29.360
pieces of context that that sub agent
00:09:31.279
might need. And that sub agent will take
00:09:34.080
those instructions, run its own internal
00:09:36.640
uh agentic loop with its own system
00:09:38.480
prompt, with its own set of domain
00:09:40.480
specific tools and then spit out just
00:09:42.720
like the just in time instructions the
00:09:44.880
instructions back to sidekick so
00:09:47.120
sidekick can actually form the response
00:09:48.720
back to the merchant and uh taking a
00:09:50.959
pause and taking a look at that
00:09:52.320
conversation ID. Why would you want to
00:09:54.160
have a conversation ID? You'll find that
00:09:56.480
a lot of users aren't really good at uh
00:09:59.360
providing the full spec of the thing
00:10:01.519
that they're trying to accomplish in a
00:10:03.120
single turn, right? They're not
00:10:04.800
providing a giant brief of all the
00:10:06.560
different conditions and they want to
00:10:07.760
iterate across that with multiple turns,
00:10:09.839
which means your sub agent needs to be
00:10:11.680
aware of the different turns uh that
00:10:14.320
have happened before to make sure that
00:10:16.240
it knows the context uh to move on to
00:10:18.800
the next step. So what we do is we pass
00:10:21.040
back the conversation ID from the sub
00:10:22.880
agent so that the main agent can say hey
00:10:25.279
this person is actually continuing that
00:10:27.120
thread from the sub agent. Please add
00:10:29.200
this set of instructions to your
00:10:30.720
conversation history so you can continue
00:10:32.480
on to the next step. But with all this
00:10:35.440
in mind, this is very much early
00:10:37.279
exploration for us and we're still
00:10:38.640
evaluating these responses. Uh so if I
00:10:41.920
can leave you with a few takeaways, it's
00:10:44.079
to stay simple as long as you can
00:10:45.760
because the simpler your system is, the
00:10:47.920
m it's way easier to reason about your
00:10:50.079
system and scale it out and you'll have
00:10:51.760
much higher response quality scores.
00:10:54.560
Don't jump to the multi- aent
00:10:57.040
architecture, especially not right away
00:10:59.440
because one, you're going to be adding
00:11:00.720
unnecessary complexity and two, it's
00:11:02.959
going to add a lot more latency. So only
00:11:05.440
really start exploring that once you
00:11:07.680
have clear evidence that you you might
00:11:10.160
actually need it. Uh the second point is
00:11:12.480
the quality of your tools matters way
00:11:14.640
more than the quantity of your tools. Uh
00:11:16.720
I think the majority of the time that we
00:11:18.640
spend on psychic is really on tool
00:11:20.560
design. We are iterating and
00:11:21.839
re-evaluating our tools constantly
00:11:23.920
because the core of the system or the
00:11:26.000
agent is fairly static. And the last bit
00:11:28.959
is stay modular. There are multiple
00:11:31.040
parts to your agent. By isolating those
00:11:33.120
individual parts, you can reduce the
00:11:34.560
blast radius of any one change um and
00:11:37.440
keep your agent running without uh too
00:11:39.600
much of an issue. So keep all this in
00:11:42.079
mind while I bring up Andrew who's going
00:11:43.600
to talk about evaling this system.
00:11:46.079
Thanks.
00:11:54.240
Hey everyone and thanks Charlie for the
00:11:55.920
sidekick intro. I will give uh a quick
00:11:58.640
midt talk intro myself. I'm Andrew
00:12:00.480
McMurra. I've been building uh
00:12:02.320
assistants now for 15 years. In 2011, we
00:12:05.519
started as a startup and we ended up
00:12:07.200
powering LG and Samsung's assistants
00:12:09.839
both on their phones and TVs. We were
00:12:12.399
acquired by Microsoft in 2017 where we
00:12:15.279
built the first LM uh assistant in
00:12:18.480
Microsoft. Originally called Sydney,
00:12:21.040
then Bing Chat. Um when it launched in
00:12:23.360
North America, it was rebranded as
00:12:24.959
Copilot. And now I'm here at Shopify
00:12:28.160
with Charlie and the team uh building
00:12:30.000
out Sidekick. So I'm going to share a
00:12:32.079
little bit about what I've learned um
00:12:34.399
over the years on how to evaluate chat
00:12:36.800
systems or agents as um they're kind of
00:12:41.440
difficult and more more difficult than
00:12:43.519
normal uh ML models. So what I see a
00:12:46.720
lot, let's say this is a theoretical
00:12:48.480
member of the Sidekick team who may or
00:12:50.480
may not be called Ben. Uh there was a
00:12:52.399
lot of vibe testing um of Sidekick. I
00:12:55.760
would say we tried it out, it looked
00:12:57.279
good, so we shipped it or it vibe tested
00:12:59.200
well is a phrase that I've heard quite a
00:13:00.800
bit. But then what would happen? It
00:13:02.560
would launch and there'd be errors and
00:13:05.040
then theoretical uh who may or may not
00:13:08.079
exist, Ben, uh he was very sad. Uh so
00:13:11.920
what kind of framework did we build to
00:13:13.680
move away from this? uh we use LM's LM
00:13:17.040
as a judges and simulators in order to
00:13:19.760
really have a lot of high trust in our
00:13:21.920
systems and it really brought sidekick I
00:13:24.399
think to the next level. So what we do
00:13:26.320
is we have a user simulator which is an
00:13:27.920
LLM based. It's actually uh for us it's
00:13:30.959
merchant facing. So we call it our
00:13:32.240
merchant simulator. The idea of it is to
00:13:34.639
replay the spirit of a conversation that
00:13:37.519
happened in production to our new
00:13:39.200
candidate system. And our candidate
00:13:41.040
system is just one one delta or one
00:13:44.079
change in the system because we really
00:13:45.680
just want to be testing things in
00:13:46.880
isolation. And then we have an LLM
00:13:49.200
judge. So after the conversation is
00:13:50.800
replayed with the CA with the uh
00:13:52.560
candidate system, the LM judge is going
00:13:54.399
to evaluate it across different criteria
00:13:56.959
and then we're going to have full trust
00:13:58.399
in this and we can ship with confidence.
00:14:00.480
Uh but I'm going to talk about how to
00:14:01.920
build this LM judge in simulator because
00:14:04.240
you um you can't just vibe create these
00:14:06.880
things either. You need a lot of
00:14:07.920
statistical rigor um to actually pull
00:14:10.160
this off. So we create what we call a
00:14:12.880
ground truth set. So all of us are
00:14:14.399
developers here. I don't know how many
00:14:15.680
of us have actually read specs. Uh I
00:14:18.320
don't think I've ever read Spaxs. I
00:14:19.680
imagine a lot of you haven't despite all
00:14:21.440
the hard work that PMs put into them. Uh
00:14:24.079
so what we do is we get the PMs to
00:14:25.839
create what's called a ground true set.
00:14:27.440
Uh this is different than a golden set.
00:14:29.199
I have a quick definition of a golden
00:14:30.720
set here. Um basically given an input X,
00:14:34.079
we expect the correct output Y. And we
00:14:36.959
have like a fixed number of these, maybe
00:14:38.240
a thousand or 5,000 or something. And
00:14:40.480
when we test a new model, call it f and
00:14:42.720
we run this input on our new model, we
00:14:46.000
we get yhat or y prime as I have here.
00:14:49.120
So this golden set will check does y
00:14:51.920
prime match our expected y in the golden
00:14:55.040
set. And this is how machine learning
00:14:57.199
did uh testing for a very long time
00:14:59.040
prior to LLM.
00:15:00.959
Um so what we actually want to do to
00:15:03.440
create this ground truth set is sample
00:15:05.199
real conversations from prod or against
00:15:07.120
the prod distribution and then we create
00:15:09.360
criteria and label these as humans and
00:15:12.480
then we continuously grow this ground
00:15:14.160
truth set. So what a ground truth set
00:15:15.680
looks like is a conversation a criteria
00:15:20.000
like for side for sidekick it's safety
00:15:23.440
uh goal fulfillment grounding merchant
00:15:26.160
sentiment uh etc. I think there's like
00:15:28.240
five five criteria and the PMs or
00:15:30.720
product experts label these.
00:15:33.519
It's important that we grab both good
00:15:35.440
and bad conversations in this because
00:15:38.320
ultimately this ground truth set should
00:15:40.480
be our specs. It should have all the
00:15:42.959
corner cases. It should have the bad
00:15:44.560
things marked as bad. It should have the
00:15:46.480
good things marked as good and why
00:15:47.760
they're good, why they're bad as part of
00:15:49.120
their criteria. So we don't just get
00:15:50.880
anyone to label these. We're not getting
00:15:52.320
like Amazon Turk to do this. we have
00:15:54.160
like let's say three product experts or
00:15:56.560
five product experts to do this. So like
00:15:58.320
our PM team what we do is if let's say
00:16:01.519
we start with like 200 ground true set
00:16:04.160
we have five PM or product expert label
00:16:06.320
these and then we calculate the
00:16:08.000
correlation using something like Cohen's
00:16:09.600
Kappa or Kendall Tao these are
00:16:11.040
statistical measures of of correlation
00:16:13.600
and we try to figure out with these five
00:16:16.079
product experts how much do they agree
00:16:18.399
on what's good and what's bad and then
00:16:20.480
we try to get coverage of like I said
00:16:22.160
both good and bad in this ground truth
00:16:24.240
set and we calculate this number and
00:16:26.639
this we consider the theoretical max of
00:16:29.440
what our LM judge can approve because LM
00:16:31.759
judge is not going to get 100. Humans
00:16:33.279
are not going to get 100.
00:16:38.480
So the next step here, let's say we
00:16:40.399
calculated Cohen's cap and it was like
00:16:42.079
69 uh which is actually like about what
00:16:44.800
it is on sidekick with some of the
00:16:46.880
agreement between some of our PMs. And
00:16:49.120
now we're going to make a prompt to try
00:16:52.240
to match what the GTX is doing. So the
00:16:55.279
GTX is again a conversation criteria
00:16:58.160
like goal fulfillment uh overall etc etc
00:17:01.920
safety. So we've labeled these criteria
00:17:04.400
on these 250 conversations. 250 isn't
00:17:07.280
enough but it's a good place to start
00:17:09.839
and we prompt an LLM. You can also train
00:17:13.199
an LM or like do whatever but you
00:17:14.799
probably want to start with prompting
00:17:15.919
and you try to get it to match the
00:17:17.600
humans as close as possible. Um so a lot
00:17:20.400
of people when they build LM judges if
00:17:22.079
you go back think back to the ground
00:17:23.439
truth set they have maybe a thousand of
00:17:24.880
those and we have Yhat or Y prime and Y
00:17:27.600
and they their LM judges does Y prime is
00:17:32.400
it semantically similar to Y or is it
00:17:34.799
close enough to Y that we consider it
00:17:36.320
correct but there's a problem with that
00:17:38.320
which is that you are then limited to
00:17:40.000
the size of your ground truth set if or
00:17:42.480
sorry with your um golden test set. If
00:17:44.720
you go this way with a ground truth set,
00:17:46.400
what you're doing is giving yourself the
00:17:47.919
ability to run the the judge on an
00:17:50.799
infinite amount of conversations that
00:17:52.320
actually happen in production. So as you
00:17:54.960
do your prompting, you're going to
00:17:56.400
probably start with a low correlation
00:17:58.720
between your prompt and the actual
00:18:00.400
ground truth set of 250.
00:18:03.360
And then as you keep doing iterations
00:18:04.799
and iterations, you're going to get
00:18:05.919
closer and closer and eventually maybe
00:18:07.360
you get to like 61. And again, these are
00:18:09.840
actual numbers. This is what our
00:18:11.360
sidekick judge is at uh right now. So
00:18:13.840
it's very difficult, you know, once you
00:18:15.039
get to a certain point, you're going to
00:18:16.000
be 0.599,601,
00:18:18.720
etc. So you're in a pretty good state if
00:18:21.039
you're um if you're at this high and
00:18:22.640
there's going to be high trust in it.
00:18:24.320
The ultimate goal here is that I hate
00:18:27.120
kind of saying like turning test here,
00:18:29.039
but it is a it is a good way to explain
00:18:31.280
it. So you want to take your five judges
00:18:33.360
and then your LLM judge, your five human
00:18:35.120
judges and for every conversation
00:18:38.160
or for every uh test in your ground
00:18:39.919
truth set, you want it to be
00:18:41.360
indistinguishable whether the judge was
00:18:43.840
actually randomly be been selected to be
00:18:46.480
in the human set or not. Uh when you've
00:18:49.280
done this, that's when you know that
00:18:50.799
you've that you've you know quote
00:18:53.039
unquote pass the touring test. So that's
00:18:55.679
pretty statistically rigorous way to
00:18:57.039
make your judge, but there's more
00:18:58.240
needed. Um, degreation testing is also
00:19:00.880
very important. So, we have different
00:19:02.720
versions of sidekick called bad kick,
00:19:04.240
annoyed kick, sadkick, uh, I'm going to
00:19:06.960
do everything but fulfill your goal kick
00:19:08.960
maybe. Uh, and you can run this as the
00:19:11.679
candidate system. So, for bad kick,
00:19:14.000
maybe it's always swearing at you or
00:19:15.280
being very rude. And then if you look at
00:19:17.440
your criteria on this run with the
00:19:19.600
merchant simulator and then the judge,
00:19:22.000
you want to target the safety score. So
00:19:24.880
you're going to see your safety score
00:19:26.000
drop and probably the overall score drop
00:19:27.520
and you might see your other scores drop
00:19:28.880
too but you are targeting a specific
00:19:31.120
criteria. So the ground truth set itself
00:19:33.600
gives us
00:19:35.600
you know the the cases that we want uh
00:19:38.400
like the positive like we are matching
00:19:39.919
it and now we're like targeted trying to
00:19:41.919
like get bad on certain scores which is
00:19:43.919
actually what we're expecting. So
00:19:45.760
degradation testing is also very
00:19:47.440
important as part of the offline
00:19:48.799
process. Once you have pretty high trust
00:19:50.960
in it at this stage, you can go to
00:19:54.160
online metric verification. So if your
00:19:56.400
system is mature enough and you have
00:19:57.760
online AB tests, you can actually look
00:20:00.320
at past tests and if you have your
00:20:02.240
controller treatment, control is like
00:20:04.160
what was in prod when you ran the test
00:20:05.520
and treatment was, you know, your your
00:20:08.240
system that you wanted to launch. If you
00:20:10.799
see a delta on those past experiments
00:20:12.559
with your online metric, the ultimate
00:20:14.160
goal is that your LM judge aligns with
00:20:16.880
your online metric.
00:20:19.039
So if you've had, you know, a a press
00:20:21.919
test that was a new model and this is in
00:20:24.799
your online system and the delta between
00:20:26.480
control and treatment was 12%, it went
00:20:28.320
up and then you introduce a new tool, it
00:20:30.240
went up by 4%. You want your judge to be
00:20:32.640
directionally aligned with this kind of
00:20:35.600
magnitudally, if that's a word, align,
00:20:38.080
but it's not going to match exactly, but
00:20:40.480
if you if it's looking something like
00:20:42.000
this, then you know it's pretty good.
00:20:44.400
This is a good test to do because a lot
00:20:46.159
of times or a couple times I've seen an
00:20:49.600
online flight be very green and then the
00:20:51.440
LM judge when we run it after is very
00:20:53.360
red and we're like, "Oh crap, you know,
00:20:55.520
our LM judge is not working as expected
00:20:57.760
despite all this testing we did." And
00:20:59.440
when we actually investigated that
00:21:00.880
deeper, it turned out that the LM judge
00:21:03.520
was correct and our online metric was
00:21:05.120
wrong. And something was happening
00:21:07.280
online. Users were like clicking. maybe
00:21:09.200
we released some some bad change but
00:21:11.760
users were clicked on it but it wasn't
00:21:13.120
aligning with what we wanted in the
00:21:14.480
product and the judge was able to catch
00:21:16.799
that. Um that's the positive testing.
00:21:20.400
You can also do degregation testing with
00:21:22.640
online flight. So if you wanted and
00:21:24.880
we've done this, you could launch maybe
00:21:26.480
not with bad kick but with some of the
00:21:29.120
other dggregation testing you can
00:21:30.559
actually launch a worse model or flight
00:21:32.880
a worse model at a low percent like 1%
00:21:35.679
and see
00:21:37.679
you want to see both your online metric
00:21:39.360
drop and your LM judge drop. You got to
00:21:41.679
got to be careful about that one. That
00:21:43.440
was more a Microsoft thing than a
00:21:44.640
Shopify thing I would say.
00:21:47.600
Um, now at this point you should have
00:21:52.960
high confidence in your judge, but this
00:21:54.720
isn't enough yet to have super high
00:21:58.480
trust in this and to test system like
00:22:00.240
candidate systems that haven't launched
00:22:01.600
yet. So that's where the user simulator
00:22:03.200
comes in. Uh, so I kind of introduced it
00:22:05.760
a little bit, but really here with chat
00:22:08.240
conversations or uh with chat
00:22:10.960
conversations or agents
00:22:13.679
after you have the first turn, things
00:22:15.200
can diverge with your candidate system.
00:22:16.880
So you can't just replay exact
00:22:18.559
conversations as they were. So you need
00:22:20.960
to create a user simulator or a merchant
00:22:23.039
simulator to look at a conversation that
00:22:25.600
happened, get the essence of it, kind of
00:22:27.200
the goals that happened in it, what the
00:22:28.720
merchant was trying to do or the user
00:22:30.000
was trying to do. And then you want to
00:22:33.120
use that and replay against the
00:22:34.960
candidate system. So that LM is going to
00:22:36.960
act like a user and have a conversation
00:22:39.039
with the candidate system. And then
00:22:40.880
you're going to end up with a
00:22:42.640
conversation that you can then use your
00:22:44.799
LLM judge on, which you've gone through
00:22:47.039
all this rigorous testing and you trust.
00:22:50.720
How can you trust your LLM?
00:22:53.280
How can you trust your merchant
00:22:54.400
simulator or your user simulator? You
00:22:55.919
can't just, you know, prompt an LLM and
00:22:57.600
vibe test and be like, "Yeah, this is
00:22:58.960
kind of doing the thing." You want to
00:23:02.159
again use statistical rigor to build
00:23:04.159
trust in this thing. So, what you can
00:23:05.679
do, I mean, there's many things you can
00:23:07.360
do. This is just a simple thing. Run
00:23:08.799
many AA tests. So, take your your seed
00:23:11.440
conversation, run your judge on it. If
00:23:14.080
it gets a score of 3.1, let's say,
00:23:16.960
simulate like a 100 conversations and
00:23:19.919
then check that those simulated
00:23:21.679
conversations when you run your judge on
00:23:23.200
it that they are all very similar and
00:23:25.280
then you know that your merchant
00:23:27.679
simulator is is doing what you want it
00:23:30.080
to do.
00:23:32.559
So this is I would say like a pretty
00:23:35.919
trustable way to build LM judge and get
00:23:38.080
away from vibe testing and you know
00:23:40.320
you're not vibe creating an LM judge
00:23:42.480
that says is this the same as that.
00:23:44.159
You're actually building something that
00:23:45.440
you can run on infinite data. There's
00:23:47.919
many many positives to this. One
00:23:49.679
positive is certainly you can use it for
00:23:51.200
reinforcement learning algorithms or you
00:23:53.280
know DSP pie I think is like very
00:23:55.280
popular right now. Um and a lot of
00:23:57.679
people are doing that on their golden
00:23:58.880
set with a judge. But the problem is
00:24:01.360
then they're building they're pro
00:24:03.360
they're tuning their prompt with DSPI
00:24:05.120
for example using their golden set and
00:24:07.679
an LM judge. But if their golden set is
00:24:09.520
only 500 then they're basically just
00:24:11.679
overfitting on that 500. So when you
00:24:13.520
have a judge that's against a ground
00:24:15.440
true set you can run it on inf infinite
00:24:18.159
and your RL systems are going to work a
00:24:19.840
lot better with this. So what else can
00:24:21.760
you do? Create skill judges RHF
00:24:23.520
pipelines and RL pipelines.
00:24:26.640
At the end of the talk here, I'm going
00:24:28.080
to talk a little bit about some of the
00:24:29.440
RL we did and something to be wary of.
00:24:32.000
So, I'm not going to get super deep
00:24:33.279
into, you know, GRPO, which is what were
00:24:35.679
you using or or reinforcement learning
00:24:37.520
in general. But I do want to share some
00:24:40.720
of this very interesting things that
00:24:42.080
happen with the judges and just
00:24:43.279
something you need to be aware of. Even
00:24:44.400
if you build a high trust judge, RL is
00:24:48.000
going to learn how to exploit you. And
00:24:50.240
uh so it's kind of interesting to do as
00:24:51.600
an experiment because then you can, you
00:24:53.279
know, keep iterating on your judge. For
00:24:55.679
those of you who are unfamiliar with RL
00:24:57.840
or reinforcement learning, basically at
00:24:59.760
its core, you have an environment and
00:25:01.600
you have an agent interacting in the
00:25:03.200
environment and that the environment has
00:25:05.120
a state and when the agent interactes
00:25:07.840
interacts with the environment, the
00:25:09.760
state changes and then given the state
00:25:11.360
change, you have a reward function and
00:25:13.840
that both of those go back to the agent
00:25:15.600
and the agent is just keep taking
00:25:16.960
actions and it's keep getting getting a
00:25:19.120
reward um and then you stay in the
00:25:21.120
environment. So when you think about
00:25:23.360
this for training an ML model uh for a
00:25:25.760
generation model let's say you have the
00:25:27.520
generation and then you have your reward
00:25:29.360
model which is let's say our LM judge
00:25:32.080
that is giving a score basically and
00:25:34.320
then that goes into our loss function
00:25:36.000
which you know back propagates into the
00:25:37.919
model and then the next generation is an
00:25:39.840
improved version of the model based on
00:25:42.000
whether the judge or reward function
00:25:43.520
said it was good or not and the longer
00:25:45.520
you run it the more steps that it runs
00:25:47.600
the higher their performance is going to
00:25:48.880
be. So when we ran this on our system,
00:25:53.440
like one of our fine-tuned models that
00:25:54.960
that did one of the skills that Charlie
00:25:56.640
talked about, we saw the accuracy go
00:25:58.960
from, you know, basically 79% to like
00:26:01.760
99%.
00:26:03.279
And the team was like super happy and
00:26:04.799
I'm like this is way too good to be
00:26:06.559
true. Uh so then they started doing
00:26:08.320
manual analysis and they found some some
00:26:11.279
pretty interesting things. So
00:26:13.840
uh one of the tasks was uh like gener
00:26:17.200
let's let's just say generating SEO for
00:26:20.720
for a product description. So the answer
00:26:22.400
we're expecting or something good might
00:26:24.000
be like you know some good SEO title. Uh
00:26:27.039
but what the model was actually doing
00:26:28.960
was creating uh a response that says
00:26:31.279
unfortunately I can't do that for you as
00:26:33.679
that's not something I support. uh it
00:26:36.000
can support it, but it found out that it
00:26:38.559
got the highest rewards because our
00:26:40.159
judge was not good enough at evaluating
00:26:44.320
this task to know that it was, you know,
00:26:46.640
it was marking this as correct because
00:26:48.240
when you read it, it it considered it
00:26:51.039
correct, which was a fault in our judge.
00:26:53.200
So despite all that rigor, you can still
00:26:55.039
have faults in your judge and RL will
00:26:58.240
find a way to exploit them. Uh so
00:27:00.400
something to be something to be careful
00:27:02.000
about.
00:27:04.320
Another one uh so this was like you know
00:27:07.360
customer segmentation the model can do.
00:27:09.279
So there's a model that specifically
00:27:10.559
turns natural language into a
00:27:12.400
segmentation query and the correct
00:27:14.960
answer is here you know customer account
00:27:16.880
status equals enabled. This is like an
00:27:18.799
actual field uh in the segmentation but
00:27:22.000
you can also have free form tags in
00:27:24.400
segmentation. So what the model was
00:27:26.640
doing here uh with G after being tuned
00:27:29.200
on GRPO with our judge, it was using the
00:27:31.360
free form tags to say just look for a
00:27:33.360
customer tag that says enabled or a
00:27:35.440
customer tag that says active. And then
00:27:37.600
when our judge looked at this uh which
00:27:40.880
our judge looks like the validity
00:27:42.480
validity of you know the the language
00:27:44.559
generated as well as if it's
00:27:46.240
semantically correct in this case and um
00:27:49.360
you know this is a valid query and looks
00:27:51.520
correct but it's technically wrong. Um,
00:27:54.480
so we had to like, you know, keep
00:27:56.000
iterating on our judges and and improve
00:27:59.600
this. So this is definitely something to
00:28:02.320
watch out for when you do create your
00:28:04.080
your LLM judges. So some takeaways from
00:28:07.440
this conversation uh and chat here.
00:28:09.840
Create an LLM judge, but don't just, you
00:28:12.720
know, don't do vibe testing, but don't
00:28:14.320
vibe create a judge either. Uh, that's
00:28:17.200
probably the thing. So people like yes I
00:28:18.640
have an LM judge but they vibe coded
00:28:21.039
their LM judge or it's just you know is
00:28:23.840
my Y I you know they have a fixed golden
00:28:25.919
set and it's just testing is Y the same
00:28:28.000
as you know Y hat and that's that's not
00:28:30.799
enough to to really you know do the test
00:28:33.679
and get into RL and really have high
00:28:35.279
trust in it. Uh you want to align your
00:28:37.520
judges to product experts. Uh so again
00:28:41.600
not just you probably don't even want
00:28:43.360
developers on the team to do it. You
00:28:44.720
want the product leaders to be doing it.
00:28:47.440
um and continuously grow your ground
00:28:49.279
truth set. One important thing is when
00:28:51.600
you are getting your ground truth set,
00:28:54.159
you you don't want to get only good
00:28:56.320
conversations. You want to get good and
00:28:57.760
bad and you want to mark them as
00:29:01.919
as the merchant, not as somebody who
00:29:04.960
knows what your product supports and
00:29:06.559
doesn't support. Uh so if something if
00:29:09.760
psychic says correctly, I can't do that
00:29:11.760
for you. merchant sentiment overall
00:29:14.080
score should be low so that you can find
00:29:16.559
these conversations know what product
00:29:18.240
features to create next and when you do
00:29:20.559
create these products uh these features
00:29:22.480
then you will see you know your score on
00:29:24.080
the judge go up so always think from the
00:29:26.640
user how do I grade this conversation
00:29:28.559
without knowledge of what it's supported
00:29:30.159
or not and when you do create these
00:29:32.240
things um expect it to do reward hacking
00:29:36.320
or find loopholes especially if you're
00:29:38.480
using it um in an RL system
00:29:41.840
so that thank you. Um, we are hiring and
00:29:44.720
if you want to come chat with us at the
00:29:46.000
booth, we are there. Thanks,