Summarized using AI

Parsing and generating SQLite's SQL dialect with Ruby

Stephen Margheim • April 16, 2025 • Matsuyama, Ehime, Japan • Talk

In the talk "Parsing and Generating SQLite's SQL Dialect with Ruby," Stephen Margheim introduced his project named Plume, which addresses the challenge of creating an SQLite-compatible SQL parser in Ruby. As SQLite grows in popularity, the need for effective tools in the Ruby ecosystem that can work with its dialect becomes essential. Unlike existing tools, Margheim's parser aims for 100% compatibility with SQLite’s grammar, which is critical for developers creating applications that harness SQLite effectively.

Key points discussed include:

  • Background: Margheim's passion for Ruby and open source led him to contribute significantly to both Rails and the SQLite Ruby driver. He previously spoke about how Rails and SQLite work well together to enhance the development experience.

  • Purpose of Plume: The project began as a response to a bug in Rails related to parsing SQL statements and evolved into a full-featured SQLite parser. Margheim emphasized that existing parsers in other languages often fail to parse valid SQLite SQL, spurred him to create a new solution that would ensure proper compatibility.

  • Parser Structure: The core of Plume is a handwritten parser that utilizes a recursive descent approach to manage tokenization. This efficient structure ensures that only essential objects are allocated during processing, minimizing memory usage and enhancing performance.

  • Abstract Syntax Tree (AST): Plume generates a Concrete Syntax Tree, which can be presented as an Abstract Syntax Tree (AST) for easier manipulation in Ruby. This structured representation of SQL statements allows detailed introspection and modification of SQL statements programmatically.

  • Community Engagement: Margheim encouraged collaboration and community involvement for further development of Plume. He invites contributions to enhance its capabilities, including hunting for any incompatibilities between SQLite’s SQL interpretation and Plume’s parsing.

  • Future Aspirations: The presentation also outlined future work, which includes broadening the support for various SQL statements and creating tools to programmatically generate SQL queries using Ruby. Margheim envisions Plume as a foundational tool that will enable new innovative developments in SQLite tooling for Ruby.

In conclusion, Margheim's presentation at RubyKaigi 2025 effectively outlined the importance of creating a reliable SQLite parser in Ruby. With Plume, he aims to empower developers by providing essential compatibility with SQLite's syntax, paving the way for more robust SQL tools within the Ruby ecosystem. The success of this project relies heavily on community participation and collaboration to build towards a more seamless integration with SQLite.

Parsing and generating SQLite's SQL dialect with Ruby
Stephen Margheim • Matsuyama, Ehime, Japan • Talk

Date: April 16, 2025
Published: May 27, 2025
Announced: unknown

SQLite's popularity is on the rise, and likewise the ecosystem of tools around it is growing. Unfortunately, SQLite does not expose its parser for 3rd parties to use independently. This greatly limits the ability of developers to build tools that must interact with SQLite's SQL dialect. And so, I have hand-written a 100% Ruby, 100% compatible parser for SQLite's SQL dialect. In addition, having a complete AST permits us to also generate SQL queries from terse, structured Ruby code. In this talk, I will demonstrate how we ensure that the parser is 100% compatible with SQLite's SQL dialect. We will also explore how the parser is implemented and what kind of AST it produces. Then, we will dive into how to use the parser to build tools that can analyze and manipulate SQL queries. Finally, we will look at how to use the generator to build tools that can generate SQL queries programmatically. As Ruby's only full SQLite SQL parser, this library opens up a world of possibilities for developers.

https://rubykaigi.org/2025/presentations/fractaledmind.html

RubyKaigi 2025

00:00:06.400 well if you don't mind I'm going to
00:00:08.240 stand here to the side of the podium
00:00:10.400 because I don't like feeling imprisoned
00:00:13.519 by the podium but nice to meet you all
00:00:16.000 my name is Stephen and I am really
00:00:20.320 honored to be here at Ruby Kai this is
00:00:22.560 my first time at a Kai this is my first
00:00:24.960 time in Japan and it has already been a
00:00:27.920 really lovely experience
00:00:30.640 you can find me basically everywhere on
00:00:33.040 the internet as
00:00:35.480 fractalmind and I write on my blog
00:00:40.040 fractalmind.github.io i'm here today
00:00:41.840 because I love Ruby and I love open
00:00:45.040 source i've had the privilege over the
00:00:47.120 last couple of years to be a regular
00:00:49.280 contributor to Rails as well as to the
00:00:53.360 SQLite Ruby driver
00:00:56.399 and if you do know my name it's likely
00:00:58.719 because of the work I have done in the
00:01:00.480 last couple of years to make the Ruby
00:01:02.760 ecosystem the single best ecosystem in
00:01:06.479 the world to build with the SQLite
00:01:08.799 database
00:01:09.720 engine i had the opportunity last
00:01:12.479 September to speak at Rails World about
00:01:15.680 how Rails 8 and SQLite together can
00:01:19.360 supercharge the experience of building
00:01:21.439 web applications as solo developers and
00:01:25.280 if That's a talk that you haven't seen
00:01:28.000 it is on YouTube in my completely
00:01:30.479 unbiased opinion it's great and you
00:01:33.360 should watch it and if it increases the
00:01:35.759 view count that's just a a nice side
00:01:38.360 effect today though I want to focus on
00:01:42.000 pure Ruby and and not web applications
00:01:44.560 specifically and I want to talk to you
00:01:46.000 about the next project in my crusade to
00:01:49.200 make Ruby the best ecosystem for working
00:01:51.360 with SQLite and that is a project I'm
00:01:53.520 calling Plume
00:01:56.159 plume is a tool set for working with
00:01:59.040 SQLite
00:02:00.520 elegantly and what I'm going to talk
00:02:02.479 about today is the heart of that project
00:02:04.560 which is a
00:02:06.840 parser you can find plume now publicly
00:02:09.520 available open source on GitHub here at
00:02:12.319 this URL
00:02:14.040 and what I want to do today is walk
00:02:17.239 through why I am writing a parser for
00:02:20.959 SQLite's dialect of SQL how you can use
00:02:24.000 Plume and how it works and then talk
00:02:27.120 about what's next and how you can get
00:02:30.040 involved so why would anyone write a SQL
00:02:34.480 parser let alone a SQL parser for just
00:02:36.959 SQLite's dialect of
00:02:39.000 SQL and it's a pretty fair question
00:02:41.760 because if you did do some research you
00:02:43.440 would see that there are already a fair
00:02:45.680 number of projects that are aiming to be
00:02:49.519 a SQLite specific parser across a
00:02:52.400 variety of languages and frameworks and
00:02:55.760 tool
00:02:56.840 sets and you would think that this
00:03:00.720 incredibly niche market is already
00:03:02.879 saturated
00:03:04.640 but the fact of the matter is that
00:03:07.440 nearly every single one of these
00:03:09.120 existing projects would fail to parse
00:03:11.599 this SQL
00:03:13.159 statement and the reason why is right
00:03:17.319 here in well- formatted SQL this would
00:03:20.720 be a
00:03:21.959 comma but SQLite is perfectly happy to
00:03:25.040 accept just blank spaces between table
00:03:27.760 constraints and if SQLite is able to
00:03:32.239 parse something but a third-party parser
00:03:35.080 cannot that is a problem and it's a
00:03:38.480 problem because it limits the potential
00:03:40.480 impact of that project it'll never be as
00:03:42.640 impactful as something like
00:03:44.680 PGQuery pgquery is used in over 1,000
00:03:48.720 applications and packages and the number
00:03:51.760 one reason why is that it is the parser
00:03:55.519 that Postcrest uses which means it has
00:03:57.519 100% compatibility and that
00:03:59.760 compatibility is the foundation that is
00:04:01.840 needed to build an ecosystem of tools on
00:04:04.720 top of a
00:04:07.080 parser when it comes to parsing
00:04:09.920 compatibility is
00:04:11.640 king if there is any SQL statement that
00:04:14.319 the SQLite CLI will happily parse but a
00:04:17.680 third party parser does not that's a
00:04:19.600 problem and if there is a statement that
00:04:21.600 SQLite would error on and the parser
00:04:23.919 doesn't error on it that's also a
00:04:25.440 problem we have to get to perfect
00:04:27.880 compatibility and so the question
00:04:30.280 is how do we get there and what's sort
00:04:32.720 of the state of things today there are a
00:04:35.759 couple of projects when I was doing
00:04:37.360 initial research here that were pretty
00:04:39.360 close this one is a rustbased parser
00:04:43.600 generator that uses the exact same
00:04:45.840 source grammar
00:04:49.240 as this one is a rustbased um parser
00:04:53.440 generator that uses the exact same
00:04:55.040 source grammar file as the SQLite parser
00:04:58.960 but it has uh a couple of really key
00:05:01.560 limitations the first is that not every
00:05:05.280 grammar declaration is supported and the
00:05:07.759 second even bigger one is that it
00:05:09.520 regularly panics so it is unable to be
00:05:14.320 this foundational tool for the ecosystem
00:05:17.280 to build on top of as we can see in in
00:05:19.280 the number of projects that are using it
00:05:22.080 this JavaScript
00:05:23.880 parser would actually successfully parse
00:05:26.400 the example we gave at the beginning and
00:05:28.400 it has good momentum there are over 300
00:05:30.320 projects that are using
00:05:32.199 this but it was archived nearly four
00:05:34.880 years
00:05:35.560 ago so I think that there's a real need
00:05:38.800 for this kind of project and that's why
00:05:42.479 I started Plume with this vision to
00:05:45.039 build a foundation for the next wave of
00:05:47.280 SQLite tooling but if I'm being honest
00:05:50.560 this project didn't start with a grand
00:05:52.400 vision for the next wave of tooling it
00:05:54.720 started with fixing a bug in Rails
00:05:58.320 i found a bug when I went to execute
00:06:01.600 this
00:06:03.000 migration and the problem was that I
00:06:06.000 wanted to define a virtual column with
00:06:09.440 an expression that had a comma inside of
00:06:12.600 it and everything in Rails blew up when
00:06:14.960 I went to run this migration so I I went
00:06:16.560 diving through the source and I found
00:06:19.039 this line rails's active record SQLite
00:06:22.639 adapter needs to understand your schema
00:06:25.600 and in order to understand SQLite schema
00:06:28.720 it has to do some parsing of the create
00:06:31.120 table statement and Rails uses some
00:06:32.960 reaxes and some basic stream
00:06:34.319 manipulation to try to make sense of
00:06:35.680 things so I found a pretty simple split
00:06:38.080 on commas probably that'll just be comma
00:06:40.240 definitions but if you have an
00:06:41.280 expression with a comma in it all of a
00:06:42.800 sudden everything breaks so to fix this
00:06:45.039 bug I wrote this incredibly fun reax i'm
00:06:48.800 not even going to spend a lot of time on
00:06:49.919 it don't read it it's horrible it fixed
00:06:52.400 that bug even as I wrote it I was aware
00:06:54.479 of the other possible edge cases that
00:06:57.360 would lead to very similar bugs this is
00:06:58.960 not resilient and when I put this pull
00:07:02.639 request in and got it merged I was happy
00:07:04.639 to have fixed that bug but I couldn't
00:07:06.080 help but daydream about what it could
00:07:07.919 have been if I could have fixed the bug
00:07:10.240 with code like this because if we did
00:07:13.440 have a parser and if we could inspect
00:07:15.919 the schema of our tables and get back
00:07:19.039 structured Ruby objects that were full
00:07:21.120 and correct it would make Rails' life so
00:07:24.240 much easier and these kinds of bugs
00:07:25.840 would
00:07:27.400 disappear so this is why I wanted to
00:07:31.680 start this project i really do want to
00:07:34.240 set a foundation for a new wave of tools
00:07:36.960 for SQLite it's obviously gained a lot
00:07:39.440 of momentum here in the last couple of
00:07:41.039 years and I think that there's so much
00:07:42.639 opportunity for tooling to come in the
00:07:45.520 future years and maybe just maybe we
00:07:48.800 will be able to use this tool to secure
00:07:51.440 Rails's introspection capabilities to
00:07:53.680 drive active record when working with
00:07:55.479 SQLite and if I'm being honest I wanted
00:07:59.039 to challenge myself uh I am a
00:08:01.280 self-taught programmer i never went to
00:08:03.039 computer science uh classes let alone
00:08:06.479 get a full degree i didn't really know
00:08:08.400 much about parsing and I wanted to and
00:08:10.560 this seemed like a great opportunity to
00:08:12.479 dive in and learn
00:08:14.280 more so let's transition now to like how
00:08:17.599 you can use Plume and how it
00:08:20.199 works and to do that we're going to look
00:08:22.240 at a simple create table statement like
00:08:24.280 this now if we parse this SQL and we get
00:08:27.759 the the statement
00:08:29.160 node it'll look like this and let's
00:08:32.560 break this down bit by bit and see how
00:08:35.279 the syntax tree is being formed and how
00:08:37.760 we can work with
00:08:39.000 it so the create table token sequence is
00:08:43.360 what is going to determine that we have
00:08:44.880 a create table statement
00:08:47.399 node we see that the table we are
00:08:50.399 creating is called users so our node
00:08:52.880 will have a name property and or a table
00:08:56.959 property that has a
00:08:59.600 uh let's look at this table name node
00:09:02.640 the table value is our string
00:09:05.160 users our first column is called ID so
00:09:09.760 inside of our columns array we're going
00:09:11.120 to have a column definition node it'll
00:09:12.800 have a name property that'll be the
00:09:14.240 string
00:09:15.399 ID that column is defined as an integer
00:09:18.480 so inside of that column definition we
00:09:20.240 will also have a type property that type
00:09:22.560 property will have a column type node
00:09:24.640 that will have the text integer and
00:09:26.720 it'll also tell us the affinity for that
00:09:29.279 particular type sqlite's loose typing
00:09:31.839 system um is fully supported here
00:09:36.000 we also see that we have a constraint on
00:09:37.839 this column so this is a primary key
00:09:39.680 constraint with the auto increment
00:09:41.760 keyword so inside of our constraints
00:09:44.000 array we see a primary key column
00:09:46.000 constraint node and the auto increment
00:09:48.320 option is indeed
00:09:49.959 true our second column is called email
00:09:52.720 so we'll have another column definition
00:09:54.320 node property name value email we see
00:09:57.760 the string is the type another type
00:10:00.560 another column type we see the affinity
00:10:03.680 this actually has two constraints it's a
00:10:05.200 notnull column so we'll see that node in
00:10:08.240 the column in the constraints array and
00:10:11.040 it is also a unique column and so we
00:10:14.399 have a second constraint node in our
00:10:16.560 constraints
00:10:18.200 array this a structure provides every
00:10:22.000 single bit of information that we had in
00:10:24.399 our SQL string structurally in a way
00:10:27.120 that is easy to use in any Ruby program
00:10:30.959 and the parser is able to handle fully
00:10:33.279 complicated create table statements this
00:10:35.040 is the most complicated create table
00:10:36.480 statement I could write and it would
00:10:39.519 create a syntax tree like
00:10:42.760 this pretty easy to
00:10:45.079 follow yeah I think we all get it
00:10:48.680 um in order to support parsing large
00:10:52.480 complex SQL statements as quickly and
00:10:54.240 efficiently as possible the architecture
00:10:56.399 is inspired by Aaron Patterson's talk
00:10:59.040 that he gave last year at Rails World
00:11:00.720 where he broke down three key principles
00:11:03.519 for parsing efficiently in Ruby so the
00:11:08.160 first is to write your parser in pure
00:11:10.480 Ruby you want to minimize the number of
00:11:11.920 calls going back and forth between C and
00:11:14.320 Ruby so if you keep everything in Ruby
00:11:16.160 and you allow YJet to optimize um you're
00:11:18.959 going to get speed improvements for free
00:11:22.399 from the just in time compiler you then
00:11:25.839 want to minimize allocations that means
00:11:27.680 leaning on integers and symbols which
00:11:30.800 are objects that aren't going to
00:11:32.000 allocate to sort of drive the primitives
00:11:33.839 of your parser and then when you do need
00:11:36.399 to allocate and obviously you are going
00:11:37.920 to need to allocate some strings to make
00:11:39.760 sense of what's there you want to do
00:11:42.079 that lazily so only when needed do you
00:11:45.519 provide the values that users ask for
00:11:47.440 and that's going to help keep everything
00:11:48.640 as efficient as possible so these
00:11:51.040 principles drove the architecture for
00:11:54.399 how plume is implemented it has a lexer
00:11:57.600 that lexer will work token by token
00:12:00.880 keeping track of the cursor offsets that
00:12:04.560 are driving the tokenization process
00:12:08.800 such that you can ask for a value if and
00:12:12.800 when you need it and only at that moment
00:12:14.959 will we go and allocate that string and
00:12:16.639 pull it out for you that lexer is used
00:12:18.720 internally by the parser the parser is
00:12:21.519 also going to work token by token we do
00:12:24.560 use a buffer so that we can peak ahead
00:12:26.959 so we keep track of the the token so the
00:12:29.839 lexer can sometimes be one two or three
00:12:32.639 tokens ahead of what the parser is
00:12:34.480 looking at but that's just managed
00:12:35.760 easily with our buffer array in order to
00:12:39.040 make the actual parsing logic relatively
00:12:41.959 maintainable there are a number of
00:12:43.839 helper methods that allow me to encode
00:12:46.959 the grammar and all of the different
00:12:48.800 flexibility that is logically required
00:12:52.160 and to sort of walk through how those
00:12:55.760 helper methods are being used and how
00:12:58.320 the parser itself is running I want to
00:13:00.000 walk through the create table statement
00:13:01.519 parsing um so that we all feel
00:13:04.000 comfortable with how this gem is working
00:13:06.240 because I am going to ask all of you to
00:13:08.560 come and help me and contribute so you
00:13:10.560 won't have any excuses by the end of
00:13:12.000 this talk
00:13:14.000 if we look at the SQLite documentation
00:13:16.079 we will see a railroad diagram like this
00:13:18.160 that really clearly explains the logic
00:13:21.040 of parsing the create table statement it
00:13:23.600 is worth noting you should be aware that
00:13:26.399 these diagrams are telling you how to
00:13:28.079 write SQL they are not telling you how
00:13:30.079 to parse SQL as we can see they say you
00:13:32.560 should put commas after table
00:13:34.320 constraints and SQLite is happy to allow
00:13:37.279 you to not put commas there this right
00:13:39.040 here I believe is the primary reason
00:13:41.200 that most parsers have a number of
00:13:43.680 incompatibilities because their
00:13:45.040 grammarss are defined on these diagrams
00:13:46.959 and these diagrams are filled with lies
00:13:50.880 so buy or
00:13:52.360 beware if we jump into the logic for how
00:13:56.079 we can convert this diagram into uh our
00:13:58.959 actual execution we will see starting
00:14:02.240 here at the beginning of the create
00:14:03.440 table statement this where the first
00:14:05.360 thing we're going to do is ensure that
00:14:07.440 we have a create to token right so if we
00:14:10.160 expect that we are parsing our create
00:14:11.440 table statement we must have the create
00:14:12.959 tok to token first so we use the require
00:14:15.440 helper that ensures that that token is
00:14:17.839 there and consumed or we error the
00:14:20.959 grammar next says that we might or might
00:14:23.040 not have either the temp token or the
00:14:25.279 temporary token that is why we use the
00:14:27.040 maybe one of helper to say maybe we'll
00:14:29.519 have a token maybe not and if we do
00:14:32.160 it'll be one of temp or temporary so if
00:14:34.480 either of those tokens are the current
00:14:35.680 token we consume them if not we just
00:14:37.519 return nil and everything's okay the
00:14:40.320 next point is requiring the table token
00:14:44.240 as a small note yes you can just define
00:14:46.399 a require method in your class or module
00:14:48.480 and use it um am I maybe writing not so
00:14:52.000 beautiful Ruby because I'm overwriting
00:14:54.320 uh the require method possibly but I
00:14:56.480 like the way it sounds
00:14:58.760 so we do it after you have a table you
00:15:02.160 might or might not have a sequence of
00:15:04.079 tokens if not exists so the maybe all of
00:15:08.399 helper uh allows us to check do we have
00:15:11.680 this exact sequence of tokens fully from
00:15:14.160 start to finish and if we do consume all
00:15:16.160 of them and grab the full span but if we
00:15:19.360 don't just return nil and continue
00:15:22.199 on next we get the table name and that
00:15:25.760 calls out to a separate rule definition
00:15:28.160 method that has all of the logic for
00:15:30.240 getting the table name that might be
00:15:31.920 qualified and structured in a couple of
00:15:34.639 different ways um structurally you might
00:15:37.519 expect that we would use a require
00:15:40.079 helper method with a block to include
00:15:43.040 the method uh and that's actually what
00:15:44.959 we do with with may if you want to check
00:15:47.199 if you can parse another rule you can
00:15:49.600 call maybe with a block with the rule
00:15:51.720 method but in the case of requiring this
00:15:54.160 just adds overhead for very little value
00:15:56.399 so for rules that are required we just
00:15:58.720 call them directly and I like to add a
00:16:01.360 lot of spaces so things line up in a
00:16:03.279 nice columner way if we shift our focus
00:16:06.160 into the actual parsing of the
00:16:08.320 definition of a table we'll see the rest
00:16:10.399 of our helpers here so we can use the
00:16:12.959 maybe helper to check whether or not the
00:16:14.720 current token is the opening parenthesis
00:16:16.720 if it is we know that we are parsing a
00:16:18.399 create table statement with column
00:16:19.959 definitions and you must have at least
00:16:22.320 one column defined so there we can use
00:16:24.399 the require sum helper which says
00:16:26.720 there's one or more uh rules in this
00:16:31.360 case we are looking at the column defaf
00:16:33.519 rule that is another helper method uh a
00:16:36.320 rule definition method we're not going
00:16:37.920 to look at its internals but uh it
00:16:40.800 includes all of the logic for parsing
00:16:43.040 out a full column definition and we tell
00:16:45.759 the require some helper that we are
00:16:48.240 going to have those column definitions
00:16:50.800 separated by commas that come after each
00:16:54.040 definition after we've collected all the
00:16:56.240 columns we look to the constraints here
00:16:58.880 there might be zero or more constraints
00:17:01.120 so we use maybe
00:17:02.759 some in this case the table constraints
00:17:07.439 only are used inside of create column so
00:17:09.760 we don't need to extract it into a
00:17:11.120 separate rule method we can just define
00:17:12.559 all the logic directly in the
00:17:15.000 block and we specify that the trailing
00:17:18.079 separator could be either a comma or nil
00:17:20.720 and this is of course the exact logic
00:17:23.039 that we need to be fully compatible with
00:17:25.679 SQL's handling of table constraints
00:17:28.559 being either uh separated by a comma or
00:17:31.679 not
00:17:33.840 so once we have collected all the
00:17:35.760 constraints we must have our closing
00:17:38.240 parenthesis so we use require one more
00:17:40.160 time and at the very end we might or
00:17:42.880 might not have a couple of table options
00:17:44.720 so we use maybe some once again and that
00:17:47.280 gives us everything we need to create
00:17:49.039 our create table statement node which as
00:17:51.679 we have seen looks like this now the
00:17:54.880 only other final point to make is a
00:17:58.000 small confession uh this slide is a bit
00:18:01.120 of a lie
00:18:02.799 this isn't actually exactly what the
00:18:05.120 parser will generate this is an abstract
00:18:07.919 tree the parser actually produces a
00:18:11.520 concrete tree uh that keeps track of the
00:18:14.559 locations and
00:18:15.960 spans uh but I personally find this a
00:18:20.360 bit aggressive on the eyes and so when
00:18:24.559 you inspect the column or any particular
00:18:28.480 node you get back a lazy representation
00:18:31.360 of the abstract version of that node uh
00:18:34.240 but you have full access to all of the
00:18:36.080 span and location information we just
00:18:38.720 pretty print it um for you so as you can
00:18:42.919 see even this simple statement has a lot
00:18:45.840 of information on it you have access to
00:18:48.320 all of that information and if you
00:18:49.760 wanted to or needed to use some of this
00:18:52.000 fine grain information that's all there
00:18:54.000 but for the most part we're going to
00:18:56.000 keep it simple just to break down
00:18:57.600 exactly how it's working we're going to
00:18:59.440 have these token spans that are composed
00:19:01.840 of four parts we have the tokens one or
00:19:05.679 more tokens the start offset the end
00:19:08.400 offset these are offsets into the
00:19:10.240 original source uh string and then the
00:19:13.200 type of token that this is this helps in
00:19:15.120 some of the the helper methods that are
00:19:16.880 there to extract the value
00:19:19.200 in order to use these offsets every
00:19:21.360 single node includes a reference to the
00:19:24.160 full SQL source string that was passed
00:19:26.080 into the parser and that is what allows
00:19:28.160 us to lazily produce the values that are
00:19:30.640 needed so what happens in an IRB console
00:19:33.360 um for example if I just look at the
00:19:35.919 node it's going to access it and at that
00:19:39.440 moment it goes and lazily grabs all of
00:19:41.440 these values to say okay what is the
00:19:43.600 table name what is the column name uh
00:19:46.240 the column type text but in reality you
00:19:48.880 could write your tools on top of this
00:19:51.360 and if you didn't need any of that stuff
00:19:52.960 and you didn't ask for it none of those
00:19:54.559 objects would be
00:19:56.280 allocated so the way that Plume works
00:20:00.080 this is a handwritten recursive descent
00:20:02.679 parser it is minimally allocating uh
00:20:06.919 objects mostly integers and symbols in
00:20:11.160 arrays it works token by token um so
00:20:15.360 there's never huge arrays in memory the
00:20:18.480 grammar is encoded with these hopefully
00:20:20.799 very literate helper methods and that
00:20:23.600 allows us to produce a concrete syntax
00:20:25.760 tree that we can then present to users
00:20:28.000 as an abstract syntax tree so that you
00:20:30.159 get the best of both worlds the
00:20:31.440 usability of the abstractness and the
00:20:33.600 fullness of the
00:20:36.200 concreteness
00:20:37.720 okay what is
00:20:39.880 next well I'll be honest i didn't 100%
00:20:45.200 finish it before today so there are more
00:20:49.039 statements to parse and I would love for
00:20:53.360 you to help
00:20:54.919 me i'm tracking progress here in the
00:20:57.600 readme now I confess that this looks
00:20:59.840 like I'm barely
00:21:02.120 started the truth is that create table
00:21:06.320 statement is one of the most complicated
00:21:08.720 statements that has a number of
00:21:10.440 dependencies so getting full expression
00:21:13.039 parsing and select uh statement parsing
00:21:16.400 uh those two are like the hardest two
00:21:20.000 and then create table statement is the
00:21:21.760 most useful one and it allowed me to get
00:21:24.159 all of the foundations right and if you
00:21:26.960 actually go and look at most of these
00:21:29.000 statements they're actually really easy
00:21:31.200 to parse so there are a lot of
00:21:33.360 opportunities for you to contribute
00:21:34.960 without having to go into very insane
00:21:38.720 difficult parts of the grammar a lot of
00:21:40.559 these statements are really
00:21:41.679 straightforward and would be very uh
00:21:43.760 easy to con to write the uh logic for
00:21:47.440 using the helper methods so I really
00:21:50.120 genuinely think that all of you
00:21:52.960 incredibly smart people could get
00:21:55.120 involved and contribute to this
00:21:57.240 project uh but even if I'm forced to do
00:22:01.280 it alone it won't take a ton of time uh
00:22:04.320 now that all of the foundations are set
00:22:07.120 but I'm not trying to hide the fact that
00:22:09.760 yeah there are more statements to handle
00:22:11.600 luckily create table statements I think
00:22:13.600 are the most important feature both for
00:22:16.480 Rails and for some of the tools that I
00:22:18.080 want to build on top of this um but if
00:22:20.240 you have needs for parsing other
00:22:22.120 statements let's do it
00:22:24.280 together next I would also love any
00:22:28.159 community help in hunting for
00:22:30.919 incompatibilities right so as I said if
00:22:33.840 there's any difference between how
00:22:35.440 SQLite parses a statement and how Plume
00:22:37.360 parses a statement that is a bug and I
00:22:39.120 would love that bug to re be reported so
00:22:41.360 that we can fix it put it in as a
00:22:43.600 regression test and ensure that we get
00:22:45.919 to full um compatibility now there
00:22:49.200 shouldn't be many bugs with create table
00:22:51.440 statement parsing because there are lots
00:22:52.880 and lots and lots and lots and lots and
00:22:54.880 lots of tests um I wrote a script that
00:22:59.120 scrapes every single SQL statement in
00:23:01.200 the public test suite of SQLite uh there
00:23:04.720 are I don't know 20,000 or so uh and we
00:23:08.159 parse all of those and then I'm also
00:23:10.880 handwriting tests for every single
00:23:12.640 logical permutation of the grammar to
00:23:15.760 get to 100% branch coverage um and so
00:23:20.200 hopefully when you go hunting for
00:23:22.240 incompatibilities you don't find many
00:23:24.240 but I hope you find some and we can fix
00:23:27.200 them because that makes the project
00:23:29.360 stronger and stronger and that is what
00:23:30.880 we need to treat this as a foundational
00:23:33.200 project for the ecosystem as it grows in
00:23:35.200 the coming
00:23:37.320 years as those of you who were here for
00:23:40.480 Marco's talk have already seen he's much
00:23:43.440 smarter and much faster than me so he
00:23:45.200 already built a web as explorer for his
00:23:47.919 parser uh that's something in the future
00:23:49.919 for me but if you saw his demo you would
00:23:53.039 see how incredibly useful that can be
00:23:55.120 you can put the SQL in you can see
00:23:57.440 exactly how the parser is going to make
00:23:58.880 sense of it and really get a feeling for
00:24:02.720 how to make use of the parser to start
00:24:04.960 building your tools on top of it how you
00:24:06.640 can get into the bits that you're going
00:24:08.720 to need um so I look forward to stealing
00:24:13.279 much of his code and porting it over to
00:24:16.320 this parser
00:24:19.520 the other thing that I have been working
00:24:22.000 on uh I didn't have a ton of code
00:24:25.760 samples i didn't want to get into the
00:24:27.200 implementation but I do want to talk
00:24:28.640 about um the results is using this
00:24:33.679 syntax tree as the foundation for SQL
00:24:36.679 generation um yes we have ARL uh no I
00:24:42.240 don't love it and so I have been
00:24:44.559 exploring some possibilities for
00:24:46.960 different ways to bring the full
00:24:49.520 expressive power of SQL to rubists
00:24:53.600 without some of the object-oriented
00:24:56.679 cumbersomeness of uh Arell so just to
00:25:00.799 give you some teasers of some of the
00:25:04.000 things that um I've been building this
00:25:07.039 is a fluid Ruby interface to produce SQL
00:25:11.600 um very much influenced by the SQL gem
00:25:15.840 um and the idea here is that we can
00:25:19.360 build the same syntax tree through Ruby
00:25:22.960 and then have that syntax tree produce
00:25:25.279 SQL for us that will be incredibly nice
00:25:28.159 but more exciting to me is work I've
00:25:30.720 been doing to define what I haven't yet
00:25:34.000 named perfectly but for lack of a sexy
00:25:37.440 term uh hashbased
00:25:40.360 SQL which is using structured Ruby
00:25:43.279 primitives to encode the full expressive
00:25:46.400 power of
00:25:47.480 SQL in Ruby and I think that this is
00:25:51.520 particularly exciting because this
00:25:53.760 bridges the gap for the biggest missing
00:25:56.960 feature in SQL in my opinion which is
00:26:01.000 composability it is quite difficult
00:26:04.080 unless you're doing select statements
00:26:05.520 and you use common table expressions to
00:26:08.000 compose parts of SQL together sql is
00:26:10.480 just strings and so you have to use the
00:26:13.520 primitive tools available to you in any
00:26:15.440 programming language that you're working
00:26:16.640 with to compose strings and it's it's
00:26:18.320 brittle and it's not particularly
00:26:21.840 um resilient to various kinds of needs
00:26:24.400 that we have as programmers but if we
00:26:27.200 were able to encode SQL in a way that is
00:26:31.440 literate in a way that isn't burdensome
00:26:34.080 or ownorous on the
00:26:36.039 eyes but used Ruby structures then you
00:26:39.360 could imagine taking that hash that is
00:26:41.760 the key of the exist the value of the
00:26:43.520 exist key there we could generate that
00:26:46.080 in a method we could store that in a
00:26:48.640 constant we could build that up piece by
00:26:51.679 piece across a module uh however it's
00:26:54.559 there it can live in in a completely
00:26:56.400 different part of our codebase used in
00:26:58.640 completely other SQL queries maybe we
00:27:00.880 just call it directly or we can inject
00:27:04.640 it as the value of this exist query
00:27:07.640 composing this smaller query into the
00:27:10.320 larger query as needed when needed
00:27:12.159 however we produce it um and the
00:27:16.240 generation step to take that into uh
00:27:19.120 full SQL using the a is um something I
00:27:23.120 have been building as well into plume
00:27:25.279 and one of the things I'm most excited
00:27:27.600 to push to all of the edges of
00:27:32.760 SQL and the last thing for what is
00:27:36.080 coming next is your contributions i
00:27:39.440 would love to see you in GitHub i would
00:27:41.440 love to see you open pull requests to
00:27:43.120 open issues to submit incompat
00:27:45.520 compatibilities that you have found this
00:27:47.840 is a project that has the opportunity to
00:27:52.559 be vital to the emerging next generation
00:27:55.440 ecosystem of SQLite tooling and it's
00:27:58.159 only going to be as strong as we all
00:28:01.039 make it and if it's just on my shoulders
00:28:03.279 it'll be mediocre but if we all come
00:28:05.919 together it could be great and so I hope
00:28:09.360 that I have given you enough initial
00:28:12.320 interest and confidence that you join me
00:28:14.880 in this longer term project of helping
00:28:18.159 to make Ruby the world's best ecosystem
00:28:22.480 for working with SQLite so find the
00:28:26.480 Plume repository give it a star numbers
00:28:29.679 matter and I'm vain
00:28:33.120 go to
00:28:34.039 GitHub and together let's make this big
00:28:39.200 and let's make it real and let's make it
00:28:42.000 elegant which is to say let's make it
00:28:44.360 Ruby thank you
Explore all talks recorded at RubyKaigi 2025
+66