
DataStax is known for its expertise in scalable data solutions, particularly for Apache Cassandra, a leading NoSQL database. Recently, the company has focused on enhancing platform support for AI-driven applications,
Loading summary
Sean Falconer
Datastax is known for its expertise in scalable data solutions, particularly for Apache Cassandra, a leading NoSQL database. Recently, the company has focused on enhancing platform support for AI driven applications, including vector search capabilities. Jonathan Ellis is the co founder of Datastax. He maintains a technical role at the company and has recently worked on developing their vector search product. Jonathan joins the show to talk about his passion for being in a technical role where AI fits into the Datastax platform, developing vector search. And he also reflects on his gradual adoption of AI into his workflows and where he thinks AI development is headed in the coming years. This episode is hosted by Sean Falconer. Check the show notes for more information on Sean's work and where to find him. Jonathan, welcome to the show.
Jonathan Ellis
Thanks Sean, glad to join you.
Sean Falconer
Yeah, absolutely. Thanks for being here. So you've been working on data stacks for nearly 15 years. Like what's kept you engaged all that time? You know, what's kept you excited?
Jonathan Ellis
Writing code. So my career arc transversed across going in the executive direction and then realizing that did not spark joy for me and then coming back to writing code pretty much full time most recently on our vector search products. So that's what makes me happy to get up in the morning, is looking forward to taking code and then at the end of the day it does something that it couldn't do before.
Sean Falconer
Yeah, I think it's good that you're able to recognize that because I think it's like a typical path that a lot of people, especially if you're an entrepreneur or you start a company but like you're in technical positions and then if you're successful a lot of times you end up getting promoted away from doing a lot of the technical work and end up doing a lot of, you know, more people management. And there might be technical aspects, but it's definitely a different skill set and you have to find joy in different ways than actually like you know, day to day building something.
Jonathan Ellis
Yeah, I tried and I think I might have said this on a Software Engineering Daily episode, but I really tried to convince myself that it was just as fun to build companies or build teams as it is to build code, but for me it isn't. So at the end of the day I had to stop trying to pretend and embrace my inner code monkey, I guess.
Sean Falconer
Yeah, I still feel like I have those inner battles every six months or so with myself as well.
Jonathan Ellis
Oh shoot.
Sean Falconer
You know, you mentioned your latest work with adding vector support to data stacks. What is data stacks doing now in the Context of AI, like where is it sort of thinking about itself fitting within this modern world of the LLM stack.
Jonathan Ellis
We use the word stack and that's in our name and that's what we're trying to provide is a one stop stack for building generative AI applications. So we built Vector Search for Cassandra. We acquired a company called Langflow. We are partnering with Nvidia to provide embeddings, computation and other models through, through Nvidia GPUs. And so we kind of approached this a bit at a time where we started with the database and we started with the vector search and we saw people saying, well, okay, but how do I host the rest of my application? You've got this hosted database, but how do I deploy the rest of this and how do I get the embeddings into the database? Instead of having to pull together OpenAI's embeddings model and manually stitch that with the Astra database, you can just tell us, here's my OpenAI key and we'll go and integrate with them. Or you can say, hey, I want to use these Nvidia embeddings and we'll compute them on the GPUs for you. So we're really trying to remove the complexity and as much as possible and let you focus on building your application.
Sean Falconer
Yes, it's more of like a platform approach than being essentially like a point solution for vector storage.
Jonathan Ellis
It's like exactly.
Sean Falconer
Serve all needs essentially. There are a lot of things that you have to stitch together to build like one of these applications and actually put it in front of users.
Jonathan Ellis
That's what we're seeing. And not only is there a lot of things to stitch together, but there's kind of a lot of unspoken knowledge or it's not necessarily clear what the best practices are. When you have a LangChain that offers five zillion ways to chunk your documents to embed them, like which way is, you know, what should my default be? What should I start with? What am I most likely to succeed with? And so we're trying to bring those into the mix as well through the Langflow platform that we're offering.
Sean Falconer
Yeah, that's interesting because I think you're right. Like if you're building something like a rag pipeline, there's so many decision points that you have to make in each decision point. Essentially you're sort of potentially giving up something in terms of like accuracy. You're making some sort of compromise. And those series of compromises could lead to kind of like a Disastrous result. And it is very hard to trace back, like what part of that decision making process led to the bad result. And going back and trying to figure that, it ends up being a lot of like tinkering and massaging the various inputs and stuff like that. And for somebody who really just wants to like, hey, I just want to build this application and integrate AI into it, having a great default experience would be hugely beneficial to those people to be like reducing the friction and barrier to entry.
Jonathan Ellis
Right? And that's like the dark side of generative AI is like the good side is it's, it's magic wand and it just works. Except when it doesn't, how do you debug that? Right? And often it's because of garbage in, garbage out, that you made a mistake in your pipeline and it's not getting the context that it needed to give you useful answers. So, yeah, so that's why we acquired langflow is that, you know, it's this visual environment that gives you best practices in reusable components that you can easily connect together to build your application this.
Sean Falconer
Direction with the product, did it feel natural or did it require some kind of like, rethinking to come up with? Like realizing that this approach made sense for data stacks?
Jonathan Ellis
I think the rethinking was primarily around, you know, we were a database company for 13 years and so realizing that we needed to embrace a wider role in providing infrastructure for Genai applications. That was the main rethinking, I think.
Sean Falconer
Okay, so I read this really interesting article that you wrote about how AI helped us add Vector Search to Cassandra in six weeks. And one of the things that you say in the article is that you're never going to go back to writing everything by hand. So first of all, prior to that project, what was your level of skepticism or enthusiasm for using AI to actually help you code?
Jonathan Ellis
So by nature, I'm a little bit of a late adopter. I've told so many people, like, I spend enough time debugging my own code. I don't want to debug other people's code too. Let other people shake the problems out and I'll be happy to use it once it's stable and production ready. And using AI for coding has been a real exception to that for me because it's just so useful that it's worth putting up with all the sharp corners and rough edges. So when OpenAI launched ChatGPT in October of 22, I think it was, and people started throwing, hey, can you write some code to do this, this was GPT 3.5 at the time, and it could solve small problems pretty well. And so that was really a big light bulb for me that, oh, wow, like, this is going to change my job. And then when GPT4 came out a couple months later, you know, it went from being, okay, this is going to be useful someday, to this is useful now. And so I think I wrote that article that you're talking about in June or July of 23. So at that point I'd been using AI to help me write code for around six months. So, you know, GPT4 in January of that year is when it started getting real for me.
Sean Falconer
Yeah, I feel like I was pretty skeptical about these things at first and now, like, I mean, it's pretty undeniable, like, how valuable they are. Like, you know, I remember like over the summer I wrote my first program ever in Go, and I certainly could have struggled through getting that program to work by reading through all the documentation and references and using stack overflow and the sort of traditional places you would go. But it would have taken me way longer to get like that program working than it did leveraging, like just ChatGPT and, you know, throwing prompts at it for what I needed, or even taking some of the code that I'd written and asking for, you know, helped improve it and stuff like that. Or even you can write in another language and say, you know, translate this over into this other language and it'll do that with like a reasonable output.
Jonathan Ellis
So I think I see it being useful in two areas primarily. One is to get up to speed in a domain that I'm not familiar with. So I've been writing Java code for, oh man, longer than I want to think about, actually. It's like 25 years now. But I recently wanted to experiment with Hybrid search in Python. And so having Claude write most of that code for me just really helped me get up to speed in terms of like, you know, it's importing all of these packages that I would have had to read up on manually and, you know, definitely sped me up by at least a factor of five and maybe 10 in that use case of I'm getting up to speed on code that's not very complex, but it's in this language or using libraries that I'm not familiar with. And the other is like, even if I am familiar with something, if it's a bunch of boilerplate code or scaffolding, it's really good at taking that out of my workday and making code more Fun. And so that's part of what I talked about in the article is that not only am I more productive, but I'm having more fun because I've got this AI intern to do kind of the boring parts and I can concentrate on the interesting parts. So the interesting parts are still there. Like just today I was writing some code to fine tune an open source embeddings model and Claude got me 90% of the way. But then it misunderstood the dimension of the tensors it was using and it just, it couldn't figure that out. I tried a couple, three times and saying here's the error it's getting and it couldn't resolve it. And so it's like, okay, at this point it's time to just dig into the code and solve it the old fashioned way. So it's a good mix. I'm really, really happy with the challenge and the intellectual puzzles that programming in 2024 with AI looks like.
Sean Falconer
One of the other values I've heard from people who are kind of new to programming too is that it gives you sort of like a non judgmental, like assistant to ask questions to where you don't feel like someone's going to be mean to you because you're asking a question that you perceive is like not an intelligent question to ask or something like that. So it kind of makes for a psychologically safe zone to ask questions that maybe otherwise you would hold back on if it was a person that you are trying to ask.
Jonathan Ellis
Yeah, and that's something that. And it's not just like questions about like how do I write this code? But also questions about a code base. Right. So Cassandra is off the top of my head. I'm going to guess that it's roughly probably between 1 and 200,000 lines of code. So not huge, but it's big enough that it's tough to wrap your mind around all at once. As a new developer or even as an experienced developer, I haven't touched the Cassandra compaction code in 12 years, give or take. And so it's really, really useful to have an AI assistant where you can say, hey, how does Cassandra make sure that it doesn't compact away a data file that's actively being used by the read threads? And so you can't just paste your whole data set into GPT4 or Claude, but what you can do is you can use one of the tools that uses Vector search to provide appropriate areas of the code to the LLM to answer your question from. So cursor is very good at this Augment code is also very good at this. And the open source free text mode AI code authoring tool called ede, I believe the author's French. That's a I D E R. It's not quite as good as those other two, but it is better than nothing. And all of those will get you an answer in seconds. And you know, if that's 60% of the time that's good enough, great, like I've saved my time that much more often. But then, you know, if it's not, then you can still do it the old fashioned way and ask a coworker. But you know, having that assistant to answer questions like the non judgmental thing, that's great. Absolutely. But also just like the speed, the latency of getting your questions answered now. And this is part of the category of things of like, yeah, it's not perfect, it gets it wrong, you know, 40% of the time. But it's much easier for me to verify the answer than to try to generate that answer in the first place manually. And so that's still a big win. Even if it's only 60% accurate.
Sean Falconer
Yeah, I mean, I think those are kind of like the perfect use cases for where Gen AI is right now. If you're looking at something like sort of the needle in the haystack search or summarization, and there's tons of industries where this is widely applicable. If you look at like legal, you know, there's whole teams of paralegals whose job is to like go and look stuff up in case files and stuff like that. Like, and to be able to pull that back in seconds to your point, even if it's only 60% accurate and you can verify it. Well, great. Now you've reserved your resources for those 40% of cases where you can actually answer that or your own personal time to dig in and so forth. In terms of the project to extend Cassandra to support vector search, what were some of the big challenges that you ran into with actually integrating vector search into Cassandra?
Jonathan Ellis
The first one was just adding a vector type to start with, which isn't something that Cassandra had before. I guess there's two components. I think there's two interesting and challenging components. One is just the vector index in the first place. Like how do you map vectors to each other to their nearest neighbors and do that in logarithmic time at search time? So that's a problem that has a fair amount of prior art around. And in fact that's what we reached for. We reached for first hnsw and then disk ann, which is a more advanced index type that lets you scale outside of memory. But then the other piece is how do you wire that into the rest of the database? How do you build a query execution engine that can do a vector search but also say restricted to documents that contain the word red or restricted to documents that Jonathan authored last week. And that's something that Cassandra hasn't traditionally been good at is doing multiple predicates like that and saying, okay, now take the result and order it globally by this other index. And so just building, you know, we built a cost based query optimizer. We built models of how expensive a vector search was going to be versus a keyword search versus a numeric predicate. So there was a lot of work on both of those sides, both the vector index itself and then integrating it with the rest of the database.
Sean Falconer
And then what's this do in terms of like sizing, you know, presumably like a vector, like an embedding could be a couple thousand parameters in length, all floating point numbers. Like it could be larger than the entire record associated with it essentially.
Jonathan Ellis
Yeah. And especially with indexes, you're not normally indexing values that are 4 kilobytes or more in size. And so I mentioned that we started with HNSW and HNSW says hey, all my vectors fit in memory. That makes things simple. But that doesn't work so well. When your vectors are that large, you run out of memory relatively quickly. And so the disk and indesign that we moved to says, hey, let's leave those raw full size vectors on disk and then we'll keep a compressed version in memory and we can push the compression up to 64x. So a lot of people are really excited about binary quantization, which gets you 32x. You're taking a float 32 and turning it into a 1 or a 0. But you can actually get to 64x with product quantization on a lot of these real world data sets. So the OpenAI embeddings, the cohere embeddings, I believe if I remember correctly, Google's gecko embeddings, you can compress all of those at 64x and still get accurate enough results that re ranking them from disk gets you good results. And so now you've turned a problem of hey, I need expensive memory for these large vectors to hey, I need cheap disk for these large vectors. And so that's a much more tractable problem. And we're pretty happy with where we ended up on that.
Sean Falconer
Okay. In the article that you wrote that talks about this project and the different AI tools that you use to assist you, you used a lot of different things that you mentioned in the article. And I'm curious about in terms of your experience with things like GitHub Copilot, like, what was that tool good at? And like, what are some of the limitations?
Jonathan Ellis
Yeah, that article is over a year old at this point, and so there's newer tools that we can talk about. But GitHub Copilot is still part of my arsenal. And really what they've targeted it at, and I think it's really good at doing it, is guessing what you're going to type on the current line, or maybe like the current line and a couple more. But you're not giving it instructions like you do with ChatGPT, but rather it's looking at your code, it's looking at what you're typing and inferring from that what you might want to type next. And so it's autocomplete on steroids is what it is. And so this is another place where, you know, hey, maybe it's wrong 40% of the time, but verifying that is just like a few milliseconds, a few tens of milliseconds. It's very, very quick to read what it proposes and decide whether to hit tab to complete it or to say, no, you know what, that's not what I wanted, and just keep on going on your own. But I have noticed, and I've seen other people comment on this as well, that you kind of get used to working with an AI partner like copilot. And so I'll start writing a line, and then I'll just think to myself, okay, copilot should be able to take it from here. And so I'll pause for half a second to see if it jumps in with a suggestion. And sometimes it doesn't. I'm disappointed that I have to keep going, but you kind of get an intuition for what it's good at completing and what it isn't. And so when it's an appropriate time to pause. And so I realized on my last plane trip, which was two weeks ago, that I don't like coding offline anymore because, you know, I don't have copilot. I don't have Ed and Claude, I don't have augment code to ask questions about my code base. So, yeah, fortunately, that's happening at the same time that airlines are getting better and faster Internet connectivity. But, yeah, it's just I can still write code the old fashioned way. It's just not as much fun, I.
Sean Falconer
Think, like for myself. Like there's certain languages that I only learned to use ever using like an IDE that had like term completion and things like that. And you get so used to those things that you don't build up necessarily, like the muscle memory around like, oh, I know exactly where this library is in order to import it. And there's other languages that, you know, my main coding interface was like VIM or Emacs or something like that. And I could like syntax was not a problem. I could write it from scratch and that I'd be like more comfortable in offline mode. Even though most of those languages I never used outside of, I don't know, like programming competitions and school projects essentially. So I totally understand and I think AI is just kind of like supercharging that dependency on some of these tools.
Jonathan Ellis
Yeah. And it does become a dependency. Right. Like if I went in to find the first occurrence of a substring in java, is it string.indexof or is it string substring? Or like, I couldn't tell you off the top of my head with certainty what that is. I would hit string object and then I would put.in my IDE and look at the methods that it proposed and say, oh, okay, that's the one that I want. And in the same way I do think that, you know, the part of my brain that used to write code manually all the time is atrophying a little bit and I'm relying on the AI to do that for me, which makes me a little bit uncomfortable. But at the same time, I remember I had a coworker who was writing Java in VIM for years at Data Stacks and I kept trying to convince him, use intellij, use intellij, it will make you more productive. And finally he did. Finally he started using IntelliJ and he was 30% more productive. As brilliant of an engineer as he was to write code the hard way, write Java the hard way in Vim, using a good tool did make him more productive. And I'm sure that Vim, using part of his brain atrophied a little bit. And he doesn't have that mental encyclopedia of Java methods quite the way that he used to. But it's probably the right trade. Like I don't see how it's not the right trade. So there might be a point where I'm just giving instructions to the AI to the point where it's like, oh, now it's like I'm managing a team and this isn't fun anymore. At which point, you know, I'll start a movement for artisanal handcrafted code. But right now it's a productivity enhancement and it's a fun enhancement.
Sean Falconer
What are your thoughts on, you know, there's the potential impact, I guess, to like people who are, you know, students and trying to learn. Like even when higher level languages like Python and stuff like that have been introduced, there's been, that's like incited certain riots within the world of engineering and computer science where some people feel like, oh, like you have to suffer through, you know, memory leaks and compiler errors and you know, kernel dumps and stuff like that in order to earn your stripes and really understand what's going on. And that stuff gets abstracted away by certain languages. And now if you start to become very reliant on some of these AI tools, you're even further away from necessarily understanding like the guts of it or having at least like having to understand the guts of it in order to.
Jonathan Ellis
Be successful, you know. Yeah, that's a good question. I think right now one of the things that junior engineers struggle with, at least in my limited experience coaching junior engineers and actually I taught high school CS for half a semester, so I've got a little bit of experience with the very, very junior end as well. One of the biggest mistakes they make is just kind of like looking at a problem and saying, oh, maybe this is the solution. And then immediately they start trying to implement that instead of slowing down and thinking a little, little bit harder of like, what's my world model here and what would have to be correct for this to be the actual solution. And AI can make that worse, right? You can just say, okay, hey, I try to make this change, okay, that didn't work. Try this, okay? Or at the extreme you're just pasting stack traces into ChatGPT and then pasting what it says back into your IDE and hitting the test button again. So it can definitely exacerbate bad habits. But at the same time there's never been a better tool for like you said, non judgmentally saying, hey, this is happening, that I don't expect what's going on. I hope that junior engineers make the right trade off there. But I could see that it would be possible to overuse it. The reason why I'm optimistic on that is that if you are overusing it, then it's self limiting. Like you hit those limits where the AI gets stuck and can't solve your problem, at which point you do need to be able to solve it yourself.
Sean Falconer
Do you think that there's certain aspects of coding that should never be AI powered? For example, writing unit tests or something like that? Should those be handcrafted or other parts?
Jonathan Ellis
Oh, man. Unit tests are the first things on the chopping block for me. Like, hey, Claude, test this code for me. I do find that if I just say test this code, it does a terrible job. But if I say write tests that exercise this path or write tests that use this kind of data, then it's much better when you give it a little bit of direction beyond just write tests.
Sean Falconer
Yeah. I mean, I think to get value out of most of these tools, whether it's from, like, a coding aspect or even other things, like just writing like a blog post or something like that, you have to have enough knowledge to give it, like, very clear instructions in order to get the thing out. Like you. I don't think, like, my mom has no engineering background is going to go and be able to, like, build an app with any of these tools right now that would actually, like, do anything, like, useful. They're not there. You still need enough knowledge about how, like, systems actually work, how to program things in order to get value.
Jonathan Ellis
You know, coming back to what I was saying earlier about being optimistic, if I were to be pessimistic, I might be a little bit pessimistic about, like, the next generation of engineers having that, you know, intuition of how to direct the AI appropriately. In other words, you know, the AI's first inclination is often to write the simplest possible solution to what the task it's been given, which means that a dozen tasks in, you've just got kind of. You've got what we used to call spaghetti code. So I had a situation today where I was directing Claude how to refactor some code that had gotten a little bit out of hand. And because I had, you know, these 25 years of experience writing code the hard way, I was able to say, here's what the API should look like. Now go make it. Make it conform to that. But how successful is it without that experience to say, here's what it should look like? I don't know, but I guess over the next couple years we'll find out. Maybe there is a way to get that experience coding with AI all the time. I hope there is.
Sean Falconer
So you've mentioned Claude a couple times. Is that your main tool that you're using these days to help?
Jonathan Ellis
Claude's my Go to LLM and they just released an update two days ago that I don't think I have enough data points personally to say that it's better. But the E benchmark that Paul Gaultier puts out says that it's better and that's a pretty high quality benchmark. So yeah, CLAUDE is better than GPT4 in general for writing code. And yeah, that's my go to. And then I usually use it through this tool called ed, which is, which lets you say here's the files that I want the LLM to edit. And then I'm going to give it this what Edit calls a repo map, which is basically it makes a graph, a network of how your code's connected. And then looking at the file that you're editing, it says, okay, here's the classes that it calls and the classes that call it. And so I'm going to give those connected pieces to CLAUDE as context rather than trying to give it the entire code base. So it's a really powerful approach and you know, that lets it edit the files in place without having to copy back and forth into a browser window. That's my go to today.
Sean Falconer
Is that the main sort of AI tool stack that you're using? Are you using other stuff as well?
Jonathan Ellis
Yeah, I mentioned the others copilot for the autocompletion and cursor and or augment code for asking questions of the code base. And so the reason I put that footnote there is that my understanding is augment code is still in closed beta, so data stacks is talking to them and so I have access to that and it integrates with JetBrains IDEs like IntelliJ, which is the killer feature for me, but also very good at answering those questions about the code base. But if you don't have access to augment code, then cursor is also good at that. Answering question questions Use case.
Sean Falconer
How do you think we get to a place where like some of this stuff is like a more consolidated, unified experience rather than having to like you use three, four different tools.
Jonathan Ellis
So right now it's early enough. This is, this is the industry pendulum, right? Like when there's something new, then you need to use best of breed tools for each individual use case because it's new. Like nobody's consolidated them into one tool that does everything. And then gradually over time, you know, you, you do get that consolidation until someone figures out like, oh, here's this new aspect of the problem that I can deliver an order of magnitude better benefits for and in which Case, you know, you start over. So we are at that beginning stage right now. I do think that we'll get to that consolidation stage over the next couple years. How quickly that happens I don't think I could guess.
Sean Falconer
I think you're exactly right. That's like a sort of normal path for all like new technology innovation. Like people forget that there was a point when even like with early messaging clients, like peer to peer, you know, you had like icq, aol and then people would build these like super apps that like aggregated them all together eventually. Now you have maybe like a handful of those things that are like all provide kind of a similar level experience.
Jonathan Ellis
Certainly Cursor would tell you that they can do all three use cases today. I personally prefer, you know, mix and matching, but yeah, that's where we are.
Sean Falconer
I want to switch gears a bit and talk about this project, Colbert Live. Your team introduced this, which My understanding is it aimed at making like vector databases smarter. Could you explain a little bit about the project and like how it actually helps or enhance the functionality of.
Jonathan Ellis
There's a little bit of background here. So the problem that Colbert. So there's Colbert Live, which is the name of the library that we open sourced and that's based on a project that I'll call Stanford Colbert, which is a series of research papers and an open source library produced by. I'm not going to risk saying his name because I'm probably going to get it wrong. But he was a grad student at Stanford when he wrote it. And so the problem that it solves is that vector search is really good at capturing semantic similarity. So it is sunny outside and it is a bright day. Like those would compare semantically. Very similar. Even though they don't have the key words the sunny and outside and bright and day. Those are completely different words. If you're trying to do a text based search, you will not see those appropriate points of comparison. The flip side is that vector search is not good at keyword search or and it's not good at terms that the embeddings model was not trained on. So terms that it's not trained on or that it doesn't see often in its training data include proper names, for instance. So Jonathan Ellis. It might be okay at searching for that. Like those are both pretty common names. Sean Falconer. I would guess that it would do less well at searching for your name just because it's less common. And so right now the state of the art in the industry is to say, okay, well let's Take the vector search results and let's take keyword search results that you get from something like a BM25 algorithm and let's give both of those sets of results to a RE ranking model and then let it sort out which ones are actually best. That does work pretty well. It actually surprisingly well, especially with like the most recent generation of RE ranking models that you've seen from Voyage AI, for instance, released one in September, that in my opinion is the most accurate available right now. But the problem is that these RE ranking models are expensive, but not just in like, hey, I have to pay for the service to use them, but expensive in terms of time. So Voyage's Rerank 2 model takes just about exactly half a second, a little bit under half a second to re rank a list of 40 documents and say which ones are the top five results for a given query. I mean, it's certainly acceptable for a lot of use cases when the alternative is not getting the right answer. But that's definitely slower than you'd like when your underlying vector search is, you know, 50 milliseconds and your BM25 search is probably faster than that. So what Stanford Colbert does is it says, hey, instead of representing our documents or our passages with a single vector, let's create a semantic vector for each token that's influenced by the surrounding tokens in the passage and we'll index all of those and then we'll do the same thing with the query. We'll break, break up the query into tokens and create a semantically influenced vector for each of those query tokens. And so now I'm comparing, you know, 16 or 32 query vectors with my database of all the vectors that were involved for this passage. And what that does is it now instead of having a vector for it's bright outside or it's a sunny day, now I have vectors for sunny influenced by its day and a vector for bright influenced by its and outside. And so I get the best of both worlds from my search result. I get the semantic matching, but I also get something that's very, very similar to keyword matching. So when you do that, now I don't need to throw in BM25, now I don't need to re rank anymore. And so it's a for it's both a more theoretically satisfying approach as well as potentially a faster one as well. So with Stanford Colbert, what you get is a standalone vector index that's specialized for doing Colbert searches, but you can't do any predicate filtering. It doesn't integrate with anything else out there. And so Colbert Live says, hey, bring me your vector database, like data stacks, Astra, like Apache, Cassandra, like PGvector, like SQLitevec, whatever it is. And Colbert Live wraps the intelligence of how to do Colbert style indexing and searches with multiple vectors in a standard single vector database.
Sean Falconer
Is that overall, like going to be a better search for vectors than using sort of traditional vectors? Is that like a wholesale replacement of the traditional vector search?
Jonathan Ellis
I mean, gazing into the future, is it going to be a wholesale replacement? Probably. So if there are use cases where single vector search is giving you adequate results, then Colbert won't replace that because Colbert isn't adding value because you're already getting adequate results and it's inherently slower since it's dealing with multiple vectors per query. But if you do need more relevant results than you get from single vector search, which I think is most use cases, then I do think the Colbert style search has the potential to replace those.
Sean Falconer
Yeah. So essentially, if you need to rely on a re ranker today, then potentially this is a better, faster option that gives you a more accurate result.
Jonathan Ellis
So we've been talking about just pure text passages, but the advantage of the Colbert approach is magnified even more when you're talking about multimodal data. So a group of French researchers created a model called Col Pali which applies the Colbert approach to searching images. So I can give a text query and it will match that with images whose vectors are similar to the query that I gave you. And the way it does that is they trained a model to map both the text queries and the images that it's indexing to the same vector space. And so the alternative, what people are doing in industry is they're taking those images and they're running OCR against it, and they're training models to recognize tables and charts and pull those out and describe those in a way that can also be indexed with traditional vector search. And so you've got like this really complex pipeline that's a little bit slow and a little bit fragile versus oh, I can use this Kohlpali model and I can index it with Colbert Live and all of that complexity goes away. So I think like, even more than the traditional vector search and getting better results, I'm more excited for the image search side of things. And Colbert Live supports both of those.
Sean Falconer
What's the state of the project right now?
Jonathan Ellis
As far as I know, I'm the only person who's, who's actually used it. So I would love to get some feedback and say, here's what was useful, here's what was hard. It supports Astra and it supports SQLite VEC. Today, what I did in an attempt to smooth the learning curve is I created two cheat sheets. One is for, hey, I'm working with Claude or I'm working with GPT and I want it to help me use Colbert Live. And so you can give it the cheat sheet which has like the API and the DOC strings and so forth. It's a Python library. And the other is, hey, I want to use weaviate or I want to use Quadrant, or I want to use Pinecone, or you know, I want to use some vector database that you haven't implemented. So I also have a cheat sheet for, here's how you extend it, here's, here's how you implement the database class for Colbert Live. Because, you know, it's an Apache licensed project. The intent really is for it to be more than just data stacks.
Sean Falconer
How does the integration work with these various, like, vector stores? Like, presumably it's creating an index outside of them. So then how does it actually, like, you know, use the index to find the data that's stored within the vector store?
Jonathan Ellis
This is something that I wish I could find a better way to do it. I haven't found a better way. So what I did was there's an abstract database class and it has two methods that you need to implement. One is for running the individual vector searches per query, and then the other is given a unified list of documents that I've identified as the best candidates, fetch all of the vectors associated with the those documents so that I can compute the Colbert max SIM score for each of those. And the reason why it's at such a high level is that I want you to be able to add in predicate filtering. I want you to be able to add in, you know, any other aspect of your database that you want to ACLs, whatever. And so, you know, it's hard to find a common denominator across Cassandra, which is a very different animal from PostgreSQL, which is a very different animal from Pinecone. So I left it at this fairly high level. And then if you drill down into the Astra implementation there, it gets more opinionated about, here's what your code should look like. And similarly, I did it for a SQLite, it's a little more opinionated there. And so that's why I have that cheat sheet for, hey, I want to implement this for this other database and there's some examples for you to follow and hopefully that helps.
Sean Falconer
And in terms of where we started the conversation around what Data Stacks is doing in the AI space, becoming this full end to end platform, what is the state there? Can I come to Data Stacks today and do everything that I need to do in order to build a AI powered rag application?
Jonathan Ellis
Everything is a big word, but you can definitely build AI applications completely on top of Data Stacks to especially if your application has family resemblance, shares some DNA with chatbot applications, because that's what langflow was originally created to target. And so, you know, if you wanted to build a backend for cursor to compete with them today, you definitely need to do some, some customization and that, that's not going to be something that we're going to do out of the box, but we will give you the building blocks and we can help you figure out how to, how to do the missing pieces.
Sean Falconer
Yeah, I mean, I think that's going to be the case with any fully fledged platform today. You need to do really advanced stuff. You're going to have to roll up your sleeves, I think, and get your hands dirty and do some custom work. So what's next for data stocks?
Jonathan Ellis
We see the journey in the industry overall as going from 2023 being a year of experimentation and testing this new tool that we have and 2024 really being where people have been able to successfully turn that into production applications. And we believe that 2025 is where people are going to go from automating and enhancing their existing products and existing workflows to addressing things that weren't possible before. And just as an example, a very simple example, I mentioned earlier that I was five fine tuning and embeddings model and the way that I got the data to do that fine tuning was by asking Gemini Flash to OCR a bunch of PDF documents for me. And you know, a couple years ago, building a data set to train or fine tune a model was considered one of the most difficult things that you could do. And it required an army of human labelers to do it. And now, you know, you've been able to reduce the time to do that by three orders of magnitude, maybe four. So I think that you're going to start seeing like, you know, just like, you know, the internal combustion engine. It started off, I think they called it a horseless carriage. Right. Like they just thought like, oh, I know, things with wheels, carriages, right? That's what we're going to build. And now, you know, a modern Tesla has very, very little resemblance to a carriage. And so I think that's the trend that you're going to see. And Datastax wants to help people make that transition.
Sean Falconer
I think that's right. If you look at any kind of big technology shift that's happened in terms of how consumers interact with technology, whether that's Internet and desktop to mobile computing and all this, the cloud, usually the first couple of years is experimentation and setting up infrastructure, and then it takes a couple years for like the really net new baked in that technology. Consumer experiences that happen, you know, Uber and Instagram that were these like landmark mobile first companies didn't happen immediately when the iPhone was released. It took several years because you had to build all of essentially the tooling and infrastructure to be able to even serve or like create that kind of use case and have people thinking about it that grew up with the technology and stuff like that. And I think we're, you know, 2025, 2026, that's when I think we'll start to see that in the AI world as well. Right?
Jonathan Ellis
I think so too.
Sean Falconer
Well, Jonathan, this has been really interesting. Thanks so much for being here.
Jonathan Ellis
All right, thanks again, Sean. Cheers.
Episode: DataStax and the Future of Real-Time Data Applications with Jonathan Ellis
Release Date: November 19, 2024
Host: Sean Falconer
Guest: Jonathan Ellis, Co-Founder and Technical Lead at DataStax
In this episode of Software Engineering Daily, host Sean Falconer welcomes Jonathan Ellis, co-founder of DataStax, to discuss the company's evolution and its pivot towards AI-driven applications. Jonathan, with nearly 15 years in the data stack arena, shares his passion for coding and the technical challenges he tackles at DataStax.
Notable Quote:
Jonathan Ellis [01:14]: “Writing code... looking forward to taking code and then at the end of the day it does something that it couldn't do before.”
DataStax has been enhancing its platform to support AI-driven applications, particularly focusing on vector search capabilities. Jonathan explains how DataStax aims to be a comprehensive stack for building generative AI applications by integrating components like their Vector Search for Cassandra, the acquisition of Langflow, and partnerships with NVIDIA for embeddings computation.
Notable Quotes:
Jonathan Ellis [02:49]: “We're really trying to remove the complexity as much as possible and let you focus on building your application.”
Sean Falconer [04:03]: “Yes, it's more of like a platform approach than being essentially like a point solution for vector storage.”
Jonathan highlights the challenges developers face when stitching together various tools for AI applications, such as embedding models and databases. DataStax's approach simplifies this by allowing seamless integration with services like OpenAI and NVIDIA, handling complexities behind the scenes.
Notable Quote:
Jonathan Ellis [04:18]: “There's a lot of unspoken knowledge or it's not necessarily clear what the best practices are... we're trying to bring those into the mix as well through the Langflow platform.”
The conversation shifts to Jonathan's experiences with AI-assisted coding tools, including GitHub Copilot, Claude, and others. He discusses the initial skepticism he had towards using AI for coding, which transformed into enthusiasm as tools like ChatGPT and GPT-4 demonstrated significant productivity gains.
Notable Quotes:
Jonathan Ellis [06:36]: “For me, it's just so useful that it's worth putting up with all the sharp corners and rough edges.”
Sean Falconer [09:04]: “...ChatGPT and, you know, throwing prompts at it for what I needed... It does that with like a reasonable output.”
Jonathan elaborates on how AI tools have not only increased his productivity but also made coding more enjoyable by handling repetitive and boilerplate tasks. However, he acknowledges limitations, such as instances where AI falls short, requiring manual intervention.
Notable Quotes:
Jonathan Ellis [08:16]: “...I'm having more fun because I've got this AI intern to do kind of the boring parts and I can concentrate on the interesting parts.”
Jonathan Ellis [10:53]: “It's a good mix. I'm really, really happy with the challenge and the intellectual puzzles that programming in 2024 with AI looks like.”
Jonathan points out that AI tools provide a non-judgmental environment for developers to ask questions, enhancing learning and problem-solving without the fear of judgment. This rapid access to information accelerates the development process.
Notable Quote:
Jonathan Ellis [11:19]: “...having that assistant to answer questions like the non judgmental thing, that's great. Absolutely. But also just like the speed, the latency of getting your questions answered now.”
Delving deeper into technical aspects, Jonathan discusses the challenges DataStax faced while integrating vector search into Apache Cassandra. This included adding a new vector type, developing efficient vector indexing algorithms like HNSW and Disk Ann, and building a query execution engine capable of handling multiple predicates.
Notable Quotes:
Jonathan Ellis [14:10]: “How do you wire that into the rest of the database?... building a cost-based query optimizer.”
Jonathan Ellis [15:59]: “We can push the compression up to 64x... a much more tractable problem.”
Jonathan shares his optimistic perspective on AI's role in software engineering, predicting that AI will continue to enhance productivity while also raising concerns about over-reliance. He emphasizes the importance of balancing AI assistance with personal coding skills to maintain a deep understanding of the codebase.
Notable Quotes:
Jonathan Ellis [22:22]: “If you are overusing it, then it's self-limiting... it's a good balance.”
Jonathan Ellis [24:03]: “Unit tests are the first things on the chopping block for me... if you give it a little bit of direction beyond just write tests.”
Jonathan introduces Colbert Live, an open-source library inspired by Stanford's Colbert project. This tool enhances vector searches by creating semantic vectors for each token in a document, enabling more accurate and efficient search results without the need for expensive re-ranking models.
Notable Quotes:
Jonathan Ellis [29:38]: “Instead of representing our documents or our passages with a single vector, let's create a semantic vector for each token...”
Jonathan Ellis [34:38]: “...Colbert style search has the potential to replace those... more relevant results.”
Jonathan explains how Colbert Live integrates with various vector databases by abstracting the database layer, allowing developers to implement custom search functionalities across different platforms. He anticipates that Colbert Live will become a standard for vector searches, especially for multimodal data.
Notable Quotes:
Jonathan Ellis [35:27]: “Even more than the traditional vector search and getting better results, I'm more excited for the image search side of things.”
Jonathan Ellis [37:55]: “It's an Apache licensed project... the intent really is for it to be more than just data stacks.”
Looking ahead, Jonathan envisions 2025 as a pivotal year where AI enables the development of entirely new applications and workflows that were previously unattainable. DataStax aims to support this transition by providing robust infrastructure and tools to harness AI's full potential.
Notable Quotes:
Jonathan Ellis [40:38]: “2023 being a year of experimentation... 2025 is where people are going to go from automating and enhancing their existing products and existing workflows to addressing things that weren't possible before.”
Jonathan Ellis [42:11]: “DataStax wants to help people make that transition.”
Jonathan and Sean conclude the episode by reflecting on the rapid advancements in AI and its transformative impact on software engineering. Jonathan expresses enthusiasm for the future, highlighting DataStax's commitment to leading the charge in AI-powered data solutions.
Notable Quote:
Jonathan Ellis [43:08]: “All right, thanks again, Sean. Cheers.”
Comprehensive AI Integration: DataStax is evolving into a full-stack platform for generative AI applications, simplifying the development process by integrating various AI tools and services.
Enhanced Developer Productivity: AI-assisted coding tools like GitHub Copilot and Claude significantly boost productivity and make coding more enjoyable, despite some limitations.
Innovative Vector Search Solutions: Colbert Live represents a significant advancement in vector search technology, offering more accurate and efficient results without the need for costly re-ranking models.
Future Outlook: The AI landscape in software engineering is poised for transformative growth, with companies like DataStax leading the way in enabling new applications and workflows.
This episode offers valuable insights into the intersection of AI and software engineering, showcasing how companies like DataStax are pioneering efforts to streamline and enhance real-time data applications. Jonathan Ellis provides a candid look at both the opportunities and challenges presented by AI, emphasizing the importance of balancing automation with foundational coding skills.