
Loading summary
A
Welcome to Just Now Possible with Teresa Torres.
B
My name is Ernesto Garcia. I'm a front end product engineer at Twist. I've been at Twist for over a little over seven years now and lately my work has been more around how AI shows up in our products and also how our products interface or integrate with external AI systems. And in particular in Ramble, which is a product that we're going to talk about. I was involved mostly in the early stage of exploration and then integrating into the web application in the ui.
C
My name is Thomas. I'm a backend software engineer at duist. I've been working here for about seven years now. Lately I've been working a lot on security as well as database schemas and resharding and all sorts of stuff about database organization. Before that I was working on Rumble on the backend side. So making sure, I mean, creating the microservices that make it work, writing prompts, all the testing and making sure that we have a high quality backend service that we can offer to our clients and then to our users.
D
My name is Hugo, I'm one of the product managers at Duist and I've been at Doist for more than 10 years now. That hurts. It's my first job and yeah, at the moment I'm working on new bet, but previously I worked on taking Rambo from a prototype to the high quality launch product that it is. So we're excited to talk about that today.
A
Excellent. It says a lot about doist that you all have long tenures there. Someone tell me a little bit about what DOIST does as a company.
D
So yeah, doist is a software company and we're building productivity software. The main product that we have and work on is todoist. So it's a todoist app, project manager. It's great for taking your tasks as a personal user, but also with your team and go from capturing the task to working on the task and then completing the task. That's the main product we work on. We also have Twist, which is a team communication app and steering more towards Async communication for teams like us that are pretty remote and covering the world. Yeah, and at the moment also working on a few new bets that we might touch on at the end of the call.
A
Excellent. And then tell me about, is it Ramble?
D
Yeah, so Todoist Rumble is our voice to tasks feature. You can just rumble anything that you have in mind and we're going to capture it as tasks. So yeah, it's a pretty delightful feature that we worked on last year. And it's our. I think it's our very first pure AI feature that we built into the product.
A
I love the name because it takes any pressure off of having to like have structured thoughts, especially for to dos. Right. It feels like I might have to know exactly how to say a to do, but just in the name of the product. It's just Ramble. And is that how you built it? Is the intent? Is this just fully unstructured input and your AI figures it out?
B
We didn't start with a Ramble feature in mind and then decided to use AI. We started with an AI exploration, given all the advances that AI has and the new things that are possible with the new AI technologies. So we had this phase of exploration to see how we could introduce AI into some features, into those in a way that also made sense and not just for the sake of it. And during that exploration, we came up with a few prototypes of different features, some of them not related at all to what Ramble does, other kinds of things, but also some other features that started to look more and more similar to what Ramble ended up being at the end. And Ramble came up as one of the few top contenders of features that actually made sense that we're solving a user problem and not just putting AI in the product for no other reason.
A
Yeah, yeah, I love this. Okay, so it started as. Let's just explore how we might use AI. Tell me a little bit about. I imagine as a to do company, this is something you've heard before. Did you have any type of voice interface before? Did you have any feedback from customers that this was something they wanted? Give me a sense for, like, how did you know this was a problem worth solving?
D
One of the main USP for todoist has always been the quick an. It's our feature to capture tasks and we've always made a point on making it super fast. We were one of the first to introduce natural language parsing maybe like 15 years ago. So you could say stuff like, I need to call my mom tomorrow at 10. And we're just parsing the text and figuring out the attributes of the task. So it's always been something we wanted to improve. And as far as I remember when joining, we were already talking about how we could make voice happen and the technology was just not really there. So I think once we're seeing some models come up around voice, we're very excited about trying them out and see how we could make capturing tasks even faster and more useful. And frictionless So I think that's really what drove us to make it one of the top contenders for using AI in the tool. It was always there. And we also backed it up with a bit of research. We've got a continuous discovery process as well here. And as part of the research, we saw that there was a bit of a cold start problem in todoist where you don't really know what to do. And even though you, I guess you don't know yet what you need to do, sometimes you need to brainstorm that a little bit. And a lot of people have been sharing that. They use pen and paper to figure out the tasks and then they copy them into todoist or they use the new voice in ChatGPT, the new voice feature, to brainstorm a little bit what they need to do and then they dump that into todoist. So it felt an opportunity for us to cover that use case of brain dump. And that's what we called it internally at the beginning. And we also had this image of when the devils were worst. Prada, Miranda, she has this moment where she's just dumping tasks to her assistant. And so we also had that in mind when building the feature. And we even tried it with the feature as we went to see if it was able to capture a lot of tasks at once.
A
That's amazing. Taking inspiration from a movie scene is great and actually using it as like your test case is excellent. So I love that you started with. I can imagine the very simple case of this is where you started with just natural language where I click on maybe a microphone icon. I'm literally stating a task the way that I would probably type it in the tool, which means I'm thinking about it from the tool's point of view of the title of a task, maybe when the task is due. But it sounds like ramble is different from this. Ramble is I can just brain dump, as you said, and it's going to try to figure out what to do with all of that. And I love this visual of that movie scene. So, like I'm imagining I could go for a walk and just stream of consciousness. Here's what's in my head that I need to capture. Is that fair?
B
Yeah, I think that's fair.
D
Yeah.
A
Okay. And so let's talk a little bit about. I love that you were seeing this idea of people were starting on pen and paper and then transferring it over to todoist. What did you learn about what's happening in that pen and paper session? Like what I'm trying to just understand this problem of a brain dump. Is it like people just haven't thought about how to structure them as tasks? Tell me a little bit about what you know about that. Snip.
D
Yeah, so I think the behavioral thing behind it is that once in todoist you've committed to it. So I think people are figuring out a way to be more unstructured than figuring out the actual tasks that they want to add. So I think for many people, they go straight to quick add, just adding the task. But we've discovered that for some people, they need to think through their plan or what they want to do first and then add the tasks once it's clear. So, yeah, I think in that case with Rumble, it's trying to bridge to that behavioral thing that we saw in the research. And. And we're also working on other types of cases like pen and paper, like taking an image, converting that into tasks, and also just dumping a lot of text and then figuring out the tasks in it. So we're going even beyond Rumble at the moment. It's exciting because we are able to bridge to what people actually do in real life. So, yeah, that's where AI is becoming pretty useful for us.
A
This sentiment that people don't want to put it into doist because it's like a commitment that now I have to do it is a really nice insight. Right. Like pen and paper doesn't feel real yet. But as soon as it goes into the to do app, now I have to commit to doing it. Okay, so it sounds you were very aware that there was this process of, let me just brain dump. You saw what was possible with LLMs. This was one of the first ideas that came up that you started prototyping around. How did you evaluate this was a problem that an LLM could be good at?
B
We started, as I said, with an AI exploration phase, and we started actually with an idea similar to what ended up being ramble, but using mostly text input. So it was still text input, but you can dump there something unstructured, just your thoughts and stuff like that. And those early prototypes with that kind of approach did work pretty well in terms of showing us that LLMs were already, at the time, which was a little over a year ago, maybe they were already good enough to take out of that unstructured text, unstructured thoughts, a set of structured tasks for the user. So that was one of the first evaluations that we did. But there was still the friction that you had to type all of that, which is not the idea. And then the next step was testing whether voice models could also be up to the task. And a few tests internally, we had initial rough prototypes to try it out before even putting it in our products, and things looked good. So we started to invest in actually giving it a shape that would be suitable to integrate into the product.
A
Great. Then tell me a little bit about. Let's get into the. How does it work? I imagine your user is hitting a ramble button and they're rambling for a little while and then what happens?
C
So what happens when a user starts a rumble session is that the browser will open a connection to our backend through our microservice dedicated to that, and it will send the raw audio from the microphone and then we will forward that to our LLM provider. So this is Google Vertex, the Vertex API. And then we send our prompts to that model and all the tool definitions. And we use a very nice model that has live audio processing, so it can call tools and do all sorts of processing while the user is speaking. We don't have to wait until the user has finished talking to start getting information from the LLM. And thanks to that, we can start displaying tasks to the user as they are talking, even if they are not finished talking. And then they can iterate on that. They can correct themselves. And we have the right tools for that. To add new tasks and to edit the tasks that they are adding so that they can correct if they said something wrong or if the LLM understood something wrong. I can just change that in real time and show the updated tasks immediately in our client.
A
So let me make sure I understand. One model is doing both the speech to text and also doing on the fly tool calls while the person continues to talk. So as I ramble, I'm seeing tasks being created that I can then refine as I keep talking.
C
Yes, that is correct. We initially tried a little bit with using a first model to do a transcription of the audio and then processing that. But in the end it wasn't worth it. It adds a lot of latency and the results are not much better because when you're talking, there's more than text. There's also the poses, there's the intonation that you have. All of that can carry meaning as well. And if you just have a text transcription, you lose all of that. So when you have a model that can process the audio directly, you get a richer input in that way and we can have a more accurate capture of what the user is actually meaning that if we just Process text.
A
Oh, so the model is. There's not a transcription step, it's actually just processing the audio directly.
C
Yes.
A
Yeah, Fascinating. Okay, and is the user. Is this available on the. On desktop or mobile? Both.
C
It's available on both. It's on desktop and mobile applications. We even started a beta for Android Wear version, so you can use it from your smartwatch if you have an Android watch. So that's really cool.
A
Yeah, very cool. Okay, so as I ramble, I'm getting. I'm seeing on the screen, tasks start to pop up. And this is just midstream, so I can start, I can stop what I'm saying and say, no, I meant this, not that I can say, retitle this, add this to the notes, whatever. This is fascinating because it feels very visual. And the thought that immediately came to mind is I imagine people use this as they're driving and they remember a task. So I imagine there must be like, you must have had to think through like just the right UX of. Do you support different modalities of. I actually want to see what's on the screen. I have audio only. Has this come up at all in your testing? I'm curious about just what use cases you have for this.
D
Yeah, a lot of people drive in the rumble. That's the thing. And to that insight we created, so we added sound effects. So when you add a task, there's a sound that is. Is going to confirm that you've added a task and we actually captured it from your rumblings. Then also when you edit a task, when you say, oh, sorry, I meant this and that, or this is not priority one, this is priority four, we have a different sound that is going to mean it's edited and you can move on to the next thing. So we've added the sound effects to make sure that people drive and look at the road ahead. But yeah, it's very visual. Most people use it, whether on desktop or mobile. And so you have a dedicated screen where you see the tasks popping up, coming in and then you can edit them and you would see that happening in real time. And yeah, it's very good to feel you're being heard and you can trust that the tool is actually understanding. Previously we, I think at the very beginning we tried an actual live audio thing where the tool would respond back with voice. There was a little slow. And we also tried with just, you know, rumble. And then at the end you have all your tasks added but you don't have this confirmation step where you want to make sure that what you said is what is going to be added to the list. So I think we struck the right balance of UX where it's live, but it's also visually like confirming to you and not just at the end or through voice. So it feels really nice that you actually feel heard by todoists.
A
Yeah, I like that you're using both visual and audio cues. It's a nice. I can imagine it feels very alive.
D
Yeah, I thought. And our designer Michael did a great job there. The tasks are actually floating around so they don't feel as static and confirmed. It's more of a live thing you can play with and edit. And we also added some tips that are going to cycle through so you can also learn about what voice commands you can use, such as removing a task or editing or adding some attributes like a label or a priority or a description. Yeah, it feels very live.
A
Yeah, Very nice. Okay, so I love that there's already just complexity in the ui. Somebody. It sounds like the simplest feature ever. I'm going to hit record, I'm going to start rambling. But already we've started to uncover like you're turning this into tasks. You're showing them what tasks were created. They can edit those things, they can add metadata to those things they might be driving. So there's sound cues. Is there anything that, like you said that you experimented with the AI talking back? There were some latency issues. Do you do anything with voice, like text to voice at all, like at the end of a session or did you decide that wasn't a part of the feature?
C
No, we're not doing any text to speech at all. It doesn't work really well and it would also be very complicated because we support many languages and so you have to also depend on that. It can be very tricky to get right.
A
Yeah. I can also imagine for your people that are looking at their phone or on their desktop, the visual cue is exactly what they need. I guess in the driving sense it sounds like your audio cues have been enough for people to feel confident that what they're saying is getting captured appropriately.
D
Yeah, I think so. No accidents were reported so far. So it's working. The data shows it's working Excellent.
A
Okay, so let's get a little bit into the technical details of how this works. So you've got a model that is processing the audio. It can call tools. Tell me a little bit. First of all, I'm curious about your background in this area. So you mentioned this was one of the first AI features at todoist. Had any of you built AI features before? Were you learning this on the fly? Was it part of this? Let's exploration. Tell me a little bit about how did this even come about.
B
This was for me and I think for everyone involved at the time, among the first few relatively big AI powered features that we had worked on. But as I said, Ramble came up during an exploration phase and it was not the first thing that we came up with. So during that time, it was about two to three months that we decided to explore how AI could help us. We built other internal prototypes, most of them simpler features, just to test them out. So in that sense, it was not the exact first thing. And before that, there were still some much smaller features that we had already worked on before. In todoist, we have this amazing super powerful filtering capability, but it requires a special syntax of how you type the filters. And one of the first things that we did even before this AI exploration phase was and people had a hard time generating the syntax because it was a programmer. Someone that was more technical could do it, but someone that is less technical. We had a lot of help center articles with example filters so that they can copy it and use that. And also our customer support team was usually helping people achieve the filtering that they wanted because the syntax was super powerful but complex to come up with. So we came up with an AI model that you give. It learns in a prompt about how the filtering syntax works, and then users can express what they want to filter with natural language, short sentence, and then the AI model will get them back the filter, query and the slightly more complex language that they can then use and save and they have the filter set up. I think that's the most important one. We also had one extension. It was not part of the core product, but people install an extension where you could break down a task into subtasks. So you have a task, but it's a very broad one, and you just click a button and the AI model will tell you what are potentially possible subtasks to break it down. So those were the. And other people at DUIs were building those features already before we started the exploration phase.
A
Okay, I love that you took a few months to just explore what's possible. It sounds like that was probably a really fun time period and also probably helped you level up a little bit of like, just how do we get comfortable with this new technology? Yeah. All right. Okay, so we've got an audio input model processing the input. Thomas, you mentioned it can call tools. Let's talk A little bit about what types of tools does it have access to. How did you think about tool design? Let's get into some of those details.
C
Yeah, sure. So our model has access to a very limited number of tools actually. It's basically add, task, edit, task, delete, task, and that's pretty much it. And these tools have several parameters every time. So the task itself, like the task title, the task description, the due date, deadline, priority label, project id and that's basically it. So yeah, the whole magic is actually in the prompt telling the model how to pass what the user is saying and to do the right tool calls based on that. So especially like not trying to do what the user is talking about, not try to achieve those tasks, but just capture them without trying to over interpret them. So be as faithful as possible when understanding what the user is talking about while also doing some more analysis. Advanced things like with, for example I mentioned we have due dates and deadlines. The LLMs are very bad at dealing with dates. So we have to give special instructions about how to deal with dates so that it plays well with the natural language date handling that we've already had for 15 years, as juge mentioned. So yeah, there's a lot of prompting involved to do all of that. Right. But the tools themselves are pretty limited in scope I think. And what actually happens with these tools is that we. So the backend doesn't do anything with those. We just pass them directly to the clients which use them to show the tasks in their UI to the users, to play the audio cues as needed so that the client themselves build the whole resulting task and show the UI to the users.
A
I want to clarify something. So first of all you mentioned two things I want to come back to which is working with dates and also getting the AI to just be literal about the task. These are things that I've encountered when I've played with building an AI driven task management system and so they resonate. But before we get there I want to highlight it sounds like the model only only returns tool calls. There's not a back and forth between the person talking and the model. The person talks and the model calls tools. And those tools are telling your system, create this task, edit this task, delete this task. Is that correct?
C
That is correct. There's no other output from the model. There's no text or no audio. It's just tool calls that happens while the user is talking so that we can show the updated tasks directly without with very minimal latency in the apps.
A
What's nice about this is. You're really constraining what the model can do. And I imagine this really helps with output quality.
C
Yeah, absolutely. We don't want the model to try to be overly smart. It just needs to fit within the boundaries that we are setting that match what we do in todoist that match our how the product works. So we are very restrictive about it. And that's also why it works so well, I think.
A
So I imagine you had to teach the model, like, here's what a task is. You have to teach the model, like, here's how to identify what's relevant and what you can safely ignore. And then I imagine you also have to, like you said, encounter this challenge with dates, decide, like, what stuff is associated with a task. I imagine, like, people don't just talk cleanly about here's a task, they ramble, maybe they come back and add something to a task earlier. Tell me a little bit about just. I can imagine this took a lot of iteration with prompt design, but also just understanding how you're evaluating it as well. And were there other hard challenges that came up in terms of teaching the model how to do this? Well?
C
So, yeah, it took a lot of iterations. So it. It was a lot of us testing it ourselves in our development environment. And of course, we also built a whole system to evaluate the quality of that, especially because we want that to work well with todoist with many languages that our users actually use. So we built. So the first thing is we built a system with an LLM judge to. To evaluate the quality of the tool calls done by the model. So basically, we record the whole session and we can replay it with a different prompt. For example, we record the audio, we replay it with a different prompt, and then we can see the output from the model. And we ask another model to evaluate how good it is compared to the transcript that you're manually providing saying, okay, did it capture everything? Did it get some things wrong? And we use a smarter model to do that, which is also a bit slower and not live, but which is really good for that. And that is really helpful to do many different tests in a very quick time. Then later we updated that a lot to add support for many languages, as I said. So we decided for four different scenarios. We broke them in plain English and we asked our staff our to record them in different languages. Basically, we have over 100 people in more than 35 countries. So we have many people speaking many different languages. And in the end we have recordings in over 20 languages with different accents. Different talking speeds, different recording quality, because this can also influence the LLM result. And we build a whole system to be able to replay that, to check the quality for every language separately, to catch regressions when we change anything, and just to see how good the quality of the results are.
A
Yeah, amazing. And it sounds like you're measuring on a few different dimensions. You mentioned, did it capture everything? Did it capture things correctly from a quality standpoint? Are there other things that your evals are measuring?
C
It's mostly about, did it capture everything? Did it miss anything? Or is it something that you didn't understand? Then our LLM judge gives a Note out of 5, and we considered it successful if the note is at least four out of five.
A
Okay, and so let's get into some of the challenges you ran into with just prompt design and trying to teach the LLM you mentioned dates were a challenge. This resonates with me because this was like the first problem I had to solve. But I'm not sure all my listeners are familiar with this problem. Do you want to just share a little bit about some of the challenges you had with dates?
C
Of course. So the first one is that the first one, and the most easy one actually, is that the model doesn't know the current date. So if we just say in three days, maybe it will write in three days. And that's fine because our natural language parser will understand that. But sometimes it will also use some date that is based on when the model was created. And so it will say some date in 2024, except that's two years ago, so that doesn't really work then. Most LLMs are really bad at math, like at doing basic arithmetic. So if we're saying that something is in 90 days, it will probably get it wrong. Our system can handle that, but the LLM itself can. So we have to teach it how to use our data parser without trying to do too much by itself. But even then, we have some bad cases that were a bit trickier to get. For example, we had some feedback from a user saying that the Rumble didn't get it when he said that something was due in two months and a half and it was writing in 2.5 months, which is understandable by a human being, but not by our lang, by our data parser. So we had to be a bit smart here on saying, okay, if it's something like that, then say, in 60, in 70 days or 75 days or something like that, which our parser can handle.
A
Yeah, this is fascinating because I'm hearing there's two sides to this problem. I imagine given your history building a to do app, you already had a lot of natural language parsing around dates, which already is a hard problem. And I think is one of those things that like, I know when I use products, it's like the most delightful part of a product to just say in three days or next Tuesday and have it really know what you mean. So it sounds like you had a lot of that in place already. And then there's this second challenge of can the LLM communicate in a way that your natural parser understands? And then there's this third challenge of LLMs are just naturally not great with dates for two reasons. Like their sense of time is wrong. It tends to be when their training date cutoff is. But also they're not great at just basic arithmetic. And then arithmetic around dates is tricky. This example of two and a half months is a great example of this. Okay, let's talk about how you solve some of this. So it sounds like you were able to update the prompt to guide it on how to give talk about dates in a way that your natural language parsing system could address. Was that enough? Was it enough to just constrain it on how you talk about dates? How did you solve that? It doesn't know when today is. Tell me a little bit about what you did there.
C
So to let it know what today's date is, we just inject that in the prompt, at the beginning of the prompt. So this is a really easy one to solve. And for the rest, yeah, we basically explained that instead of talking about months or weeks, it should talk about days. And that's it. And that's the most important part of it. Then there's some other details. For example, our model supports many languages, actually more than todoist itself. So sometimes the user is using a language that is not supported by todoist and by our data parser. And of course this doesn't work. So we had to tell the model to always talk about dates in English. This way it works even if the user is talking in another language. Our clients get dates in English and we can handle that.
A
Ah, interesting. Okay, so there may be this translation layer of they say in whatever language they're speaking, you translate it to English. Your natural date language parser can handle that. Yeah. It's funny, these problems aren't always the hardest to solve, but I think most teams that I talk to, they're surprising that they need solving at all. It's like the gap between what A human just intuitively understands two and a half months and then a literal LLM struggles with a little bit. And it to me, I remember when I first encountered the problem of the LLM doesn't know today's date. I was shocked, like, why can't it know that? And then it made perfect sense. So of course how would it know that? But I think the very first thing I did was I built a Python script that just tells the LLM, here's today's date, here's yesterday's date, here's tomorrow's date. Don't even worry about it, just use the script. Okay, let's get a little bit into task boundaries. I imagine with rambling you like just even understanding where one task starts and another task ends wasn't trivial. Is there things you had to do there to help the LLM understand where a task started and a task ended?
C
No, actually the LLM was pretty good with that. It's one of the surprising things, like dealing with dates is complicated, but dealing with most of the rest is pretty easy. It's really good at understanding what people are talking about. And yeah, with some very light, with a few sentences at the beginning of the prompt, it can understand that. Okay, each task that the user is talking about needs to be captured separately. And that's basically enough. Okay, it's really surprising. It can get that very easily.
A
And then you mentioned you had to do a little bit of work to guide the LLM towards you're capturing the task, you're not doing the task. Tell me a little bit about that.
C
I can give you a very simple example. If you'll start a ramble session and say, okay, I need to do grocery shopping and to buy all the ingredients to to do a carrot cake, for example, either it can just say that it can add one task by ingredients for a carrot cake, or it can list many tasks with all the ingredients for a carrot cake. That's the whole. That's the basic difference. And the thing is, it's not perfect. Sometimes it does add all ingredients for the carrot cake, but the users are pretty happy about that. I saw that several times on Reddit that to me it's not supposed to work like that, but some users are happy about it. So yeah, we try to do it. It's not perfect, but it works well enough, the users are happy with it. So we made a complaint.
A
I play with this a lot. So just for context, I built a like markdown based task manager that I use with Claude. And so whenever I add a Task to my to do list. I'm literally telling Claude, new task. And then I type what the task is. And some of the problems I ran into is that it would summarize. I want it to just literally put exactly what I wrote on the task. And sometimes it would summarize it, sometimes it would expand it, sometimes it would start doing it. And so there was like this boundary I had to set of, no, I want you to literally capture what I put on the task. But then there's some edge cases, like I have task templates. So then I would be like, use the blog post template. And it would literally write on the task, use the blog post template. And I had to be like, no, use the blog post template to create this task. And so they're just silly examples of where there's instances where you want the LLM to be literal, there's instances where you want the LLM to be smart. I've even started to move into can you enrich the task? So I'll have here's my notes and then here's my notes to the LLM. What I love about it is it really exposes just where there's ambiguity in the way that humans think and how good other humans are at parsing through that ambiguity. I'm curious if you've ran into challenges with this. It sounds like your carrot cake example is one that's very like this. But is this something that comes up and how are your users managing this? Or how are you managing this for your users?
D
Yeah. Something we've discovered pretty quickly is that the LLM was getting very creative. And in some ways it's great because if you say, I need to plan a marathon in Paris next year, it's going to probably give you a plan of what exactly you need to do. Buy shoes, et cetera. So it's nice, you get a plan. So it's not really expected. It's just you're supposed to just capture new tasks as they are, that in some ways it was pretty creative. In some others, you could get too creative. And so you would get too much away from just capturing a list of tasks. So I think we also needed to at some point. I remember with Thomas, we played around with the temperature of the model. So at the beginning it was set to one on one, so it was very creative. And then we tried to just lower that to the point where it was almost too literal. And we're missing some of the delightful stuff that it would do. For example, in the prompt, we were making sure that the Task is always actionable, so we always add a verb even though the person doesn't really give it. So we're making it actionable for the user. But if you're lowering the temperature too much, it's going to just add the word that you said. So it wasn't as helpful as it was. So we had to play around with that notion of temperature into LLM model. Does that make sense?
A
Yeah. And if we have listeners that aren't familiar with the temperature setting, it's basically just, it's like you said, it's like a creativity setting. It's like how, how much variation in responses is there going to be? Is probably the easiest way to think about it. And it's, it is fascinating to see the change in behavior just based on a really simple setting. Okay, let's dig a little bit into your evals because it sounds like you've got a pretty robust system and especially with different languages. So there's a few things I already heard you share. Did they capture all the tasks? Did they capture them well? And I can imagine you could have errors at different levels. Like you have speakers that are speaking different languages, they might have accents. And so is it really all categorized around those two errors or are you trying to understand, is it a translation error? Tell me a little bit about you. It sounds like you're recording interactions, you're scoring them with a judge. You've got a threshold where it's an error or not. What happens to those errors? What do you do? Are you, what are you doing as a team with when you find errors?
C
So this was mostly used to catch regressions and to see how we could improve the prompt and how, what impact it had on realistic scenarios. But then we also realized when doing that that actually the biggest problem is that the model has very different quality support for different languages. Some languages like English, French, Spanish are handled very well. If we're starting with Arabic or Bengali, for example, it's way more tricky. So in the end it's far from perfect. If we look at the numbers, a lot of tests, sorry, that never pass. And yeah, we accept that it's one of the problem. We know about it. We hope that with future version of the model it will improve. But it's mostly used to have a baseline to make sure that we are not doing things worse when we're doing any change.
A
I see. And I recall you shared you got your employees to create almost like your data set for evaluating. So you've got a whole bunch of recordings from different employees, different Environments, different languages. That's your eval data set. That when you make prompt changes, you're running against that. It's not that you're recording everything your users are doing and constantly eval eval ing there. Okay. Yeah, okay. I was getting think about like a data policy there and how you're communicating that. And I can imagine with to do lists that can get very sensitive. So you're using internal data created by employees and you have an evaluation set for evaluating prompt iterations. Is there anything you're doing in production to get feedback from customers? Do you have a way to collect feedback that this is working well for them?
D
When we were actively working on Ramble, so during the last semester, we had this feedback button in the experience in the Ramble ux so you could access it and it was pretty open. People would just be able to give us the feedback on how they would rate Ramble, what they found was working well versus not working for them. And it was just grunting through looking at all this feedback that we were getting from users and then figuring out, is it a problem we can solve or is it something that we just need to accept from the model and the quality of it. We did a lot of changes, small iterations on the prompt over time, just going through all the feedback we were getting and just trying to guide the Rumble experience to the right place. There was no magic there. It was just looking at the feedback, using a bit of LLM help to get the feedback and just going through the prompt again and using these evils was really helpful. So we got a few regressions where you tried to make the prompt smarter in a way, and then you were like, oh yeah, it's actually adding subtasks in the description. We don't support subtasks in the Rumble experience, so we had to go back and figure it out. Testing again and again.
A
You mentioned you had to decide, is this something we could fix or is it just something we have to accept as a limitation of the model? How did you distinguish between those two things?
D
Probably just breaking the sort of end the prompt, it breaks again, try and fix it again, it breaks again. And then you just decide that probably a shorter context window for the prompt and the LLM model is going to be easier and you actually get better results. So we're also actually trying to reduce the prompt. So it's not trying to direct too much. But yeah, I mean, I think it was just trying, actually trying stuff. Whenever Thomas was publishing a new prompt update, we're actively working through Ramble Trying different test cases that we had and just seeing what works, what doesn't.
A
So really trying to fix it. And then if you go through iterations and it's just not getting better, just recognizing this might be a limitation we have to live with.
D
Yeah, pretty much, yeah.
A
Okay. And have you seen, you mentioned you built this a year ago. Have you seen model improvements that have helped with your quality?
D
Yeah, I think maybe. Ernesto, you remember the first model we used from Gemini, from Google? I think it was 2.0, 2.5, 2.0 or 2.5.
B
Or we may be iterated between both of them while still playing with it internally. But I wouldn't be able to tell right now if we have upgraded.
D
Okay, yeah, we've upgraded a couple times. Now it's using the 3.0 flash model.
A
Yep.
D
And yeah, remember we did the switch with Thomas and it was both like, faster and you would feel like it would understand the tasks better. So I don't think we've quantified the improvement. It was more of a feel for it. There was really an improvement. So it's great that it's not something you had before where you worked on a feature and then it was possible that the feature would just upgrade over time. That's the nice thing with AI is that Ramble is going to get better. Even though we don't touch the prompt or touch anything of the product, it's going to probably over time figure out tasks even faster and in a more elegant way. So that's a nice thing to it get, like auto maintenance.
A
Yeah. There's this, like, mantra of build for the model that's going to come out in six months from now, which breaks my brain a little bit. Like, how do I know what that model will be able to do? But I think it's this idea of the quality will continue. You can assume the quality will improve over time. And so there's these, like, little things that you could spend a lot of time trying to optimize for. But anybody who's done prompt optimization in the context of a product, it's a little bit of one step forward, two steps back, two steps forward, one step back. And maybe it's just recognizing that, like, all these little edge case things don't necessarily need to be solved because the model will get better with time. And then, Hugo, I love what you just how you just framed this. Like, it's pretty cool that the brain of our features get better without us having to invest time and energy into it. Okay, is there anything about Ramble that we haven't covered that we should have.
B
There's something that comes to mind that I was thinking about when you were talking and Thomas as well about the difficulties. And there's this one about when we started to introduce. At first it was only capturing tasks, but then we needed to make them all aware of the surroundings. The projects users can mention they're adding the task and they can say. And at first it wasn't possible, but users wanted to be able to say put this task in this project, put this task in this other project. And they refer to their projects in loose ways. But the model would need to be made aware of what are the projects for the user to see which one matches how the user referred to the project. They can just use a single word out of the entire project name. The same applies to labels. This is in a way similar to the issue with dates, but a different thing because dates is something that the model already knows about, but the model doesn't from its training. But the model doesn't know about the particular situation in the todoist user account. So we need to make the model aware of that. And that introduced some complexity around is it picking the correct project could be ambiguous project names and there may be other stuff aside from projects. Like I said labels as well. The users can have different labels to categorize tasks across different projects and they can just say it out loud and stuff like that.
D
Yeah, so this projects labels, assignees Also I think we're releasing assignees support in a couple weeks. It's one of these tricky features where if you're in a personal non shared project, there's no assignee. If you are in a workspace with a team and you have a team project, then you have multiple assignees. So also figuring out how to recognize when to add the assignee, not trying to add a new user at all if you are in a private project. And I think with Thomas we worked a lot on adding support for more and more attributes and with that it raises complexity and you also need to figure out how you inject that context. So I think with Thomas we worked a little bit on do we use rag or do we just inject the context, all the structured data of the user, all the projects that they have in the system prompt. I think I remember we just came back to that simple solution of just add that into the system prompt. All the projects that I have. So then the model is able to figure out the project from that list of projects and not having to make an API call which would Be like a lot more latency for the tool. So eventually it just worked pretty well. It's actually able to recognize emoji. So if you have a frog for the project name, if you're doing it, the frog as a productivity system, then if you say add that to the frog project, it's going to recognize that it's the emoji, the frog. So it's going to add it to the right project. Which is pretty cool.
A
That is pretty cool. Okay, so it sounds, I'm glad, Ernesto, that you brought this up because this is a really common challenge for folks is like, what is the right context to give for the model to be good at the job at hand? And I can see how different users set up their projects completely differently. I can see how project names are ambiguous. I love this emoji example. And so it sounds like you found a simple solution of just share the project structure in the system prompt. Is that correct? Okay, Thomas is nodding, which for people listening is hard to hear.
C
Yeah, I can extend on that a little bit. Yes, it is correct. In the end, we just inject a list of projects in the prompt that we send to the model and that is enough. And it was actually quite surprising to us. We had a bit of a debate about whether that would work or not, or if we needed something more complex like try to add another tool call to let the LLM request a list of projects that started with something or something like that, which would have required more model calls, a full rag pipeline, that kind of stuff. And in the end, just injecting the list of project names works. So this is amazing for us because it's of course much simpler. So for us engineers it's much better because it's easier to maintain and to debug if anything goes wrong. It's also cheaper, probably because we need fewer LLM calls and it works just fine. So this would have been trickier, maybe even a year before because models had much smaller context windows. But nowadays the context windows are so large that even with sessions, audio sessions that last for several minutes, and prompts that have thousands of tokens, everything just works without needing to add too much complexity to our, to our systems.
A
Yeah, I love it's so easy to over engineer some of this stuff. And I love the fact that models are just smart and we can just give them a little more context. Hugo, I want to go back to something you said. You're working on, assignees, which gets into names. And I can imagine you already mentioned one challenge of don't create A new user. But I can also imagine names are really ambiguous from my perspective. If I'm rambling and I say invite John to this meeting, you need to know which John. So I'm curious about what you're doing there and what challenges you're facing there.
D
Yeah, that's a good question. We're not doing much and it's the same for projects or labels. Some people have or sections. Some people have the same names for sections, like to do, done, et cetera. Same for names. If there's multiple Thomas a doist, they could just say the other one, then it actually works. So the alumni is going to do the best, is going to do their best to figure out what is matching. But then if it's not the right match, the good thing is that we have this edit task tool call and people can just say it's not this one, it's the other one. And usually it's going to pick up the right project or assigning again. So I guess as far as this goes, it's as intelligent as a human. If you're talking to someone, there's two Thomas in front of you. If you say, hey Thomas, you don't give enough context for the human to understand who's the Thomas you were referring to? And in that case the LLM is the same. So we're just making it. Just saying the other one actually works pretty decently.
A
You're highlighting something that I think is we often underestimate, which is language is really easy to correct. Right. So it's not as critical you get it right the first time. As long as it's dead simple to correct. And that I think this is one of the things that's fascinating about natural language interfaces is that if you observe a conversation between two humans, there's probably a lot of correcting happening and we don't even think twice about it. That gives us a little bit of leeway in how well the model performs. As long as it's easy for us to correct the model. I like that a lot.
D
Yeah. And it's going to make as many assumptions as we do as humans. The thing is it doesn't know as much from the context as you. So that's also why we need to figure out where in the prompt we need to be more eliciting the exact stuff that it needs to understand versus sometimes it's just going to get it through talking the same way the humans can understand. That was pretty interesting for me to see that happening in that project.
A
Yeah, very cool. Okay. Anything else on Ramble that we haven't covered.
D
People should try it. It's pretty cool. It's pretty cool.
A
Okay, let me ask this. What's next? I know you worked on this a year ago. There may not be something next for Ramble. So if you want, you can share what are you working on now?
D
I can share a little bit about what's next in terms of task capture because it's a continuation at the moment. There's just a team working since January. They're working on new modalities. So we did a long sprint on adding voice support. But as Ernesto shared a few minutes ago, we also had in mind that you can turn text, a large amount of text, or a file or an image into tasks. So we've baked that into todoist and it's life in beta at the moment. And so what snacks is really polishing and figuring out how these multimodal experiences work as a sort of a cohesive experience and not just like a bunch of buttons that you can click. So, yeah, it's super cool. It's working really well. And again, we're using the Gemini models under the hood and they have this capability of parsing an image for tasks, parsing a long blob of text for tasks. So it's pretty cool that we're able to do that these days. And there's also some work on Apple Watch, there's also some work on improving Rumble as it is with assignees, stuff like that. But there's a lot more AI projects going on at those. It's everywhere. I know Ernesto is working on something too.
B
Yeah. Something to allow users to create. Automate some tasks or create some automations that connect to other systems. That connect to those. To other systems and do things automatically for you.
D
Yeah, as well.
A
Yeah. I feel like there's a fun jump from manage my tasks to actually start doing some of my tasks, which task managers are uniquely situated to help with. You know, something you said about a giant text blob to tasks made me think about. I'm seeing more and more like meeting transcription software is trying to identify tasks and go straight from like meeting to task list. Is that something you guys are experimenting with?
D
Yeah, it's something we've been thinking about. I think the first step would be integrating with these tools. I think Ernesto, if I'm wrong, I think granola, for example, is in that list of tools that you can automate. And so we could take that blob of text from the meeting notes, also from Google, meet notes and turn them into tasks for the user. So I think there's more to it in general in terms of capturing the tasks from other work tools. Whether it's in Slack or in Microsoft Teams or meeting notes from Gnoma. There's a lot we can do to make sure that the stuff you agreed to do is actually on your list so you don't forget. Yeah, I'm excited to see what we can do there.
A
That's funny. I rolled my own task management tool because I. Because I feel like task management is so idiosyncratic. I want it to, like, just match the way my brain works. But all this talk about where tasks come from and all these integrations, you got me thinking. I'm like, oh, I don't want to do all that myself. So maybe I do need to find a task management tool that's doing all these integrations. So that still allows me to create the IDOs idiosyncratic workflows.
D
So there's a todoist CLI if you want to use that. I have heard all about it, still built it.
A
Dominique Jost has tried to get me to play with it quite a bit. Yeah, it's such a fun. I actually really love this space of just task management because it's almost like a human computer interface. Right. We got to get this stuff out of our head on paper, in the computer, on. On our phone, whatever it is, so that we then know what to do. And there's both the get it out of your head, but then there's also consume it and process it and do stuff with it. And I feel like it's so. Every human is so specific in the way they want to do those things. I feel like it's a great, like, fascinating problem, like product problem of how do we do this? That supports the way a wide variety of humans think. It sounds like you guys have spent a lot of time on the capture part, which I think is one of the biggest, most important parts. So I'll definitely go check it out.
D
Yeah. And the next steps, there are many. Once you have a lot of tasks and the source of truth is in todoist, now you have a thousand tasks. You need to help plan them, to triage them, to figure out the priority of them, to actually work on them, execute them. And so, yeah, there's infinite possibilities, which is pretty cool for us as builders, is figuring out how we can help people commit to the right stuff and do it. Because obviously just capturing tasks is not the end of the world. You're not actually doing stuff or just capturing. So it's the first step, and we recognize that the next thing is really being able to plan them, understand what you need to do at the right time, and also maybe sometimes pruning some tasks that you just don't need to do. So I think eventually you want to be able to understand that context and use it to really just help people do the right thing at the right time for themselves.
A
Yeah, it's funny how it's easy to look at this space and think it's really simple, but as you all mentioned in your intros, you've been working at DOIST for years and there's always new problems to solve. It's an evergreen space of just how do we do our work better? Which is really nice. All right, I really appreciate you taking the time to share your story with me. Task management is something that I have always nerded out on and it's fun to see a company being so thoughtful about it. So thank you. If you enjoyed this conversation, please subscribe in your favorite podcast app and give us a rating as it helps others find the show. Thanks. I appreciate it.
Host: Teresa Torres
Guests: Ernesto Garcia (Frontend Engineer), Thomas (Backend Engineer), Hugo (Product Manager)
Date: April 16, 2026
This episode dives deeply into the origins, design, and ongoing evolution of Todoist's "Ramble" feature—a real-time, AI-driven voice-to-task system designed to seamlessly capture unstructured thoughts and transform them into actionable tasks. Host Teresa Torres interviews three core members of the Doist team (the makers of Todoist and Twist), unraveling how the idea was born, the initial technical and UX challenges, and what lies ahead for multimodal AI task capture.
Evaluation system:
Quote: “We built a system with an LLM judge...it would evaluate how good the model did compared to what was said. We use a smarter model to do that, which is slower but really good for evaluation.” — Thomas [27:00]
Problem: LLMs don’t know the current date and struggle with date arithmetic ("in 90 days", "two and a half months"), compounded by language/localization.
Solution: Inject today’s date into the prompt; directives for the model to talk about durations in days and always output dates in English for consistent parsing (32:10).
Regressions & Prompt Iteration:
Model version upgrades:
Upgraded through Gemini models, notably to the faster and more capable 3.0 Flash (44:53).
Impact: System performance and task extraction improved naturally as underlying models improved.
Quote: “The nice thing with AI is that Ramble is going to get better even though we don't touch the prompt or touch anything of the product.” — Hugo [44:54]
Ongoing work:
Future themes:
“We didn't start with a Ramble feature in mind and then decided to use AI. We started with an AI exploration...Ramble came up as one of the few top contenders of features that actually made sense and that solved a user problem.” — Ernesto [03:26]
“One of the main USPs for Todoist has always been the quick add...But we've discovered that for some people, they need to think through their plan or what they want to do first and then add the tasks once it's clear...We called it internally at the beginning: brain dump.” — Hugo [04:43]
“We added sound effects so that people driving get confirmation their task was added without looking away from the road.” — Hugo [15:08]
“The tools themselves are pretty limited...add task, edit task, delete task...the backend doesn't do anything with those. We just pass them directly to the clients which use them to show the tasks in their UI or play the audio cues.” — Thomas [22:05]
“It's going to make as many assumptions as we do as humans. The thing is it doesn't know as much from the context as you. So that's also why we need to figure out where in the prompt we need to be more eliciting the exact stuff.” — Hugo [53:53]
“The nice thing with AI is that Ramble is going to get better—even if we don't touch the prompt or touch anything in the product, it's probably over time going to figure out tasks even faster and in a more elegant way.” — Hugo [44:54]
This episode provides a deeply practical and transparent look into how Doist’s team leveraged AI—particularly live audio LLM processing—to bridge the natural “brain dump” behavior of users with structured, real-time task capture in Todoist. From prompt engineering details to UX in the car, from eval systems to “just inject the data” pragmatism, and a relentless focus on actual user needs over tech-for-tech’s sake, this is a must-listen (and read) for any builder shipping AI-powered workflow features.
Try Todoist’s Ramble if you haven’t yet!
As Hugo puts it:
"People should try it. It's pretty cool." — Hugo [54:26]