Small AI Models with Yoeven Khemlani - Software Engineering Daily

Summary6 min read

Podcast Summary: Software Engineering Daily – Small AI Models with Yoeven Khemlani

Release Date: July 24, 2025

1. Introduction

In this episode of Software Engineering Daily, host Gregor Van sits down with Yoeven Khemlani, the founder of Jigsaw Stack. The discussion revolves around leveraging small AI models for a variety of backend applications, highlighting the advantages of using specialized, lightweight models over large-scale language models (LLMs).

2. Yoeven's Background

Yoeven Khemlani shared his journey into entrepreneurship and technology. Starting as a game developer, he transitioned through various industries, including banking and corporate sectors, before founding his first startup, StayRWayback, a hotel aggregation service in Southeast Asia. After selling his initial venture, Yoeven became passionate about creating tools for developers, leading to the inception of Jigsaw Stack.

Yoeven Khemlani [01:33]: "I love building and I love building for people, right? It's like, how can I make money from the things that I build."

3. The Genesis of Jigsaw Stack

Jigsaw Stack was born out of Yoeven's desire to automate backend tasks using AI. Observing that existing LLMs like GPT-3 and GPT-4 excelled in front-end, human-in-the-loop applications but fell short in generating structured, actionable data, Yoeven envisioned a platform that could handle backend processes without manual intervention.

Yoeven Khemlani [01:33]: "Can we bring this technology to backend applications where there's no humans in the loop? It just works and takes away processes that used to require a lot of manual intervention."

4. The Concept of Small AI Models

Jigsaw Stack differentiates itself by utilizing small, specialized AI models instead of relying on massive LLMs. This approach focuses on creating models that are efficient, cost-effective, and highly accurate for specific tasks.

Yoeven Khemlani [04:02]: "Can we take the same and bring it to a 70B model? Because then we reduce our cost and increase efficiency."

Yoeven explains that while larger models are powerful, they are often overkill for many backend tasks. By fine-tuning smaller models (around 70B parameters), Jigsaw Stack achieves high accuracy (97-98%) while maintaining deployability and affordability.

5. Key Applications and Models

Jigsaw Stack offers a suite of small models tailored to various data extraction and transformation tasks:

Web Scraping: Automates the extraction of structured data from websites without the need for writing complex Puppeteer or Playwright code.

Yoeven Khemlani [04:09]: "Jigsaw Stack is a suite of small models to automate your backend task."
Optical Character Recognition (OCR): Enhances traditional OCR by integrating vision-based LLMs to provide more accurate text extraction with bounding boxes.
Speech-to-Text: Utilizes optimized versions of Whisper 3 to deliver some of the fastest speech-to-text conversions in the market.
Translation: Specializes in translating structured data, outperforming traditional services like Google Translate by using models trained specifically for translation tasks.
Translating Text in Images: Beyond basic OCR, Jigsaw Stack is developing capabilities to translate text within images while preserving the original style and layout using diffusion models.

Yoeven Khemlani [08:03]: "Is there a way that we can take an existing image, understand the text on that image, then translate it and diffuse it back with the same style?"

6. The Prompt Engine

One of the standout features of Jigsaw Stack is the Prompt Engine, designed to streamline prompt management, model routing, and the application of prompt techniques. This engine intelligently selects the best model for a given prompt by running the input across multiple models and aggregating the results to ensure high accuracy and reliability.

Yoeven Khemlani [09:33]: "We took different data sets in different industries and trained a really small model that can make decisions on which model to pick at runtime based on your input of your prompt."

Gregor Van shared his personal experience using the Prompt Engine for scheduling across multiple time zones, noting its superior performance compared to individual LLMs in maintaining constraints and delivering accurate, structured responses.

Gregor Van [12:47]: "Prompt Engine came back pretty reliably covering that."

7. Speed and Infrastructure Advantages

Jigsaw Stack emphasizes speed and cost-efficiency by deploying small models that require fewer resources compared to their larger counterparts. This approach not only reduces operational costs but also enhances deployability, allowing enterprises to integrate Jigsaw Stack's models seamlessly into their existing infrastructure without the burden of managing bulky GPU resources.

Yoeven Khemlani [18:05]: "We started with the idea of GPU poor... focus on training the model to be deployable and cheap to run anywhere."

8. Developer Experience and APIs

Jigsaw Stack prioritizes developer experience by offering intuitive APIs that resemble familiar platforms like Stripe and Supabase. Developers can easily integrate Jigsaw Stack's models with simple installations (e.g., NPM or PIP) and benefit from well-typed libraries that minimize the need for extensive documentation.

Yoeven Khemlani [20:57]: "From day zero, every field needs to be a named field, everything needs to be descripted, everything needs to be typed."

9. Pricing Strategy

Initially, Jigsaw Stack adopted a usage-based pricing model, charging per API call. However, as their product offerings expanded, Yoeven identified the limitations of this approach, especially for services like speech-to-text where usage can vary significantly. In response, Jigsaw Stack is transitioning to a token-based pricing model, aligning more closely with industry standards and offering greater flexibility for developers.

Yoeven Khemlani [23:16]: "We're shifting to a token-based pricing where we're estimating it to be around $1.40 per million tokens."

Additionally, Jigsaw Stack will introduce a free tier, providing developers with a generous number of free tokens monthly to encourage adoption and experimentation.

10. Community and Developer Usage

Jigsaw Stack primarily targets startups and indie developers who value high-quality developer tools. The company fosters a close-knit community by actively engaging with users, addressing bugs promptly, and iterating based on real-time feedback. Hackathons and direct interactions with developers play a crucial role in shaping the product and ensuring it meets the needs of its user base.

Yoeven Khemlani [28:20]: "The feedback loop is really good, like in real time when you do stuff like that from the startup community."

Yoeven notes that while startups hold high expectations, especially in the U.S., they also exhibit a forgiving nature when it comes to documentation issues, focusing instead on critical functionality like uptime and reliability.

11. Future Roadmap and Product Direction

Looking ahead, Jigsaw Stack plans to deepen its focus on two primary areas:

Data Extraction: Continuing to enhance models for OCR, segmentation, object detection, and other extraction tasks.
Data Transformation: Expanding capabilities in translation and other data manipulation processes.

Instead of broadening the product range, Yoeven emphasizes improving the quality and efficiency of existing models, enhancing developer experience, and optimizing infrastructure for better performance.

Yoeven Khemlani [37:07]: "We're going super deep into some of the detection space and the embedding space."

12. Funding and Team Growth

Jigsaw Stack recently secured $1.5 million in funding, which will support the growth of a lean team focused on product excellence. Yoeven plans to expand to a five-member team, maintaining agility while advancing the platform's capabilities.

Yoeven Khemlani [35:23]: "We're raising one and a half million with the goal to grow the team to like a five-man team."

The company is actively hiring, particularly seeking a founding full-stack AI engineer who showcases passion through side projects, reflecting Jigsaw Stack's commitment to building a stellar team.

Yoeven Khemlani [37:54]: "We only hire star players. We have three criteria... Do you have a side project?"

13. Conclusion

Yoeven Khemlani's insights into Jigsaw Stack reveal a focused and innovative approach to utilizing small AI models for backend automation. By prioritizing speed, cost-efficiency, and developer-friendly experiences, Jigsaw Stack is poised to offer robust alternatives to large LLMs, catering especially to startups and developers seeking reliable, scalable AI solutions.

Yoeven Khemlani [39:40]: "Solo founders just have to build their team better and that's the only challenge. It's not a pain point for me."

As the company transitions out of its beta phase and continues to enhance its product offerings, listeners can expect Jigsaw Stack to play a significant role in the evolving landscape of developer tools and backend automation.

Loading summary

Transcript77 lines

[00:00]
Yevan Kemlani
Jigsaw Stack is a startup that develops a suite of custom small models for tasks such as scraping, forecasting, vocr and translation. The platform is designed to support collaborative knowledge work, especially in research heavy or strategy driven environments. Yovan Kemlani is the founder of Jigsaw Stack and he joins the podcast with Gregor Van to talk about making use of small models for diverse applications. Gregor Vand is a CTO and founder currently working at the intersection of communication, security and AI and is based in Singapore. His latest venture, Wintik AI, reimagines what email can be in the AI era. For more on Gregor, find him at van HK or on LinkedIn.
[00:59]
Gregor Van
Hi, welcome to Software Engineering Daily. My guest today is Yevan Kemlani. Welcome Jovan.
[01:04]
Yevan Kemlani
Hey. Hey. Good to be here.
[01:06]
Gregor Van
So Yevan, you are the founder of Jigsaw Stack, ultimately Singapore company, but you're now in the US which is a pretty well trodden path I think for this part of the world. So before we get into Jigsaw Stack, which I cannot wait to get into, it's a very interesting product and very timely. I'd love to just hear a bit more about your background. I know this is, as they say, not your first rodeo when it comes to having founded a company. So yeah, just tell us about your journey to Jigsaws Tech.
[01:34]
Yevan Kemlani
Yeah, I think it didn't start very exciting. It's like every other engineer, I'm an engineer like everyone else. I love building products, I love exploring new technologies and that's kind of like where I started. So I started as a game developer which many choose not to take that path because it's one of the hardest industries to kind of break into from a revenue standpoint or from like a salary standpoint, like your initial career. But something I enjoyed, did that for a bit. Loved building, didn't love the industry, went into like the banking industry, didn't love the corporate industry, decided to leave that. And as I went to study my masters at Imperial, Covid hit right the right moment. So I decided to drop out. I didn't want to pay with $200,000 just to sit in my dorm room. So I decided to drop out. And eventually that's when I got inspired to start this company called Stay R Way back, I think it was like four years ago, it was a hotel aggregation company in Southeast Asia that basically did short term bookings for like the pool, the gym and different things within the hotel because domestic markets were growing back then. Then that grew to a point we went into co working Spaces. And eventually we sold it. And I think that kind of kicked off my path of like, okay, startups is like where I want to be. I love building and I love building for people, right? It's like, how can I make money from the things that I build? And that's when I realized that I specific basically loved building tech for developers and tools and software. And that's when I went into this rabbit hole of what should I do next? And this was like during GPT3 when it came out, I was like, AI space is getting interesting and a lot of the technology is being built is for front end human in the loop applications. Right? Chatbots or tools that people can use in the loop. And I was like, can we bring this technology to backend applications where there's no humans in the loop? It just works and takes away processes that used to require a lot of manual intervention. The most common thing right now is like web scraping, right? Like everybody's trying to scrape the web. Can we structure it in a way where you don't need to write puppeteer code anymore or playwright code anymore? And basically you just give a URL and you prompt fields that you want to extract and it does the work for you at a 98% accuracy, not just markdown, like actual fields that you can pull out. So that's the pain point that we realized that can we automate these backend tasks? And we went down this rabbit hole of fine tuning models and training and how can we increase the accuracy to 97, 98 and then get it up there? So that's kind of like how Jigsaw Stack started. And then we started narrowing our field there.
[04:02]
Gregor Van
Nice. So I think you've kind of hinted at it there, but Jigsaw Stack, what is in sort of, I don't know, one sentence, what is Jigsaw Stack?
[04:10]
Yevan Kemlani
So Jigsaw Stack is a suite of small models to automate your backend task. That's the way I like to phrase it.
[04:16]
Gregor Van
Awesome. Okay, so let's kind of dive just into that you kind of already touched on. There was a pain point around pure web scraping. But then I guess when did that morph into small models versus just going down like, oh, we're just going to make a web scraping model? Basically.
[04:32]
Yevan Kemlani
Yeah. So the first challenge that we tried or the way we try to tackle this problem is like, can we use GPT3 or GPT4, one of the big, large LLMs, the best in the class at that point, to solve this problem? Most of the times we get good Solutions for very human in loop things. You get markdown generated from a website that you can use to pass into a chatbot to then, you know, talk with that markdown. But you don't get structured data that you can actually use. For example, if you want to pricing comparisons between two products on Amazon, typically a developer will need to write some code to kind of extract that specific price field that they need to. And OpenAI or with any kind of scraper tool can achieve that at that same accuracy a developer would be able to. So that's when we realized that you had to fine tune. So when we started fine tuning, we tried 400B models like the Llama 3.1 at that time and then we started scaling down. We said can we take the same and bring it to a 70B model? Because then we reduce our cost increase efficiency. And then we started going down even further, like can we bring that down to a 13B model? 13B didn't work really well. So now we're kind of on average at like the 70B scale. And that's where we see it's easy to deploy, it's cheap. And since we're so specialized in that one use case, we can get rid a lot of the other generic use cases built into the model and kind of read and train a lot of it specifically for that one thing. So that's where the small model kind of utility really came in.
[05:53]
Gregor Van
Nice. So let's sort of maybe go through a few, I guess. Well, you have many small models is my understanding. So could you kind of go through a few of those you've touched on kind of web scraping that which is still very much in the product today, I believe. What are some of the other ones?
[06:07]
Yevan Kemlani
Yeah, so when we launched the web scraping that blew up and it kind of opened this door of like, where should we go next? We didn't want to dive super deep into web scraping because we realized the technology that we built for web scraping, it was 50% the model, 50% the infrastructure, because we trained the model specifically to Puppeteer, for example, that controls puppeteer and injects JavaScript code. So now the question is like, how can we broaden the scope to do other forms of data extraction? And that's when we exploit data extraction as one pillar. So OCR is an example of pulling out text from things. A traditional OCR like optical character recognition uses machine learning to kind of recognize the characters. A very traditional way of doing it and very inaccurate way of doing it. Now we combined it with today's tool of vision, LLMs that gives us a whole new realm of OCR with bounding boxes and these kind of tool sets. And then we went into things like speech to text. It's basically extracting text from an audio audio. That's where we took Whisper 3, optimized the shit out of it and basically made it one of the fastest speech attacks. And we're actually faster than Grok at this point. So that's the direction we were taking. So that data extraction is just one example. Eventually we realized that generative or data transformation was another big pillar. A lot of the users were asking like, hey, I scraped this chunk of data from let's say a Mexican passport, like they were doing passport verifications with our ocr and then now they need to translate some of the information. Translation became a huge pillar by itself where people were moving away from Google Translates and the typical translation providers because of the quality of the product LLMs that exist today. So can we take a model, train it specifically for translation, and then add languages as we go and then become a specialized translation model? So that's what we basically started to do for data transformation as well.
[07:56]
Gregor Van
And I saw you give a presentation a couple weeks ago and you were talking about also about translating text in images. That's a kind of interesting challenge, right?
[08:04]
Yevan Kemlani
It is. I think people have seen this in consumer apps like Google where you point the app at a match and it kind of like overlays a bunch of text in a different language. They have this in Apple, Microsoft, and we were looking into this and we're like, we're surprised there's no API for that. Literally all these cloud providers don't provide an API for that specific service. I think one reason it could be because it's not high quality. They're actually just overlaying like a blurred text on top of the current text. So what we are exploring is that we saw that image ads and image generated content is becoming a big thing in the market, right. Is there a way that we can take an existing image, understand the text on that image, then translate it and diffuse it back with the same style? Right. So basically don't affect the style and the way the shape of the text is. And that's something we're exploring. So with diffusion you can really do a lot of that. So you can train a model on glyphs, on text, glyphs to then basically translate things. So I think we're like two weeks away from version one and then version two is going to be the full diffusion model. That we're going to launch. So right now we're going for English and Chinese as the two major languages to kind of diffuse against. And then we're going to include Hindi and Arabic and different languages.
[09:11]
Gregor Van
Awesome. I'd like to touch on another, I guess part of the product offering and you can correct me whether this is sort of an amalgamation of bits of the product or totally standalone. But the prompt engine, that's something that we've talked a bit about before previously outside the podcast and I had a sort of specific use case which I can always talk about. But could you just describe what is prompt engine and like, what is that solving?
[09:34]
Yevan Kemlani
So prompt engine, it solves three big problems. First is prompt management, which there's a lot of providers in the market that do that for you. The second is prompt model routing. Again, another suite of products in the market that does that for you. And the last is prompt techniques, techniques that you can apply to obviously increase the quality of the output, reduce your token cost and all these other things. So we decided and we see in the market there's so many challenges for developers using three different products just to solve this. Like, they use LangChain plus they use another product has to track their tokens and another product has to track evaluate their model outputs and stuff. So what we managed to do is can we train a really small model, like 1B 13B size, to basically make these decisions for you at runtime. So we took different data sets in different industries, law, history, English, mathematics, and we trained a really small model that can make decisions on which model to pick at runtime based on your input of your prompt. So whenever you give an input, say, generate a poem, it will understand. Okay, it's a writing focused product. It needs to generate poems. Then we have a data set that has been trained. Okay, so poems are really good with GPT 4.5. Like they're really good at writing. It'll automatically pick that model for you, but it will also run the same prompt against like five other models. And when you get the output, every single time you run that same prompt, it narrows down the best model based on this concept called mixture of agents. Have you heard of that idea where it basically runs it and then uses LLMs as a judge to basically say, hey, three out of five of this thing gives the same output. Typically this is going to be the correct output. And then we commerce all these three into one output and then it gives that. So it's a chain of techniques that kind of basically gives you this thing. So from A user perspective, you don't pick Claude 3.7 or GPT 4.5 or Gemini 2. You basically get everything and the selector just the model picks that for you. And you can also store the prompts and rerun it and stuff.
[11:37]
Gregor Van
Yeah, I mean, the example I spoke with you about previously was it was around time zones. And I think we all know in programming time in time zones is a hilariously difficult thing to still achieve various scenarios on. So that was kind of what I at least ran through Prompt Engine initially to kind of get an understanding of what is this product and how can it actually help us. And it was super interesting because it was far and away better from a results perspective if I put into just pure GPT or I'm talking about GPT over the API, or just go to Claude and just put it in there, or go to Prompt Engine and say, hey, we've got three people in three different time zones. Here's a few rules though, like don't start a meeting before 6am for any of these participants. Strangely, the other two really struggle with this. You have to keep correcting them. When they come up with these three, oh, here's the three times. Yeah, one's at 5:00am it's like I explicitly said, not before 6:00am oh, yes, you're right. And then duh, duh, duh, duh. Whereas Prompt Engine came back pretty reliably covering that. So I think that's a very, I mean, can you sort of help explain why then it's able to produce that result?
[12:47]
Yevan Kemlani
So when you send your request, what happened behind the scenes is basically it sent a request across five models. Right. And typically what would happen is that maybe two or three out of that five models gave the right answer. And what happened? We have the smaller model that we trained access that judge. Right. So models, small models are really good at consuming data, but are horrible at generating. So they can make a decision if you give it a lot of information to be like, yeah, this is the right one based on its context. But it's really bad at saying, hey, I can give you the right answer. So if you tell it, okay, these are five of the answers and pick the best answer, it automatically uses a weighted average and be like, okay, this two has these answers are coming up more often. Plus from my understanding, this is the prompt, the initial prompt, this is the answer and. And it will automatically start merging those top two answers together. So what happens is that you now you're always guaranteed the best answer and you're not going to get the worst answer because 2 out of 3 or 3 out of 5 of the models are basically giving you that same answer or similar answers and it emerges the output. So even if one other model gives, let's say they add am and the other model adds pm, when you combine it, it overwrites based on the initial prompt. So make sure of agent is just one technique that we apply on it recently. Using the O1 concept of chain of thought is another thing, another idea. But implementing an entire model like R1 or 01 mini or O3 mini into from the engine is going to take tons of token cost and take a really long time process. But we applied the same technique of chain of thought into our models from scratch way before the whole Deepseek R1 came out. But obviously at a smaller scale. So it runs a lot faster and cheaper for the user. So that's why it tends to work more accurately, especially even for structured data. So if you ask it for a structured response, you almost guarantee get structured response.
[14:35]
Gregor Van
Very simple, I think. One thing I'm curious though, if a lot of users who are relying on LLMs, they're not actually tuning models themselves, that's kind of usually a sort of secondary exercise once you have proved product market fit or so forth. And clearly the next step is to then tune a model and become more deterministic for their use case. So at least where I sit, I kind of get used to understanding the likelihood of what's going to come back from a specific model. For example, if I look at our use case we have doesn't really matter the details of it. But if I put that use case to a llama model versus to GPT, I get very different responses back and I'm like, okay, we're always going to go with GPT for this one because I'm just confident that what's coming back in the prompt engine example, how does that work in the sense like the result might be quote better but I guess would the model behind the scenes have flipped? I'm just trying to get my head around like 100%. How much can the same prompt going to prompt engine end up then changing from sort of, I don't want to say quality perspective because quality clearly is what the product's about. But change from sort of structure or like oh, that's completely different to what I was shown expecting initially?
[15:45]
Yevan Kemlani
Yeah, 100%. That's an interesting question. So the way we do that, that's why we always suggest the user to create we have two parts to this system. You Create a prompt first and then you execute that prompt, because that's where the prompt management layer comes in. The reason we suggest to the user to always create the prompt and run it like the typical LLM. You just execute the prompt, right? Like you just generate from. But in our approach, you create because you're storing this thing. There's a level of memory that takes place, meaning every time that you run your prompt, we store a generated version of that prompt of like how it outputted every single time. So let's say if you run this prompt 6, 7, 8, 10 times, it basically will always store the version of the output previous to the next run in the database. And then we use that as a baseline as well. So if you keep changing your prompt and you keep tuning your prompt, we know clearly something is wrong. But if you kind of run the same execution or the same prompt, then we know, okay, hey. We say, hey, this is positive, this is positive, this is positive. This is how the structure should kind of look like. This is the output that it should kind of look like. So we give it some form of point system at the back where we're saying, this is good, this is bad, this is good, this is bad. And as the more you run, the more accurate it gets actually, and the less models that we run against. So initially it runs against five. By the time you're on your fifth run, it's running against two models. By your tenth run, you're running just one model and we stick to that model so you don't change your model by your 10th run. So if you're using, let's say it automatically picks GPT4 row and GPT4 keeps being the best model for your answer. It will just stick to GPT4 row throughout the lifetime of that prompt that you created so that you don't have changes and like breaking changes as you scale unless the model fails. So that's where we fall back. So let's say the model is down and then we fall back to another model.
[17:27]
Gregor Van
Nice.
[17:29]
Yevan Kemlani
This episode is sponsored by Mailtrap, an email platform developers love go for fast email delivery, high inboxing rates and live 24. 7 expert support. Get 20% off for all plans with our promo code sedaily.
[17:47]
Gregor Van
Something you touched on earlier and I know is sort of quite a big piece of the product when it comes to, I guess, advertising this to developers is speed. Can you talk a bit about how does that work? Let's just talk about small models versus large models maybe specifically first, and then move on to like Jigsaw stacks infrastructure and how generally how is that making things so much faster?
[18:06]
Yevan Kemlani
So we started with the idea of GPU poor. I mean we don't have a lot of GPUs, right? We don't even have a H100 physically. A lot of AI companies you see today, the first thing is like, let's buy our own GPUs for training. So we came in with the idea of like, hey, we're going to be GPU poor as much as possible. We need to work on a 1/ hundreds, a 10 GS, any of the smaller GPUs. I mean at that time it was not small, but now it's considered small. We have to kind of be restricted to that scale. That was one of the ideas or like the methodology that we stick with. The second is deployability. The biggest issue with I think LLMs or big providers like OpenAI right now is that distribution has been a big blocker for them. They are distributed with Azure or people have to use OpenAI directly. And that's why you see a bunch of these enterprise AI companies coming out trying to train open source models, because you can't self host OpenAI as models anywhere, being the best in class. And that was one of the biggest blockers currently. So we looked at this problem like we don't want to be in that realm.
[19:05]
Gregor Van
Right.
[19:05]
Yevan Kemlani
We cannot afford to basically have that as a blocker where enterprises want to use our OCR model and we're like, oh, we can't self host because we rely on 30 different proprietary APIs that require like this much work. And that's when the small model idea came in. Because most enterprises don't have the resources to run large language models at 400B scale. So if we can focus on training the model to be deployable and cheap to run anywhere, it became an easier distribution thing for us from the get go. It's hard to build for sure, but in the long run the cost kind of pays off because we own this proprietary model that we can eventually then go out to the market and be like, we can distribute on your systems easily, no restrictions. You want aws, you want Azure, you want gcp, you want your own on house gpu. Or eventually when we get big enough, we can even host on GROK to be even faster. And it's like the kind of opportunity that we have with smaller models. Cost efficiency is a big part of it.
[20:02]
Gregor Van
Yeah, so just so I kind of read that back. Brock is behind the scenes from an inference perspective or multiple or for prompt.
[20:09]
Yevan Kemlani
Engine we use a bunch of Grok under the hood as well. Llama Guard is a feature that they have that's pretty interesting that there's a lot of prom guarding. And we didn't want to add Llama Guard into our platform without Grok because it's one of the fastest and we didn't want that bottleneck eventually. So we do speak to Grok and the goal is like, you know, if we can get big enough volumes, we can also host. Because the cost of hosting on Grok is huge. You need millions and millions of users to kind of like be using you to even host one small model on Grog. So we want to obviously reach that scale where we can start hosting our models on infrastructure dedicated for small models.
[20:42]
Gregor Van
Okay, so let's I guess then talk about just general devex here. Obviously we've got a very technical listener base and this is precisely who should be using Jigsaw Stack. So what's the kind of likely first kind of what does the product look like to a developer on day zero when they're getting up and running?
[20:58]
Yevan Kemlani
Yeah, right now it looks like an NPM install. Right. So you just honestly NPM install Jigsaw Stack, get your API key, that's it. Everything else works when you build the libraries in a way where you don't need the documentation. So as long as you install the library, you should have enough typings that kind of like guide you in what you need to do. So from day zero, every field needs to be a named field, everything needs to be descripted, everything needs to be typed. Even our Python library is fully typed so that we as a developer, my best experience is using Stripe or a library like Supabase where you literally like stripe. You get your options customer subscribe. We want it to be that intuitive. Right. Where you don't need the documentation in most scenarios, but obviously if you need advanced configs, you go there but. But you should be able to get started with just the library. And that's kind of the developer experience thought that we had. Like can the library just be self sufficient enough that the MPF install. They'll figure it out. They PIP install, they'll figure it out. And that's kind of like the design structure that we went with.
[22:00]
Gregor Van
Yeah, I like that you mentioned both Stripe and supabase. I think they're both great reference points. Stripe, especially on the API side and Supabase just for pure, I think Devex when it comes to. Yeah, what that product is, my understanding is the API itself. It's a pretty consistent structure across all the services. Is that the way to think about it or.
[22:19]
Yevan Kemlani
Yeah, exactly. So we kind of have a system at least where we kind of structure every API to be pretty consistent in the way you call it, in the way you structure your body JSON to kind of pass that data over all the keys. We use the same key across all the APIs so you don't get confused. Everything is just URL or file slot key. So there's no confusion in that sense.
[22:40]
Gregor Van
Nice. And I guess we kind of have to talk about pricing and how you do charge for this. And I think most of these products we're talking just in AI and Genai especially it's a usage based model. Could you just talk a bit about how have you determined at the moment any the best balance for accessibility for developers versus then sustainable business economics, that kind of thing. How have you looked at it so far and how do you see that evolving?
[23:09]
Yevan Kemlani
Yeah, I think it's a good time to announce as well. So we're changing our pricing, right?
[23:12]
Gregor Van
Okay. So we're recording on 12th of March today. So just 12 of March.
[23:16]
Yevan Kemlani
Yeah, exactly. So we'll be changing it in like a week or two. Ish. The reason is that pricing I think is an ongoing change. We have been learning a lot since the time we launched. We've removed so many products as well and we've added a lot of new products. The reason we've done that is because we kind of like speak to customers, speak to developers and we understand truly what they want. When we started the product we were like, we built on things what we wanted and then now it's like based on what developers wanted the whole time. So we reached a point where we realized that the pricing that we have doesn't make sense for a lot of the products that we are offering. Big issue is that for example, we charge right now 0.05 cents per API call. It made sense for things like the AI scraper, the OCR because you have a fixed cost that you can scale with and then obviously get discount pricing the more you use. And it worked in competition to providers like AWS and gcp. It's a very similar pricing model. You have a fixed price to an API cost. What happened is that we started releasing products like speech to text and text to speech. And what happened is that if you're charging 0.05 cents for one hour of transcribing of audio, you're still getting charged the same for five seconds. It didn't make sense to a lot of users at scale, right? And managing discounts manually for each customer didn't make sense. A lot of the users came back to us like, hey, why not just use token based pricing? I'm like, yeah, why didn't we not just use token based pricing like every, like every LLM provider is using it. Every developer is used to token based pricing and if we can just shift to that kind of idea, it gives one the flexibility to us to now charge for more usage in a more minute way. Meaning if you use less, you charge you less, you use more based on the output will charge you more accordingly. Secondly, it lets us explore more technologies because now we can allow you to configure specific APIs to run for a longer period because we're not kept by that 0.05 cents cost underlining cost behind the scene. So now we're shifting to this token based pricing where we're estimating it to be around $1.40 per million tokens, which is actually pretty good because we're trying to keep ourselves as cheap as possible to most users for scale. Right. So 140, I think Claude is at 15 bucks. So 140 for infra. Plus a specialized model is something that we think it's fair. Obviously we're still going to test more. We've been getting a lot of feedback. We haven't got a lot of experiments yet because we have not launched it. But when we launch it we'll get more feedback, figure it out and we'll still keep adjusting pricing to what developers feel comfortable with at the end of the day.
[25:44]
Gregor Van
And I guess both. Well, I mean today, but I guess with this feature pricing model there'll still be intro free tier of some description or.
[25:52]
Yevan Kemlani
Yeah, oh yeah, for sure, 100%. So it'll be like a million free tokens every month.
[25:57]
Gregor Van
This is sort of not pricing exactly, but it is related like I guess context window. I'm trying to get my head around how does context size play with small models? Because I guess if I think about, without you saying anything further, I think, well, context window, small model, surely it's just a really small context window. But what is it?
[26:14]
Yevan Kemlani
Yeah, it is a small context window, but we don't really rely on the context window as much. It's not all language models. A lot of them are trained models to control infrastructure. So like the AI scraper example I gave, which basically takes in fields, right? Like I need the price, I need the description of this place, I need the image on the left side of the whatever. So very small context input, huge output on the output level. That's where you see basically the model generates JS puppeteer code, then gets injected into Puppeteer and then the puppeteer outputs a huge context. But it doesn't need to go back to the model.
[26:55]
Right.
[26:56]
And that goes to the user and that context gets easily post process. So context is not a big problem for us because we are more of an infrasite problem. If you're obviously building a chatbot or like some form of a chat system or that requires huge amount of context to be in, then yes, I think context makes a big difference typically from most chat applications or like human in loop applications, but not so much on like where we actually never had like a really context problem at this point.
[27:22]
Gregor Van
Yeah, yeah, okay, that's super interesting.
[27:26]
Yevan Kemlani
APIs are the foundation of Reliable AI. And Reliable APIs start with Postman.
[27:30]
Trusted by 98% of the Fortune 500.
[27:33]
Postman is the platform that helps over.
[27:35]
Gregor Van
40 million developers build and scale the.
[27:37]
Yevan Kemlani
APIs behind their most critical business workflows.
[27:40]
Gregor Van
With Postman, teams get centralized access to.
[27:42]
Yevan Kemlani
The latest LLMs and APIs, MCP support and no code workflows all in one platform.
[27:47]
Gregor Van
Quickly integrate critical tools and build multi.
[27:50]
Yevan Kemlani
Step agents without writing a single line of code. Start building smarter, more reliable agents today. Visit postman.comsed to learn more.
[28:00]
Gregor Van
So want to talk about the future of the product as well as currently with the developer community that you have. So let's start actually with current which is like developer community who's using it. I'm aware you guys did like a hackathon maybe a couple weeks ago which looked very interesting. So maybe what was kind of some of the results of that and what is the current kind of community that are using Jigsaw Stack and how do they talk to each other?
[28:21]
Yevan Kemlani
I think there's two big communities. One is the typical startup indie hacker is kind of, you know, devs, people who love the product, love to play with it, try it out. But they are I think for a devtool like any like super Base Vercel or any product on the market, it's always we face this kind of problem where it's when you need the product versus when you discover the product. Right. So you, a lot of times you discover the product but then you don't need it at the point and then you, when you need it, you forget about that process product. So kind of having us exist constantly in front of developer to the point where they need it and then get Reminded of us is a key thing for us. So that's why we target a lot of startups. While they might not need us today, they enjoy the product. They eventually, when they built their startup scales or they built a product that does need Jigsaw Stack, we're the first one to give a try, give a shot, or compare us against like GCP and aws, the competitors that we're going against, and then basically be like, hey, yep, this is a better product. Let me try it out up, let me go for it. So startups is our key thing. So Series A company is the smallest at this point. We do do one or two enterprises, but Enterprises is something that are a bit more challenging for us at a scale. We're happy to do it because we are still a small team. It's like three to five of us at this point. And as we scale, we'll take on more and more larger companies that are Series A and bigger. They have special needs that like, you know, they needed to be deployed only in Australia or only in specific parts of the US and stuff like that, which we want to kind of do at a later scale. But right now it's mostly startups and we get better feedback, to be honest.
[29:49]
Gregor Van
Right.
[29:50]
Yevan Kemlani
Like at this point, because it's so high. Like the hackathon. Because of the hackathon. On that day, I was not even building. I was building Jigsaw stack. Everybody came to me and they were like, hey, I have this bug over here. Or I didn't even know this field existed over here for the AI scraper. And I'm like, okay, updating the documentation and fixing this field. So the feedback loop is really good, like in real time when you do stuff like that from the startup community. And that's what I look for.
[30:15]
Gregor Van
The startup community both have, I think, quite a very high bar versus some people and then quite a low bar other places. When you look at the product, very high bar for this should just work. I should be able to just get running in 10 seconds and then a fairly low bar when it comes to. Oh, maybe this piece of documentation isn't great, but that's okay. Yeah. Is that kind of how it feels?
[30:35]
Yevan Kemlani
Oh, for sure. It depends on where you talk about it. That one Reddit. The bar is very high Reddit. Every so anonymous. They're like, this product is shit. And I'm like, oh my God. You speak to the same guy in real life, he's like, oh, yeah, no, I love your products. Sometimes you don't know. But yeah, I Think you're right. Basically right. Developers are very intuitive. I think the more senior of a developer you speak to or a developer that's comfortable with their skill, they're very intuitive. So if the product works or they can figure out they love to fix things. Right. So if they can fix the product for you, they would, right. And like if, especially if they can. But if that's why if the documents are broken or, or something as well, it's not big of a problem for most engineers, right? And that's what they used to. So they will go and they'll take screenshots and be like, can you fix this? Can you update this? And even the biggest of companies like Google Vertex AI doesn't work half the time. So the expectation from a startup is.
[31:33]
Gregor Van
I have experience of that and just thinking like, no, surely it's not actually Google just the API doesn't work right now. No, it doesn't exactly.
[31:42]
Yevan Kemlani
If Google can do that. Like, you know, the expectation from like Jigsaw style startup is like surprisingly from companies actually is a lot higher. Like in the US when I a lot of the startups that we meet there, the expectation from a startup is like we expect you to be way better than gcp.
[31:56]
Gregor Van
Right.
[31:56]
Yevan Kemlani
Like I think in Southeast Asia it's a bit different where they're like, yeah, I can expect you to be worse than GCP because like, you know, you're a small guy, which is good. I kind of like, I like the expectation in the US where like the expectation is like, hey, let's. I expect you to be better at gcp and that's why I'm going with you in the first place. I'm not going with gcp. So that kind of downtime and these things are very important to us and the clarity and feedback that we get from a lot of these companies and engineers, very upfront and very clear and they only complain about the things that have real problems. If it's actually down, they will complain about it. If the docs are wrong, it's not a big issue, they will just come and screenshot, can you fix that? And they solve the problem already. So it's a forgiving community when it's the right thing that goes down.
[32:38]
Gregor Van
And I guess now just looking kind of future and obviously roadmap and obviously super early days for you guys. So roadmap is always one of these very difficult things sometimes I think to kind of pin down. But I mean for example, I saw a post I think you guys put out very recently about how you Compare to say Mistral when it comes to ocr. Because Mistral had this big splash with how their OCR is in theory well and away better than anyone else. And you sort of made some good comparisons of why in many cases I think you argue that you're actually better. So is the future direction sort of let's take these cases that the big guys are doing and try and perfect them. Or are there just like completely other use cases you're adding, thinking of adding.
[33:18]
Yevan Kemlani
So we want to focus on two big pillars. The first is data extraction, the second is data transformation. So we're going to stick in this realm. So as long as it falls within this realm, we're competing with any big guy that comes in this split market. Right? Metro was never in this market. They were always in the LLM market. And then one day they dropped an OCR model and we're like, oh, okay, cool. They want to get into the small model kind of like space as well. And that's when it got exciting. So I'm like, let's try it out. And after trying it out, it wasn't as good as they claimed to be. Based on their title of their article, they used the word World's Best. I don't like shitting on certain things, but like I love Mistral. I love the especially that they had their first few models that they launched. It was like they kind of initiated a lot of the open source scenes like way before the llama, you know, and the guys did. So I love them for what they did in the open source world. But when they did this Mistral OCR and I like World's Best, like really is just like. So I re benchmarked it and it was like very clear, far apart like from the world's best. And so I just had to put it out there and kind of show like, okay, hey, like there's a lot of room for growth and I'm happy that they're doing it and it kind of ups the OCR market. So let's actually be the world's Best and then claim that title, right? So we did the benchmark. We're like, hey, yeah, it took us a lot of time to get our OCR model out and we were surprised that there's a World's best. Especially when Google with Gemini level of resources couldn't be the world's best. I'm surprised that Misrule did. So we were kind of like benchmarking it and we were like, oh yeah, so a lot of room, a lot of Room for improvements and I mean.
[34:54]
Gregor Van
In terms of future of just the company team. Very recent announcement, a couple of days before us recording this, funding has increased, which is awesome. And what is that going to enable you to do? And how do you see the team going up or not going up? I mean, obviously we're in the golden age of small teams now, which I love because I've actually always been a small team person and was very frustrated through the 2010-20 period when you had to keep defending why you had a small team. So how does it look for you guys for the next couple years?
[35:23]
Yevan Kemlani
So we raised one and a half million. The goal is to grow the team to like a five man team, including myself, keep it very lean and then the goal is to get the product to a solid standpoint. We're in beta still. We want to get ourselves out of beta. We are launching two significant products. The first one is the embedding model. It's a multimodal embedding model. We realized that embedding is only text based, which is like. But everybody embeds PDFs, images and a bunch of different documents. We need an embedding model that can support all these document types natively. So we're launching that and the image to image translation that we spoke about. One new idea that we had is in the data extraction space is I think Microsoft released a segmentation model where you can segment buttons and different fields in a site. Can we combine that with object detection and other forms of detection and into a single model? Right, so segmentation, object detection and a few things into one. And that's something that we're exploring. So we're going super deep into some of the detection space and the embedding space. So the next I think one year is diving way deeper into the technologies that we've built, improving the quality of each thing rather than scaling into more products. We kind of like this is where we are stopping the product roadmap. But now we're just going to go deeper. How can we improve the developer experience? How can we make it even more seamless? Can we improve our AI scraper to make it even cheaper and make it faster by updating the engine under the hood, there's so many new property alternative engines that are coming out that's faster to run. Can we scale that, make that faster? So I think the next one is really about the quality of the product, the developer experience and how deep we can go and distribution rather than scaling the kind of like the product roadmap. So the product I share with you right now are basically going to be the last three additions and after that we're just going deep.
[37:08]
Gregor Van
That's a great strategy I think for anyone listening and kind of getting going with the product. That's nice to know in the sense that what you're using is just going to improve and you've already been through that phase of shaking out what bits actually make up the product and what kind of gets people excited and obviously useful. Are you going to be hiring? We've got a great technical listener base because it's great if you say yes or no because then you either don't get flooded with emails or you do.
[37:30]
Yevan Kemlani
But just yeah, go to jigsawstart.com careers we are always hiring at this point and the significant role we're hiring for right now is a founding full stack AI engineer. We only hire a star players. We have three questions. One of the question is like do you have a side project? That's my benchmark as an engineer. If you don't have a side project you're working on for yourself or for fun, then don't need to apply, just don't apply.
[37:54]
Gregor Van
I think that's great. Probably a caveat there is if you're already working on a startup but then realize this is a better opportunity that effectively is your side project, right? Yeah. If you're perhaps in a sort of regular role but itching to kind of get into something more interesting, side project is always kind of looked upon very favorably by people like yourselves. Just a kind of I guess closing question generally was I think you founded this effectively solo, is that correct?
[38:19]
Yevan Kemlani
Yeah, I'm a solo founder. It's not that I wanted to be a solo founder. I think being the technical because in my previous company both my co founders were non technical. When I typically go for a non technical product that I'm building, then it's easy to find a lot of co founders that in that space that specialize in a specific space. I think the same way non technical founders find it challenging to find a technical founder. It's difficult for a technical founder to find another technical founder. It's even more difficult because the issue is that you have a belief system of how things are built and then you're always kind of in this realm of the way you see things and finding another technical founder is even more challenging. I've tried, couldn't find a perfect fit and that's why I'm like let's just find the specific roles and then eventually those people become Co founders. Right. And that's why I'm kind of hiring the founding team that you eventually scale. You get equity and your equity grows. And I can give away, honestly, more equity because I'm a solo founder. So that's a good benefit there.
[39:17]
Gregor Van
I think it's great to call out. I mean, there's just. There's so much out there basically saying, like, almost don't even try unless you find your co founder. And I think there's quite a few examples in the last couple of years where I've seen solo founders doing incredibly well. So I just kind of thought it was good to call out.
[39:34]
Yevan Kemlani
So 100%. Like, I don't think. I'm not really solo, like a really great team. I don't feel alone in the company. And that's the thing.
[39:41]
Gregor Van
Right.
[39:41]
Yevan Kemlani
Like, I think solo founders just have to build their team better and that's like the only challenge. And so it's not a pain point. I think. It's not a big pain point for me.
[39:49]
Gregor Van
Yeah, yeah, absolutely. And yeah, I mean, you sort of founded it effectively in Singapore, I believe. Kind of moving. Well, is it west or east? It's kind of equal distance, pretty much. Well, let's just say you're going east of the States. Is that right?
[40:02]
Yevan Kemlani
Yeah, yeah. So moving to San Francisco. One big reason is because when we launched in Singapore, naturally I'm here, so it's easy to launch here. We put it on Hacker News, we put it on Reddit, we put it everywhere we could. Just majority of our customers and users are from the U.S. at this point, at least at the initial stage, we want to focus on selling the US customers or in that region. Our second biggest market is actually the uk and so we just think being in that region makes more sense for me to speak to customers, get a better feedback loop, just be in that same energy and at the forefront of technology in that space. Southeast Asia is still a big market, obviously, but we require a lot more capital to tackle Southeast Asia and that's something that we want to come back down the line.
[40:42]
Gregor Van
Yeah, for sure. Well, Yevan, it's been so great to catch up and to hear all about czechoslov Stack at this stage in the journey. Sure. We'll catch up again in time when Jigsaw stack is probably 10 times the size in only a couple years or something like that. So, yeah, just wishing you guys all the best and we'll be following along 100%.
[41:00]
Yevan Kemlani
Thanks for having me on.