
Modern software systems are composed of many independent microservices spanning frontends, backends, APIs, and AI models, and coordinating and scaling them reliably is a constant challenge. A workflow orchestration platform addresses this by providing ...
Loading summary
A
Modern software systems are composed of many independent microservices spanning front ends, back ends, APIs and AI models and coordinating and scaling them reliably is a constant challenge. A workflow orchestration platform addresses this by providing a structured framework to define, execute and monitor complex workflows with resilience and clarity. Orcas is an enterprise scale agentic orchestration platform that builds on the open source Conductor project which was pioneered at Netflix. The platform coordinates AI agents, humans and APIs with a focus on scalability, compliance and trust. It further expands on the Conductor core by adding features like security, governance and long running workflows. Viran Barreya is the founder and CTO at Orcas and he's the creator of Netflix Conductor. Viran joins the show with Gregor Van to talk about his building Conductor at Netflix, the challenge of orchestrating microservices, rule based versus programmatic workflow orchestration, agentic orchestration, MCP integration and much more. Gregor Vand is a security focused technologist, having previously been a CTO across cybersecurity, cyber insurance and general software engineering companies. He is based in Singapore and can be found via his profile at Van HK or on LinkedIn.
B
Hello and welcome to Software Engineering Daily. My guest today is Varane Barreya from Orcus. So we're going to be talking all about what ARCUS does and especially Orcus Conductor, and some of you may already know the word conductor from some other companies and we're going to be talking about that. But yeah, welcome Viren.
C
Thanks Gregor for having me here and I'm very excited to be here.
B
Yeah, so as we do on this podcast tradition, we just like to kind of get an understanding of your background. We've definitely worked at quite a few interesting companies that I think our audience will be fairly familiar with. But yeah, just walk us through what's your kind of career path? I guess to Orcas.
C
Yeah, I started Orcas about close to getting four years now and prior to Orcus I spent like in very different set of industries. Right. So before Orcus I spent almost few years at Google mostly working on developer products. So Firebase and Google Play, this is one place where I got to kind of work with developer as one of the audiences, how do they interact with systems and how do you build for them, which was also kind of one of the reasons why, you know, which kind of motivated me to build Orcas. More interestingly, before Google I spent my time at Netflix, which is where I was part of an infrastructure team that was responsible for building the platform for Netflix's Back then, very ambitious studio project. Basically the goal was to build the largest production studio in the world and to enable that. There is, of course, the whole production side, but at the same time, there was the engineering side in terms of building out the products and tools. And my team was responsible for building out the platform. This is where Conductor originated, amongst many other things that we built. And interestingly, before I moved to Netflix, I was in a very different kind of industry. So I spent almost six years at Goldman Sachs working in the investment banking technology side of it. That was a totally different experience altogether. So I was in east coast, moved to west coast, all the way from investment banking to entertainment, and then Internet consumer, and then now SaaS.
B
Yeah, yeah, that's very interesting. I don't think we have many guests who have actually been in investment banking tech and then managed to move into, I guess, sort of Silicon Valley tech. So that's super interesting. And obviously Netflix, especially the time that you were there, which years you were there, that was such a pivotal time for that company. And speaking of Netflix, so some of the audience may be familiar with Netflix Conductor, and I think we should kind of talk about that to begin with. So you were working on that part of the technology. But let's just talk about what was Netflix Conductor and what was the challenge it was solving and sort of what was that all about?
C
Yeah. So if you look at Netflix's history, right? Netflix has historically kind of been the pioneer when it comes to some of the frontier technology, right? Like, they were one of the very first big tech company to be completely on cloud. They invested very heavily into microservices. And when I joined, one of the things that was very surprising was that it really had embraced microservices. Right? So there were a lot of microservices, and that enabled teams to kind of move very fast. At the same time, one of the biggest challenge was how do you essentially coordinate the work across different microservices, Right? Because by definition, a microservice is not going to implement the entire business flow. It just implements part of it, which means that you have to stitch together multiple microservices. And the very kind of traditional way of doing that is through some sort of eventing system. Right. Back in the day, that used to be message buses, service bus, and so forth. Right. And given Netflix and its scale, we had our own internal implementation of the service bus, pushing through billions of messages a day. And that worked very well. However, the major issue that you run into with something like that is not when things are working fine, but when things are not working fine, and this is where you start to see the brittleness of the system, what happens when something is not working. Now, to understand exactly what's going on, you have to kind of either dig to the record, talk to 20 different people, manage hundreds of different queues and things like that. So that is where the motivation for Conductor started, to say that we want to continue investing into the same principles of building through microservices. We want, and we like this whole loosely kind of coupled aspect of building distributed systems. What we don't like is coordinating directly through code, and let's have an Orchestrator do that thing for us. At the same time, we wanted the Orchestrator to be working at network scale, built using Chaos Engineering principles. And that's how we started working on Conductor. And that's what Connector started to do. Right. And the whole idea was, let's tame all the microservices, bring the order to the chaos without losing and sacrificing what it delivers at the end of the day, Right?
B
Yeah. And, yeah, for those that maybe aren't familiar, Chaos Engineering is this sort of principle of just switching things off and not telling anyone and seeing what happens, watch sort of how our system fails and how it's responded to. So I think it'll be interesting maybe to get into that, maybe a bit more as we go through what Orcas is doing. And with Netflix, am I right in saying it wasn't originally open source, but then it was open sourced at some stage? Is that correct?
C
Yeah. So when we started, of course we started building it at one point in time, what we realized was that the product was good. There was a lot of adoption inside, within the company. And what we saw was that, hey, if we open source this, we get the benefit of community contributing to it. We can also leverage it as a way to recruit good people, build collateral for the team itself. And more importantly, Netflix has been always very active about pushing things back to open source, especially things which are not proprietary, not key to the business. And Connector is a very generic piece of software. It's nothing to do with encoding or streaming or anything. So we decided to kind of make it open source. And since then it has been open source for quite some time now.
B
Yeah, yeah. So we'll come back to kind of then the transition from it being kind of associated with Netflix and then sort of what happened there, and then Orcus coming out as a company on its own, so to speak. But I think we should also just take a step back. You already mentioned orchestration there. So let's just kind of give a baseline of like, what is workflow orchestration?
C
I think that's a great question. And the reason is that orchestration is a lot of things. In the end, the moment you start coordinating work across different systems, that's orchestration. But then you could have orchestration of containers. If you are talking about infrastructure, it could be messaging, it could be data pipelines, services, in our case, for example. And in general, if you look at the workflow orchestration and workflow engines, it's not a new concept, right? It's been around for very, very long time, decades probably, if not longer. And the primary reason for that is that when you look at the business processes, everything is a workflow. And one thing that I like to say is that whether you like it or not, whether you know it or not, but everybody is building a state machine one way or other. And the hardest part of building a system is not how to implement a particular business logic, but is how to maintain the state in a way that it remains consistent and coherent with the business goals. So if you can offload that entire responsibility to a workflow engine, then how you develop systems, how do you add resiliency and scale and everything else becomes much easier to deal with. In the end, if you think about it, if you have a serverless lambda or a completely stateless service, if you want to scale up, you can just horizontally scale because it maintains no state. But the moment you add a state now it becomes challenging because if there are failures, you have to recover from the state failures. When you are scaling, you have to ensure that state can be managed in a distributed environment. So scaling stateful systems are much harder. And this is where one of the things that we have also seen is developers end up spending a lot of time. So workflow engine solves that problem, the right ones especially because you also don't want workflow engines to be single point of failure, because if the workflow engine is down, then everything is down. So that itself has to be resilient.
B
Yeah, absolutely. So, yeah, we have this concept of like a workflow engine. And before we then get to pure conductor, there are kind of two sort of ways you could still go about a workflow engine, is that right? And we've got like rule based and conductor style. So maybe just walk us through what rule based? I assume the rule based came before conductor style, and maybe people are more familiar with that. But let's just talk about that and maybe its limitations and then we can sort of Then go on to Conductor.
C
I think the roots for the rule based workflow engines are in the fact that if you are able to define a set of rules, how the process should be orchestrated, and then somebody who owns the business, this could be a product manager or a business analyst, they can come and define the rules, which works well. That is how in practice things happen. But it also simplifies things because now you can write rules based on your business specific needs. Where it starts to kind of break down is that one, you are constrained by the rules that you can write an underlying implementation. So if you were to make systems a bit more complicated, then that starts to become challenging. The second is that for simplest cases, that is completely okay, but as it gets more complex, it becomes much harder to understand what is going on here. And then you are constantly translating those rules into the code, right? To see how this is going to get executed. So that kind of choreography is constantly happening in your mind when you are debugging things, when you are trying to understand what's going on there. So that becomes a bit of a challenge. And the third thing that we realized was that oftentimes what happens is that in theory it looks great that as a business owner you can come and define rules. In practice, they will write up a doc and a developer is supposed to then go and implement the rules and write the rules. As a developer, you deal with the code, not the rules. So Kanuta takes a different approach to say that let's keep that idea of being able to define orchestration as a dag, a direct acyclic graph, but instead of making it a very business specific or rule based system, make it programmatic. So just like you write code, you should be able to write a workflow and it should follow the same principles and the same kind of semantics, like a code, in terms of being able to run things in parallel, decision cases, loops. So as a developer, the way you would think about it, you can just write the workflow as it is. And the fundamental principle and the thumb rule that we follow was that if you can write code, you should be able to define overflow. It should be one to one, there should be no missing cases there. And it should follow the exact same principles as variables, states and things like that, essentially making your code completely durable. That has worked very well. That has worked very well because one, as a developer, when you are implementing it, you are not constantly fighting against a different type of system. You are doing exactly how you think about it. When you are debugging, it becomes pretty Clear as to what you are debugging. And then to give the business owners the visibility, you can always transpose that into a business specific set of dashboards. So it keeps everybody happy.
B
Yeah, because I mean when you mentioned or when you talked about role based, it sounded like the focus was more for non technical people ultimately. And Conductor style is where actually the developer has a lot more control over that process.
C
That is correct, yeah.
B
Yeah. Awesome. So let's kind of, I guess look at how ARCUS kind of came to be. I believe there was something around Netflix decided to stop supporting or stop contributing I guess to Conductor as an open source project. But I think there was a group of you that weren't actually working at Netflix by that stage, but then sort of came back together. So what was that story?
C
Yeah, so when we started Orcus, we weren't at Netflix, right. Like I had Netflix for about four years back then I was at Google. And when we started Orcus, one of the things that we did was we started working with Netflix. So you know, we started contributing back to Conductor at some point in time we started to become one of the larger contributors compared to Netflix. And it just made sense that like, you know, let's take it out of Netflix umbrella and put it into its own project repository. Right. Giving community more control over the project and increasing the velocity. Because in the end the motivation for maintenance of the open source project between Netflix and open source community and us is going to be very different and we have a much bigger team in terms of being able to support and take this forward. So that's where we kind of work with Netflix to kind of help them kind of archive the repository which cover the code base. And we did it in such a way that we are not losing all the previous contributions, contributor history and everything. So it remains that way. And that has been kind of the case. And it remains completely compatible because it's the same source code in the end. And we see that as an evolution of the open source project. Many of them kind of go through the similar case where they start at a particular company. Like Kafka is a good example. Going from LinkedIn to Apache and then mostly shepherded by a confluent seeing little bit of databricks in Spark.
B
Yeah, exactly. We've seen this sort of, as you say, across a bunch of open source projects that sort of usually come out of a company. And then yeah, there's just various reasons why it makes sense to. Well, either to just kind of fully open source it and say, hey, we're just not going to support this anymore when whoever wants to come in and work on it. So that makes a lot of sense. Let's then talk about Orcus and if you could explain what Orcus is and maybe let's start with like what is it ultimately? What has it built on top of Konductor and the Conductor project?
C
Yeah. So when we started Orcas, right, the primary motivation was that we have the open source project. There is a very clear fit between the project and the market because you know, we had seen by then like you know, thousands of companies, some of the very well known companies as well, using them in their production flows. Personally I was getting a lot of pings on LinkedIn, sometimes people asking me to review their PR or asking for some help. And in the end we felt that the market was ready. Typically what tends to happen is that companies like Netflix, they are on the bleeding edge of the innovation. So the problems that they see and that they solve, the industry starts to see them three, four years later. So in a way kind of the timing was very correct that like, no, this is the time where companies are going to start looking at it and realizing that as you move your systems to cloud, you need to break down your monoliths at the same time. Cloud, yes, it gives you the elasticity of infrastructure, but at the same time, unlike data center, you have to start thinking about resiliency aspect also. And this is where they will start thinking about workflow engines. And we were kind of right in that sense also. So that's how we started kind of the Orcas as a company and in terms of the business model and how we go about it. By this time around, other open source companies had already paved the way in terms of how should you think about building an open source project, monetize that, how do you kind of differentiate and things like that, right. So that kind of become the foundation for how we think about Orcas as a company and the Conductor as an open source project that we kind of monetize on. And then bringing back to your other question is like, you know, how does it differentiate? Right. So what we have been doing is that you have the open source project, we use the open source. So in the end Orcas is open core, right. Where I think enterprises wanted us to be able to support them was in terms of adding enterprise features. Because if you want to run a project inside your company, let's say a bank or a healthcare company, you need to have things around security compliance, governance, those things. Open source. I would like to Think about it more like a Linux kernel, right? You can take a kernel and build your own distribution, but as a company, you probably want to get a distribution that is vetted by a vendor and has all the security features and everything. So that is exactly kind of way we did it. The other part of Conductor is that Conductor is a very plug and play system, so it supports multiple different backends. And what we do is we take the right ones, we optimize for the performance and the cost and everything, and the manageability also on top of it. And that's essentially what we deliver. Right. What our customers are essentially paying us for is in the end, how do we take the project and run it reliably in their environment. Because that's the key challenge that we solve for them. If you were to give them three nines or four nines of availability, that's one thing that we can deliver them without them having to worry about it.
B
Yeah. So yeah, again, kind of walking quite a well, I think now we have seen it's quite a familiar path where an open source project can just benefit hugely actually, if there is a sort of commercial arm around it where, as you say, it's able to provide the security, the stability, the compliance, especially, you know, in the enterprise setting, which is what Orcus really caters to. Having all that just kind of taken care of, as you say, there's a huge need often for that on the basis that original project has got a bunch of people using it. And yeah, I track back to sort of. It was interesting sort of in Hacker news at the time. There was a lot of comments, people saying, oh, this is awesome. It's so great to see that some people are taking this on as like a proper company. We can just kind of buy it from them as opposed to needing to try and run it ourself now.
C
As a matter of fact, our first few customers were the open source adopters. Was like, glad that you guys started the company. Can you help us?
B
So yeah, and especially because quite a few of you were very much the core contributors in the first place. So that's awesome to see. So let's move on to Orcus, maybe. Walk us through. So how has Orcus evolved? Because I think we're talking sort of back in 2022. ISH is kind of when that started, what we've just been talking about. So maybe just walk us through, how has Orcus evolved? We're going to get into agentic orchestration and AI because that's where it can help in a big way in things that are very pertinent now. But maybe just walk us through kind of how's the product kind of evolved.
C
Yeah, yeah. So I think when we started, our initial focus was that, hey, is open source, how do we make it enterprise ready, run it in cloud? So we kind of focused on that one. Right. And of course our customers tend to be mostly in very regulated industry, quite a few of them, right? Which means that like, you know, you have to support different modalities in terms of whether it's running fully hosted by Orcas or is it like bring your own cloud or in some cases running in data centers. So that is one area where we spend time and making sure that we can take the software, we can run it at a highly reliable scale for the customers. And then we started to kind of start thinking about as a company, if you are leveraging something like Conductor, they don't want different tools for different problems. When you think about workload orchestration, you want to be able to do n number of things with it, right? So we started to kind of add some of the features that we got as clear feedback from our customers. Sometimes they had kind of built out their own internal versions of it, but they wanted us to kind of support them by adding it as a proper feature inside Conductor. So some of the things that we did was like workflow engines traditionally are asynchronous orchestration, meaning the workflows can run anywhere from few minutes to hours to days. We added support so that your workflows can run for much longer period of time as well, like months and months or even years for some cases. And we do have some use cases like that. And then on the other extreme was if you are orchestrating services, you have HTTP services, you have GRPC services, and you want to orchestrate them, those are going to not run for seconds. They are going to probably finish the entire flow in tens of milliseconds. So how can we run workflows synchronously? Right. So true microservices orchestration, but very much synchronous and very low latency. So that's another area that we focused on. And I think that's one of our key capabilities that is very unique to Conductor that you typically don't find in other workflow engines. And then as kind of the industry was starting to think about AI and LLMs, that's where we started to kind of invest into how can we let workflow engines orchestrate language models. I mean, today I think that has become like a very common place that you need a workflow engine to orchestrate your agents. But back in the day, people were still writing Python code to just call LLMs. And that's where we started to kind of build integration suites and everything. Right. Like we core to our nature, right? Like in the end we are not a solution, we are a platform. Which means we want people to be able to kind of use it in whatever ways and format they want to use it. So one area where we focus on is that like, let's start to integrate and provide support for pretty much every foundational model that is out there. And today I think we support pretty much every possible model out there. You can switch back and forth, you can run them together in the same workflow and things like that. So that's one of the things areas where like, you know, ORCAS has evolved into, you know, true LLM orchestration platform. Right. If you have multiple agentic models and that then allows you to kind of build. If you think about traditional workflows, those are deterministic flows, right. You could have switch cases which could take different paths, but in the end it is still very deterministic. Given the right input, it is always going to produce exact same output path. Now if you add language models and LLMs inside that, you start to kind of see the non determinism aspect of the workflow because even for the same input it could take a different path. And then we started to kind of support those things. And I think that's one area where in general industry also is moving towards and we are continuing to kind of invest into that area as well.
B
Yeah, so I mean, maybe again, just to sort of baseline this, I'm sure majority of the audience are kind of familiar with what an AI agent is or sort of what they maybe think it is. But at the same time, I think it's always helpful to kind of get your definition as well, because I think you could probably pull up five different sort of definitions of what an AI agent is, and especially in the orchestration sense. I mean, for example, are we talking, when we say agentic orchestration, are we saying, well, these are multiple agents that get orchestrated or are we saying that ultimately an orchestration could be termed as an AI agent or you can help me out here, I think.
C
Yeah, that's a good question. I think agent is a very confusing term because it's a very general purpose thing. Right. Pretty much anything can be thought of as an agent. But in the end, I think the textbook definition of agent is that agent is something which is an Agency, it has its own autonomy in terms of how can it plan and execute its goals. And now if you translate that into agentic systems, it means I think three different things, right? In my opinion, an agent essentially could be purely an orchestration where you have language models deciding the path. There has to be some sense of autonomy, otherwise you just have a very deterministic system. So agents by definition has some level of autonomy and therefore non determinism kind of built into it. Now you can think about a workflow with a single language model or an LLM that is either running in a loop or in a single execution path. In that case you have a single agent that is operating inside that workflow. Now we have heard a lot about humans in the loop and guardrails as humans. You can think about humans also as an agent. So the moment you put a human inside a workflow with an LLM you are starting to think about multi agent systems where now you have two agents and they have very clear responsibilities. Maybe the LLM has a responsibility to come up with a plan and human has a responsibility to kind of vet that plan or approve or reject the plan and then continue executing on that one. Similarly, you could add more LLMs and build true multi agent systems where LLMs are participating and each one has a pretty well defined role. I think a very good example I would say is what we see with AI coding tools like Cursor and Windsurface. You could think about an agent. One of the agents which takes your instructions generates the code. A second agent could be actually responsible for compiling and third one could be responsible for testing against and checking against your input goal site. And they are all coordinating, running in a loop until it achieves the goal. So that's a true multi agent system in the end. And as a human, as a developer, you are also an agent who is kind of saying yeah, this looks good, approved it commit the code. So that's now a true multi agent system. But in the end agents are. If you think about heuristic workflows, the way I would like to think about is that you don't have a very set defined part, but you have a very high level definition of this is how you should do. Will you do it or not? It depends upon how LLMs are thinking about doing it.
B
Yeah, I think that's really helpful. And the code orchestration example is a good one. I think also through the ARCUS website and sort of there's examples of flows and I think this is an example that sometimes you pull out around Inventory management, more like claims management, for example, maybe that'd be quite interesting to sort of understand. Now, how does it differ compared to say like a traditional rule based system, like what are the things that can be done differently and better? I guess when we're now talking about AI agentic, orchestrated and then applied to these kind of quite clear business use.
C
Cases, I think, see, the biggest thing that I think that can be done better is if you have a non agentic system. And because by definition it is a very deterministic system, every time you have a different use case, you have to build a new workflow, a new system around it, which essentially creates an explosion of different use cases in which is what you see is that, hey, if I were to approve a claim, for example, and depending upon different requirements, you have different claim systems or different parts of the claims and things like that. But if you were to add something new again, like, you know, you go back to development mode, rebuild or build a new feature and it takes time and things like that. With agentic system, I think the biggest change is that instead of writing the entire system end to end, you focus on writing tools. A tool can be something that sends an email. A tool can be something that looks at the claim information and pulls up the customer information or the claimant information or looks up the policy. Now if you think about right, you can put those tools in any particular combination. So now we are talking about combinatorial explosion, right? If you were to build deterministic systems, you end up building large number of different use cases and paths, which is why most of the software projects takes months and months to develop, because you have to cater to all different possibilities and everything. But if you break it down to say that like I have got N tools, it can be used in any combination and an LLM can decide which one to use. Now your thinking changes, right? Like you're no longer thinking about putting them together by yourself. You are building stateless tools, very similar to microservices if you think about it. But instead of as a developer, you kind of putting them together and LLM is taking your input and deciding on the fly, how should I do this? Which means that like, you know, your development process becomes simplified, you can introduce a new tool without having to change everything and start incorporating them. So the way it differs from traditional rule based systems is that it now allows you to go from 0 to 1 and 1 to n very quickly by just incorporating more and more tools, but you are no longer catering to kind of the combinatorial explosion. Of different use cases. Right. You can just do it out of the box. And we are starting to see that. Right. That's how I would say a lot of new systems are starting to build out is through agent. Agent can do pretty much anything as long as they have the right tools and context given to them.
B
Yeah, I mean, I think to sort of use a slightly overused term is sort of this idea of basically setting the first principles of what can be done and then letting the orchestration aspect kind of then deal with how it wants to then go about that. And you touched on it obviously the deterministic or non deterministic, especially in this case aspect. And I think that's something probably a lot of the audience is curious about is how does that then work? Because that's basically the crux of all this is how do we allow the system to take its own decisions and what sort of constitutes this, say a first principle that can be laid down and then the rest is allowed. Yeah. How does ORCAS deal with this and how does somebody using Orcas, I guess, how can they feel confident that the non deterministic aspect is kind of taken care of?
C
I guess I think the analogy that I like to think about it is when you have a car that can do self driving, there are two aspects of it. One is the notion of control that I can take on the steering wheel at any point in time and do whatever. So guardrails, humans who can be in the loop wherever you need to be. Second part, which is I think more critical if you think about it. If you just treat LLM as a black box and say here are the tools, just go and do it and come back with a result. How do you know what was the thought process there and what did it do? So second part is basically showing me what it sees, saying this is what I'm thinking, this is my plan and this is exactly how the graph of this execution is going to look like. So as a human, now I can look at it and say this makes sense that you're going to execute step one, two. And based on the output of step two, I can take three or three prime and then go and execute step number four. So now this graph is something that I can see and say this is what you are thinking about doing it. This makes sense for me that you should do it this way and go and do it. So humans in the loop becomes critical along with that entire aspect of being able to visualize the execution graph. I think that's tremendous because now you start to Build confidence that this works. The second part is when to apply guardrails. A good example that I like to give here is if I am building a DevOps system and I'm using an agent to manage my kubernetes clusters, when it decides to execute an operation to get the list of pods and deployment, yeah, nothing bad is going to happen. So just do it. Even if you execute that command on a wrong cluster or a production cluster, I mean you are just going to execute a read operation, nothing bad is going to happen and that's completely fine. But if you are going to destroy a cluster, you better check with me first. Maybe you send me a slack message or an email and let me approve it because you might hallucinate, you might end up taking wrong decision or a typo and destroy a production cluster. So I don't want you to do it. So when to apply guardrail is another aspect. This is where we are spending a lot of time to say as a builder of the agent, you should have full control. So instead of saying here is the LLM, you give them the tools and let it execute everything. Our approach is fundamentally different in the sense that here is the LLM, we give LLM saying here are the tools that you can use, tell me what you are going to use and then based on the outcome I can decide and build that inside my workflow. So now the workflow becomes a combination of some set of algorithms and some set of non determinism. Right, you add determinism when you need and otherwise let non determinism take care of everything else.
B
Yeah, and I believe exactly that Orchis is really focused on sort of this trust aspect because I think that is what everyone is. I say everyone, but especially enterprise is sort of concerned about the potential productivity gains around allowing agents to run. A bunch of stuff is in theory fantastic. It is just kind of that, well, that developer example you just gave of a cluster being destroyed or being some typo somewhere that's I think what a lot of especially I would say the non maybe technical folk in companies are very concerned about. They're sort of like this all sounds great, but there's no way that this could actually do it reliably. So can you maybe just walk us through maybe a few mechanisms or sort of how does Orcus, or maybe I don't expect it's all kind of solved today, but how is Orcus actually approaching this? And what kind of tools and mechanisms are there to help the developer? And then what could that then help the developer say to the non business person to sort of help them feel more at ease about all of this.
C
The way we are approaching is as I was kind of trying to explain why in two ways. One is being able to add guardrails and be able to add guardrails when you think this operation is going to be something that you want someone to take a look at it. And guardrail doesn't have to be necessarily a human, right? There are a lot of systems for automated guardrails. You can also use agent as a guardrail so you can delegate it to another agent, which can get tricky because what if that also hallucinates and two of them agrees and does something bad? But depending upon the use case, depending upon the contextual need, we allow developers to put the right guardrails. And adding guardrails is a deterministic step. We are not asking LLM to decide when to use guardrail because that then brings back a cyclic dependency and trust aspect. Instead of that, we let developers to add and say that this is where you will add a guardrail. And that's a very, very deterministic step that if you have a guardrail set up for a specific tool, it will get executed. So that is one part of it that takes away the whole aspect of LLM doing something bad without your approval. The second part is understanding what really happened. So it's completely possible that the LLMs did exactly what it was supposed to do. No hallucinations and everything. But then there are questions about why. And that's important, right? In terms of, let's say if it's a claim processing system and if I approved or denied a claim, and if there's a question as to why, the answer cannot be that because my AI said so. It has to be that, hey, this is the thought process, this is how we evaluated the claim. And therefore it is. So it's less about LLMs making decisions, it's more about LLM defining the flow. But you need to have the complete visibility. So other aspect that we give is that we give the full blown graph of exactly what happened step by step, every step. What was the input given, what was the output that came out of it, and exactly what was the decision made based on that. So now as an operations person or a human, you can look at it and explain exactly what happened and why it happened. So you know, that takes away the other aspect of is as to I can't explain what happened now we can completely explain what happened. You can control. And these two things combined we think kind of gives you Enough guardrails. And of course like one other aspect of conductor and workflow engines in general is that it keeps trail of everything. So every conversation, every execution that happen is captured, stored and can be kept in the storage for whatever is your retention policy. Right. So if you were to go back and see what was happening, how things were happening, those things can be later queried on the other part is access control. One thing that is built into ORCAS is just because you have a tool does not mean anybody can use it. So even to use a tool you need to have the right access control. Which means a good example I'd like to give here is that if you are building an agent that can do all stuff HR for you, you should not be able to ask agent to give yourself a promotion unless you are an HR admin and it goes through proper approval process. So that is also built into orcas. So with the right level of access control visibility and the human guardrails, I think we think that that's going to be enough for someone to say hey, we can trust the system.
B
Yeah. And yeah, you've mentioned it, good explanation. The graph sort of aspect of it, does it present kind of the same and let's just mix this in with the access control for a second. Does it present the same to sort of across all types of person or is it you're able to present different kind of views that make the most sense of the explanation? Is this an ops person who's going to understand what happened in this way, which is quite different at times to a developer who wants to kind of see it in a slightly different way. But maybe this has been solved in one pane of glass again to use a slightly overused phrase. But yeah, tell us about that.
C
Yeah, I think the short answer is yes, slightly longer answer is that as a developer you are able to see every step, but depending upon how you construct the whole thing. As an ops person, either you can look at the high level block saying step one, step two, step three, step two could be a lot more complex, which as an ops person you may not need to understand and know. And of course, because one thing about Connector is that it's pretty much API driven system, you can then go and build very business specific views of it which might make sense for your business users and follows your kind of process flows and definitions and everything around it. So it kind of decouples those two aspects. As a developer you don't have to think about building everything with a rule for the business. And as a business users you don't have to think about, I don't understand this. Somebody come and explain to me.
B
Yeah, so bringing this kind of back and forwards, I guess, to where the developer sits. One bit that we haven't touched on before, we kind of get on to sort of just getting up and running, so to speak. But one thing we haven't touched on is actually how MCP and MCP servers come into this. And I believe you have open sourced your MCP server for Conductor. I think this was the thing when I was sort of getting my head around what Orcus and Conductor does in the first place. I instantly started to think about MCP because I thought, well, isn't this what MCP is sort of for? So maybe talk to us about where the intersection is and how they kind of work together.
C
Yeah. So MCP focus is primarily on how do you expose your tooling and API. Something that LLMs can understand, right. And that simplifies a great deal in terms of LLMs being able to call the tools. And what we do is that we allow developers to bring their own MCP servers. We are also working to kind of bring in most of the common ones as a part of the out of the box capabilities inside our Enterprise edition and then it can basically use them as tools. So if you want to send an email and LLM decides that no, I need to notify the user and if you have an integration through MCP via Outlook or at Twilio, it can send you an email. So that's the primary role for mcp. The Conductor MCP server does a very similar thing, but it acts as a tool to generate the execution graphs to say that, hey, I have this goal, can you give me a Conductor workflow for this that I can then go and execute? Right? So that's like a stepping stone for us to build a fully autonomous systems. Because one area where, if you think about today, right, like LLMs, essentially what they do is like in the programming terminology, they do a look ahead of one. They look at the current context and say, what's the next set of tools that I'm going to execute? Where we are going with is I can look at the goal and say I can define the entire execution graph with a look ahead of N, N being decently finite number. So that improves both performance, the cost aspect of also because you are making less LLM calls and more importantly reliability, because that output can be pretty much deterministic or deterministic enough for multiple iterations. So that's the primary goal of the MCP server.
B
Got it. And I mean it is effectively completely optional. It's not sort of. Yeah, it's not sort of required in terms of. And I mean other, I mean MCP is obviously it's a protocol that was ultimately developed by Anthropic. Are you looking to do any support for any of the other competing, shall we say protocols or sort of does MCP make sense as the one to kind of sit with?
C
I think MCP is a great one for being able to call the tools we added some features that are kind of gaps or some of the things that MCP as a protocol definition lacks. Things like access control. It does now support notion of authentication. But the odds other part that we have added then the other One is a 2A. When you start thinking about multi agent protocols, I think A2A is coming out to be something that people are starting to think about as agent coordination. So that's another area where we are going to add support pretty soon.
B
Nice. Awesome. So let's just sort of talk about I guess sort of up and running first of all where does the developer kind of go? And then maybe could you just talk us through what is a sort of high impact, say first 10 minutes of getting started with Orchis? I believe there's like some kind of template type workflows you can kind of run out of the box.
C
Yeah.
B
So what's the kind of. Yeah, what's like a high impact 10 minute place for someone who's never used this or let's just say for argument's sake they've never even used orchestration before. This is the first time they're actually approaching this. Like what does that look like?
C
So I would say there are three main categories. Right. One is like, you know, if you are looking to orchestrate APIs, you can create a workflow connector has notions of system tasks or you know, things which are like pre built. You don't have to write code for it and it just, even if you write the code it won't do the same thing. So you can orchestrate multiple HTTP endpoints and see for yourself. There are a lot of example API endpoints available on the Internet. You can just put them together and see how it orchestrates them and gives you the visibility. So that's the API orchestration use case that you can very quickly test it out. Second part is if you are trying to build an agent, you can try and build out a simple chat complete agent. Like you can put a loop and chat complete inside it and it will keep on running until your loop terminates. And you can actually put two agents. Like you can take two chat, complete two agents and give them some instructions and you will see that they start talking to each other in a conversational way. That's pretty fun to see and quite interesting sometimes. And the third part is if you are building a workflow, you can take an existing business process that you have like let's say order management or claim processing and we have templates for it. You can try it out, mock up the actual implementation and see for yourself. You know, how easy is it to like change, modify, get the visibility into it. But I would say those are some of the things that can be done in the next like you know, in 10 minutes. And we have a developer edition. So, you know, anybody can go to developer.orcuscloud.com and get started pretty quickly without having to worry about how do I download run locally. Which if you want to do it, you can always do it. But nothing beats like you know, one click, go to this URL and start working on it.
B
Yeah, exactly. I think that's kind of where I went. So that's developer.orcascloud.com, orcas spelled O R K E S. So yeah, head there. Yeah, it's kind of pretty foolproof. You could just either choose templates or you could just hit start from scratch, sign up and then off you go. So awesome. So from what you can share, you kind of touched on where you might go. You say agent to agent things, but from what you can share, just before we wrap up, what does the next say six months look like for arcus and what are you looking to add or develop as well?
C
I think I would say that the industry is slowly moving towards agentic workflows. Everyone is thinking about how can they incorporate language models into their business processes and leverage them to accelerate the pace at which they can innovate, get ahead of the curve. And that's one area where we are focusing on it. And most importantly, as I said, like trust and safety aspect is the most important one because that's what enterprises care about more than anything else. And that's one area where we are spending a lot of effort and see how can we simplify those things. And the other part that is coming up pretty quickly is when you start thinking about agentic systems. Traditionally when you think about software, it was like as a developer, you will build end to end stack. That role might shift towards as a developer, you will build tools and the agents will be built by the business users going Back to our original discussion about rule based workflow engine side, I think they are coming back and I would say this time with a vengeance, saying, hey, we are going to let you do it, but now no more DSLs, no more quirky rule engine, but rather just describe what you want to do and I'll figure it out and do it for you. So I think that's going to be a pretty interesting area to see and that's one area where we are also investing to see how can we bring business and developers together to accelerate the speed at which they can innovate for companies.
B
Yeah, awesome. Sounds really powerful. I mean especially as you've called out, given that Archis is super focused on enterprise, enterprise grade reliability and trust in this sense. This is kind of where if it's going to be possible to do it in an enterprise setting, then this is kind of the place to come to. And obviously it's been proven as base level given it came out of places like Netflix and sort of used in big settings, big companies. So yeah, very exciting. So yeah, well thanks so much for coming on Varain. I think we've learned a lot. And yeah, again, just for anyone who wants to just kind of get up and running, that's developer.orcuscloud.com just head there and give it a try. So yeah, Viren, thank you so much. I hope we get to catch up again in the future.
C
Yeah, thank you. Thanks for having me here.
Episode Title: Orkes and Agentic Workflow Orchestration with Viren Baraiya
Podcast: Software Engineering Daily
Date: October 2, 2025
Guests:
Main Theme:
This episode explores the evolution and challenges of workflow orchestration in modern, microservices-heavy software architectures. Viren Baraiya discusses his journey from building Netflix Conductor to founding Orkes—an enterprise-scale platform for agentic workflow orchestration that extends Conductor's open source foundation with AI and compliance-focused features.
Focused on making Conductor “enterprise ready”—support for highly regulated industries, deployment flexibility (SaaS, BYOC, on-prem).
Extended Conductor beyond asynchronous workflow:
Orkes addresses operational needs often unmet by open source alone (monitoring, reliability, performance, compliance).
The term “agent” is broad but typically refers to an autonomous LLM or system that can plan and execute towards a goal.
Emphasis on “non-determinism” and flexibility—agents make real-time decisions rather than following fixed scripts.
Orkes prioritizes “trust and safety,” essential for enterprise adoption.
These measures help businesses and non-technical stakeholders “feel more at ease.”
Orchestrating APIs with system tasks
Building basic agent/chatbot flows (including multi-agent conversational demos)
Adapting business processes (order management, claims, etc.) using templates
Quote:
“Nothing beats like, you know, one click, go to this URL and start working on it.” — Viren (43:28)
| Timestamp | Topic | Key Points | |--------------|-----------------------------------------------|-----------------------------------------------| | 02:01–04:25 | Viren's background & Conductor origins | Building platforms at Netflix, state headaches, need for orchestration | | 07:51–10:35 | Orchestration Types & Philosophy | Rule-based vs. programmatic, state machines, developer-centric approaches | | 13:21–15:31 | Open source & Orkes founding | Moving Conductor out of Netflix, open core business model | | 15:31–19:50 | Orkes enterprise evolution | Features for reliability, compliance, AI/LLM support | | 19:50–26:32 | Agentic orchestration | Determinism vs. autonomy, multi-agent workflows, business tooling focus | | 29:29–37:11 | Trust, guardrails, and enterprise safety | Deterministic controls in agentic systems, explainability, audit trails | | 38:32–41:39 | MCP protocols, integration & extensibility | Tool exposure, cross-protocol support, agent-to-agent comms | | 41:39–43:50 | Developer onboarding & quickstart | Templates, playground, API/agent/business use cases | | 44:24–45:42 | Future roadmap | Democratizing workflow automation, safe LLM/AI enablement for business users |
This episode offers a deep look at the evolution of workflow orchestration technology, centering on how Viren Baraiya’s work brought about tools for modern microservices at scale—from Netflix Conductor to the enterprise-grade Orkes platform. The discussion balances practical engineering challenges, product philosophy shifts (from rule-based to agentic, AI-driven orchestration), and concrete enterprise concerns like trust, compliance, and usability. It should be helpful for engineers, architects, and business leaders interested in the next generation of workflow automation and AI integration.
Quick Start: Try Orkes at developer.orkescloud.com (O-R-K-E-S) — one-click template or custom workflow builds.