
Tech Talks are in-depth technical discussions. Oleg Illyenko is primary creator of Sangria, a graphql implementation used by twitter, The New Yorks Times and many other companies. We discuss the problems that graphql solves, how sangria works and the ...
Loading summary
A
I think GraphQL API is quite helpful if you have this fast evolving data model which you would like to expose to specific clients and you would like to provide a lot of flexibility for those clients.
B
Today I talked to Oleg Ilianko about GraphQL and his GraphQL implementation, Sangria. Welcome to the podcast.
A
It's nice to meet you. Thank you for having me here.
B
Yeah, it's great to meet you. So, a couple, actually about a month ago, I was working on this API and I had to add some new endpoints to it because they're building this new REACT client and they just wanted some new endpoints that return slightly different data. And as I started to dig into this, this, what seemed like a small thing, I kind of came to the realization that the API design is maybe a lot more complicated than I thought. And at this time I started to look into GraphQL and then I came across Sangria, which you made, which is a GraphQL tool for Scala. So I thought maybe we could start with maybe talking about some of these API difficulties and then transition into talking about GraphQL and then finally your product. The problem I had, because I think it's an interesting way to illustrate where GraphQL could be a solution, maybe. So there was a user endpoint and it returns some sort of user object and it's being used by certain clients and return certain, you know, just basic user information. Now we have a new client who'd like to call this API and either they have to, either they have to make extra requests because all the information isn't there, or I need to add more information to this call. So how would I solve this problem, say, without using GraphQL?
A
So without using GraphQL, there are different approaches to this. In many cases, if you just have one REST API, you would probably just introduce this change in the backend and try to think about how it will look like for all existing clients who maybe not necessarily need this information. Another approach you can go is to build a dedicated endpoint specifically for this particular client. It is especially important for mobile applications where they are very sensitive to type of data that is returned to them and to amount of data. With GraphQL, it's a little bit more flexible because every single client always needs to provide information about what kind of data it needs in form of GraphQL query. So this GraphQL query, it's kind of, you can think of it as a JSON, but if you skip the values, if you just have the names of the fields, the structure and you send it to the client, to the server, and then server fulfills kind of those requirements. It provides the missing parts, the value part. So the shape of response the client gets back is the same as the initial query, with exception that it has no values.
B
So for example, if I was using GraphQL, rather than building two endpoints, one client could just specify that they actually want whatever the organization name as a part of the request and they would get that back where the other client could not include that, and then they wouldn't receive that.
A
Exactly. So with Kafka API, you would still need to edit this capability to the server, but it will not affect any existing clients because if client doesn't know about this new field like organization, it will never ask for it. So this means other clients wouldn't be affected by this change. The new clients now can ask for this field. And as a side effect of this, it makes it also easier to track usage of API. We often talk, especially in context of mobile applications, we talk about different ways, versions of API, about maintenance of those versions. Because as soon as you deploy your mobile application, as soon as it went through the review process and it landed on the, let's say Android or iPhone, it is very hard to kind of make users updated. So this means you kind of need to either tolerate those older versions, older clients that require or ask for older versions, or, or you just drop support for those older clients. With GraphQL, you no longer think in terms of the version, you're thinking in terms of data requirements. So client says, okay, I need an organization. And it doesn't describe the version of the API, it just tells exactly which fields it needs and this is pretty much it.
B
Well, so what if there's a field that, for instance, in a traditional setup I was returning some field as part of V1 API, and then in the V2 I'm no longer returning that. How would I handle that in GraphQL?
A
What is often done and I do, for example, it was our API. So the GraphQL itself has first it has a query language which you normally use, or you say, okay, give me a user with a name, with a name and organization. In this case you'll get a user back. But what GraphQL also has in addition to it is a type system. So at the same time as you ask about the user, you can also ask an introspection API to give you information about the structure of the user type. So this means user type and scalar types like int and string, they're all part of the type system which is exposed together with the actual data in the actual API. And this metadata includes things like documentation. So every single field and type has a description and every field has a duplication. So the field called, I think it's called duplicated and duplication reason. So this means when you would like to remove for example this organization in future, like the organization field, you would first duplicate it. This information is only understood and used by different clients. Like for example this graphical which is kind of in browser IDE to write your queries, it will automatically remove those fields. It will not show it to users. So you can still use those fields, but they would be explicitly marked as duplicated and not shown to users who don't know about them. Given this, and given that server always knows what clients uses, we can track the usage of particular fields. Like for example, in our case we have all our metrics go to the InfluxDB and then present it in Grafana. And we actually built a dashboard that shows us which fields are in use and especially which duplicated fields are still in use. So we know precisely when particular field is no longer used by different clients. We can define thresholds, we can define the time frame, for how long do we wait until we drop this field. But when we drop this field, we have a lot of confidence that this field is not used by any client.
B
That is very cool because otherwise you end up with all these endpoints that you're not sure they're being used. I mean, I guess you could do this monitoring if you just had a V1 endpoint as well.
A
Shola, One of the issues I have with this global API versioning is that it is very imprecise. So you kind of declare the version one of the whole API of your whole API. And sometimes you just want to change one field or want to remove one field. So you need to be very cautious about introducing new changes or coordination those changes because you can't maintain a lot of versions. So you kind of. Maybe you do one version in six months. Once in six months or something like this. You can also be more precise about it. So we try to be more precise in our REST API. So we try to like we also have REST API and we try to avoid versioning as much as possible just because of maintenance. And we not a big team. What we did, we actually introduced a header. So as soon as user or client sends to us something that is duplicated, like a duplicate query parameter, we add this additional header. I think it's called xduplicated or something like this and this header tells that, okay, you're using this query parameter, but it's no longer supported or it will be dropped in future. And the same we do to our responses. But it's much, much harder with responses because we always need to send this information whether client actually needs this field or not, because we do just simply don't know.
B
How do you send it? Like it's just included in the response.
A
Exactly. So it's a response header and it just contains list of all things that are duplicated.
B
Yeah, makes sense. I agree with you about the global versioning, right? Because say you want to just make one change to the API and now you need to have a whole new version. And it does discourage like innovation on the API, right? Keeps it very static where I suppose with GraphQL, if you're just adding a new field to an existing thing, clients who, clients who aren't using that, it doesn't matter. Like they won't be asking for it, they won't get it back.
A
Exactly.
B
So you mentioned, you mentioned overfetching. So to my understanding, overfetching is like just there's more data returned than the than the client actually needs.
A
Yes.
B
One way I've seen this dealt with in just a non GraphQL API is that you actually specify the fields. Like the client will just give you a list of hey, get me this user id 7 and I want name and I don't know some other field. What do you think of that?
A
Indeed, I saw those different approaches of doing this. In fact, our SAP includes this type of information as well. We don't have a black or white list of fields, but we have field expansion so you can kind of expand on references. I also saw for example JSON API. So JSON API is a kind of specification for REST API where it includes a lot of concepts like pagination, like field, black and white listing, and so on. So it has structure, it has semantic in it. One of the problems is that often there's no single way of doing this. So as far as I saw in my personal experience, different API implemented in different ways, which are not always compatible with each other. So every single time you see a new API, new REST API, new you need to learn the concepts that this API uses. The way they allow you to specify the black and white let you do the black and white listing of fields. In particular, it becomes more complicated when we are talking about nested structures. Like for example, if you have a user with a list of repositories in each repository, you Have a list of projects or a list of commits. If you're talking about, for example, GitHub API, like in this case, you need a special language to express yourself. You need to say something like a JSON pass where you say, okay, I would like inside of a user, inside of the list of repositories, I would like to select particular field. And GraphQL provides an advantage because it is a specification which describes processes, those concepts. It allows you to to do it in a standard, well specified way. And all GraphQL API work in exactly the same way, so they implement exactly the same semantic, which means you can build very powerful tools, especially considering that GraphQL API also has a full introspection exposed. So you not only know that this is GraphQL API and it follows particular semantic, you can also discover everything there is to know about this API, including documentation.
B
So in that way you start with this idea where you want to make your API better for each client. So you add these fields like expand this field, don't include this field. But a solution to this problem already exists, which is GraphQL. Facebook's already gone through this and came up with this protocol for deciding what to return in a uniform way.
A
Exactly. And another important aspect actually, when Facebook started working on GraphQL, they started around 2012 and funny enough, they actually started using it in context of mobile applications. Even though now if you look at GraphQL and the whole ecosystem, it is often used in context of web applications, together with React, with AngularJS2, with Vue JS and so on. But in fact, the origins of GraphQL come from mobile space, from iOS and Android applications. Actually at Facebook they started to use REST API. So they tried to use it, but unfortunately it was not very good fit for mobile API. And another thing is also conceptually it was a different approach, because in most of the cases when we do the data model, just think about Scala code. In Scala, we define all those concepts like a user, an organization, a repository and so on. All those things are connected and somehow related to each other. So we define the data model in terms of types and relationships. Then when we go to define a REST API, we suddenly start to think about resources, flat resources that somehow encapsulate particular part of API. But they don't have a lot of notion of relationships between data, especially deep and more complicated relationships. And this is where GraphQL helps a lot, because it allows you not only to expose the data itself, but also those relationships between data. So you can directly ask for organizations of A user at the same time. And it's just normal field, so you're not longer working with those references. So the conceptual model is a graph with GraphQL. So it's about types and relationships instead of tables and foreign keys, for example, or resources and links.
B
And this is a graph in the computer science sense of nodes that you can follow to another node with vertices.
A
Exactly. Conceptually, on the server, you can think of the whole GraphQL API, the data inside of it, as a graph, but when you query it, you need to start somewhere. And this is kind of an entry Point. With GraphQL, you normally start with a query type and then what you actually get back is a tree. So you get a tree of traversal, so to say, of this data. Because often graph graphs have loops and recursive data structures. So when you as a client, you ask for particular data, you don't get the curve shape, but you get the tree shape of the data. Kind of a projection of the graph.
B
Yeah, because you could say, for instance, get me these four users and these users organizations and the organizations repositories or something. Right. Does that make sense?
A
Yeah, exactly, exactly. And you not only can ask for, let's say, a list of organization of user, you can also ask for a particular page. Like you can do pagination inside of the data structures, because the only thing that Kafka defines as fields, maybe a nested selection of fields, and every single field can have arguments. And this is how you do pagination and sorting and filtering. So you can have for example, a list of users that are like top 10 users. And then inside of a user type, inside of a user object, you can ask for a list of organizations, but not all of them. Let's say last three organizations that user had activity in.
B
Makes sense. So I think I kind of understand the motivating examples for GraphQL. So what is GraphQL? It's not a library, it's not an implementation. What is it exactly?
A
So GraphQL is a query language for an API. It was developed by Facebook, so it was first developed by Facebook at 2012, but in 2015 it was publicly released as a specification. So this means that GraphQL is a specification, which you can think of it as just a PDF document describes as a syntax semantic. So the syntax and execution semantic of this query language. And in my opinion, this is important point because this was a key factor to build for this whole GraphQL ecosystem to emerge. If you look now, GraphQL specification is implemented for more than 15 different programming languages, including Scala with Synclia, and languages like Ruby and JavaScript, Python, Elixir, Erlang and so on. So all major programming languages have an implementation. And I think this success is. So the specification was a key for the success. So yes, GraphQL specification, we have different implementations of the specification to help people build GraphQL servers. There's also a reference implementation written in JavaScript which is maintained by Facebook.
B
Like from the sounds of it on the outside being a query language, it sounds like somehow it's a database front end. Is it a database front end?
A
It is often misrepresented or perceived as similar to SQL. So many people, when they hear about GraphQL, they think about SQL or this kind of more complicated, more complex and very expressive language. But the fact is that GraphQL is not quite similar to SQL, it's quite different. It's more similar to REST API than SQL. Because with GraphQL, whatever you allow your client to ask for everything needs to be explicitly specified. This means every single time you, for example, say you provide a list of organizations, it is explicitly specified that user has a field, and this field is called organizations, and you can ask it on a user, but maybe not in any different other way. Also, if you would like to start this list of organizations in particular way, you need to explicitly define an argument for this field and so that user can specify a limited subset of sort criterias. But all of this is done by server developer and they need to explicitly allow this field to be available on a user type. Otherwise GraphQL doesn't have any generic type of aggregation like SQL does. So this means you can kind of anticipate what client would be able to ask, and you can optimize, you can be sure that everything you Expose to your GraphQL API is always optimized in some way.
B
So GraphQL, it's a query language, but the person on the server is responsible for writing the evaluation of this query.
A
Yes, exactly. So another thing I haven't mentioned yet is GraphQL is completely agnostic of the protocol. Like it can be HTTP, it can be also tcp, just a simple TCP connection. And it also doesn't care about the actual data format. So it can be a JSON, it can be an xml, it can be a binary data product. Like for example, I used Kafka with Message Pack and Amazon Ion formats, which are binary binary data format. And when you implement a Kafka API for every single field, you need to specify a function, and this is your function. So it's all up to you how and what you do if particular fields are sequestered. Of course, there is a lot of tools that help you build those resolve functions. There are things like macros that will help you to derive the structure and implementation based on the case classes. But in some cases you actually need to go to the database and fetch data. The data. But since it's just a normal function, you can do it, you need to do it yourself. So this means GraphQL also have no notion about any kind of data storage, and it's all up to you how you implement it. As a nice side effect of this is that you can query not only a single data store, but multiple multiple data store or databases at once. For example, in our API we use MongoDB and Elasticsearch. And in a single GraphQL query can be fulfilled using data from loaded from MongoDB and Elasticsearch. And I think even a part of it can come from some internal REST API.
B
Yeah, or you could have, you know, if you had some sort of microservice architecture, there's like a whole bunch of various services and your API endpoint is kind of gathering those up to fulfill the request.
A
Yeah.
B
So one of these tools for implementing the server side of GraphQL request is Sangria, which you created. So what led you to create this?
A
Yes, exactly. So Sangria is Scala GraphQL implementation that helps you to build the Server and expose GraphQL API. And it provides a lot of tools to help you work with GraphQL queries, to help you validate and maybe build tools on top of it. For me, motivation was a lot of problems I personally faced with REST API. I was building REST API for a long time and the first time I heard about GraphQL, I saw how it works and how it feels. I got immediately excited about this technology even before it was publicly released. As soon as it's released in 2015, I think it was July 2015, it was announced at React Europe. I was there and I was excited about the specification and I immediately started working on Scala implementation because of course none was available for Scala. And it was a huge help that GraphQL has not only specification but also a reference implementation. I am not a big JavaScript, so I'm not a very good JavaScript developer and it is written in JavaScript. But it is very helpful if you're trying to implement some specification in different language, because if there's some ambiguity, if something is not quite clear based on the specification text, you can always go and look into algorithms. You can look in specific implementation, you can test it out in the actual thing and just implement the semantic. So this was a huge help and it took I think about a month. And after a month it was kind of the first feature complete implementation of GraphQL. Elaborate.
B
How did you find combining Scala with GraphQL? Do the technology seem to fit well together or not?
A
I think it fits quite well. What always concerned me is that we often define case classes. We define and think through our data model, we define very precise relationships between all our types and to help us reason about our type system, to help us reason about an application. But then as soon as we hit and start to implement a REST API, suddenly we throw away all of these types, all of this data model and what we expose is just adjacent blobs. So we just expose endpoints and JSON blobs. Some people do expose things like Open API, like the schema, but those schemas are often maintained separately, so they often are out of date and maybe not as precise as one would wish for. And for me it was personally a huge selling point of Kafka that it has a type system. So it's not as powerful as Scala type system. But I think it's also not a good idea to expose something like Scala type system through API. So it needs to be something more simple and something that is more usable by many different other clients. Like in many scenarios companies actually implement their server side in GraphQL, in Scala with Anglia and on the client side they have the whole team of front end developers who build React applications, who build React native applications and maybe iOS and so on. So it is important so those people can take advantage of this information about type system and including all of the tolerance that they use.
B
And like the Open API standard. That's like a swagger. It makes like the swagger document, right? And exactly. It does tell you the types and how you would call things and stuff. But it's not enforced like there's nothing to say that it's correct and it's also completely, as you were saying, like outside of your API. Like it's just like I maintain some giant YAML document that produces that. And yeah, it can lag. So how do types work in GraphQL? How do I get the type of a certain call I'd like to make?
A
So it's actually the API of introspection. It's not different from anything else. You just make a normal GraphQL request, you just say underscore underscore schema and this field is always available on all GraphQL APIs and you get the full introspection of the type system. It includes things like which types are available, like user or profile or maybe organization, which fields those types have. You have information about interfaces and union types, about scalar types like int and string long and so on. Then you also have enum values or enum types and enum values. And I think the last thing, you also have input types. So there's a difference between output and input types. And input types can be provided as a complex types. It's kind of an object. You can think of it as a JavaScript object or JSON object, but it can be provided as an argument for.
B
Like updating something or adding something.
A
Exactly. So GraphQL has. So I think it's also important point that GraphQL is not read only. Not only read only, but it also has a mutation part and subscriptions. So when you say just a query, it means you just want to read the data. So in this case you are working with output types, but you can also say a mutation. So you can send a mutation query which looks very, very similar. It's just a list of fields which have arguments, but they intended as mutations for your data. So you can have things like add new product, create a user, change username and so on. So those kind of fields.
B
So does that mean I can send an update that affects like a whole graph, like several objects that could be in several data stores?
A
It's possible because as far as GraphQL is concerned, it doesn't know anything about the business logic of your application. It doesn't try to prescribe specific way of modeling the data, it just provides tools to model data and to describe this structure via the type system. So in this case you just define a field which is called for example subscribe user and on the server side you provide a function, a resolver, it's often called resolve function, where it's all up to you, what you would like to do, which data stores you would like to talk to in order to subscribe a user. It might be just single data store. It can be also, for example you can communicate to mailchimp API and you can create a user and subscribe user to a particular mailing list.
B
This is the beauty of it just being a protocol, right, Is that the details behind it can vary.
A
Yeah, exactly. I think this is a very powerful concept. For example, Twitter uses GraphQL. So they started to use GraphQL a while back and I Think this demonstrates the power of this kind of independence of the transport and independence of the data format. Because you can, on the surface you can have HTTP API that returns JSON, but then internally you can use the same query and give it or orchestrate its execution across multiple different microservices, which also to GraphQL, but they use maybe HTCP protocol, something like Protobuf or Swift, and use more efficient binary data format.
B
And is Twitter using Sangria to do this?
A
Yes, there's actually a very interesting talk from GraphQL Summit 2017, I think it was in October or November, and the last talk is from Twitter and they talked about how they implemented subscriptions with GraphQL. So GraphQL subscriptions with Songka.
B
That's awesome. I'll have to check it out. What were your thoughts when you, when you found out this library you built was being used by Twitter?
A
It's definitely, I was very excited and maybe a little bit scared because, you know, you never know, like maybe something is wrong and something will cause disaster. But it was a while now and we actually had some communication and they actually contributed some of the improvements to the library. Actually, there's a lot of companies that use now GraphQL and a lot of them also use Sankeya. I think it's. I don't know, I'm very happy about it and I think it helps to improve the library because all those companies use Sankya and they contribute back. They contribute back not only code, but also the feedback and maybe some problems when something goes wrong or maybe there's some performance issues and we can figure it out, we can discuss it. So this helps a lot.
B
Yeah. I can understand why you'd be a little fearful at first, but at this point, if Sangria is being used by a number of big companies, including Twitter, the number of GraphQL requests that it's served, it must be pretty battle hardened at this point.
A
I guess. As far as I remember, last time as I tweeted about it, it was like, I think it was about 2 billion requests per day at the moment.
B
Wow.
A
Which is not. I mean, on a Twitter scale, it's not that big, but in terms of this new technology or relatively new technology, it is quite an amazing thing to see. So, yes, it's definitely a good proof that this works and it scales and maybe there are some issues along the way, but I'm pretty sure that there's nothing that we cannot solve or address in some way.
B
Yeah. Wow, that's some big numbers. I mean, at least from my perspective. So who else are there other exciting companies using Sangria?
A
Definitely. So I know of Twitter and New York Times, they also use Sangria Coursera. So the Coursera, they provide this educational videos. So they have also courses about the functional program with Smart and desky.
B
Interesting. So if I were to use Sangria and getting into the details, if I have my user case class and I want to return that as part of my GraphQL request, like what do I have to do?
A
So if you just have a case class, for example, and you have a way to get this case class from some place like database, what you need to do is to define an object type. So the object type is kind of a meta information about your type. It contains a description, the name and a list of fields. And every field has a name and description and Lazaro function. So as soon as you have this type, so it's kind of, you define additional meta information, you kind of describe the user type in more detail, then you can create the schema. So it's just a simple case class and you can execute queries against this schema against this type. So it's actually very little work you need to do in order to expose particular type via GraphQL.
B
So if I have a service, whatever it's called, get user and it takes like a GUID and it returns this user case class. So then I write an object type which basically contains the documentation for my user object, like what types the fields are, is that right?
A
Exactly, exactly. It defines the name of the fields, the documentation possibly and the type of this field. And you also need to find this result function which kind of provides information given a user case class. How do I get a field like name, for example? So just a simple function. What you can also do, you can use marker like Sangre provides derivation markers. It's a marker that looks at the structure of the case class or any class and generates this meta information for you. So you just make a simple. You just derive the structure from a case class and expose it via Kafka API.
B
Because in most of the cases I'm just returning this object, basically in most of the cases it's just a straight translation from like, I have this object and now I want to make this object type that's just describes its fields with some description. So I can generate that with a macro.
A
Exactly.
B
But if I really want to, because there's the cases I'm thinking where it's like I don't want to return this particular value is that when I would use the more is that when I would skip the macro.
A
In fact, you can go pretty far with macro because I personally believe that macro or macro based elevation shouldn't be all on us and starter. So in many cases when I see a macro, it's kind of all on us. Like if macro does what you like, you can just use it. If you do want to do some small customizations to the result of this macro, you kind of need to go in this more explicit style. This is not the case with Sync AI because you can for example, derive the structure of a case class, but you can still provide a description for fields, or you can exclude or include particular fields. You can even replace field or add new fields, but still drive the structure, the base structure based on the case class. And it's all type safe because macro just executes at compile time. So this means if you for example, exclude field that doesn't exist, it would be a compilation error. But if you have something completely custom, something completely new that doesn't have kind of a case class to it, in this case, yeah, you can use explicit style, and it's not that much well applied, but you can be very explicit. Actually, some people do prefer to use explicit style because one of the big reasons is because you often want to keep your API data model or API type system separate from your internal data model. Your internal data model represents how you kind of implement your business logic. But the things you expose, this is what you would like your users to see.
B
Yeah, because I suppose if I have this user object and I just use a macro to expose it, and then somebody else adds a new field to the user object, like they might not actually realize that they're exposing that via the API if it's not explicit. I could see that happening.
A
Exactly.
B
So I think you slightly touched on this. But if I want to return something in my user object that is not in fact a field, that is like a function, like say I have employee salary, which is actually a calculation based on some other properties. But I don't want to have that calculation have to be redone on the client. I'd rather return it. How would I expose that?
A
Exactly. So for example, if you're using macro, you say drive object type user and then as an argument you provide to it kind of a setting. It's just a list of settings, they are analyzed at compile time. And there you can say something like add a new field. And from this point on, just for this field, you define this field explicitly, which means you can provide name Description and resolve function. And inside of this resolve function you can do whatever you want. You can communicate for example, to external service and fetch this data from somewhere else, maybe from some existing RESTful API.
B
And this would be kind of the same way. I would use it to follow references. Like if I want to say that there's something hanging off my user that is organization name, however I actually have to go out and perform a query to get that on the back end. Would I expose it the same way?
A
So every single field is just a function. So most of the functions will just return a data or delo the data. But some functions will need to load data from some data store. In particular, there's a root type in GraphQL, a root query type. This is kind of your entry point. Those fields are available for you when you just type, okay, open curve braces and then you start to type your fields. This is a query type. So those are top level fields that you expose to the client. And for most of those types you will need to go to the database or some kind of data storage to load the data, like load the user case class and then the user type will work with those fields, but those fields are already loaded. And the same way it works for references, it is recursive in the sense. So on a user you can have field organizations and those organizations can. The field itself can go to database and fetch a list of organizations for this user and then give it back. I think one of the problems that many people ask after introduction of those kind of things is that, well, you will end up with this n plus one problem where you kind of load the same thing over and over again.
B
Yeah.
A
Or you have a lot of like this n +1 queries to the database. And with Sinclair there's a notion of deferred value resolution. So what you can do, instead of loading an organization, you can give it, so you can return a deferred value. So this is kind of, this says that, okay, in this place I would like to load an organization, what execution engine does, it collects all of those deferred values and at the right moment, when there's nothing more to collect, it calls another function. And this function just gets a list of those deferred values as an argument and can very, very efficiently load all this data at once. So at the end of the day you will not end up with N1 problem. You can load all this data at once in single place.
B
Yeah, I think this is a really cool feature. So a person makes a request and they get back all the users with the first name Bob, but they also want to include with that Bob's organization. So if 10 records come back, in theory it's got to make 10 requests out to organizations. But I believe what you can do in Sangria is that you get back all these IDs, all the organization IDs, and feed them as a list into some request that's like, get me all these organizations, and then it builds the graph back for you. Right. So you end up with two requests instead of 11.
A
Exactly. So if you're familiar with libraries like Fetch, I think, from 47 decades or clump, this is a very, very similar concept. So you kind of give back IDs and you say, okay, eventually use this ID to load this data. But you do it in bulk or in batch. And a lot of people actually use it. They either do this batch SQL query or they maybe have a REST API that accepts a list of IDs and gives back just a list of objects back, JSON objects and. Yeah, exactly. So you can do it all at once and in terms of the structure. So at the end of the day, your execution can branch, but you will end up in many cases if you have kind of the same type of data with one query per nesting level. So if you have deeply nested query, so you say you ask for a user, and then organization of a user, and then users of this organization, you will end up with at most three SQL queries or maybe HTTP requests to internal microservices.
B
And as an implementer, it just means I just have to write the service that can get a user, service that can get a list of users, and then a service that can get an organization by id and a service that can get organizations in bulk by a list of IDs. Right. And then Sendgria is building this data model, sort of. It seems like effectively you're doing a SQL join, but you're doing it in memory using maps. Is that.
A
Yeah, it's kind of similar, in fact. So there are different approaches to this. And in many cases it's actually not very efficient to make those giant SQL queries where you have tens. So a lot of joins. It is quite inefficient. So what people often do, they kind of separate this big SQL query in multiple smaller queries that work in bulk. So for example, you can ask for a user and you load the user information and then a user has a list of organizations. In this case you just make, okay, give me all of those organizations by id. And in many cases Actually this is more efficient than make a huge scalper with nested joins and so on.
B
Makes sense. It probably in some ways can scale better because you can have a whole bunch of things. Like you can have a whole bunch of GraphQL serving boxes sitting out there, right?
A
Yeah, exactly. It's much more flexible because you are not tied to specific data storage. Because like for example, in our case, especially if you are working with NoSQL databases, like in our cases we work with MongoDB and Elasticsearch, part of the data actually comes from Elasticsearch, part of it comes from MongoDB, part of it might come from external or internal microservice and you need to orchestrate the fulfillment of the GraphQL query across all those data storage engines. And in this case it's quite flexible that you can actually separate the whole or separate the API from the data storage.
B
Gives a little give there. You could move things from one data store to another and it wouldn't even matter. Well, it would matter, but not as much. Do you think there's cases where GraphQL isn't an appropriate solution for an API?
A
I think so. I think GraphQL API is quite helpful if you have this fast evolving data model which you would like to expose to specific clients and you would like to provide a lot of flexibility for those clients. But in many cases you would like, or at least in some cases you have well established data model and you have huge amount of clients. The actual storage of the data is distributed. Just think about Wikipedia API. Wikipedia API doesn't change that often. It provides a lot of media, it needs to be very cacheable. It is very helpful to have a cache, for example JSON of particular Wikipedia article. And in this case I think REST API might fit much better because it's so tightly coupled with the HTT and the way HTTP works. So we can use all of those reverse proxies and caches along the way and they all can understand things like eTags and the cache headers and so on. So this caching part is kind of much easier with REST API because you kind of expose just resource and it's much easier to cache it with GraphQL API if you expose it via HTTP. You normally have just a single endpoint, but what you would like to get you specify via GraphQL query. There are ways to cache it, but it's more complicated than with REST API which is tightly coupled to HTTP and the way HTTP works.
B
Because I guess there's some advantages then of the protocol being tied into the transport protocol, I guess.
A
Exactly. So there are different advantages for this. But for example, clients like Apollo client, Kafka clients try to mitigate this problem. So for example, Apollo client has a normalized cache on the client side, so it maintains it. So this is a big help if dedication is important in your application. But otherwise you don't really take huge advantage of HTTP caching with GraphQL API.
B
So what do you find is a stumbling block for people learning about GraphQL and about sangria?
A
Maybe just a different data model. Sometimes people have things with API you model data in very different way than with Kafka APIs. And this might be a big kind of roadblock where people try to apply the same concepts like resources on top of GraphQL API and they don't take advantage of more expressive data model that GraphQL provides all of those relationships and connection between different parts of the data. Relationships between types are maybe not modeled.
B
If you're just thinking in terms of resources and the properties that hang off those resources. GraphQL actually is more about how these things relate. So you might miss the way that you can model these relationships.
A
Yeah, exactly. People don't even realize that GraphQL has a very powerful type system behind it. And this type system is exposed to an introspection API which is always available for all GraphQL API that is out there and those revelations. And I think it's very, very powerful notion. So for me personally, this is, I would say one of the biggest features of API is that it has a type system and it has all those.
B
Nice qualities and that leads to discoverability as well. Right. So I can always, with the rest API, there might be like a swagger documentation, there might be some document somewhere describing things. But if it's a GraphQL endpoint, it's always going to be self documenting.
A
Exactly. And you can rely on it. I think this is also important point that you can actually rely. So if you can have a GraphQL endpoint, just a URL, you know precisely how you will figure out like what types it provides, what things you can do with it. And because of this we actually see a lot of tools built for GraphQL API. Like for example, there was recently a small scal library loads the schema from two different places and it compares the types and fields between each other. So there's a helper in some care and it just prints a list of breaking changes between those two different schemas. And this is very helpful because what you can do and what people actually doing. When I was talking to people who use it in production, they integrated it as a part of the CI A build pipeline, and they compare schema changes between the staging environment and production environment.
B
So it's a tool that takes a GraphQL endpoint, says, like, okay, get me the types. Okay, now get all the fields of the types. And then it says, okay, the last time I looked, there was this field, and now it's gone. So this is a breaking change.
A
Exactly.
B
That's very cool. I never thought of that. That's a neat idea. Well, like, I want to be considerate of your time, so thank you so much for talking with me. I've learned a lot about GraphQL. It's been a lot of fun.
A
Yeah. Thank you very much. Thank you for having me.
Host: Adam Gordon Bell
Guest: Oleg Ilyenko, creator of Sangria (GraphQL for Scala)
Date: April 18, 2018
This episode explores the world of GraphQL through an in-depth conversation with Oleg Ilyenko, the creator of Sangria, a GraphQL implementation for Scala. Host Adam Gordon Bell discusses common API challenges, how GraphQL addresses them, the strengths and weaknesses of both REST and GraphQL, and gets into the technical nitty-gritty of building APIs with Sangria.
“You just want to change one field or want to remove one field. So you need to be very cautious about introducing new changes...” – Oleg (08:23)
Query Flexibility: Clients specify exactly what data they require, avoiding over-fetching and under-fetching (03:46-04:09).
No More Global Versioning: Instead of versioning the whole API, GraphQL focuses on the shape of the response clients need (04:09-05:33).
“With GraphQL, you no longer think in terms of the version, you're thinking in terms of data requirements.” – Oleg (04:09)
Field Deprecation & Usage Tracking:
Easy Schema Building:
Custom Fields & Batch Loading:
Separation of Data Model and API Model:
“Last time as I tweeted about it... it was about 2 billion requests per day at the moment.” – Oleg (35:07)
“In this case I think REST API might fit much better because it's so tightly coupled with the HTT and the way HTTP works.” – Oleg (49:28)
“If you can have a GraphQL endpoint, just a URL, you know precisely how you will figure out like what types it provides, what things you can do with it.” – Oleg (53:41)
On Field-Level Deprecation & Safety:
“Given this, and given that server always knows what clients uses, we can track the usage of particular fields... We can define thresholds, we can define the time frame, for how long do we wait until we drop this field. But when we drop this field, we have a lot of confidence that this field is not used by any client.” – Oleg (07:10)
On Discoverability and Tooling:
“All GraphQL APIs work in exactly the same way... you can build very powerful tools, especially considering that GraphQL API also has a full introspection exposed.” – Oleg (13:12)
On Batch/Bulk Loading:
“Instead of loading an organization, you can... return a deferred value. What execution engine does, it collects all of those deferred values and... can very, very efficiently load all this data at once.” – Oleg (43:59)
On Open Source Responsibility:
“I was very excited and maybe a little bit scared because, you know, you never know, like maybe something is wrong and something will cause disaster.” – Oleg on Twitter’s Sangria adoption (33:37)
| Timestamp | Segment Description | |-----------|--------------------| | 00:32–02:21 | Adam describes API versioning problems | | 03:46–05:33 | Oleg explains GraphQL's approach to field selection and avoiding breaking changes | | 05:49–08:07 | Field deprecation and usage monitoring in GraphQL | | 13:57–16:13 | History of GraphQL at Facebook & graph data modeling | | 18:34–20:07 | GraphQL as a specification, implementations across languages | | 24:38–26:38 | Motivation and process behind creating Sangria | | 29:07–31:03 | Exploring the introspection API and complex mutations | | 43:57–46:53 | Batch data loading and avoiding "N+1" problem | | 49:28–51:08 | Situations where REST is a better choice than GraphQL | | 53:20–54:45 | Schema diffing and tooling enabled by introspection |
Oleg’s Reaction to Twitter using Sangria:
“I was very excited and maybe a little bit scared because, you know, you never know... But it was a while now and… they actually contributed some of the improvements to the library.” (33:37)
Production Scale:
“It was about 2 billion requests per day at the moment.” (35:07)
Macro-based Schema Generation:
“You can derive the structure of a case class, but you can still provide a description for fields, or you can exclude or include particular fields. You can even replace field or add new fields…” (38:59)
This conversation gives a comprehensive, nuanced perspective on GraphQL’s strengths, trade-offs, and real-world use, both at small and massive scale. Oleg’s insights into how GraphQL (and Sangria) empower type-safe, client-driven APIs, while acknowledging REST’s continued relevance, provide valuable guidance for anyone designing modern APIs.
For further details, live demos, and production stories, tune into the full episode!