
In this conversation, Jillian Forde interviews Dave Bechberger, Principal Graph Architect at AWS, wh
Loading summary
A
This is episode 687 of the AWS podcast released on September 30, 2024.
B
Welcome everyone to the AWS Podcast. I'm your host for today, Gillian Ford. And we've got a really interesting topic for you today on graph. I feel like it's been emerging the past couple of years and we're really gonna unpack it. We as in Dave? Because I'm just the host. Dave is the one who actually knows about this stuff, the expert, the one and only Dave. So why don't you introduce yourself to all of our amazing listeners here today.
A
Hi everyone, my name is Dave Beckberger. I'm a principal graph architect on the Amazon Neptune service team. But I've been working in software for 20 plus years at this point and working with graphs for about the last 10. So definitely happy to be here and happy to talk about graphs and how they can help people solve solve problems.
B
Let's start with the foundations since a lot of people now maybe think that they want a graph, but a lot of people also don't really know what a graph is. So how would you describe to someone who is new to this what a graph is?
A
From a very technical side, a graph is a way to actually look at your data as entities and connections between entities. But from kind of a practical side, it's something that we interact with all day, every day. You know, any social network, Facebooks, LinkedIn, things like that are really great kind of classic examples of a graph where you have people and those people are connected to each other and those people are connected to other people, things like that. So that's really kind of the basis of what a graph is. It's, you know, it's these concepts of entities or things and relationships between them. In graph world we call those nodes and edges or nodes and relationships. But really, you know, that's, that's at a high level. This is how most of the world works, is interconnections between things and that's a graph. From a technical side, these are things we use all day, every day in our normal thing, in our normal just day to day things. Things like linked lists, trees, these are all kind of graph struct.
B
What's interesting about those use cases, social networks. Now a lot of generative AI, which you didn't mention, but I know it's on people's minds as a graph use case is those are within the past 20 years have been around, but it's interesting that graphs have been around a lot longer. So maybe you can explain us I'd love to know the history of graphs.
A
Graphs at its very core is a mathematical construction that was first the kind of original paper, if you want to, or the original kind of theory behind it came from a man named Leonard Oil. And it was the whole seven breakages of Konigsberg problem is kind of where it came from, I want to say. That came out in the 1700s, I think, is roughly where it was. So the concepts of what graphs are and what you can do with them have been around for a long time around the mathematical aspects. As we moved into, you know, more computer programming, computer science, data structures, graphs are very prevalent there. But even at the actual database level, graphs are actually graph concepts were because some of the basis of some of the very earliest computers and database systems out there. And then as they've kind of progressed throughout the years, they've become, you know, they've kind of come in and out a little bit in different mechanisms, but most recently since about 2000 or so, they've started to kind of really come into their own as their own kind of database underlying database technology themselves.
B
Really interesting that now with this database technology and as people are starting to think about the different types of database that they should use, depending on the use case. When should someone use a graph?
A
That's a kind of a really classic question we get is when do you. When should you use a graph? And when she use a graph database. And I mean, the answer there is when you're the types of data you have are those highly connected data. So things like social networks where people and the types of questions you want to answer are the types of questions where you need to be able to quickly and easily move between those entities across those sorts of relationships. You know, so that's kind of really good use cases for graphs in general. And graph databases are just kind of an extension of that. They're a type of NoSQL database that stores that data kind of in that native data structure versus, you know, you could store graphs in relational databases. You can store graphs in, you know, DynamoDB or other things like that, and then process those on your application side. Graph databases are good where you kind of need to be able to push that processing down to the database level. The movement between those connections of connecting those sorts of pieces of data together. We often call it highly connected data or highly connected data questions. But those are the sorts of things where like a graph is really good.
B
Yeah. Can you maybe be a little more specific? So highly connected, like let's say in just the relational Database world. If we're just joining for example, two tables. So maybe we're creating that one connection versus in a graph. What is the, I guess additional highly connectedness. I don't even know if that's the word, but that is to represent.
A
It's the word val. One of the biggest differences between a relational database and a graph database is in a graph database I always like to say it's sort of editors are treated more as first class citizens, or I should say relationships are treated as more as first class citizens. It's in a graph terms, those are called edges. But basically in a relational database, if you had two tables and you have a join against it, that join is sort of done at runtime and it's sort of implicit through the use of foreign keys. That's kind of how you kind of maintain the, you know, consistency across those, that data and how you do those joins, you know, using inner join, outer join. Those things that we're very familiar with that, you know, we're all very familiar with in a graph database. Basically not only do you store those entities as edges, so you know, you can sort of think of those as like rows in a relational table, but you also store those connections between different tables or different entities and graph terms as data itself. So if you want to be able to connect those data together in graph terms, we call that traversing those edges, meaning moving from one to the other. You do that very quickly because it's a lookup versus kind of a calculated thing. So that's kind of one of the reasons you use it. When do you want to use this? I mean, if you're doing something like a joint interrelational database, you know, to go back to kind of our social network example, if I wanted to say who are my friends? This is pretty easy to do in a relational database too. You know, you're not trying to deal with it. Who are the friends of my friends? You know, okay, so that's two joins. But if I want to be able to say how are Jillian and Dave connected? That's something that's very difficult to do in a relational database. You're going to have to do some sort of recursive traversal which if you've ever written those, are hard to write, hard to understand and impossible to optimize in a graph database because those relationships are kind of those first class citizens. And there's actually some aspects around the, just the nature of graph query languages, the way you actually query it, that make those sorts of recursive traversals very, very simple.
B
Super interesting. I think we should do a little bit of a vocabulary lesson because I know there's a lot of people probably listening and they're maybe thinking like, yeah, I want a graph, or maybe they're saying I want a knowledge graph. But when I know they come to you, you like to distinguish between a graph versus a graph engine versus a graph database. So maybe can we break those down for everyone?
A
I mean, I think at the very core, what we kind of talked about, these are all kind of graph is that data structure. It is a way of representing data as edges or as entities and connections using nodes and edges. This is something that's very common in programming. Linked lists are a type of graph. Trees are a type of graph, all of those sorts of things. So you can use graphs without having to use a graph engine or a graph database as well. It's sort of what you're trying to. What, what is your angle? Do you want to represent the data? That's where a graph is really great. Do you want to be able to analyze that data in some sort of potentially like in memory library? That's where we kind of start getting into what I call like graph engines or graph analysis engines, things like that, where you're basically loading that data maybe maybe not into memory and being able to use that to just kind of answer the sorts of queries that you would expect, answer the sorts of business questions you want to be able to do. This is really common, especially from a data science perspective. You may load your data into a pandas data frame and then run analysis library like network X or something over top of that data to gets out some information. And then there's also a graph database which is, you know, take some of those concepts that you get with graph engines, but also add that kind of persistence layer and all the other things that you get from a database sort of perspective on top of that.
B
Got it. So it sounds like what you're saying is you've got to also know what it is that you really want. Do you want a graph, a graph engine and. Or a graph database? What are the kinds of relationships that you would want to be able to understand? Are they highly connected? And maybe even understanding the types of queries that you'd want to be able to run, since it sounds like within the graph world there are certain things that you can do that are awesome that you can't necessarily do in just a. Or maybe it'd be much harder. You could, but maybe much harder on a relational, non relational Database, well, I.
A
Mean, in the way I always kind of look at it, is pretty much any database at some level is an abstraction built on top of key value stores to solve specific types of questions. You know, at its very lowest level, they have some very nice abstractions on top of it. And graphs have those sorts of abstractions built on top of that are very good for answering those sorts of questions. Can you use a graph to answer a question like what is Dave's name? Yes, you can absolutely do those sorts of point lookups, but that's also not something that that kind of model is optimized for. It's optimized for finding those. How are, you know, how are Dave and, you know, Jillian connected? How you know, how, how do I get from point A to point B inside of this, like, you know, transportation network? How are these, how is this transaction associated with potentially fraudulent transactions? Those sorts of questions where you need to be able to look across different things to see how they're connected is sort of where graph, where graph questions sort of come in. And that sort of leads you to, okay, what's sort of. Now that I know the type of question and that a graph is the right answer, is this something I'm going to run one time and never do again? Okay, well, maybe then, you know, using a, an anal, an engine is exactly what you want to look for. Are you, you know, are you sort of doing just kind of more pure data science or do you need to take that and now kind of take what you've learned and productionize that and make that part of your actually everyday application where you may want to use a database that has that sort of durability aspects to it.
B
Got it. So let's kind of break down some of the different types of graph queries or maybe graph things that you can be able to do. So the first is link prediction. Can you explain what that is and like the use case for it?
A
Yeah, I mean, link prediction at its very core, there's multiple different ways you can go about doing that in graphs, but it really comes down to looking for connections in your data that, that are implicit in there based on how things are connected, not necessarily explicit, something, you know. So a really common one you, you, you get is basically something in graph terms is known as like a triadic closure. But it'd be sort of like answering if you wanted to take that and put it into a social network context. Again, find me all of the friends of my friends that I am not also connected to and then make that a Recommendation or product recommendations is a sort of another thing here, where what products are commonly purchased together, by looking at just the straight data, you're not going to be able to see that. But by connecting them together you can start to see, oh, if you buy this product, a lot of people also buy that product, things of that name from a pro, kind of from a product recommendation. So there's multiple different ways to do it. You can do it using things like the graph topology itself. So the, so the shape of the graph and the shape of the data. So you can basically do a deterministic sort of thing. You know, if I run this a hundred times, I'm going to basically say these people are the same things. One of the kind of newer areas of this is what's known as like graph machine learning sort of use cases, graph neural networks, things like that, where you can take that graph topology and use that as part of a machine learning model to sort of get those non deterministic sort of inferred edge sort of constructs where, you know, we think that these two people are connected or these are the same people with a 70% confidence, something like that. So there's kind of two different ways to solve that same problem.
B
And then those algorithms that you were just describing, the machine learning use cases is, are those the same as graph analytical algorithms?
A
They are a type of graph algorithm, I would say, I would kind of put it that way. I would kind of like put graph algorithms is a bigger thing. There are types of graph algorithms that go beyond just machine learning algorithms. There's ones that are path finding algorithms. What is the shortest path between two locations? We all probably use some, you know, Google Maps or Apple Maps and do this all of the time. The sort of algorithm that you would want to use, there could be, you know, that may be done as a graph. I'm not sure how they do it behind the scenes. But like that is a sort of graph problem of going from here to there. You know, what is the shortest path or the least expensive path to be able to do that. That's one kind of graph algorithm that's kind of like path finding. There are similarity sort of algorithms to find out how many common neighbors do these people have, how many different things. And those can be useful things like fraud detection of. If all of a sudden you find there's a lot of overlap between these two people that and one of those people is fraudulent, this may be useful for you. And then there's other ones, sort of things like centrality, which kind of Shows you, you know, there's multiple different ways to measure centrality, but they all kind of come down to how important or influential is a specific entity inside of this graph. So the really classic one there is the page rank algorithm that was developed kind of for the original functioning of Google to determine how important a webpage is to rank it higher inside of your actual search. That's, that's a graph algorithm itself. And then there's kind of another set we call a community detection. So finding groups of people that are tightly connected together in different ways. Community detection is a common one that's used in like things like fraud detection because it depends a little bit on your domain, what you're looking for. But it could be that in your domain you expect everyone to be connected together or you expect no one to be connected together. So if you have a graph and you kind of know that piece of information, you run your community detection algorithm and you find the things that don't match what your expectations are. Like, everybody's connected except this little set of transactions of people over here. Well, I might want to go investigate that a little bit more. Or the flip is true. We expect everybody to be disconnected and we have this huge group of people together.
B
It's so fascinating to hear about how all these different use cases that we're using in the real world relate back to some type of graph, maybe graph analytical algorithm using link prediction. And there's another one that I wanted to ask you about, which is ontology. Can you explain what that is?
A
Yeah, I mean, especially with all of the excitement around generative AI knowledge graphs, things like that, you look out there, you'll hear that word thrown around a lot. While there are a few kind of really niche differences at the low level, an ontology is really can just be thought of as the schema of your graph. What entities are in your graph and what relationships between those entities are you expecting? So I think it's kind of a very, a very big word for a concept we're all familiar with. Like I said, there's a few, few little things where, where people may disagree on me, with me on some of that. But at a high level, when you think, when you hear the word ontology, you can kind of think of it as the schema of your graph.
B
That sounds a lot more approachable. Schema of your graph then. Okay, perfect. I thought it would maybe like a new medical practice or something, but great, awesome. So when I hear you describe these graphs and requiring a highly connected data set, to me it sounds like you need a large data set to really use a graph. Is that the case?
A
No, I mean, actually I would say many, many if not most graphs are relatively small. You know, and when I say small, I mean it could be anywhere from tens to hundreds of nodes. Probably not super common, but you know, tens of thousands, hundreds of thousands, a million nodes, something like that is absolutely thing we kind of see customers using all the time. It really, it's not so much the how large your data set is, it's what do those entities and those connections represent that it can be very important. Like I was working with one very, very large customer that we would all kind of know and they were looking for, they were doing a sort of fraud analysis. But their fraud analysis, they were only looking at, you know, a million nodes in the entire graph. So not a very large, you know, I, I want to say that the million nodes at that point, like 50 megs of data wasn't a huge, wasn't a huge amount of data because what they were representing with those nodes was a very summarized version of their data. Because that's what they, that's how they thought about fraud. In their specific use case, they didn't need a million properties, they just knew that this represented, in that case it was a user and they were looking at how those users were connected together to look for specific patterns.
B
Super interesting. And in that use case, like you were saying that they already had a million nodes. So it sounds like that's a million.
A
They essentially had like a million users and they were looking at how those users were connected together to look for what they knew looked was fraud in their sort of scenario. When you're looking at graphs, especially when you're looking at like analytics on it, what you're really looking for is things that are out of what you would expect, you know, especially like in a fraud case, I want to find, here's how customers use my system. Why is this one doing it differently? There's anomaly detection is, you know, another word that you could use for it. Like how can I find the things that don't match what we expect, Things.
B
You expect people to be using in that use case. So the person then already knows what to expect. So it sounds like they already know what the relationships are of the graph. What if someone doesn't know the relationships? Do they need to do other prerequisites or can they already start to be able to use a graph?
A
I mean at some level you have to, you have to have some sort of useful model about your domain. I Think one of the unique aspects of graphs compared to some others is that the schema of graphs tends to be very flexible. So it's very easy to add new connection as you figure things out, to add new node types, add new properties associated with them as you're kind of going through, and dynamically build it up without having to do like an ETL style process that you would probably be familiar with from, you know, relational database worlds, sort of implicit versus explicit schema. So you have to have some amount of that. But if you go to, let's say you're working and you know, in any company, you go to people and ask them to describe what their data looks like, they can give you some basic information about, oh, that's going to contain a company and this file contains the people that work at that company and things like that. Like the relationships are already kind of known, so you need to map those in and then you can start actually looking at your data and being able to start to see and infer additional sorts of information around that. That's where kind of link prediction sort of tasks are very common. Once you know how these things are connected, you can start to go, but aren't these things also connected? Maybe it's because we go from a transaction to a phone to an email to another transaction. Well, now I can make some sort of inference that these two transactions are related to the same person.
B
Got it. Okay. So you start by knowing some general idea about the relationships between your data. And then as of course new use cases evolve, your business operates there. You can use either maybe like link prediction to be able to, or maybe there's other ways to be able to.
A
Or you start adding new data sources. A lot of times, you know, data isn't static. Data evolves over time. So being able to kind of now, now all of a sudden you have access to a new data source and you can start to link that together and start to make a bigger, more complete graph. I mean, there's no such thing. You know, there is no graph of the whole world. So everything's, you know, part of it. You know, being able to make, starting, adding more data and that starts to let you see different information potentially really cool.
B
Now let's use an example to make this more concrete for people. So let's say someone's a data scientist, they've got their data in S3, they're already doing graph analysis. How does that data scientist scale their graph analysis into production?
A
Here's a great example. You know, there's like a fraud Sort of scenario because that's a very common one where you have an entire team that's looking at analyzing data on the, you know, from the science perspective, looking for what does fraud look like in our domain to kind of go back to that, you know, so you have that data and generally they're going to be working against a small subset of data. Maybe they're using Spark or EMR or just their laptop to do run some sort of smaller data analysis on it. Looking for. They look at their data and they've realized that a normal pattern of usage of our system is that, you know, no user is connected to more than two email addresses and doesn't have more than 10 transactions a month. Something like that, you know, they kind of figure out what is the standard for their. In that sort of domain. That's great information. But now how do you make that into something that really kind of impacts the business? Well, you probably want to start taking that information and pushing it towards more of a kind of real time sort of system, which is when you may want to take that work that's been done on that engine and move it to a more graph database sort of application where you can start to run. Now that you kind of know what some of these patterns are, you can run analysis on those patterns in real time to see, you know, potentially flag. This is something that maybe it's, maybe it's from, maybe it's not, maybe this is not what we do know is it's outside of what our expectations. And then this process kind of becomes an iterative process as in a fraud scenario for example, for what PROD looks like is constantly evolving. It's a cat and mouse game because as soon as you figure out one type of fraud, they find a different. So as you're adding the new information, you kind of rework through this process of looking for the patterns that you expect and then applying those patterns to kind of incoming data in sort of a never ending cycle there. It's like any software development project, you know, you get out, you put it out the first version. Well, there's going to be iterative versions. You're going to have to update things, you're going to find things that work and don't work.
B
I love it. Yeah, it really is like any other type of use case of being able to utilize data and not having stale data. And a lot big one that you're probably getting asked about every single day is generative AI with graphs. So I'd love to understand what are the use cases that you're starting to.
A
See that is absolutely something we get asked about a lot is a lot of times, you know, customers will come to us and talk about knowledge graphs and generative AI or graph rag or probably the terms that you, you know, you've probably heard. And really what it comes down to is, comes down to one of a few different types of use cases. When people come to me and talk about that, they're. What I've kind of found is that they're using the term graph rag to kind of mean anything relating to generative AI plus knowledge graphs. And they use the term knowledge graph to really mean graph. What can you do with graphs and generative AI? And we kind of find there's sort of two large buckets with a couple of different examples inside each one from people that already have a graph or already, you know, have invested in a graph. They want to be able to do expose that out to a lot more of their user base. And they want to do that through kind of being able to safely allow customers, you know, their customers to access that data with natural language sorts of questions. So they don't have to learn graph query languages. They don't have to learn all the underlying aspects of this. But I just want to be able to say who should I connect to? You know, what product should we recommend to somebody who likes, who's given a five star review to this sort of thing? Those sorts of high level questions where you want to be able to, you know, under the covers it's a graph that's running it, but your end user doesn't need to know that it's a graph. Your end user doesn't know graph query languages doesn't know graph modeling, things like that. They want to be able to ask those questions and get back the relevant information. And those kind of fall into two categories and subcategories. I guess at that point, which is just straight, take the natural language question, use an LLM to convert it into graph query and run the graph query. So that's got to kind of call that natural language querying. The other one is kind of, I call it knowledge Graph retrieval, which is a similar thing. But if you're working in a domain where you need to still have control over your data, you use the LLM to kind of figure out what sort of question is the customer asking. Is this a question that we're allowing them to ask or giving them? And then what entities do they want to start at? And then you take that and then you kind of run templated Queries that you've written behind the covers there. The example I always like to use for the differences there is, let's say I'm making a chatbot for a bank, something like that, to be able to interact with your bank account information. Well, if I'd used a pure natural language querying, I could ask things like what is Jillian's bank account balance? That is not something I would want to be able to expose it out. Well, you need to control that information a little bit. So in that sort of scenario, you could use the templated knowledge graph retrieval approach to be able to pull pull that data about versus if it's an internal application where you would have access to all of this data anyways, then just allowing them to run, you know, natural language queries on top of it is perfectly acceptable. You know, natural language querying, it's for open question and answering. So you know, when you ask, it's open question and answering. They need access, they need to have the ability to access that whole data. So that's kind of like the one big bucket. The other big bucket is customers that have data and want to be able to leverage it with other data through something maybe like a RAG application or a knowledge graph enhanced RAG application where you have maybe a bunch of data, a bunch of PDF documents or something like that, and you want to be able to build a graph off of that and use that in conjunction or in replacement of like a semantic search to be able to get better being more complete or explainable context that you're giving to an LLM.
B
So cool. I am seeing a lot of the use case that you were talking about in the rag, especially for tech enabled companies. So their business model is something else. So for example, I see it a ton in the life sciences where now you have scientists who want to be able to have a knowledge graph of biological data and they don't want to write queries, they want to be able to just ask a question of maybe some whatever it is, it could be protein, whatever the biological data is, and have the graph be able to come back with them to help with their own experiments. And I know a lot of them, they've really like been excited about this idea of finally being able to use a graph because A, now it seems more accessible with NEPTUNE analytics, their data is already in S3 and we talk about that a bit more. And then B, the fact that with LLMs they don't have to write these queries, it just makes it much more accessible to them.
A
Yeah, the other common one I've seen coming up even more and more is I have, I have, I want to be able to do some sort of hybrid search where I have maybe a bunch of document data that was, you know, to go back to your healthcare life sciences. Maybe you have like a bunch of patient reports or symptom reports inside of traditional RAG vector similarity sort of application. And I have this biological knowledge graph and I want to be able to take these and combine them together and give all of that context from two different kind of data sources into an LLM to be able to provide additional context back to, to your end, you know, the end user. So they're getting, it's sort of the, the total is more than the sum of its parts sort of sort of use case. You can get certain information from, you know that, that biological knowledge graph. You could get certain information from a RAG application. You could also combine this together in a hybrid sort of search, hybrid rag application to get even better information. Because you're pulling different data sources.
B
Really cool. And there's a lot of different like data stores that people are starting to think about with these rag graph use cases. So can you maybe break down when someone would want to use Amazon OpenSearch versus maybe Amazon Neptune if they're doing vector similarity?
A
A lot of that comes down to what it is you're. You're looking to actually be able to achieve with that. So with our release of NEPTUNE analytics last year, we added vector similarity search in that as well. So you can actually store your vectors there. And that's very useful if you want to be able to directly combine something like your vectors with a graph traversal. You know, if you want to be able to use vector similarity to start, you basically find the top K nodes of most similar and then traverse out from there. So it's kind of useful in that scenario where you have that data represents the same data and you want to be able to combine them together to get an answer versus another common pattern we see is either someone using our like Neptune Database with OpenSearch or Neptune Database Serverless with OpenSearch Serverless. If you kind of wanted to go the kind of whole serverless way where that data, you know, you're using OpenSearch as that vector store and you're using NEPTUNE to do those graph traversals either in a combined RAG approach or in that hybrid search RAG approach.
B
I think we should also just give people a good Overview of Neptune vs. Neptune Analytics. How would you describe the difference to someone?
A
Yeah, no, that's actually a great question. So I mean Amazon NEPTUNE is the purpose built set of databases we have, you know, at AWS for writing graph, for running graph sort of workloads. And inside that there's sort of two different options you can choose. There's NEPTUNE Database, which is, you know, a highly available, scalable, more transactional focused graph sort of database. Last year at Reinvent we, we also announced NEPTUNE analytics, which is an in memory optimized analytics database engine. So it's more analytics. You know, NEPTUNE analytics is really focused on a lot of ways. There's a sort of ephemeral use cases. I have data stored in S3, I want to be able to load it up, I want to be able to run analysis, I want to be able to run, you know, more analytical queries, algorithms, things like that. And then take that data, store it out and use it somewhere else. Shut that graph down until I maybe want to do it again next week or next month or things like that. Whereas, you know, NEPTUNE Database is really for Those always on 247 applications where you need to be able to kind of run those sorts of queries. Like is this transaction fraud? You know, is this specific transaction? Start from there, go out a couple of hops, look at that information, look for those patterns of sort of expected activities.
B
So it sounds like then from that comparison that you were just saying Amazon Neptune being these use cases where you need the lowest latency possible types of responses. And then NEPTUNE analytics, maybe it's more suited for those RAG use cases if it's a search, maybe for internal search types of operations. Did I get that right?
A
Yeah, NEPTUNE Database is absolutely really targeted for those sorts of I want to transactional sort of use cases. I'm on 24 7. The Neptune analytics is really targeted for those sorts of, I call them more exploratory or analytical sort of queries. Something like RAG could be either one of those. Part of that comes down to what is your end goal with it. You know, what is your end use here? You know, if this is something that needs to be able to dynamically scale and be up 24 7, you may look at something like NEPTUNE Database in combination with potentially OpenSearch to support that sort of use case. If this is something you are trying out, if this is something you want to be able to use vectors in conjunction with your graph data. The NEPTUNE Analytics, I should say not in conjunction, in the same. As part of the query. To be very specifically as part of the query, you may want to look at NEPTUNE Analytics.
B
Really cool. We've covered a lot. So for the folks who are new to graph and they want to be able to learn more and be able to get started with like their first graph, what do you recommend for them?
A
So there's actually quite a few different resources we have getting started with graphs on our NEPTUNE Developer Resource page, because this is a very common question that we get is okay, where do I start? It's a new, new database technology for most people. So being able to kind of look, you know, going to our developer resource page, being able to use that, you know, if you're more interested in some of the generative AI sort of workflows. We do have a generative AI Samples repo up there under AWS Samples as well as we've actually done quite a lot with the open source community around integrations with Lang Chain, Llama Index, things like that, to be able to allow users that are already in working with those very common LLM and RAG style tools to be able to quickly get started trying out a graph and seeing if that fit improves their sorts of applications.
B
Really exciting. And anything else for the people who maybe they're already using NEPTUNE analytics or have a NEPTUNE database in production, those.
A
Users, we definitely have a lot of, you know, yet again, in sort of those same areas, we have a lot of ongoing blog content, ongoing samples, things like that that are coming out for more and more advanced use cases. As you know, especially in the generative AI space. This is new to everybody, so we're really working closely with customers, figuring out what the best practices are as we're kind of going through this journey with them. It's, you know, it's a rapidly evolving space. So keeping an eye on those different areas on the database databases blog, where we're consistently trying to put out information there to really kind of highlight not only what people can do, but how you can be successful and what, you know, what tips and tricks we're learning along the way as well.
B
Love it. Dave, thank you so much for being here on the AWS podcast.
A
Thank you for having me.
AWS Podcast Episode #687: Graph Analytics Breakdown
Release Date: September 30, 2024
Host: Gillian Ford
Guest: Dave Beckberger, Principal Graph Architect, Amazon Neptune Service Team
In episode #687 of the AWS Podcast, released on September 30, 2024, host Gillian Ford delves into the intricate world of graph analytics with expert Dave Beckberger from the Amazon Neptune service team. This episode provides a comprehensive exploration of graph databases, their applications, and their integration with emerging technologies like generative AI.
Defining a Graph
Dave begins by demystifying what a graph is, making it accessible for newcomers.
“From a very technical side, a graph is a way to actually look at your data as entities and connections between entities...”
[00:57]
He emphasizes that graphs represent data through nodes (entities) and edges (relationships), akin to social networks like Facebook or LinkedIn. These structures mirror everyday interactions, making them intuitive for various applications.
Historical Context
Graphs are rooted in mathematical theory, originating from Leonard Euler's work on the Seven Bridges of Königsberg in the 1700s.
“As we moved into more computer programming, computer science, data structures, graphs are very prevalent there.”
[02:27]
Over the decades, graph concepts have evolved, becoming foundational in computer science and database systems, ultimately leading to specialized graph databases in the 21st century.
Highly Connected Data
Dave addresses the classic question: when should one opt for a graph or graph database?
“The types of data you have are those highly connected data... social networks... questions where you need to be able to quickly and easily move between those entities...”
[03:44]
Graph databases excel in scenarios where data is richly interconnected, allowing efficient traversal and querying of relationships, which is cumbersome in traditional relational databases.
Relational vs. Graph Databases
He contrasts relational databases with graph databases, highlighting that in graph databases, relationships are first-class citizens.
“In a graph database, those are called edges... you can store those connections between different tables or different entities as data itself.”
[05:15]
This distinction enables simpler and faster recursive queries, such as determining connections between entities beyond immediate relationships.
Graph vs. Graph Engine vs. Graph Database
Gillian prompts Dave to clarify the differences between a graph, a graph engine, and a graph database.
“An ontology is really can just be thought of as the schema of your graph... What is an ontology?”
[15:23]
Dave explains:
Link Prediction
One of the key graph queries discussed is link prediction, which identifies potential or implicit connections within data.
“Link prediction... find me all of the friends of my friends that I am not also connected to...”
[11:13]
Use cases include:
Graph Analytical Algorithms
Dave elaborates on various graph algorithms, including:
“Community detection is a common one that's used in like things like fraud detection...”
[15:06]
Defining Ontology
Gillian inquires about the concept of ontology, often mentioned alongside generative AI and knowledge graphs.
“Can you explain what [ontology] is?”
[15:23]
Dave clarifies that an ontology serves as the schema of a graph, defining the types of entities and relationships expected within the graph.
“An ontology is really can just be thought of as the schema of your graph.”
[16:03]
This schema provides structure, ensuring consistency and clarity in how data is interconnected.
Fraud Detection
A prominent use case discussed is fraud analysis, where graphs help identify anomalous patterns by examining connections between users and transactions.
“For PROD looks like is constantly evolving. It's a cat and mouse game...”
[23:08]
Generative AI Integration
The integration of graphs with generative AI (Graph RAG) is highlighted as a transformative application:
“If you're making a chatbot for a bank... you need to control that information.”
[26:59]
Life Sciences and Hybrid Search
In sectors like life sciences, graphs combined with AI enable sophisticated analyses, such as understanding biological data or enhancing search capabilities by merging graph data with traditional document searches.
Amazon Neptune
Dave describes Amazon Neptune as a purpose-built graph database designed for transactional workloads requiring high availability and low latency.
“Amazon NEPTUNE is the purpose built set of databases we have... for running graph sort of workloads.”
[30:18]
Neptune Analytics
Conversely, Neptune Analytics is tailored for in-memory, exploratory, and analytical queries, ideal for tasks like RAG (Retrieval-Augmented Generation).
“NEPTUNE Analytics is really focused on a lot of ways... run analytical queries, algorithms...”
[30:18]
Choosing Between the Two
The choice between Neptune and Neptune Analytics depends on the use case:
“Neptune Database is really for those always on 24/7 applications...”
[31:29]
Resources and Tools
For newcomers eager to embark on graph analytics, Dave recommends accessing the Neptune Developer Resource page, which offers tutorials, samples, and integrations with popular tools like LangChain and Llama Index.
“Going to our developer resource page, being able to use that... quickly get started trying out a graph...”
[32:52]
Community and Continuous Learning
He also highlights the importance of staying engaged with ongoing blog content and customer collaborations to adopt best practices and leverage the latest advancements in graph analytics and generative AI.
“We're consistently trying to put out information there to really kind of highlight... what tips and tricks we're learning along the way...”
[33:43]
The episode concludes with Dave emphasizing the evolving landscape of graph analytics and its synergistic potential with generative AI. He encourages listeners to explore Amazon Neptune's offerings and leverage available resources to harness the power of graph databases in their applications.
“Dave, thank you so much for being here on the AWS podcast.”
[34:32]
“Thank you for having me.”
[34:37]
Key Takeaways:
For more information, visit the Amazon Neptune Developer Resources.