Transcript
Jonathan Lowe: My name is Jonathan Lowe. Have you ever been working really hard on a piece of code and it’s just stuck in your head but you cannot break through? You can’t squash the bug or finish the algorithm or make the procedure run, and you try and try until you are exhausted and you go to bed. Then you sleep the night through, and the next morning, as you wake up, the answer is right there. You think, it can’t be true. I can’t have just come up with this answer in my sleep. You go up back to the screen, you try it out, and sure enough, it’s the right answer. You squash the bug. You resolve the algorithm. It just feels so strange to have your embedded LLM working for you while you sleep. Has any of you ever had this before? Maybe it’s a programmer thing.
Roadmap
The topic today is chatting with your knowledge graph about how to enable an LLM to chat directly with structured graph data. All of this is available both in a GitHub repo if you want to replicate it for yourself and in a published article on Medium. It’s all here in great detail in a variety of formats if you need it later.
What are you going to get out of this presentation? It’s really a rapid prototype that I did just for myself to better understand how this collection of tools and capabilities work together. If you’re like me, then you’ll pick up a lot of stuff just as a way of framing and illustrating data modeling, graph databases, knowledge graphs, and semantic search. I say generative AI at the end, but it’s the punchline that I’m still learning about myself. Maybe we’ll learn something about that, maybe not.
Chatting with Data – What Does it Mean?
Let’s talk about what does it mean to chat with data. On the left, a human being is saying, LLM, what brands of carbonated soft drinks should I buy for my son’s birthday party? What does the LLM say? I don’t know your son, but Fanta is popular lately. Is that chatting with data? Yes, sure, an LLM is all about data, but it’s not current data, it’s not recent data, and it’s not private data, it’s not your data. Some of it may be, maybe you published something and that got into the LLM in the training process, but it’s not really what we’re talking about today. What if you, on the right, extended what the LLM knows by offering it up a collection of private documents?
If you asked, LLM, given my family’s private email history and recent grocery store receipts, what drinks should I buy for Jimmy’s party? The LLM, with help from these documents, with help from a sentence transformer and the generative AI aspects of an LLM, could say, buy Ginger Ale, Pepsi, and Coca-Cola, they’re his favorites now. Is that chatting with data? It definitely is. It’s all the power of the LLM itself plus the additional data in the documents. What is this example of chatting with your data? This is really Retrieval-Augmented Generation, or RAG. In that process, a sentence transformer converts the user’s question into an embedded vector, and also sentence transformers convert the private documents into vectors.
Then, the question vector is compared to the document vectors. The most closely matched vectors from the documents are pulled into the prompt, and the prompt then uses them as context to answer the question more intelligently. There’s also an approach to doing chatting with data where you use software like LlamaIndex to take those documents and augment them with a graph. The graph takes subjects, verbs, objects, data from the documents, turns them into nodes, relationships, deduplicates terms, form standard ontologies, and then uses the graph to make the LLM less likely to hallucinate and give more accurate answers.
This presentation isn’t really about that either. As Picasso said, good artists borrow, great artists steal. We’re going to steal from this. It’s not really about that because it’s just about the pure ability to talk directly to that graph database. The rapid prototype goal here was with free software on a 24-gigabyte laptop, which happens to be with us here today, and no internet connection.
Could we ask an LLM a natural language question about data in a structured graph database and receive an accurate natural language answer, borrowing from GraphRAG, but beginning and ending exclusively in that graph database, and not calling the graph database with Cypher, which is the graph version of SQL. Not calling it, not generating Cypher, calling the graph, taking the data back and prompting with that. No, direct conversation in some way. Is it possible? It is. We’ve all worked with databases. I’m sure many of you work with very large databases.
How exciting is it to think that so many more people could communicate with the large databases we’ve all come to know and love without needing to learn SQL or without needing to learn Cypher if it were a graph database. That was the rapid prototype goal. Just a quick reference. What is the nature of the specification? It’s a MacBook 4 Pro, no Wi-Fi, running on Ollama with a variety of different LLMs, Llama 3, Qwen, QwQ, DeepSeek, running a graph database called Neo4j, and using Python to tie it all together in a Jupyter Notebook. As you can see from the little screenshot there in the top of the Mac, you’ll see on the left part, the symbol for Neo4j next to the Ollama symbol and then the Wi-Fi disabled. This is all privately run, no token charge whatsoever with all these LLMs downloaded from the internet.
Direct Chats with Structured Databases
What’s the story we’re going to walk through today? It’s a seven-step journey that starts with that natural language question. The question gets embedded just as in GraphRAG. The embedding is compared to a graph database that also has embeddings in it for semantically similar embeddings. Then the graph is traversed from those matches to bring in even more relevant data, which we then combine and feed back to the LLM in the form of a system and user prompt. The LLM then uses that as context to answer the original question. That’s the simple path. Now we’ll walk through the whole thing.
Data Modeling
The part that’s not on this chart is preparing the knowledge graph for this whole experience. We’ve got to set up shop. To do that, I’d like to do a quick recap of data modeling. The other night, I asked my companions, so what is data modeling to you? There were a variety of answers. I think it’s a lost art. Let’s take a look. What is data modeling? It’s representing the world with data, strings, numbers, dates, and sometimes binary large objects of various kinds. Here’s a real project from my checkered past with an agency that subsidizes farming. If the government didn’t provide farmers with subsidy funding, this particular country would not have been able to grow enough food to keep itself feeling secure in times of political conflict. To subsidize farming, they had to figure out how much to give each farmer.
The rules around that subsidization were quite complicated. I was asked to come in and build a data model that they could then use to build applications to help manage the financial subsidies to all the farmers over time. To do that, the data modeling process is, you talk to a lot of people and you say, how does it work? What are the rules? In this example, the real world is on the far left. It almost always seems to start with the real world, doesn’t it? The real world was walls and fences and ditches and earth and areas where they were growing some crop, in this case, wheat. The way that this agricultural subsidy government department looked at the world was, the growth area where the actual plants were coming out of the ground was the agricultural parcel, but the fenced in area was the reference parcel. It got more complicated. The more you asked, the more details seemed to come out.
For instance, it wasn’t just about within the fence that mattered as the reference parcel, but two different farmers could be growing two different crops within that one fence. They called it the farmer’s block 1 and the farmer’s block 2, and the whole thing, the physical block. As you dig in, you find out that it’s not just the real world anymore, there’s also invisible things like renting. How does it work if somebody is renting land from another farmer? Does the renter get the subsidy or does the owner get the subsidy? What about variations on where the farmer lives versus where they should send the check, and the interplay of the land with the county and parish, and this thing called holding boundaries? All of this played into the subsidies. How do you create a data model for that? When you’re encountering something like this as a data modeler, you buckle down, maybe you consult your favorite data modeling book, highly recommend Steve Hoberman’s work, and you start building relational tables.
At least this is the way it used to go. Maybe there’s a place of business that someone talked to you about in subsidy farming, and it needs an ID and an address to identify it. Maybe there’s a production unit that’s defined by three different area descriptions, a county, a parish, and a holding. Maybe they’re related to each other. You’ve got a table that joins them up by combining their two keys and describes the connections with these funny crow’s foot lines known as cardinality relations. You eventually end up with a logical data model that is describing all the tables in your database and lines describing the relationships and cardinality between them. It’s beautiful in a way. It was my favorite part of any application development project.
This is for relational databases. How does this differ for a graph? We’ll get to how a graph works. For a graph, instead of tables and invisible relationships that need to be instantiated by SQL queries with joins at the time of writing the query, the graph database instantiates everything, the nodes and the relationships. As you can see here, that same collection of place of business, production unit, and the join between them, the POB, prod unit table, are represented by a production unit node type, a place of business node type, and an overseas relationship. Each of the nodes and the relationships, in some cases, have properties that make them unique in the database. The name of the node, that place of business or production unit is called the label. Overseas for the relationship is called the type. All the little detailed datasets that go with them are called properties. This is called a property graph. Nodes, relationships, and properties.
Graph Databases
Let’s take a look at a sample dataset that explains this whole process that we’re going to set through, which has really only two different node labels, person and beverage, and only a few relationships, two of which are called out here, MARRIED_TO and LIKES.
In this tiny example, it calls out one of the cool things about graph, which is, even people that don’t know how to do anything with technical databases can read them like a newspaper. It’s easy to see that Jane Doe, whose gender is female, was married to John Doe, whose gender is male, from 2010 to 2023. It’s easy to see that John Doe likes Ginger Ale, which costs 75 cents a can. You can read it like a newspaper. Sometimes people say the graph is like the whiteboard. When you’re figuring all this stuff out on the whiteboard, drawing the circles and the arrows, it just translates straight into a graph.
As a longtime data modeler, I found it much easier to do than a relational design. How do you talk to this data? The language is not SQL, it’s Cypher. This is an example of the Cypher for asking the question, how much would it cost to buy Jane Doe’s former husband a can of his favorite soft drink? It’s a really interesting language that uses ASCII art to depict the nodes and the relationships in the query itself. You can see the parentheses around the person, MATCH, Person, MARRIED_TO is the relation, and there’s a little arrow that shows the direction of the relationship. There’s other traditional stuff that you’re all very familiar with, I’m sure, like SQL clauses like where, and so on, but it is its own language. The responses can either be another picture of a graph, which we’ll see, or a traditional table response. Like in this case, the answer is, the beverage is Ginger Ale and the cost is 75 cents.
Knowledge Graphs
Before we get any further into the details of a graph, what are knowledge graphs? This is where you go all in on some aspect of your business. Why would you want to? Here are some places where you might consider using a knowledge graph rather than a relational database or some other form of datastore to hold key production data. Things like social networks, hierarchies, genealogy, a series of ingredients in a recipe, for instance, or a family tree. Routing and supply chains are also a popular one. Networks, people model their infrastructure with graphs sometimes. Anything to do with a linked sequence. I was talking to someone about the patient journey, where a person begins a journey in trying to figure out what their ailment might be, and meets with various technicians and nurses and doctors, and you track that person’s history and everything that happens to them along the way. Turns out that graphs are really good for that kind of use case. Or metrics.
Graphs are helpful for taking a journey like that, for instance, and then attaching metrics along the way. For some reason, being able to have these tags along a sequence works very well in graph, and so does aggregation. If you’re trying to express something in terms of, what do we show to the shop floor level? What do we show to the manager level? What do we show to the C-suite level? The aggregations that are connected to the underlying data can be well modeled in a graph.
What we’re going to look at to test this all out is a toy graph with made-up data. Part of the reason I have made this data up is I originally did it with the royal family in England and actually created a graph with the real royal family people, their relationships, their children. Then I realized that the data in the LLM itself was influencing the results of all the questions that I might ask of this LLM with additional data. I decided we’ll do it with fake people, fake data, and that way the answers have to match or have to diverge from what’s in this dataset, and I’ll know whether the LLM is being accurate or not. What is this? It starts out with Adam Adams and Adelle Adams, the patriarchs who are married and had some kids, Angie and Brittany, who then married and had their own kids, and so on. At the end of this genealogical family tree, we have some of the beverage preferences of all the great grandkids, Belle and Ben and Brandon and Carla and Christy and so on.
Part of the other plan for this is it’s easy to tell if the LLM gets the answers right or wrong because most of the answers have the same first name if they’re getting into the right generation. That was part of the plan too. Let’s see, how does it work? First you load this data, and you can load data in a variety of ways. Like most software products, there’s all kinds of ways to get data in and out. This is an example just in pure Python code where the data is in the form of text and you run it as a Cypher statement that loads the data into an empty Neo4j database. If you wanted to see your data, there are browsers that allow you to look at a graph database and visualize not just the contents in text format but in actual graphical nodes and edge relationship formats. This is a snapshot from Neo4j’s browser-based interface.
As you can see, the married relationship between Chris Bond and Connie Bond is highlighted yellow, and that means on the right, it’ll show the relationship properties. You can see there are a couple of inbuilt IDs, the element ID and the ID, but then the from and the to are dataset’s properties that were intentionally put into the dataset for our purposes. Also, this labels the specifics of each node, but the labels themselves could also be displayed instead. We use color coding instead to get the fact that there are people and beverages in this picture. You can see some of the other relationship labels, is father of, is mother of, and so on.
We talked about how do you query it. We use Cypher. What if, as we talked about, we wanted to be able to query it with a non-Cypher natural language question? Graph, how much would it cost to buy Jane’s ex-husband a can of Ginger Ale? Chatting with a graph database requires help from an LLM because we don’t know what the human language question will be, we need the LLM to interpret it, but LLMs work with sentences, and the graph is structured data. We’re going to have to do something to turn that graph into sentences that an LLM could understand.
Interestingly, graphs are set up to hold subjects, verbs, and objects: node, subject, relationship, verb, other node, object. It’s really easy to turn a graph into a collection of mini documents in a way. Here in orange, you can see some of the ways that you might choose for this graph to turn the data into sentences. Here’s the code for how to do it. It’s a series of Cypher queries that find the relevant nodes, turn them into regular sentences, and then store them in the place that you think they should live in the graph. It ends up adding properties to each of those things. Whereas before we just had from and to for our married relationships, now we have a sentence property. In the middle one, Jane Doe is married to John Doe from 2010 to 2023.
Semantic Search
There’s one last thing we have to do before we can set this up for business, and that is we have to add embeddings to our graph. Fortunately, Neo4j allows us to store that data format. Each of those sentences will be turned into an embedding. What are sentence embeddings? A rough approximation is you’re turning a natural language sentence into a series of numbers. Those numbers express vectors. Vectors can be compared for meaning, not just keyword combinations, but meaning. The three sentences shown in the middle of this slide, Ginger Ale is a sweet carbonated beverage that aids digestion. Pepsi’s flavor profile includes caramel and citric acid. My boss drank the Kool-Aid about AI agents. Which of those two are more similar than the third? I would say the first two, Ginger Ale and Pepsi are talking about two drinks, whereas the reference to Kool-Aid in the third one is less about the Kool-Aid and more about the boss and AI agents.
The vector representations of those three sentences would bring the Pepsi sentence and the Ginger Ale sentence closer together, making them more similar from a meaning standpoint. Here’s how it works. This is how you turn sentences into embeddings with a Python code that uses the sentence transformer called paraphrase-MiniLM-L6-v2. It’s pretty old. I used it because a friend said it was good. It maps to a fairly small set of dimensions, just 384 dimensions of dense vector space. It’s a workhorse. It ran quickly. We run it using Cypher again and store the results in the graph. Now what we used to have as a from and a to for that married relationship, has become a from, a to, an embedding, and a sentence. It’s all in the graph. It’s all ready to go in one place. Let’s start using it. Here’s the code for the indexing. A lot of this code I’m sharing just to show you that there’s not much to it. It’s pretty simple stuff.
The Cypher takes a while to figure out, you can figure it out in a couple of days, but the rest of it is pretty basic Python code. Just for note, you can see there the vector dimensions were 384, so you have to know what sentence embedder you’re using, and you have to use that sentence embedding LLM for everything you do in this entire process, because the embeddings have to match. Then the similarity function, we’ll get to, is cosine similarity. Other than that, it’s pretty standard.
We’ve prepared our knowledge graph, let’s get into it. We’re going to ask a question, embed it, and try semantic similarity. What questions can we ask of this data? I came up with nine. From the first to the ninth are increasingly difficult setups for the LLM to answer. It starts out with simple stuff, like, who are Angie Bond’s children? More difficult stuff, which of Bob Bond’s marriages produced children, and how many? A really hard one at the end, Bob Bond is throwing a party exclusively for his grandchildren.
First, identify who Bob’s grandchildren are. Next, identify each grandkid’s favorite beverage. If the grandkid is a boy, allocate him 5 cans of the beverage. If the grandkid is a girl, allocate her 10 cans of the beverage. Then produce a list of beverage types, counts, and costs. These aren’t difficult things for you or I to do, but for the LLM, they represent some pretty challenging things given this dataset. It’s got to traverse a family tree. It’s got to group things by a particular set of categories. It’s got to do math. None of those things are particularly easy for LLMs, at least at the time I ran this. They are getting better all the time, but still, these are challenges. It may be worth calling out that since doing this rapid prototype, MCP has become increasingly popular.
I can’t tell you much about it today, but I’ve experimented with giving the LLM a tool to allow it to traverse a graph. Using a graph makes traversal really easy. That is another way to accomplish this, especially when the question might be pretty lengthy, if it would have to traverse 5, 10, 20, 25 different hops. I think MCP would be the way to go about it. This particular dataset is small enough, so as you’ll see, it didn’t turn out to be a problem, but worth considering.
Let’s start. We’ve got our question, now we’ll do the sentence embedding. The sentence embedding in Python is one line. It’s really easy. Question embedding equals model and code question 9. Ollama makes this really simple. We take that big question about Bob Bond and costing out the drinks, and we turn it into an embedding. That’s done. Now semantic similarity. Again, remember, this is where we’re comparing embeddings to each other to see how close in meaning they are, not keywords. What’s the keyword distinction? For so long, before all the embedding stuff came along, keywords was the way to go. Remember fuzzy matching? With keywords, you’d say, I’m searching through my data for something to do with dogs. You might get back anything that had dogs, doggy, dog food, sheep dog, but nothing about canines or Labradors, because they don’t have D-O-G. The keyword search would know nothing about relevance. You’d get back everything that matched, whether it was dog food or sheep dog. It’s just everything that matches those characters.
With semantic search, when you search for dog, the meaning gives you canine, puppy, man’s best friend, all the dog breed names. If it’s sentences that you’re looking for, it’ll rank them by relevance. Because it’s not just the match, it’s how similar in meaning the search term is to the search data. That’s amazing, isn’t it? You might wonder, what is the meaning of meaning? Here’s a wonderful explanation from a Google research article called Advances in Semantic Textual Similarity. The top question, how are you, has all but one of the same words as the second question, how old are you? They’re very semantically different. Whereas the second question, how old are you, has no words in common with the third question, what is your age? They’re semantically very similar.
This Google article explains, it’s the answer to the question that helps you figure out how similar they are. If the answer is very similar, then the questions are very similar. How old are you? I’m 20 years old. How are you? Great. Different. Versus, how old are you? What is your age? Same answer.
Here’s how you do a semantic search in Python code. This is all built into the Neo4j graph database. It’s basically Cypher stuff. Then we sort the results by relevance, and this gives us all the matching sentences in the graph. We can cut it off at any level of relevance. You can see on the right, it’s the sentence folJonathan Lowed by parentheses with the relevance score, which is basically a percentage. The top one is 0.708, that’s about 71%. The bottom one is 61%. We take the top X number of relevant sentences, depending on how much you want to ladle onto your LLM. Before we give it back to the LLM, we take advantage of another cool thing about graph, which is connections also indicate relevance. Let’s traverse the graph for all the hits that were relevant to that particular question, semantic similarity, and work our way out.
If the hits were just two things, one, this Jonathan Lower left node, and two, this upper right red relationship, it stands to reason that in this related collection of data, the nearest neighbors might also be relevant to the hits. We can run a program that steps outwards from the node, we’ll go relationship node, relationship node in all possible paths outward. From the red relationship, we’ll go node, relationship node in all possible paths outward. This, in a way, solves our traversal problem of the family tree genealogies. Here’s what it looks like. Again, it’s heavy in Cypher, light in Python. The steps are, you get a sentence from the source node, you get all the sentences from all the relationships. You step out, do it again, step out, and return all those. The assumption here is that all of those will be about equal relevance.
GenAI Prompts and Outputs
We’ve done a sentence question, embedding, semantic similarity, traverse the graph for additionally relevant stuff, and now we’re going to give it back to the LLM. If you haven’t worked with this stuff before, I’m sure many of you have, but the weird thing is the question comes at the end and all the context goes before it. You flip the story of the way we’ve been going around this seven-step chart by giving back all the relevant stuff first, and saying, this is the context LLM, and then, here’s the question that was asked. What does the LLM do with it? Here comes some of the big decision points. What LLM should you use? What temperature should you use? How much of your relevant results should you give back? How long are you willing to wait? I experimented with QwQ, which is an Alibaba LLM, Llama 3.2, Qwen2.5, I think it’s also Alibaba, DeepSeek, two different versions of DeepSeek, and I found it was a real balance between response times, level of hallucination, accuracy of results. QwQ did the best in this particular time and place. It’s changing a lot. It changes all the time. Let’s see what actually happened here.
All of this, again, is running on a trusty little 24 gigabyte of memory laptop with no internet connection. We’ll get to why that’s important. Here we go. We’re going to run it. Here’s the answer. QwQ started answering in just an amazingly organized way. I don’t know if I should read it all to you, but it goes through the original question, Bob Bond is having a party for his grandchildren. Then steps through the whole relationship chain, the son, the marriage, the grandchildren, the favorite beverages of the grandchildren, the beverages by gender, the number of cans for each, how many cans total, and how much it all costs, and that’s not even the end. Then it summarizes.
The total cost, $25.35, but wait, you only asked for a list of beverage types and their counts and costs, so not the total cost. I’ll just list them out, and here’s the list. Then I’m going to give it to you in a format that you could easily represent in a more programmatic environment. It still amazes me, like, how does it work? How can it do this? The LLM was able to talk to our graph, and it did just as well with all the easier questions too.
When you do this kind of stuff, the considerations I would offer would be data security. If you can’t be bothered to set up LLMs on your own local hardware, like this example shows, you can definitely use online ones like OpenAI, or Anthropic, or Google, GPT, Claude, Gemini, DeepSeek. However, that means sharing your prompt data externally, and if any of the data in your knowledge graph is private or sensitive, then using a local model will keep your data safe locally. As we learned in a previous presentation, it turns out OpenAI is potentially going to be required to keep every single exchange, every text inquiry permanently. I know in a corporate setting, this is a big issue, so running things locally can be very helpful.
Then, legally, but I recommend to read the license agreements. If you’re using LLMs to help write software, do you own the software at the end of the day, or does it potentially have an ownership dispute with the owner of the LLM? There are plenty of them that have good licenses like MIT and Apache where there’s absolutely no string attached and you can use it however you want. Not all LLMs are built and run equally. Here’s an example between the QwQ answer that we just looked at and the Llama answer, Llama 3.2. To be fair to Llama 3.2, on this laptop, I used a pretty small version. The larger ones and the more recent ones, I’m sure, would do much better. In this case, Llama’s answer for the same question with the same data and the same graph was only correct on one line. It only got Daisy Bond as the granddaughter right, and everything else was wrong: the beverages, the math, the relationships. This one I find particularly tricky.
In a previous presentation, a really nice presentation by a Red Hat gentleman, one of the tricks for evaluating the results was if the words in the answer are different from the context and the question, that might raise a red flag. In this example, the words are all the same. They’re just mixed-up relationships and wrong math. I’m not quite sure if maybe some of you have figured out better ways how to evaluate stuff like this. Maybe with other LLMs to test it and so on. It feels like still a very thorny area to how to be sure that you’re getting it right. The circle from one to seven, we talked to our graph with an LLM. You can find all these materials online.
Questions and Answers
Participant 1: What about step 6.1, adding RBAC to the source data that we add into the LLM? Adding RBAC, role-based access control. Assuming this is an organizational solution, not everybody can see the pay slips. Have you stumbled across a solution around that?
Jonathan Lowe: Around role based? Unfortunately, I have not.
Participant 2: You talked about doing a semantic search of your sentences using the question. Semantically the meaning of the question is different from the sentences. Is it a trick in the embedding? How does the search come up with the relevant sentences?
Jonathan Lowe: When taking a sentence and comparing it to the sentences in the graph, why are there any matches at all if the question is not similar in meaning?
Participant 2: Yes. The meaning of the question is completely different from the meaning of the sentences themself.
Jonathan Lowe: In this particular example, the meaning of the question, which was about the birthday and the drink costs and the relatives, there are pieces of meaning in the question that match pieces or specific sentences in the graph. That’s why the relevance scores were not 100%. They started around 70%, and went down even below 60% if you take back all the hits. They’re saying things like, Bob Bond. There’s a question that has Bob Bond in it, and he’s going to throw a party. Is there a Bob Bond reference in the graph? There is. There’s a node about him. Then there are relationships he has to different people through marriage or through childbirth. Those come back with a partial relevance hit. I think context is the big answer there. By getting back all these different pieces of the answer, you get a broad context that the LLM then stitches together on its own to give a meaningful answer to the original question.
Participant 3: I was wondering if the added step and the complexity of turning your data into a graph actually had a big impact. Because you use numbers and math, and math and numbers can still keep going wrong, even with a graph database for a smaller LLM. Did you notice a bigger difference if you just threw in unstructured data versus after you graphed it out and then passed it in, there was like a big difference in terms of accuracy or adherence and context, things like that?
Jonathan Lowe: Was there a big difference in the LLM’s answer if the data source was graph data versus unstructured text document data? Not really. Actually, there’s a nuanced answer. The neat thing about the graph sentences is they are coming from a structured situation, so they’re tighter than typical document sentences. They’re smaller. They’re tighter. In a document, you’re writing prose. You might have a sentence that goes for 10 lines, and another sentence that goes for one line, and they might contain a variety of different pieces of information just in one sentence. Whereas in this graph approach, typically the sentences are very short and tight and specific, and the specific data helps the LLM more than extensive, complex data. It’s easier for an LLM to work with really focused inputs to get a more accurate output. Yes, to that part.
The other thing I’ve noticed about working with graphs and the effort it takes to get to that point is, generally, it saves my team so much ramp-up time, especially with new joiners, because of that ability to read the data like a newspaper. We have had so much less effort needed to understand the data that we already have, and then when a project is done with data that needs to reconfigure it, we just keep building on that graph. Every project adds new relationships and potentially new nodes that all the downstream people can then take advantage of. It’s all in one place because the graphs can live all together in one database. That’s been an advantage, too. Sort of different than your question about the accuracy, but yes.
Participant 4: Compared to SQL, if you just ask the LLM to do a SQL query on your data, or if there was a graph?
Jonathan Lowe: If you just ask the LLM to do a SQL query on your relational database or a Cypher query on your graph database, why not just do that? There’s a company I encountered in February in an AI conference in New York, called PromptQL, and they were very bold about getting close to 100% accuracy on converting a human language question into SQL and returning that answer from a database, and then converting it into the answer to the human language question in human language. In their demonstration, it didn’t work perfectly, but I think they’re well on the way. I haven’t seen a similarly successful conversion for Cypher yet. I don’t have a lot of confidence that even with SQL and PromptQL, if the data source were very complicated and the question were very complicated, that it would be as reliable as 100%. This could be a variation on the theme.
Participant 5: To talk about the difference between the structured graph data that you’ve got versus unstructured, maybe like document data, you said something like, yes, you’re writing more prose and it’s more like fancier language, as opposed to having just these noun-verb relationships. You have adverbs and just other filler words and things like that in there. Could you not take that unstructured data and model it using that same Cypher and things like that, with the adverbs and with the filler words and like prose?
Jonathan Lowe: Take a look at LlamaIndex. This is a company that takes unstructured documents and sucks the data out of them into a graph format and attempts to deduplicate and standardize on labels, and convert as necessary adverbs into properties and then uses that for similar purposes in conjunction with the documents. It’s not 100% perfect at deduplicating yet, but they’re also on that path. That’s a way of attacking that. What gets to me about the whole space where we’re trying to just talk to documents is, for decades, we’ve all worked with all these different structured databases. We’ve had such an effort to model the data and get it into a database so that we can expose it in our applications. It feels like gold just waiting to be mined if we could get a clean way for non-technical people to access it. Because the more people that use these databases, the more accurate they become, the more valuable they become.
See more presentations with transcripts