LLM Knowledge Bases Explained
Andrej Karpathy is someone whose content I personally enjoy consuming. He was the director of AI at Tesla doing computer vision work with regards to Autopilot and also a founding member of OpenAI. His YouTube channel is also pretty awesome where he teaches deep learning not just to professionals but also the general public.
Not so long ago he posted on X about a new idea of leveraging LLMs to build personal knowledge bases. You use LLMs to reason and synthesize information across documents as well for answering queries. He gave a cool analogy to tie these ideas together which I’ll explain now.
Ok so if you have coded before you will know the basics of how programs are run. Us humans write the source code in a language, be it Python, Java etc. But our machines can’t directly understand this so we run the source code through a compiler or an interpreter to get executable code that the machine can understand. It is a transformation from one format to another which then is executed. Applying this idea to our setup we have that:
-
Raw documents such as research papers, notes, articles etc. are our source code. This is the raw information, it is messy and unstructured.
-
The LLM is the compiler. It will take the raw documents and compile it into something it can understand and reason over to generate responses.
-
The executable code is a wiki. It is created by the LLM and is structured to store entities, topics and LLM generated overviews/summaries of our data.
So the idea is simple. Based on our raw documents, the LLM will generate/update our wiki and use this to answer queries. The raw documents themselves remain unchanged. The documents in the wiki will have references to entities and topics which help the LLM to read across other documents. In his original X tweet Karpathy said that he even likes to feed the outputs back into this knowledge base so that in further queries he can add on to them. So really what you are looking at is a simple but neat system to compose your own knowledge bank and enhance it over time.
He also mentioned a health check which can be done every now and then. This is to check the wiki for any contradictions, gaps and any inconsistencies in the data.
You may also come across RAG (Retrieval Augmented Generation) which is also a way to have LLMs answer questions over documents. The basic idea is that you chunk your documents (since the LLM may not be able to read the entire thing), use an embedding model to convert these chunks to vectors and store them in a vector database. When querying, the query is converted to an embedding by this same model and a similarity search (something like cosine similarity) is done to get the relevant chunks which are then injected into the LLM’s context to answer the question. The issue is that this is fragmented and gives the LLM a local rather than a global understanding of the data. You may not even get the relevant chunks at all to answer the query! This is more of an agentic approach and the wiki is structured so that the LLM can understand it well.
Having said that, there are some potential downsides. For one this would work for maybe 100 or so documents but does not scale well beyond that. It is a known weakness in LLMs that as the context increases, their ability to retain and process that information degrades sharply even if the context limit has not been reached. Additionally, since the wiki consists of LLM generated summaries and overviews, it may be void of subtle details which in a query you might be after. Nonetheless it does seem like a promising way to interact, reason over and build connections across information.
The whole idea seemed pretty interesting to me so I decided to try it out for myself. You can see my repo here if you are looking to do something similar or use it. You can clone this repo into an agentic IDE of your choice like Claude Code, OpenCode, Antigravity etc. since they can access your file system for reading and writing. To use this repo you will need Obsidian which is a note taking application which is completely free to download and also has a Web Clipper extension that can turn web articles into markdown files.
To make the knowledge base actionable I decided to add a quiz component that allows you to create multiple choice and written questions on a topic of your interest to test yourself. You can do what you like with this idea so play around with it! 😀