Hosting my own Large Language Model (FreddyGPT)

Frederick Dopfel
Dec 31, 2023
5 min read

Updated: Jun 19, 2024

Large Language Models (LLMs) are an incredibly powerful tool for productivity, but a number of security problems with them remain, including prompt injection attacks, information leakage, and hallucinations.

As someone with a history of self-hosting applications, I was curious if I could self-host my own LLM, not only for privacy and security reasons but also to understand better how they work. Additionally, if I hosted my own LLM, I could allow it to answer questions about my own files without hosting those sensitive files on the cloud (I self-host my own version of Dropbox as well on my NAS).

There are two ways in which an LLM can be augmented with local files and information. The most well-known method is “fine-tuning” in which the model is re-trained with additional information in the data set. This is similar to the technique I use with stable diffusion, where I train it to understand a new “concept” to generate photos or artwork of a specific topic (such as myself). This is particularly useful when training a model for a specific style of text, but requires a large amount of compute to train the model, and it requires re-training whenever new data is added. This is not a sustainable method for data sets that are constantly evolving (such as my meeting notes).

Another method is Retrieval Augmented Generation (RAG). RAG takes all the documents in a specific folder, examines them in relatively small snippets of text (varying from roughly a sentence in length to a paragraph), and then “vectorizes” them to a representation of what the text says. The LLM then analyzes the prompt text, and does a quick search for vectors that are close to the question. In the case of a search engine like Bing or Google Bard, rather than vectorizing a folder of files, they vectorize the first few results from any web search on the prompt. These vectors are then decoded to the original text and added to the prompt of the LLM as part of its “context window”. For example, if I were to ask my LLM “What did Freddy Dopfel major in College?”, the information actually sent to the LLM would be: “resume.doc - ‘EDUCATION - UC Berkeley, College of Engineering (2008-2012) - B.S. Engineering Physics, Minor Electrical Engineering, and Computer Science. Stanford University, College of Engineering (2016-2018) - M.S. Management Science and Engineering.’ What did Freddy Dopfel major in in College?” The LLM would use the data from the file to try to answer the question and could reference the file it grabbed the data from. However, the LLM may still hallucinate, listing my Masters program instead of my undergraduate degree.

RAG is great for asking questions about a continuously changing data set and for reducing (although not completely eliminating) hallucinations, as the output can reference which files it pulled its data from. This is great for checking the work of the LLM, something that is essential as we rely more heavily on LLMs in our daily workflows.

Although most users running single GPU systems (such as myself) do not have the resources to fine-tune a full LLM, and, as a result, can only use RAG, many larger corporations may benefit from using both approaches. For example, a law firm may fine-tune an LLM on every legal opinion that it can find on the entire web to help train the model in how to “talk like a lawyer” but then use RAG on their own documents and legal opinions for more useful and accurate results.

The explosion in popularity of LLMs has led not only to a large number of increasingly capable models, but also to quite a few GUIs that enable people (with a powerful enough PC) to run these models quite easily. Thankfully, I’ve built a PC specifically for working with AI models. My criteria for an LLM manager were that they should support running on CPU and GPU (my GPU is faster, but only has 24GB of RAM, whereas the CPU can access 64GB for larger, slower models), that all compute happens locally, and that it must support RAG.

GPT4All provides a convenient graphical user interface, supports RAG (including on documents stored on my NAS), and is generally easy to use. I particularly liked a feature where if the GPU ran out of memory when running an LLM, GPT4All would automatically switch to using the CPU to run the model (albeit a bit more slowly). This reliability, even at the expense of performance, is critical for future projects I aim to build.

h2oAI provided more customization on the models but was less user-friendly, and I had issues with accuracy when it was performing RAG. I liked that h2oAI was accessible via a web interface, which could make for easier deployment; however, my biggest qualm was that (at least in my testing) it would delete its RAG data every time it restarted, making it non-ideal for long-term use. h2oAI did have one really exciting feature, though - built-in optical character recognition (OCR). OCR would allow it to search through and index (for RAG) a wider variety of files, including scans of documents like tax returns, receipts, or anything else that I had saved.

Using my RAG data set of meeting notes, personal documents, and years of saved industry reports, I found interacting with the AI to be a useful starting point for research into a market, or finding the name of a company that I had met with previously based on facts I remember about it. I expect that although I am using this system primarily as an advanced search engine for my personal data, I will eventually integrate it into more of my workflow.

Next Steps:

I plan on adapting this work for both personal productivity and for my team at LG Tech Ventures.

For LG Tech ventures, this technology can substantially improve team productivity. There are concerns at LG around using public LLMs, but hosting our own eliminates the concerns about data leakage. Most of the difficult work in implementation is in complying with security rules around handling notes, pitch decks, and other information that LG Tech Ventures has custody of. We can then build a computer (likely with very similar specs to my own) and load it with the relevant data. Operationally, we will also have to figure out how to continually update that data without using cloud providers or blacklisted file-sharing software. Finally, we will need to build a web portal for interacting with the LLM and ensure that it complies with LG security standards. The team is excited about this idea, especially when it can be paired with automatic transcription of meetings and founder pitches. All of that additional data that would be painful to sort through manually could be extremely useful when ingested into an LLM.

For my own personal productivity, I plan on dedicating an NVIDIA Jetson development board to run GPT4All 24/7 in my networking closet. GPT4All is the better solution for this use case because it has an automatic CPU fallback, plays nice with my existing file management system, and, perhaps most importantly, can emulate the OpenAI API. This last point is particularly important because it means I could re-route traffic from various apps meant for OpenAI to my own LLM. HomeAssistant, the software I use to control my home, already supports ChatGPT, and so by editing that extension to use my own LLM, and my own Wi-Fi microphones for collecting voice data, I could build my own fully-local Amazon Echo / Google Home for private, intelligent, personalized conversational AI. I also hope to integrate this into my self-hosted search engine (More about that in another post).