Hosting my own LLM - Part 2

Frederick Dopfel
Jun 19, 2024
4 min read

I had previously written about my attempts at hosting my own LLM on my AI Machine, but I wanted to take what was a working prototype and turn it into something that is actually usable in my everyday life.

Fortunately, other hackers have been similarly interested in hosting their own LLMs, and the open source community has been active in making self-hosted private LLMs accessible to everyone. Two projects in particular have made huge strides in improving accessibility: Ollama and Open Web UI. The former is a cross-platform easy way to manage and host LLMs, and the latter is a self-hostable user interface that allows anyone to chat with these models via a web browser.

Starting with the user interface, I decided to break with my usual philosophy of a dedicated device for each application, and instead decided to host OpenWebUI in a docker container on my Synology NAS. I am still not a huge fan of Docker generally, but this was an easy way to get it up and running quickly with a high degree of reliability. Most importantly, the NAS is where all of my files are stored, and where my search engine is hosted. Because 8080 is a commonly attacked port, I decided to use a different port for hosting the WebUI (and no, I’m not posting it here).

I decided to set up the models using Ollama. Although I had previously experimented with other self-hosting technologies, I decided to use Ollama because it seems to be one of the easiest to work with and integrate with other tools (more on that in a later post). Ollama also works with Mac, Windows, and Linux, making it easy to run across multiple devices. I have it running on my AI PC, my primary PC, my Minecraft server in the closet (after giving it a RAM upgrade), and in a docker container on my NAS. PCs with more GPU horsepower run more powerful models with more parameters, and those with less power (such as my NAS and minecraft server) will be running smaller models on their CPUs. Ollama manages setup and running of these models and automatically detects what can be run on CPU versus GPU.

Next, I point the OpenWebUI to my various Ollama servers on the network, in order of priority (determined by the the power of the GPUs and CPUs of the network). If my AI PC is offline, it falls back to my office PC. If my office PC is offline, it falls back to the Minecraft server, and if that is offline, it falls back to the NAS. The only real difficulty here is ensuring that all these servers are using the same models, and that Ollama on each of these computers is configured to allow access over the network. In order to do this, I had to change “environmental settings,” something I was unfamiliar with until now, and set “OLLAMA_HOST=0.0.0.0”. This opens up Ollama’s API for other apps, such as OpenWebUI to access.

Because open source models only know about the world up until the time they are released, I wanted to give these models more flexibility by adding web search functionality to them. Thankfully, OpenWeb UI supports RAG on search results from SearXNG which I use as my self-hosted search engine.

OpenWebUI supports RAG, and I can upload any file or website and talk to the AI about it. However, what I really want to do is run RAG over all my personal and work documents simultaneously so I can ask difficult questions “what were the names of the silicon anode companies I was talking to?” OpenWebUI supports this, but unfortunately its RAG does not scale well as the number of files in its database increases, and the accuracy and relevance of results drops dramatically when it searches through a large number of files. This seems to be a problem with most RAG systems today, and one that I hope the open source community will be able to fix in time. For now, I'm uploading documents one at a time for analysis, but am experimenting with RAG in a limited way with my own files, creating expert chatbots for different aspects of my work and life (startup searcher, market researcher, instruction manual troubleshooter, etc)

Finally, I wanted to edit the models themselves. Although I do not have the horsepower to properly fine tune a model, I can do some system level prompt engineering to give it a personality that fits what I am looking for. In this case, I adjusted the system level prompts to encourage it to make dad jokes and puns, and gave it a little information about itself and me, its creator.

And if that wasn’t enough, OpenWebUI also supports image generation! I have had some previous projects with fine tuning my own Stable Diffusion models in the past, and a web interface is a great way to interact with them. I can use Automatic1111 to run my own fine-tuned models for generating images of myself and my friends, then allow prompting through this interface. Alternatively, I can use ComfyUI and its support for StableDiffusion 3 (whereas I only managed to get Stable Diffusion 1.5 working on Automatic1111) for creating images that are not fine-tuned (eg, not of myself or my friends). I decided to use ComfyUI since I have it working with more powerful image generation models, its UI downsides are nullified by using the chat interface

I decided to create a new model called "Image Generator Bot" that only responds in Stable Diffusion Prompts, so I can simply describe what I want, then have the bot respond with a prompt and send it to Comfy UI to generate an image. (note: you must ad "--listen" to the bash command for Comfy UI if you want to access it over the network)

Finally, I did some port forwarding on my firewall to make the interface accessible over the internet, and shared the link with my family so they can each make their own accounts. Now my whole family can enjoy a free, private, and customized OpenAI alternative.