Everyone's talking about AI. Your accountant wants you to use ChatGPT to draft emails. Your director saw a demo on LinkedIn. And meanwhile, you're wondering: "if I paste my biggest client's contract into ChatGPT, where exactly does that data end up?"
It's a fair question. When you use ChatGPT, Claude, or Gemini through their web interface, your data passes through servers in the United States. Depending on the terms of service, it may be used to train future models. For a generic email, that's fine. For a confidential contract, HR data, or a financial document, it's a different story.
The good news: there are now AI models you can install on-premise, on your own server. Your data never leaves your network. And it's much more accessible than you might think.
Un LLM local, ça veut dire quoi?
An LLM (Large Language Model) is the engine behind ChatGPT, Claude, and the rest. It's a program that has "read" billions of web pages and can then generate text, answer questions, summarize documents, translate, and draft content.
Normally, these models run on massive data centers. But since 2023-2024, the open-source community has closed much of the gap. Models like Llama 3 (Meta), Mistral (French startup), Gemma (Google), and Phi (Microsoft) are available for free, with licenses that allow commercial use. And they're compact enough to run on reasonable hardware.
A local LLM is simply one of these models installed on a server you control. Your employees ask it questions, it answers. The data stays within your walls.
Ollama: the simplest entry point
Ollama is an open-source tool that radically simplifies LLM deployment. Think of it as a "model manager": one command to download a model, one command to start it. No need to compile code, configure dependencies, or fight with GPU drivers for three days.
In practice, installing Ollama on a Linux server takes about five minutes. Then, downloading and launching a model is a single line:
ollama run llama3.3
That's it. The model downloads, loads into memory, and you can start talking to it. Ollama automatically handles GPU detection, memory allocation, and performance optimization.
Ollama runs on Linux, macOS, and Windows. It exposes an API compatible with the OpenAI standard, which means most tools that work with ChatGPT can also work with your local model. More on that below.
Open WebUI: the interface your employees will actually use
Ollama on its own is a command-line tool. Perfect for a technician, unusable for the rest of the team. That's whereOpen WebUI comes in.
Open WebUI is a free, open-source web interface that connects to Ollama and offers an experience very similar to ChatGPT: a chat window, conversation history, the ability to attach files. Your employees don't need to know there's a local LLM behind it: for them, it's "the company's ChatGPT."
The project is very active (over 45,000 stars on GitHub, weekly updates) and includes features that matter for organizations: user management with roles, conversation history, document upload for contextual questions (RAG), voice support with Whisper, and even a built-in code editor.
Installation is done with Docker in a few minutes. Point Open WebUI at the Ollama server, and you're good to go.
What hardware do you need?
This is the question that always comes up, and the honest answer is: it depends on the model you want to run.
| Model | Parameters | Minimum RAM/VRAM | Good for |
|---|---|---|---|
| Phi-3 Mini | 3.8B | 4 GB | Simple tasks, short summaries, testing |
| Mistral 7B | 7B | 8 GB | Good quality/size ratio, multilingual |
| Llama 3 8B | 8B | 8 GB | General purpose, large community |
| Gemma 2 9B | 9B | 10 GB | Reasoning, high quality for its size |
| Llama 3.3 70B | 70B | 48 GB | Near GPT-4 quality, but demanding |
In practice, for an SMB looking to get started, here's what we recommend:
Minimal budget (around $2,000): a server or dedicated PC with 32 GB of RAM and an NVIDIA graphics card with 8 GB of VRAM (like an RTX 3060 or 4060). That comfortably runs 7-8B models, which are sufficient for most office tasks.
Comfortable budget ($5,000-8,000): a server with 64 GB of RAM and an NVIDIA GPU with 24 GB of VRAM (RTX 4090 or A5000). At that level, you can run larger models, serve multiple users simultaneously, and response quality improves noticeably.
Without a GPU: it's possible, but slower. Ollama can run models on CPU alone. For a small model (3-7B) used by a few people, it's still functional. Responses take a few seconds instead of being instant.
What it actually changes in an SMB
We're not going to sell you AI as the solution to all your problems. But there are concrete use cases where a local LLM saves time without putting your data at risk:
Summarizing long documents. A 40-page report, meeting minutes, a request for proposals: feed it to the model, and you get a structured summary in 30 seconds. It doesn't replace reading, but it gives you a starting point.
Drafting text. Emails, letters, job descriptions, template responses: the model produces a first draft that you refine. That saves 10-15 minutes on repetitive tasks.
Searching your internal documents. With Open WebUI's RAG (Retrieval-Augmented Generation), your employees can ask questions in plain language about your procedures, policies, or technical documentation. The model searches for relevant information in your files and formulates an answer.
Translating. Recent models are surprisingly good at translation, especially between French and English. For a Quebec SMB juggling both languages daily, it's handy.
Analyzing text data. Classifying customer feedback, extracting structured information from a batch of emails, identifying trends in meeting notes.
Integration with your existing tools
A local LLM doesn't live in a vacuum. Since Ollama exposes a standard API, it can integrate with other tools in your infrastructure.
Nextcloud: if you already use Nextcloud for your files and calendars, you should know that Nextcloud's AI Assistant can connect to a local Ollama server. Document summaries, text generation, translation: all of that directly from your document management system, without your files ever leaving your server.
Scripts and automation: Ollama's API lets you automate repetitive tasks. For example, a script that reads new support emails, classifies them by urgency, and drafts a reply. Or one that automatically summarizes meeting minutes dropped into a shared folder.
Odoo and other ERPs: with a bit of integration, you can connect a local LLM to Odoo to enrich client records, summarize chatter exchanges, or draft proposals from templates.
We've already published a comparison of commercial AI models (GPT, Claude, Gemini). Local AI is the complement: your sensitive data stays with you, and tasks that need a more powerful model go through an external service when needed.
Available models: which one to pick?
There are dozens of models available on Ollama. Here are the ones we recommend to get started:
Llama 3.3 (Meta): the most versatile. Available in 8B and 70B parameters. Large community, many specialized variants (code, conversation, instruction). 128K token context, which means it can ingest long documents. It's our first choice for general use.
Mistral 7B and Mixtral: excellent quality/size ratio. Mistral is a French startup, and their models are particularly strong in French. Mixtral uses a "Mixture of Experts" architecture: it has 46.7B total parameters but only activates 12.9B per request. Result: big-model quality at small-model speed. Apache 2.0 license, so zero commercial restrictions.
Phi-3 and Phi-4 (Microsoft): the lightweight champions. Phi-3 Mini runs with 4 GB of RAM and delivers surprising performance for its size. Ideal for math reasoning tasks and heavily resource-constrained environments.
Gemma 2 (Google): very good quality, especially in 9B and 27B. Fast inference speed. A solid pick if you need quality responses without tying up a large GPU.
Les limites Ă garder en tĂȘte
We'd rather be upfront about this than let you discover the limitations after you've invested.
Local models are less powerful than GPT-4 or Claude. A 7B model is impressive for its size, but it doesn't have the same reasoning capacity as a 1,000+ billion parameter model running on thousands of GPUs in a data center. For complex tasks (detailed legal analysis, high-level creative writing, sophisticated code), commercial models remain superior.
Hallucinations still happen. A local LLM will make up facts with the same confidence as a commercial one. RAG helps (it forces the model to rely on your documents), but you should always verify outputs before relying on them for important decisions.
It requires dedicated hardware. You can't run an LLM on Marie's workstation alongside her Excel and Outlook. You need a dedicated server, and someone to maintain it. That's a cost on top of the model license (which is free).
Performance depends on concurrent users. A server with an 8 GB GPU works great for 2-3 users. If 15 people send requests at the same time, things slow down. You need to size the hardware accordingly.
No real-time training. Your local LLM doesn't "remember" what you tell it between sessions (unless you set up RAG with your documents). It doesn't improve on its own with use. To truly customize it, you need fine-tuning, which is a project in itself.
French language support varies. Larger models (Llama 3.3 70B, Mixtral) handle French well. Smaller models (3-7B) are often less comfortable and may mix languages or make errors. Mistral has an edge here, given its French origins.
Our recommendation for getting started with local AI:
- Identify 2-3 concrete use cases in your organization (summaries, emails, document search)
- Install Ollama + Open WebUI on a test server with a 7-8B model
- Test with a small group for 2-4 weeks
- Measure time saved vs. hardware cost
- Decide whether it's worth rolling out broadly
Ce qu'on déploie pour nos clients
At Blue Fox, we help our clients deploy local AI when it makes sense. We install Ollama and Open WebUI on a server hosted in Quebec, choose the model that fits the needs (and the hardware budget), and connect everything to the existing infrastructure: Nextcloud, Odoo, automation scripts.
We don't recommend local AI for everything. If you need the power of GPT-4 or Claude for complex tasks, we'll advise you to use those services, but with awareness of what you're sending. The hybrid approach is often the most sensible: confidential data goes through the local LLM, the rest goes to the cloud when quality demands it.
What we don't do: install a tool and disappear. We train your teams, configure permissions, and make sure adoption actually happens.
Where to start?
If this sounds like it could work for you, no need for a big project. We can start with a 30-minute call to understand your needs, then set up a test environment in a few hours. You try it with your team for a few weeks, and we see if it sticks.
Interested? Explorons ça ensemble.
Sources
- Ollama : official site, documentation, list of supported models
- Open WebUI : open-source web interface for local LLMs
- Llama 3 (Meta): open models, benchmarks, licenses
- Mistral AI : European models, Mixture of Experts architecture
- Nextcloud AI as a Service : Ollama integration with Nextcloud
- LocalLLM.in : hardware requirements guide for Ollama (VRAM, RAM)