Marco Sbragi

Posted on May 29

LLM-Manager: Orchestrating Ollama and Llama.cpp with Pure Bash

#bash #ai #opensource #devops

LLM-Manager is a lightweight, modular Bash suite with a dual JSON/Interactive interface designed to manage local and remote inference engines across Linux and WSL2.

When I started experimenting with Large Language Models (LLMs) to build an On-Premise RAG (Retrieval-Augmented Generation) application, I hit a massive roadblock: environment fragmentation.

Managing multiple inference engines like Ollama and Llama.cpp meant memorizing different command-line flags, environment variables, and configurations. Once my frontend and backend prototypes were ready for testing, I realized I was spending too much time manually starting, stopping, loading, and unloading models.

I looked online for solutions. Most people suggested complex Python scripts, heavy Docker setups, n8n workflows, or complicated web dashboards.

I didn't want the bloat. I wanted something lightweight that executed commands as if I were doing them manually, but with zero cognitive load.

That is why I built LLM-Manager: a modular orchestration suite written entirely in pure Bash.

Why Bash?

Choosing Bash wasn't about being old-school; it was a pragmatic engineering decision:

Zero Overhead: No python virtual environments, no npm install, no runtime dependencies. It’s native and lightning-fast.
OS-Level Access: It can natively probe hardware metrics (CPU load, RAM, Disk, GPU VRAM) and manage OS processes.
Cross-Platform via WSL2: By utilizing minor PowerShell bridges only when necessary (like starting a server on the Windows host side), the exact same Bash scripts run flawlessly across both native Linux and Windows/WSL2 environments.

The Architecture

The system is designed with a strict plug-and-play modular layout. At the center sits a single entry-point orchestrator (engine-run.sh) that validates arguments against whitelists and routes actions to engine-specific scripts.

.
├── engine.conf               # Global configuration constants
├── engine-models.json        # Model registry with per-engine metadata
├── engine-templates.json     # Prompt/Model templates by family
├── engine-run.sh             # Main orchestrator & entry-point
├── engine-common.sh          # Shared utilities (OS detection, JSON formatting)
├── engine-status.sh          # Cross-engine status aggregation
├── engine-system.sh          # Hardware metric probing
├── logs/                     # Centralized logs
├── llama/                    # Llama.cpp backend scripts
└── ollama/                   # Ollama backend scripts

Every engine directory implements a consistent interface (start.sh, stop.sh, status.sh, load.sh, unload.sh, show.sh, remove.sh). If an engine doesn't support a specific action, a simple stub script that exits with 0 keeps the pipeline happy.

The Dual-Output Contract (Human + Machine)

One of the core features of LLM-Manager is how it handles output.

Interactive text (Logs, Help, Errors) is routed to stderr.
Structured JSON data is routed to stdout.

This dual nature makes it perfect for local interactive use, but also means it acts as a local proxy. You can run it over Remote SSH and pipe the clean JSON straight into another monitoring script, custom Web UI, or automation tool.

Example 1: Global Status Check

Running ./engine-run.sh status probes the system metrics and queries active network ports, spitting out a comprehensive payload:

{
  "timestamp": "2026-05-29T06:38:30Z",
  "status": "success",
  "action": "status",
  "engine": "all",
  "data": {
    "system": {
      "os_type": "wsl",
      "memory": { "total_mb": 5927, "available_mb": 4555 },
      "gpu": { "detected": true, "name": "AMD Radeon(TM) Graphics", "vram_total_mb": 512 },
      "cpu": { "cores": 4, "load_1m": 1.11 }
    },
    "engines": {
      "ollama": { "state": "stopped", "port": 1234 },
      "llama": { "state": "stopped", "port": 12345 }
    }
  }
}

Example 2: Interactive Help Interface

If a command fails or is called without parameters, the machine gets the JSON error contract, and the human operator gets a clean, human-readable usage menu:

Error: LLM Manager
Usage: engine-run.sh <action> [engine] [args...]
actions:
    config                           Global config
    models [-h]                      List available models (-h human readable)
    status <engine>                  Show global or engine status
    start <engine> [model] [users]   Start an engine
    stop <engine>                    Stop an engine
...

Dynamic Modelfile Generation

Managing raw .gguf files on Ollama can be a chore since it requires a Modelfile. LLM-Manager abstracts this entirely in the backend via model loading strategies in engine-models.json.

If a model is configured with a gguf loader, the load.sh script dynamically generates the required Modelfile on the fly, injecting correct prompt templates based on the model family, and loading it into Ollama seamlessly. It also supports native strategies to pull directly from the official Ollama registry, or auto to fallback if the local file is missing.

Check out the Code

The project is fully open-source. If you want to see how the WSL2/PowerShell bridges are handled, how the dynamic Modelfiles are generated, or if you want to use it to clean up your own local LLM testing environment, check out the repository:

msbragi / LLM-Manager

Large Language Model management for on-premise installation

LLM-Manager

A lightweight, modular Bash orchestration suite to manage, start, stop, and monitor local and remote LLM inference engines (Ollama, Llama.cpp) with a dual interactive/JSON interface.

Developed primarily to solve the complexity of managing ibrid environments (like Windows hosts from WSL2) and remote deployments via SSH without the overhead of heavy Python or dashboard solutions.

Prerequisites

Before running the orchestrator, ensure your environment has the following tools installed:

Bash (v4.0 or higher recommended)
jq — Crucial for parsing and formatting JSON outputs (sudo apt install jq on Debian/Ubuntu).
curl — Used for engine health checks and API interactions.
PowerShell (Windows/WSL2 host setups only) — Required strictly for launching/stopping engine services on the Windows host side when managed from WSL2.

Key Features

Multi-Engine Support: Native orchestration for ollama and llama.cpp (with vLLM planned).
Cross-Platform & Hybrid Environments: Supports native Linux, WSL2, and Windows hosts (orchestrating Windows processes from WSL2 using…

View on GitHub

I am currently working on completing the vLLM engine integration and refining the startup health-checks into proactive retry loops.

Let me know what you think or if you've built similar lightweight alternatives for your AI workflows!

Top comments (2)

Harjot Singh • May 31

Environment fragmentation across inference engines is the real, unglamorous tax of running local LLMs, and it's exactly the kind of problem a Bash suite is right for, no heavy runtime, just a thin uniform layer over Ollama and llama.cpp so you stop memorizing per-engine flags. The dual JSON/interactive interface is the smart part: interactive for humans poking around, JSON for the moment an agent or script needs to drive it programmatically, which is the difference between a tool you use and infrastructure other things build on. The deeper value is the abstraction boundary, once you have one manager in front of multiple engines, you can swap or route between them without rewriting callers, same reason a gateway in front of cloud providers wins. For on-prem RAG specifically that flexibility matters because the right engine depends on the model and the hardware, and you want to change that decision without touching the app. Keeping it pure Bash also means near-zero dependencies, which is its own reliability win on the kind of box this runs on. That uniform-layer-over-fragmented-backends instinct is core to how I think about model orchestration in Moonshift. Does the manager handle routing/fallback between Ollama and llama.cpp, or is it one-active-engine-at-a-time for now?

Marco Sbragi • Jun 1

Thank you for the analysis. You hit the nail on the head: the goal is to transform the "noise" of engine-specific CLI tools into a predictable, programmatic infrastructure.

Regarding your question on orchestration: currently, llm-manager does not perform autonomous routing or fallback. It is designed for deterministic management driven by configuration. In engine.conf, you define whether multi-engine support is enabled. If disabled, the orchestrator enforces an exclusive state—you must stop one engine before initializing another. This is a deliberate design choice to maintain predictability in production environments where hardware constraints (VRAM allocation) are the priority.

engine-run.sh handles everything, enforcing a unified schema across all backends. Here is an example of the standardized JSON output for a system status check, ready for consumption by any agent or script:

$ ./engine-run.sh status

{
  "timestamp": "2026-06-01T13:20:17Z",
  "status": "success",
  "action": "status",
  "data": {
    "system": {
      "os_type": "wsl",
      "gpu": { "detected": true, "vram_total_mb": 4095 }
    },
    "active_engine": "ollama",
    "engines": {
      "ollama": { "state": "running", "port": 1234 },
      "llama": { "state": "stopped" }
    }
  }
}

And this is the consistent response format when loading a model:

$ ./engine-run.sh load ollama "phi-3-mini"

{
  "timestamp": "2026-06-01T13:22:04Z",
  "status": "success",
  "action": "load",
  "data": {"model": "phi-3-mini", "status": "loaded"}
}

This structure eliminates the need for brittle regex scraping of raw engine output, allowing you to build control logic directly on top of this abstraction layer.