TensorPanel is a multi-tenant SaaS platform for teams that want to run open-source AI models on their own GPU infrastructure rather than paying per-token to a cloud API provider. The platform connects to GPU servers — Hetzner dedicated servers, AWS instances, RunPod deployments, or bare-metal hardware — and provides a management layer for model deployment, fine-tuning, API access, and team permissions.
The architecture splits across three components: a Laravel control plane that handles the SaaS logic, a Go agent that runs on each GPU server, and a Flutter mobile application that gives users a private ChatGPT-style interface to their own models.
TensorAgent: The Go Binary on the GPU Server
The design decision that defines TensorPanel's architecture is where the GPU server management logic runs. It would have been simpler to SSH into servers from the Laravel backend — send commands, parse output, manage state remotely. Instead, TensorPanel installs a lightweight Go binary (TensorAgent) on each GPU server that acts as a local gateway.
TensorAgent runs an HTTPS API on port 8080. All communication between the Laravel control plane and the GPU server goes through this API. The agent handles model deployment (downloading from HuggingFace, spawning inference containers), fine-tuning job execution (running Docker containers with unsloth or axolotl), hardware monitoring (parsing nvidia-smi output for GPU metrics, gopsutil for CPU/RAM/disk), and rate limiting enforcement.
Using Go for the agent was deliberate. Python agents are common in ML infrastructure tooling, but Python's startup time and memory footprint make it a poor choice for a lightweight daemon that needs to be always-on and responsive. A compiled Go binary starts in milliseconds, uses ~10MB of RAM at idle, and handles concurrent HTTP requests efficiently without the GIL considerations that would affect a Python service doing the same work.
One-Click Model Deployment
The model marketplace in TensorPanel lists curated open-source models: Llama 3, Mistral, DeepSeek, Qwen, and others. Each model entry includes its VRAM requirements. When a user selects a model to deploy on a specific server, TensorPanel checks the server's available VRAM (total VRAM minus VRAM currently in use by running models) before allowing the deployment.
Multiple models can run simultaneously on a single GPU server if VRAM permits. TensorPanel tracks running models per server and their VRAM consumption, automatically assigns available ports to new deployments, and updates the available-VRAM calculation in real time. A server with 80GB VRAM might run several smaller models simultaneously rather than a single large one.
The deployment itself is triggered through TensorAgent: the agent downloads the model from HuggingFace using an encrypted HuggingFace token stored in the tenant settings, then spawns a Docker container running vLLM (for production deployments) or Ollama (for prototyping). The container exposes an inference endpoint that TensorPanel's API proxy can route to.
OpenAI-Compatible API Proxy
TensorPanel exposes an API at /api/v1/chat/completions that is compatible with the OpenAI API specification. A team already using the OpenAI Python SDK or any tool that targets the OpenAI API can switch to TensorPanel by changing the base URL and API key — no other code changes required.
The proxy layer handles routing, token usage tracking, and quota enforcement. Each API key is associated with a role, and roles have configurable RPM (requests per minute) and monthly token quotas. The rate limiting is enforced both at the control plane level and synchronized to TensorAgent for local enforcement on the GPU server itself — a double-check that prevents quota evasion by calling the agent directly.
Fine-Tuning as a First-Class Feature
Fine-tuning is not an add-on in TensorPanel — it's built into the core interface. Users upload a training dataset in JSON format, configure hyperparameters (LoRA vs QLoRA vs full fine-tuning, learning rate, batch size, epoch count), and submit the job. TensorAgent executes the fine-tuning run in a Docker container, streaming loss and epoch metrics back to the control plane in real time.
The fine-tuning interface shows a live loss curve as the job runs, not just a "job running" indicator. When the job completes, the resulting model adapter is available for deployment alongside the base model. This workflow — from dataset upload to running inference on a fine-tuned model — is entirely self-contained within TensorPanel without requiring any ML engineering expertise from the user.
Global Guardrails
TensorPanel includes a content guardrail system: blocklists and allowlists applied to system prompts and completion requests. These are configured at the tenant level and enforced by TensorAgent at request time — the enforcement happens locally on the GPU server before requests reach the inference engine, not just at the control plane level.
This local enforcement matters for compliance use cases. If a team needs to guarantee that certain content never enters or exits their AI models, having the enforcement happen at the GPU server (rather than at a remote control plane that could theoretically be bypassed) provides a stronger guarantee.
TensorScripts: One-Command Server Bootstrap
Connecting a new GPU server to TensorPanel takes one command: curl -sL https://tensorpanel.talivio.com/agent/install.sh?token=YOUR_TOKEN | sudo bash. The TensorScripts handle NVIDIA driver installation, CUDA toolkit setup, Docker and NVIDIA Container Toolkit installation, and TensorAgent deployment and registration with the control plane. A fresh GPU server goes from bare OS to ready-to-deploy in a single terminal session.