Guide to Using DeepSeek Model in MSAEZ (RunPod Cloud GPU Environment)

MSAEZ now supports the use of DeepSeek AI inference models in a private cloud environment.

DeepSeek models are available in various parameter sizes including 7B and 67B, trained on over 2 trillion tokens of data. This data includes code, mathematical problems, and general text, making it applicable across various fields. Notably, most models are open-source under MIT or Apache 2.0 licenses.

By utilizing the Ollama tool to install DeepSeek AI models directly in a local environment, you can reduce costs and dependencies on cloud-based AI services while freely using AI capabilities in an on-premises environment.

Particularly, using DeepSeek AI enables requirement analysis and Domain-Driven Design (DDD) based cloud-native modeling through human-in-the-loop communication with designers. This allows for building more sophisticated microservice architectures while maintaining data consistency and flexible design.

This guide explains how MSAEZ users can run DeepSeek AI models in a RunPod cloud GPU environment and integrate them with MSAEZ. It is intended for developers looking to build AI-based microservices using MSAEZ.

Cloud GPU Service Configuration for DeepSeek Environment

Setting up DeepSeek Model Environment Using RunPod

1. You can create and request a new Pod through the Pods menu in RunPod.

The deepseek-ai/DeepSeek-R1-Distill-Qwen-32B model currently requires a VM with at least 80GB.
While both community cloud and secure cloud options are available, we recommend using the community cloud due to current instability issues with secure cloud.
We recommend the 4x RTX 4000 Ada architecture; if unavailable, choose an instance with similar performance.

2. Click Edit Template to configure the template.

For template configuration, SGLang-based options like Qwen 2.5 Coder 32B - SGLang by Relis are stable.
- --tensor-parallel-size activates tensor parallel processing and determines how many GPUs to distribute the model across. This helps overcome single GPU memory limitations and improves inference speed through parallel processing. Generally, the optimal value should match the number of available GPU instances. For example, when using four RTX 4000 Ada GPU instances, set --tensor-parallel-size to 4.
- --mem-fraction-static is a parameter that sets what proportion of GPU memory to reserve statically before model execution. While GPU memory can be allocated dynamically, memory shortage errors are likely when long context sizes are passed. To prevent this, set to pre-occupy memory by the specified ratio. Generally, start with 0.8-0.9 and adjust based on model size, context length, and GPU memory situation.

python3 -m sglang.launch_server --model-path deepseek-ai/DeepSeek-R1-Distill-Qwen-32B --context-length 131072 --host 0.0.0.0 --port 8000 --tensor-parallel-size [Number of GPU instances used] --api-key [API key for LLM requests] --mem-fraction-static 0.9 --disable-cuda-graph

For Volume Disk, you can allocate about 90GB as initial capacity considering model caching and various configuration files for the current Qwen 2.5 Coder 32B model.
Set to On-Demand for stable operation without service interruption.

Verifying DeepSeek Model Configuration

Checking Logs

Check logs through Log > Container.
The server is fired up and ready to roll: This indicates when the system is actually ready for use.

Accessing

Connect > HTTP Service
The access URL is the path to the deployed Pod.

https -v POST <Request Pod URL>/v1/chat/completions \
  Authorization:"Bearer <API key for LLM requests>" \
  model="deepseek-ai/DeepSeek-R1-Distill-Qwen-32B" \
  messages:='[{"role": "user", "content": "What is the capital of France?"}]'

Configuration for Using RunPod-based DeepSeek Model in MSAEZ

MSAEZ provides three model configurations to utilize DeepSeek models for various purposes: complexModel, standardModel, and simpleModel.

complexModel: Used for complex tasks requiring high performance, such as policy generation.
standardModel: Used for most general AI functions (e.g., text generation, Q&A). MSAEZ's core AI features are provided through standardModel.
simpleModel: Used for relatively simple tasks requiring quick processing, such as JSON object error correction.

1. Run the related Proxy server.

The server.js Proxy server mediates smooth communication between MSAEZ and RunPod, and MSAEZ efficiently provides various AI functions through the model configurations described above.

node ./server.js

2. Modify localStorage values to use the related models.

localStorage.complexModel = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
localStorage.standardModel = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
localStorage.simpleModel = "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B"
localStorage.runpodUrl = "<Request Pod URL>/v1/chat/completions"

3. After testing, reset to empty values to return to default model usage.

localStorage.complexModel = ""
localStorage.standardModel = ""
localStorage.simpleModel = ""