LlamaFactory-Ollama-Langchain大模型训练-部署一条龙

前言

近些日子，大模型火的一塌糊涂，那么现在就有义务要学习一套好用的、从 dataset --> train --> deploy 的一整套流程，好拿来装逼。话不多说，进入正题

Train 框架

目前好用的框架太多，如BELLE, ChatGLM等，今天笔者推荐一个 Llama-Factory.

环境安装

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
conda create -n llama_factory python=3.10
conda activate llama_factory
pip install -e .[torch,metrics]

没错，你没看错，就这么几步，环境就搞定了

Dataset

数据集是通用格式，如下：

[{"instruction": "你好","input": "","output": "您好，我是XX大模型，一个由XXX开发的 AI 助手，很高兴认识您。请问我能为您做些什么？"},{"instruction": "你好","input": "","output": "您好，我是XX大模型，一个由XXX打造的人工智能助手，请问有什么可以帮助您的吗？"}
]

如果是text-to-text任务，可以instruction中写prompt，input 和 output 分别写text即可。

注：还有很多种数据格式，具体见项目的readme

自定义数据集制作好以后，放在项目的 data 下，然后在 data/dataset_info.json 文件中添加描述，如下mydataset：

{"starcoder_python": {"hf_hub_url": "bigcode/starcoderdata","ms_hub_url": "AI-ModelScope/starcoderdata","columns": {"prompt": "content"},"folder": "python"},"mydataset": {"file_name": "mydataset.json","file_sha1": "535e1a88e1d480f80eca38d50216ea3a5dbedfa9"}
}

mydataset 是数据集自定义名称，file_name 是 data 文件夹下数据集json的文件名。file_sha1 通过以下代码哈希加密获得：

import hashlibdef calculate_sha1(file_path):sha1 = hashlib.sha1()try:with open(file_path, 'rb') as file:while True:data = file.read(8192)  # Read in chunks to handle large filesif not data:breaksha1.update(data)return sha1.hexdigest()except FileNotFoundError:return "File not found."# 使用示例
file_path = 'mydataset.json'  # 替换为您的文件路径
sha1_hash = calculate_sha1(file_path)
print("SHA-1 Hash:", sha1_hash)

Train & Inference & Export

具体使用单卡or多卡，微调or全量具体看一下原项目说明，简洁易懂

笔者使用的指令是：

bash examples/full_multi_gpu/single_node.sh

single_node内容：

#!/bin/bash
NPROC_PER_NODE=4  # GPU卡数量，和下边的CUDA_VISIBLE_DEVICES数量对应，否则报错
NNODES=1
RANK=0
MASTER_ADDR=127.0.0.1
MASTER_PORT=29500CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun \--nproc_per_node $NPROC_PER_NODE \--nnodes $NNODES \--node_rank $RANK \--master_addr $MASTER_ADDR \--master_port $MASTER_PORT \src/train.py examples/full_multi_gpu/llama3_full_sft.yaml

再去修改一下你的 stf.yaml 配置文件即可。

train, infer，export model 都只需要一条指令，具体更改配置文件yaml中的参数即可。

CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/lora_single_gpu/llama3_lora_sft.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
CUDA_VISIBLE_DEVICES=0 llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

Ollama部署

当然前边的 llamafactory-cli chat 也可以进行推理，但是不好用，这时Ollama就闪亮登场

导出llama模型

上一步训练好的llama-factory模型，ollama不能直接使用，需要转换一下格式。

如果是 lora 微调，首先使用 llamafactory-cli export 导出合并成一个模型文件；
全量微调只有一个模型文件，不需要操作。

模型转换代码如下：

git clone https://github.com/ollama/ollama.git
cd ollama
# and then fetch its llama.cpp submodule:
git submodule init
git submodule update llm/llama.cpp
# conda 
conda create -n ollama python=3.11
conda activate ollama
pip install -r llm/llama.cpp/requirements.txt
# Then build the quantize tool:
make -C llm/llama.cpp quantize
# convert model
python llm/llama.cpp/convert-hf-to-gguf.py path-to-your-trained-model --outtype f16 --outfile converted.bin

还有不懂可以看原文档中 Importing 部分。

量化、创建Ollama model、docker部署

Quantize the model(Optional)

llm/llama.cpp/quantize converted.bin quantized.bin q4_0

Create a Modelfile for your model:

FROM quantized.bin
TEMPLATE "[INST] {{ .Prompt }} [/INST]"

使用docker进行转换Ollama可用的格式：

将 quantized.bin 和 Modelfile 移动到 ollama_file 文件夹内（没有创建一个）

docker run -itd --gpus=1 -v ollama_file:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

该过程会自动pull镜像，container创建好以后，进入容器，转换模型：

# 进入container
docker exec -it ollama /bin/bash/
cd /root/.ollama
# 转换ollama可用格式的模型
ollama create name-to-your-ollama-model -f Modelfile
# 测试转换结果
ollama list

直接在docker内部测试：

ollama run name-to-your-ollama-model

停止测试：

/bye

退出docker：

exit

docker的安装参考笔者另一篇博客Ubuntu装机必备软件和配置

服务器or局域网调用

可以通过curl进行，也可以通过post请求

通过curl：

curl http://localhost:11434/api/generate -d '{"model": "name-to-your-ollama-model","prompt":"hello","stream": false
}'

通过post请求：

import json
import requests# Generate a response for a given prompt with a provided model. This is a streaming endpoint, so will be a series of responses.
# The final response object will include statistics and additional data from the request. Use the callback function to override
# the default handler.
def generate(model_name, prompt, system=None):try:url = "http://localhost:11434/api/generate"payload = {"model": model_name, "prompt": prompt, "system": system}# Remove keys with None valuespayload = {k: v for k, v in payload.items() if v is not None}with requests.post(url, json=payload, stream=True) as response:response.raise_for_status()# Creating a variable to hold the context history of the final chunkfinal_context = None# Variable to hold concatenated response strings if no callback is providedfull_response = ""# Iterating over the response line by line and displaying the detailsfor line in response.iter_lines():if line:# Parsing each line (JSON chunk) and extracting the detailschunk = json.loads(line)if not chunk.get("done"):response_piece = chunk.get("response", "")full_response += response_piece# print(response_piece, end="", flush=True)# Check if it's the last chunk (done is true)if chunk.get("done"):final_context = chunk.get("context")# Return the full response and the final contextreturn full_response, final_contextexcept requests.exceptions.RequestException as e:print(f"An error occurred: {e}")return None, Noneif __name__ == '__main__':model = 'name-to-your-ollama-model'SYS_PROMPT= 'hello'USER_PROMPT = "what your favorate movie?"response1, _ = generate(model_name=model, system=SYS_PROMPT, prompt=USER_PROMPT)print(response1)