欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 汽车 > 新车 > 快速体验LLaMA3模型微调(超算互联网平台国产异构加速卡DCU)

快速体验LLaMA3模型微调(超算互联网平台国产异构加速卡DCU)

2024/10/24 10:20:17 来源:https://blog.csdn.net/m0_37605642/article/details/140865026  浏览:    关键词:快速体验LLaMA3模型微调(超算互联网平台国产异构加速卡DCU)

序言

本文以 LLaMA-Factory 为例,在超算互联网平台SCNet上使用异构加速卡AI 显存64GB PCIE,对 Llama3-8B-Instruct 模型进行 LoRA 微调推理合并

超算互联网平台
异构加速卡AI 显存64GB PCIE

一、参考资料

github仓库代码:LLaMA-Factory
使用最新的代码分支:v0.8.3

二、重要说明

  1. 遇到包冲突时,可使用 pip install --no-deps -e . 解决。

  2. 测试PyTorch是否支持DCU:

    (llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/m
    odels/LLaMA-Factory# python
    Python 3.10.8 (main, Nov  4 2022, 13:48:29) [GCC 11.2.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import torch
    >>> torch.cuda.is_available()
    True
    
  3. pip软件包

    环境内缺失的依赖可以到光合社区内查找,或者直接从平台预置的常用依赖包路径下查找 /public/software/apps/DeepLearning/whl/dtk-24.04,直接cp到用户项目路径下,直接pip安装。

  4. pip不安装依赖包,只安装指定包,防止包冲突。

    # 例如
    pip install --no-dependencies modelscope
    

三、准备环境

1. 系统镜像

异构加速卡AI为国产加速卡(DCU),基于DTK软件栈(对标NVIDIA的CUDA),请选择 dtk24.04 版本的镜像环境。

jupyterlab-pytorch:2.1.0-ubuntu20.04-dtk24.04-py310 镜像为例。

2. 软硬件依赖

必需项至少推荐
python3.83.11
torch1.13.12.3.0
transformers4.41.24.41.2
datasets2.16.02.19.2
accelerate0.30.10.30.1
peft0.11.10.11.1
trl0.8.60.9.4
可选项至少推荐
CUDA11.612.2
deepspeed0.10.00.14.0
bitsandbytes0.39.00.43.1
vllm0.4.30.4.3
flash-attn2.3.02.5.9

3. 克隆base的虚拟环境

root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# conda create -n llama_factory_torch --clone base
Source:      /opt/conda
Destination: /opt/conda/envs/llama_factory_torch
The following packages cannot be cloned out of the root environment:- https://repo.anaconda.com/pkgs/main/linux-64::conda-23.7.4-py310h06a4308_0
Packages: 44
Files: 53489Downloading and Extracting PackagesDownloading and Extracting PackagesPreparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate llama_factory_torch
#
# To deactivate an active environment, use
#
#     $ conda deactivateroot@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# source activate llama_factory_torch

4. 安装 LLaMA Factory

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/m
odels/LLaMA-Factory# pip install -e ".[torch,metrics]"
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Obtaining file:///public/home/scnlbe5oi5/Downloads/models/LLaMA-FactoryInstalling build dependencies ... doneChecking if build backend supports build_editable ... doneGetting requirements to build editable ... donePreparing editable metadata (pyproject.toml) ... done
...
Checking if build backend supports build_editable ... done
Building wheels for collected packages: llamafactoryBuilding editable for llamafactory (pyproject.toml) ... doneCreated wheel for llamafactory: filename=llamafactory-0.8.4.dev0-0.editable-py3-none-any.whl size=20781 sha256=70c0480e2b648516e0eac3d39371d4100cbdaa1f277d87b657bf2adec9e0b2beStored in directory: /tmp/pip-ephem-wheel-cache-uhypmj_8/wheels/e9/b4/89/f13e921e37904ee0c839434aad2d7b2951c2c68e596667c7ef
Successfully built llamafactory
DEPRECATION: lmdeploy 0.1.0-git782048c.abi0.dtk2404.torch2.1. has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of lmdeploy or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063
DEPRECATION: mmcv 2.0.1-gitc0ccf15.abi0.dtk2404.torch2.1. has a non-standard version number. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of mmcv or contact the author to suggest that they release a version with a conforming version number. Discussion can be found at https://github.com/pypa/pip/issues/12063
Installing collected packages: pydub, jieba, urllib3, tomlkit, shtab, semantic-version, scipy, ruff, rouge-chinese, joblib, importlib-resources, ffmpy, docstring-parser, aiofiles, nltk, tyro, sse-starlette, tokenizers, gradio-client, transformers, trl, peft, gradio, llamafactoryAttempting uninstall: urllib3Found existing installation: urllib3 1.26.13Uninstalling urllib3-1.26.13:Successfully uninstalled urllib3-1.26.13Attempting uninstall: tokenizersFound existing installation: tokenizers 0.15.0Uninstalling tokenizers-0.15.0:Successfully uninstalled tokenizers-0.15.0Attempting uninstall: transformersFound existing installation: transformers 4.38.0Uninstalling transformers-4.38.0:Successfully uninstalled transformers-4.38.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lmdeploy 0.1.0-git782048c.abi0.dtk2404.torch2.1. requires transformers==4.33.2, but you have transformers 4.43.3 which is incompatible.
Successfully installed aiofiles-23.2.1 docstring-parser-0.16 ffmpy-0.4.0 gradio-4.40.0 gradio-client-1.2.0 importlib-resources-6.4.0 jieba-0.42.1 joblib-1.4.2 llamafactory-0.8.4.dev0 nltk-3.8.1 peft-0.12.0 pydub-0.25.1 rouge-chinese-1.0.3 ruff-0.5.5 scipy-1.14.0 semantic-version-2.10.0 shtab-1.7.1 sse-starlette-2.1.3 tokenizers-0.19.1 tomlkit-0.12.0 transformers-4.43.3 trl-0.9.6 tyro-0.8.5 urllib3-2.2.2
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: pip install --upgrade pip

5. 解决依赖包冲突

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/m
odels/LLaMA-Factory# pip install --no-deps -e .
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Obtaining file:///public/home/scnlbe5oi5/Downloads/models/LLaMA-FactoryInstalling build dependencies ... doneChecking if build backend supports build_editable ... doneGetting requirements to build editable ... donePreparing editable metadata (pyproject.toml) ... done
Building wheels for collected packages: llamafactoryBuilding editable for llamafactory (pyproject.toml) ... doneCreated wheel for llamafactory: filename=llamafactory-0.8.4.dev0-0.editable-py3-none-any.whl size=20781 sha256=f874a791bc9fdca02075cda0459104b48a57d300a077eca00eee7221cde429c3Stored in directory: /tmp/pip-ephem-wheel-cache-7vjiq3f3/wheels/e9/b4/89/f13e921e37904ee0c839434aad2d7b2951c2c68e596667c7ef
Successfully built llamafactory
Installing collected packages: llamafactoryAttempting uninstall: llamafactoryFound existing installation: llamafactory 0.8.4.dev0Uninstalling llamafactory-0.8.4.dev0:Successfully uninstalled llamafactory-0.8.4.dev0
Successfully installed llamafactory-0.8.4.dev0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: pip install --upgrade pip

6. 安装 vllm==0.4.3

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/m
odels/LLaMA-Factory# pip list | grep llvm[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: pip install --upgrade pip
(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/m
odels/LLaMA-Factory# pip install --no-dependencies vllm==0.4.3
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting vllm==0.4.3Using cached https://pypi.tuna.tsinghua.edu.cn/packages/1a/1e/10bcb6566f4fa8b95ff85bddfd1675ff7db33ba861f59bd70aa3b92a46b7/vllm-0.4.3-cp310-cp310-manylinux1_x86_64.whl (131.1 MB)
Installing collected packages: vllmAttempting uninstall: vllmFound existing installation: vllm 0.3.3+git3380931.abi0.dtk2404.torch2.1Uninstalling vllm-0.3.3+git3380931.abi0.dtk2404.torch2.1:Successfully uninstalled vllm-0.3.3+git3380931.abi0.dtk2404.torch2.1
Successfully installed vllm-0.4.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: pip install --upgrade pip

四、关键步骤

1. 获取Access Token

通过Hugging Face,获取Access Token用于登录Hugging Face 账户。

注意:选择 Write 权限。

在这里插入图片描述

在这里插入图片描述

2. 登录Hugging Face 账户

推荐使用下述命令登录您的 Hugging Face 账户。

pip install --upgrade huggingface_hub
huggingface-cli login
(llama_fct) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# huggingface-cli login_|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|_|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|_|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|_|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|_|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible):
Add token as git credential? (Y/n) Y
Token is valid (permission: write).
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.git config --global credential.helper storeRead https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful

3. llamafactory-cli 指令

使用 llamafactory-cli help 显示帮助信息。

(llama_fct) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/mod
els/LLaMA-Factory# llamafactory-cli help
No ROCm runtime is found, using ROCM_HOME='/opt/dtk'
/opt/conda/envs/llama_fct/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Fail                                     ed to load image Python extension: 'libc10_hip.so: cannot open shared object file: No such file or d                                     irectory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this w                                     arning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `                                     libpng` installed before building `torchvision` from source?warn(
[2024-08-01 15:12:24,629] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to                                      cuda (auto detect)
----------------------------------------------------------------------
| Usage:                                                             |
|   llamafactory-cli api -h: launch an OpenAI-style API server       |
|   llamafactory-cli chat -h: launch a chat interface in CLI         |
|   llamafactory-cli eval -h: evaluate models                        |
|   llamafactory-cli export -h: merge LoRA adapters and export model |
|   llamafactory-cli train -h: train models                          |
|   llamafactory-cli webchat -h: launch a chat interface in Web UI   |
|   llamafactory-cli webui: launch LlamaBoard                        |
|   llamafactory-cli version: show version info                      |
----------------------------------------------------------------------

4. 快速开始

下面三行命令分别对 Llama3-8B-Instruct 模型进行 LoRA 微调推理合并

llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

运行前的资源占用情况

在这里插入图片描述

在这里插入图片描述

4.1 LoRA 微调

模型微调训练是在DCU上进行的。

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/m
odels/LLaMA-Factory# llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[2024-08-01 19:06:41,134] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (a                                                                            uto detect)
08/01/2024 19:06:44 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distri                                                                            buted training: False, compute dtype: torch.bfloat16
[INFO|tokenization_utils_base.py:2287] 2024-08-01 19:06:45,194 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 19:06:45,194 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 19:06:45,194 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 19:06:45,194 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2533] 2024-08-01 19:06:45,563 >> Special tokens have been added in the voca                                                                            bulary, make sure the associated word embeddings are fine-tuned or trained.
08/01/2024 19:06:45 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/01/2024 19:06:45 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/01/2024 19:06:45 - INFO - llamafactory.data.loader - Loading dataset identity.json...
Converting format of dataset (num_proc=16): 100%|███████████████████| 91/91 [00:00<00:00, 444.18 examples/s]
08/01/2024 19:06:47 - INFO - llamafactory.data.loader - Loading dataset alpaca_en_demo.json...
Converting format of dataset (num_proc=16): 100%|██████████████| 1000/1000 [00:00<00:00, 4851.17 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|███████████████| 1091/1091 [00:02<00:00, 375.29 examples/s]
training example:
input_ids:
[128000, 128006, 882, 128007, 271, 6151, 128009, 128006, 78191, 128007, 271, 9906, 0, 358, 1097, 5991, 609,                                                                             39254, 459, 15592, 18328, 8040, 555, 5991, 3170, 3500, 13, 2650, 649, 358, 7945, 499, 3432, 30, 128009]
inputs:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>hi<|eot_id|><|start_header_id|>assistant<|end_header_id|>Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?<|eot_id|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 9906, 0, 358, 1097, 5991, 609, 39254, 459                                                                            , 15592, 18328, 8040, 555, 5991, 3170, 3500, 13, 2650, 649, 358, 7945, 499, 3432, 30, 128009]
labels:
Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?<|eot_id|>
[INFO|configuration_utils.py:731] 2024-08-01 19:06:53,502 >> loading configuration file /root/.cache/modelsc                                                                            ope/hub/LLM-Research/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-01 19:06:53,503 >> Model config LlamaConfig {"_name_or_path": "/root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct","architectures": ["LlamaForCausalLM"],"attention_bias": false,"attention_dropout": 0.0,"bos_token_id": 128000,"eos_token_id": 128009,"hidden_act": "silu","hidden_size": 4096,"initializer_range": 0.02,"intermediate_size": 14336,"max_position_embeddings": 8192,"mlp_bias": false,"model_type": "llama","num_attention_heads": 32,"num_hidden_layers": 32,"num_key_value_heads": 8,"pretraining_tp": 1,"rms_norm_eps": 1e-05,"rope_scaling": null,"rope_theta": 500000.0,"tie_word_embeddings": false,"torch_dtype": "bfloat16","transformers_version": "4.43.3","use_cache": true,"vocab_size": 128256
}[INFO|modeling_utils.py:3631] 2024-08-01 19:06:53,534 >> loading weights file /root/.cache/modelscope/hub/LL                                                                            M-Research/Meta-Llama-3-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-08-01 19:06:53,534 >> Instantiating LlamaForCausalLM model under default                                                                             dtype torch.bfloat16.
[INFO|configuration_utils.py:1038] 2024-08-01 19:06:53,536 >> Generate config GenerationConfig {"bos_token_id": 128000,"eos_token_id": 128009
}Loading checkpoint shards: 100%|██████████████████████████████████████████████| 4/4 [00:08<00:00,  2.04s/it]
[INFO|modeling_utils.py:4463] 2024-08-01 19:07:01,775 >> All model checkpoint weights were used when initial                                                                            izing LlamaForCausalLM.[INFO|modeling_utils.py:4471] 2024-08-01 19:07:01,775 >> All the weights of LlamaForCausalLM were initialize                                                                            d from the model checkpoint at /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaFor                                                                            CausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-08-01 19:07:01,779 >> loading configuration file /root/.cache/modelsc                                                                            ope/hub/LLM-Research/Meta-Llama-3-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1038] 2024-08-01 19:07:01,780 >> Generate config GenerationConfig {"bos_token_id": 128000,"do_sample": true,"eos_token_id": [128001,128009],"max_length": 4096,"temperature": 0.6,"top_p": 0.9
}08/01/2024 19:07:01 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/01/2024 19:07:01 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementati                                                                            on.
08/01/2024 19:07:01 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
08/01/2024 19:07:01 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/01/2024 19:07:01 - INFO - llamafactory.model.model_utils.misc - Found linear modules: q_proj,up_proj,v_pr                                                                            oj,down_proj,k_proj,o_proj,gate_proj
08/01/2024 19:07:04 - INFO - llamafactory.model.loader - trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:648] 2024-08-01 19:07:04,471 >> Using auto half precision backend
[INFO|trainer.py:2134] 2024-08-01 19:07:04,831 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-08-01 19:07:04,831 >>   Num examples = 981
[INFO|trainer.py:2136] 2024-08-01 19:07:04,831 >>   Num Epochs = 3
[INFO|trainer.py:2137] 2024-08-01 19:07:04,832 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2140] 2024-08-01 19:07:04,832 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2141] 2024-08-01 19:07:04,832 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2142] 2024-08-01 19:07:04,832 >>   Total optimization steps = 366
[INFO|trainer.py:2143] 2024-08-01 19:07:04,836 >>   Number of trainable parameters = 20,971,520
{'loss': 1.5025, 'grad_norm': 1.3309401273727417, 'learning_rate': 2.702702702702703e-05, 'epoch': 0.08}
{'loss': 1.3424, 'grad_norm': 1.8096668720245361, 'learning_rate': 5.405405405405406e-05, 'epoch': 0.16}
{'loss': 1.1286, 'grad_norm': 1.2990491390228271, 'learning_rate': 8.108108108108109e-05, 'epoch': 0.24}
{'loss': 0.9808, 'grad_norm': 1.1075998544692993, 'learning_rate': 9.997948550797227e-05, 'epoch': 0.33}
{'loss': 0.9924, 'grad_norm': 1.8073676824569702, 'learning_rate': 9.961525153583327e-05, 'epoch': 0.41}
{'loss': 1.0052, 'grad_norm': 1.2079122066497803, 'learning_rate': 9.879896064123961e-05, 'epoch': 0.49}
{'loss': 0.9973, 'grad_norm': 1.7361079454421997, 'learning_rate': 9.753805025397779e-05, 'epoch': 0.57}
{'loss': 0.8488, 'grad_norm': 1.1059085130691528, 'learning_rate': 9.584400884284545e-05, 'epoch': 0.65}
{'loss': 0.9893, 'grad_norm': 0.8711654543876648, 'learning_rate': 9.373227124134888e-05, 'epoch': 0.73}
{'loss': 0.9116, 'grad_norm': 1.3793599605560303, 'learning_rate': 9.122207801708802e-05, 'epoch': 0.82}
{'loss': 1.0429, 'grad_norm': 1.3769993782043457, 'learning_rate': 8.833630016614976e-05, 'epoch': 0.9}
{'loss': 0.9323, 'grad_norm': 1.2503643035888672, 'learning_rate': 8.510123072976239e-05, 'epoch': 0.98}
{'loss': 0.9213, 'grad_norm': 2.449227809906006, 'learning_rate': 8.154634523184388e-05, 'epoch': 1.06}
{'loss': 0.8386, 'grad_norm': 1.009852409362793, 'learning_rate': 7.770403312015721e-05, 'epoch': 1.14}40%|███████████████████████████▌                                         | 146/366 [10:19<15:11,  4.14s/it]                                                                            {'loss': 0.856, 'grad_norm': 0.863474428653717, 'learning_rate': 7.360930265797935e-05, 'epoch': 1.22}
{'loss': 0.838, 'grad_norm': 0.712546169757843, 'learning_rate': 6.929946195508932e-05, 'epoch': 1.3}
{'loss': 0.8268, 'grad_norm': 1.6060960292816162, 'learning_rate': 6.481377904428171e-05, 'epoch': 1.39}
{'loss': 0.7326, 'grad_norm': 0.7863644957542419, 'learning_rate': 6.019312410053286e-05, 'epoch': 1.47}
{'loss': 0.7823, 'grad_norm': 0.8964634537696838, 'learning_rate': 5.547959706265068e-05, 'epoch': 1.55}
{'loss': 0.7599, 'grad_norm': 0.5305138826370239, 'learning_rate': 5.0716144050239375e-05, 'epoch': 1.63}
{'loss': 0.815, 'grad_norm': 0.8153926730155945, 'learning_rate': 4.594616607090028e-05, 'epoch': 1.71}
{'loss': 0.8258, 'grad_norm': 1.3266267776489258, 'learning_rate': 4.121312358283463e-05, 'epoch': 1.79}
{'loss': 0.7446, 'grad_norm': 1.8706341981887817, 'learning_rate': 3.656014051577713e-05, 'epoch': 1.88}
{'loss': 0.7539, 'grad_norm': 1.5148639678955078, 'learning_rate': 3.202961135812437e-05, 'epoch': 1.96}
{'loss': 0.7512, 'grad_norm': 1.3771291971206665, 'learning_rate': 2.7662814890184818e-05, 'epoch': 2.04}
{'loss': 0.7128, 'grad_norm': 1.420331597328186, 'learning_rate': 2.3499538082923606e-05, 'epoch': 2.12}
{'loss': 0.635, 'grad_norm': 0.9235875010490417, 'learning_rate': 1.9577713588953795e-05, 'epoch': 2.2}
{'loss': 0.6628, 'grad_norm': 1.6558737754821777, 'learning_rate': 1.5933074128684332e-05, 'epoch': 2.28}
{'loss': 0.681, 'grad_norm': 0.8138720393180847, 'learning_rate': 1.2598826920598772e-05, 'epoch': 2.36}
{'loss': 0.6707, 'grad_norm': 1.0700312852859497, 'learning_rate': 9.605351122011309e-06, 'epoch': 2.45}
{'loss': 0.6201, 'grad_norm': 1.3334729671478271, 'learning_rate': 6.979921036993042e-06, 'epoch': 2.53}
{'loss': 0.6698, 'grad_norm': 1.440247893333435, 'learning_rate': 4.746457613389904e-06, 'epoch': 2.61}
{'loss': 0.7072, 'grad_norm': 0.9171076416969299, 'learning_rate': 2.925310493105099e-06, 'epoch': 2.69}
{'loss': 0.6871, 'grad_norm': 0.9809044003486633, 'learning_rate': 1.5330726014397668e-06, 'epoch': 2.77}
{'loss': 0.5931, 'grad_norm': 1.7158288955688477, 'learning_rate': 5.824289648152126e-07, 'epoch': 2.85}
{'loss': 0.6827, 'grad_norm': 1.3241132497787476, 'learning_rate': 8.204113433559201e-08, 'epoch': 2.94}
100%|█████████████████████████████████████████████████████████████████████| 366/366 [25:42<00:00,  4.02s/it]                                                                            [INFO|trainer.py:3503] 2024-08-01 19:32:47,527 >> Saving model checkpoint to saves/llama3-8b/lora/sft/checkp                                                                            oint-366
[INFO|configuration_utils.py:731] 2024-08-01 19:32:47,556 >> loading configuration file /root/.cache/modelsc                                                                            ope/hub/LLM-Research/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-01 19:32:47,557 >> Model config LlamaConfig {"architectures": ["LlamaForCausalLM"],"attention_bias": false,"attention_dropout": 0.0,"bos_token_id": 128000,"eos_token_id": 128009,"hidden_act": "silu","hidden_size": 4096,"initializer_range": 0.02,"intermediate_size": 14336,"max_position_embeddings": 8192,"mlp_bias": false,"model_type": "llama","num_attention_heads": 32,"num_hidden_layers": 32,"num_key_value_heads": 8,"pretraining_tp": 1,"rms_norm_eps": 1e-05,"rope_scaling": null,"rope_theta": 500000.0,"tie_word_embeddings": false,"torch_dtype": "bfloat16","transformers_version": "4.43.3","use_cache": true,"vocab_size": 128256
}[INFO|tokenization_utils_base.py:2702] 2024-08-01 19:32:47,675 >> tokenizer config file saved in saves/llama                                                                            3-8b/lora/sft/checkpoint-366/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-08-01 19:32:47,677 >> Special tokens file saved in saves/llama3-                                                                            8b/lora/sft/checkpoint-366/special_tokens_map.json
[INFO|trainer.py:2394] 2024-08-01 19:32:48,046 >>Training completed. Do not forget to share your model on huggingface.co/models =){'train_runtime': 1543.2099, 'train_samples_per_second': 1.907, 'train_steps_per_second': 0.237, 'train_loss                                                                            ': 0.8416516305318947, 'epoch': 2.98}
100%|█████████████████████████████████████████████████████████████████████| 366/366 [25:43<00:00,  4.22s/it]
[INFO|trainer.py:3503] 2024-08-01 19:32:48,050 >> Saving model checkpoint to saves/llama3-8b/lora/sft
[INFO|configuration_utils.py:731] 2024-08-01 19:32:48,081 >> loading configuration file /root/.cache/modelsc                                                                            ope/hub/LLM-Research/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-01 19:32:48,082 >> Model config LlamaConfig {"architectures": ["LlamaForCausalLM"],"attention_bias": false,"attention_dropout": 0.0,"bos_token_id": 128000,"eos_token_id": 128009,"hidden_act": "silu","hidden_size": 4096,"initializer_range": 0.02,"intermediate_size": 14336,"max_position_embeddings": 8192,"mlp_bias": false,"model_type": "llama","num_attention_heads": 32,"num_hidden_layers": 32,"num_key_value_heads": 8,"pretraining_tp": 1,"rms_norm_eps": 1e-05,"rope_scaling": null,"rope_theta": 500000.0,"tie_word_embeddings": false,"torch_dtype": "bfloat16","transformers_version": "4.43.3","use_cache": true,"vocab_size": 128256
}[INFO|tokenization_utils_base.py:2702] 2024-08-01 19:32:48,191 >> tokenizer config file saved in saves/llama                                                                            3-8b/lora/sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-08-01 19:32:48,192 >> Special tokens file saved in saves/llama3-                                                                            8b/lora/sft/special_tokens_map.json
***** train metrics *****epoch                    =     2.9847total_flos               = 20619353GFtrain_loss               =     0.8417train_runtime            = 0:25:43.20train_samples_per_second =      1.907train_steps_per_second   =      0.237
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
08/01/2024 19:32:48 - WARNING - llamafactory.extras.ploting - No metric eval_loss to plot.
08/01/2024 19:32:48 - WARNING - llamafactory.extras.ploting - No metric eval_accuracy to plot.
[INFO|trainer.py:3819] 2024-08-01 19:32:48,529 >>
***** Running Evaluation *****
[INFO|trainer.py:3821] 2024-08-01 19:32:48,529 >>   Num examples = 110
[INFO|trainer.py:3824] 2024-08-01 19:32:48,529 >>   Batch size = 1
100%|█████████████████████████████████████████████████████████████████████| 110/110 [00:18<00:00,  6.07it/s]
***** eval metrics *****epoch                   =     2.9847eval_loss               =     0.9957eval_runtime            = 0:00:18.23eval_samples_per_second =      6.031eval_steps_per_second   =      6.031
[INFO|modelcard.py:449] 2024-08-01 19:33:06,773 >> Dropping the following result as it does not have all the                                                                             necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

输出结果

root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models# tree -L 6 LLaMA-Factory/saves/
LLaMA-Factory/saves/
`-- llama3-8b`-- lora`-- sft|-- README.md|-- adapter_config.json|-- adapter_model.safetensors|-- all_results.json|-- checkpoint-366|   |-- README.md|   |-- adapter_config.json|   |-- adapter_model.safetensors|   |-- optimizer.pt|   |-- rng_state.pth|   |-- scheduler.pt|   |-- special_tokens_map.json|   |-- tokenizer.json|   |-- tokenizer_config.json|   |-- trainer_state.json|   `-- training_args.bin|-- eval_results.json|-- special_tokens_map.json|-- tokenizer.json|-- tokenizer_config.json|-- train_results.json|-- trainer_log.jsonl|-- trainer_state.json|-- training_args.bin`-- training_loss.png

运行时的资源占用情况

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

4.2 LoRA 推理

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
[2024-08-01 21:26:27,270] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[INFO|tokenization_utils_base.py:2287] 2024-08-01 21:26:31,957 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 21:26:31,958 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 21:26:31,958 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 21:26:31,958 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2533] 2024-08-01 21:26:32,341 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
08/01/2024 21:26:32 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/01/2024 21:26:32 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
[INFO|configuration_utils.py:731] 2024-08-01 21:26:32,343 >> loading configuration file /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-01 21:26:32,344 >> Model config LlamaConfig {"_name_or_path": "/root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct","architectures": ["LlamaForCausalLM"],"attention_bias": false,"attention_dropout": 0.0,"bos_token_id": 128000,"eos_token_id": 128009,"hidden_act": "silu","hidden_size": 4096,"initializer_range": 0.02,"intermediate_size": 14336,"max_position_embeddings": 8192,"mlp_bias": false,"model_type": "llama","num_attention_heads": 32,"num_hidden_layers": 32,"num_key_value_heads": 8,"pretraining_tp": 1,"rms_norm_eps": 1e-05,"rope_scaling": null,"rope_theta": 500000.0,"tie_word_embeddings": false,"torch_dtype": "bfloat16","transformers_version": "4.43.3","use_cache": true,"vocab_size": 128256
}08/01/2024 21:26:32 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.
[INFO|modeling_utils.py:3631] 2024-08-01 21:26:32,376 >> loading weights file /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-08-01 21:26:32,377 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1038] 2024-08-01 21:26:32,379 >> Generate config GenerationConfig {"bos_token_id": 128000,"eos_token_id": 128009
}Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.93s/it]
[INFO|modeling_utils.py:4463] 2024-08-01 21:26:40,525 >> All model checkpoint weights were used when initializing LlamaForCausalLM.[INFO|modeling_utils.py:4471] 2024-08-01 21:26:40,525 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-08-01 21:26:40,528 >> loading configuration file /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1038] 2024-08-01 21:26:40,528 >> Generate config GenerationConfig {"bos_token_id": 128000,"do_sample": true,"eos_token_id": [128001,128009],"max_length": 4096,"temperature": 0.6,"top_p": 0.9
}08/01/2024 21:26:40 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/01/2024 21:26:46 - INFO - llamafactory.model.adapter - Merged 1 adapter(s).
08/01/2024 21:26:46 - INFO - llamafactory.model.adapter - Loaded adapter(s): saves/llama3-8b/lora/sft
08/01/2024 21:26:46 - INFO - llamafactory.model.loader - all params: 8,030,261,248
Welcome to the CLI application, use `clear` to remove the history, use `exit` to exit the application.User: clear
History has been removed.User: exit

4.3 模型合并

模型合并实在CPU上进行的。

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
[2024-08-01 21:34:37,394] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[INFO|tokenization_utils_base.py:2287] 2024-08-01 21:34:41,664 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 21:34:41,664 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 21:34:41,664 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 21:34:41,664 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2533] 2024-08-01 21:34:42,030 >> Special tokens have been added in the vocabulary, make sure the associa                 ted word embeddings are fine-tuned or trained.
08/01/2024 21:34:42 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/01/2024 21:34:42 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
[INFO|configuration_utils.py:731] 2024-08-01 21:34:42,031 >> loading configuration file /root/.cache/modelscope/hub/LLM-Research/Meta-Lla                 ma-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-01 21:34:42,032 >> Model config LlamaConfig {"_name_or_path": "/root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct","architectures": ["LlamaForCausalLM"],"attention_bias": false,"attention_dropout": 0.0,"bos_token_id": 128000,"eos_token_id": 128009,"hidden_act": "silu","hidden_size": 4096,"initializer_range": 0.02,"intermediate_size": 14336,"max_position_embeddings": 8192,"mlp_bias": false,"model_type": "llama","num_attention_heads": 32,"num_hidden_layers": 32,"num_key_value_heads": 8,"pretraining_tp": 1,"rms_norm_eps": 1e-05,"rope_scaling": null,"rope_theta": 500000.0,"tie_word_embeddings": false,"torch_dtype": "bfloat16","transformers_version": "4.43.3","use_cache": true,"vocab_size": 128256
}08/01/2024 21:34:42 - INFO - llamafactory.model.patcher - Using KV cache for faster generation.
[INFO|modeling_utils.py:3631] 2024-08-01 21:34:42,058 >> loading weights file /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-In                 struct/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-08-01 21:34:42,058 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1038] 2024-08-01 21:34:42,059 >> Generate config GenerationConfig {"bos_token_id": 128000,"eos_token_id": 128009
}Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  3.40it/s]
[INFO|modeling_utils.py:4463] 2024-08-01 21:34:43,324 >> All model checkpoint weights were used when initializing LlamaForCausalLM.[INFO|modeling_utils.py:4471] 2024-08-01 21:34:43,324 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint a                 t /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions with                 out further training.
[INFO|configuration_utils.py:991] 2024-08-01 21:34:43,327 >> loading configuration file /root/.cache/modelscope/hub/LLM-Research/Meta-Lla                 ma-3-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1038] 2024-08-01 21:34:43,327 >> Generate config GenerationConfig {"bos_token_id": 128000,"do_sample": true,"eos_token_id": [128001,128009],"max_length": 4096,"temperature": 0.6,"top_p": 0.9
}08/01/2024 21:34:43 - INFO - llamafactory.model.model_utils.attention - Using vanilla attention implementation.
08/01/2024 21:40:34 - INFO - llamafactory.model.adapter - Merged 1 adapter(s).
08/01/2024 21:40:34 - INFO - llamafactory.model.adapter - Loaded adapter(s): saves/llama3-8b/lora/sft
08/01/2024 21:40:34 - INFO - llamafactory.model.loader - all params: 8,030,261,248
08/01/2024 21:40:34 - INFO - llamafactory.train.tuner - Convert model dtype to: torch.bfloat16.
[INFO|configuration_utils.py:472] 2024-08-01 21:40:34,700 >> Configuration saved in models/llama3_lora_sft/config.json
[INFO|configuration_utils.py:807] 2024-08-01 21:40:34,704 >> Configuration saved in models/llama3_lora_sft/generation_config.json
[INFO|modeling_utils.py:2763] 2024-08-01 21:40:49,039 >> The model is bigger than the maximum size per checkpoint (2GB) and is going to be split in 9 checkpoint shards. You can find where each parameters has been saved in the index located at models/llama3_lora_sft/model.safetensors.index.json.
[INFO|tokenization_utils_base.py:2702] 2024-08-01 21:40:49,046 >> tokenizer config file saved in models/llama3_lora_sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2711] 2024-08-01 21:40:49,048 >> Special tokens file saved in models/llama3_lora_sft/special_tokens_map.json

输出结果

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models# tree -L 6 LLaMA-Factory/models/llama3_lora_sft/
LLaMA-Factory/models/llama3_lora_sft/
|-- config.json
|-- generation_config.json
|-- model-00001-of-00009.safetensors
|-- model-00002-of-00009.safetensors
|-- model-00003-of-00009.safetensors
|-- model-00004-of-00009.safetensors
|-- model-00005-of-00009.safetensors
|-- model-00006-of-00009.safetensors
|-- model-00007-of-00009.safetensors
|-- model-00008-of-00009.safetensors
|-- model-00009-of-00009.safetensors
|-- model.safetensors.index.json
|-- special_tokens_map.json
|-- tokenizer.json
`-- tokenizer_config.json

运行时的资源占用情况

在这里插入图片描述

在这里插入图片描述

五、FAQ

Q:`OSError: You are trying to access a gated repo.

Make sure to have access to it at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.`

(llama_fct) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/mod
els/LLaMA-Factory# llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
No ROCm runtime is found, using ROCM_HOME='/opt/dtk'
/opt/conda/envs/llama_fct/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Fail                                     ed to load image Python extension: 'libc10_hip.so: cannot open shared object file: No such file or d                                     irectory'If you don't plan on using image functionality from `torchvision.io`, you can ignore this w                                     arning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `                                     libpng` installed before building `torchvision` from source?warn(
[2024-08-01 15:13:21,242] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to                                      cuda (auto detect)
08/01/2024 15:13:24 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cpu, n_gpu: 0, distributed training: False, compute dtype: torch.bfloat16
[INFO|tokenization_auto.py:682] 2024-08-01 15:13:25,152 >> Could not locate the tokenizer configuration file, will try to use the model config instead.
Traceback (most recent call last):File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_statusresponse.raise_for_status()File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_statusraise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://hf-mirror.com/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.jsonThe above exception was the direct cause of the following exception:Traceback (most recent call last):File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/transformers/utils/hub.py", line 402, in cached_fileresolved_file = hf_hub_download(File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_freturn f(*args, **kwargs)File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fnreturn fn(*args, **kwargs)File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1240, in hf_hub_downloadreturn _hf_hub_download_to_cache_dir(File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1347, in _hf_hub_download_to_cache_dir_raise_on_head_call_error(head_call_error, force_download, local_files_only)File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1854, in _raise_on_head_call_errorraise head_call_errorFile "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1751, in _get_metadata_or_catch_errormetadata = get_hf_file_metadata(File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fnreturn fn(*args, **kwargs)File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1673, in get_hf_file_metadatar = _request_wrapper(File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 376, in _request_wrapperresponse = _request_wrapper(File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 400, in _request_wrapperhf_raise_for_status(response)File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 321, in hf_raise_for_statusraise GatedRepoError(message, response) from e
huggingface_hub.utils._errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-66ab3595-53663c2f4d5cf81405b65b9e;080cfa15-3220-4ab1-b123-4a32ba31a03a)Cannot access gated repo for url https://hf-mirror.com/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.json.
Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted. You must be authenticated to access it.The above exception was the direct cause of the following exception:Traceback (most recent call last):File "/opt/conda/envs/llama_fct/bin/llamafactory-cli", line 8, in <module>sys.exit(main())File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/cli.py", line 111, in mainrun_exp()File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exprun_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 44, in run_sfttokenizer_module = load_tokenizer(model_args)File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/model/loader.py", line 69, in load_tokenizertokenizer = AutoTokenizer.from_pretrained(File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 853, in from_pretrainedconfig = AutoConfig.from_pretrained(File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 972, in from_pretrainedconfig_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/transformers/configuration_utils.py", line 632, in get_config_dictconfig_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/transformers/configuration_utils.py", line 689, in _get_config_dictresolved_config_file = cached_file(File "/opt/conda/envs/llama_fct/lib/python3.10/site-packages/transformers/utils/hub.py", line 420, in cached_fileraise EnvironmentError(
OSError: You are trying to access a gated repo.
Make sure to have access to it at https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct.
401 Client Error. (Request ID: Root=1-66ab3595-53663c2f4d5cf81405b65b9e;080cfa15-3220-4ab1-b123-4a32ba31a03a)Cannot access gated repo for url https://hf-mirror.com/meta-llama/Meta-Llama-3-8B-Instruct/resolve/main/config.json.
Access to model meta-llama/Meta-Llama-3-8B-Instruct is restricted. You must be authenticated to access it.

错误原因:默认是从Hugging Face中获取模型,由于Hugging Face 模型授权失败,导致获取模型失败。

解决方法:从魔搭社区下载模型。

export USE_MODELSCOPE_HUB=1 # Windows 使用 `set USE_MODELSCOPE_HUB=1`

model_name_or_path 设置为模型 ID 来加载对应的模型。在魔搭社区查看所有可用的模型,例如 LLM-Research/Meta-Llama-3-8B-Instruct

修改 llama3_lora_sft.yaml 文件:

# model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
改为
model_name_or_path: LLM-Research/Meta-Llama-3-8B-Instruct
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

Q:OSError: LLM-Research/Meta-Llama-3-8B-Instruct is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory# llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
[2024-08-01 21:17:22,212] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_statusresponse.raise_for_status()File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_statusraise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://hf-mirror.com/LLM-Research/Meta-Llama-3-8B-Instruct/resolve/main/tokenizer_config.jsonThe above exception was the direct cause of the following exception:Traceback (most recent call last):File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/utils/hub.py", line 402, in cached_fileresolved_file = hf_hub_download(File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fnreturn fn(*args, **kwargs)File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1221, in hf_hub_downloadreturn _hf_hub_download_to_cache_dir(File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1325, in _hf_hub_download_to_cache_dir_raise_on_head_call_error(head_call_error, force_download, local_files_only)File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1823, in _raise_on_head_call_errorraise head_call_errorFile "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1722, in _get_metadata_or_catch_errormetadata = get_hf_file_metadata(url=url, proxies=proxies, timeout=etag_timeout, headers=headers)File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fnreturn fn(*args, **kwargs)File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1645, in get_hf_file_metadatar = _request_wrapper(File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 372, in _request_wrapperresponse = _request_wrapper(File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 396, in _request_wrapperhf_raise_for_status(response)File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/huggingface_hub/utils/_errors.py", line 352, in hf_raise_for_statusraise RepositoryNotFoundError(message, response) from e
huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-66ab8ae6-4ed0547e1f86fcb201b723f8;acee559e-0676-48e4-8871-b6eb58e797ca)Repository Not Found for url: https://hf-mirror.com/LLM-Research/Meta-Llama-3-8B-Instruct/resolve/main/tokenizer_config.json.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated.
Invalid username or password.The above exception was the direct cause of the following exception:Traceback (most recent call last):File "/opt/conda/envs/llama_factory_torch/bin/llamafactory-cli", line 8, in <module>sys.exit(main())File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/cli.py", line 81, in mainrun_chat()File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/chat/chat_model.py", line 125, in run_chatchat_model = ChatModel()File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/chat/chat_model.py", line 44, in __init__self.engine: "BaseEngine" = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/chat/hf_engine.py", line 53, in __init__tokenizer_module = load_tokenizer(model_args)File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/model/loader.py", line 69, in load_tokenizertokenizer = AutoTokenizer.from_pretrained(File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 833, in from_pretrainedtokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 665, in get_tokenizer_configresolved_config_file = cached_file(File "/opt/conda/envs/llama_factory_torch/lib/python3.10/site-packages/transformers/utils/hub.py", line 425, in cached_fileraise EnvironmentError(
OSError: LLM-Research/Meta-Llama-3-8B-Instruct is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

错误原因:找不到 LLM-Research/Meta-Llama-3-8B-Instruct模型。

解决方法:从魔搭社区下载模型。

export USE_MODELSCOPE_HUB=1

Q:ModuleNotFoundError: No module named 'modelscope'

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/m
odels/LLaMA-Factory# llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
[2024-08-01 19:05:15,320] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (a                                                                            uto detect)
08/01/2024 19:05:18 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distri                                                                            buted training: False, compute dtype: torch.bfloat16
Traceback (most recent call last):File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/extras/misc.py", line 219, i                                                                            n try_download_model_from_msfrom modelscope import snapshot_download
ModuleNotFoundError: No module named 'modelscope'During handling of the above exception, another exception occurred:Traceback (most recent call last):File "/opt/conda/envs/llama_factory_torch/bin/llamafactory-cli", line 8, in <module>sys.exit(main())File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/cli.py", line 111, in mainrun_exp()File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in                                                                             run_exprun_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line                                                                             44, in run_sfttokenizer_module = load_tokenizer(model_args)File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/model/loader.py", line 67, i                                                                            n load_tokenizerinit_kwargs = _get_init_kwargs(model_args)File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/model/loader.py", line 52, i                                                                            n _get_init_kwargsmodel_args.model_name_or_path = try_download_model_from_ms(model_args)File "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/extras/misc.py", line 224, i                                                                            n try_download_model_from_msraise ImportError("Please install modelscope via `pip install modelscope -U`")
ImportError: Please install modelscope via `pip install modelscope -U`

错误原因:缺少modelscope依赖包。

解决方法:安装modelscope

(llama_factory_torch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/m
odels/LLaMA-Factory# pip install --no-dependencies modelscope
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting modelscopeUsing cached https://pypi.tuna.tsinghua.edu.cn/packages/38/37/9fe505ebc67ba5e0345a69d6e8b2ee8630523975b484                                                                            d221691ef60182bd/modelscope-1.16.1-py3-none-any.whl (5.7 MB)
Installing collected packages: modelscope
Successfully installed modelscope-1.16.1

Q:ImportError: /PATH/TO/site-packages/torch/lib/libtorch_hip.so: undefined symbol: ncclCommInitRankConfig

(llama_fct_pytorch) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/mod
els/LLaMA-Factory# llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
Traceback (most recent call last):File "/opt/conda/envs/llama_fct_pytorch/bin/llamafactory-cli", line 5, in <module>from llamafactory.cli import mainFile "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/__init__.py", line 38, in <module>from .cli import VERSIONFile "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/cli.py", line 21, in <module>from . import launcherFile "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/launcher.py", line 15, in <module>from llamafactory.train.tuner import run_expFile "/public/home/scnlbe5oi5/Downloads/models/LLaMA-Factory/src/llamafactory/train/tuner.py", line 19, in <module>import torchFile "/opt/conda/envs/llama_fct_pytorch/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>from torch._C import *  # noqa: F403
ImportError: /opt/conda/envs/llama_fct_pytorch/lib/python3.10/site-packages/torch/lib/libtorch_hip.so: undefined symbol: ncclCommInitRankConfig
>>> import torch
Traceback (most recent call last):File "<stdin>", line 1, in <module>File "/opt/conda/envs/llama_fct_pytorch/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>from torch._C import *  # noqa: F403
ImportError: /opt/conda/envs/llama_fct_pytorch/lib/python3.10/site-packages/torch/lib/libtorch_hip.so: undefined symbol: ncclCommInitRankConfig

错误原因:当前PyTorch版本不支持DCU。

该问题的解决方法,请参考下文的FAQ。

Q:PyTorch版本不支持DCU

(llama_fct) root@notebook-1813389960667746306-scnlbe5oi5-17811:/public/home/scnlbe5oi5/Downloads/models/LLaM
A-Factory# llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
No ROCm runtime is found, using ROCM_HOME='/opt/dtk'
/opt/conda/envs/llama_fct/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to lo                                                                            ad image Python extension: 'libc10_hip.so: cannot open shared object file: No such file or directory'If you                                                                             don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there                                                                             might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before buildin                                                                            g `torchvision` from source?warn(
[2024-08-01 17:49:08,805] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (a                                                                            uto detect)
08/01/2024 17:49:12 - INFO - llamafactory.hparams.parser - Process rank: 0, device: cpu, n_gpu: 0, distribut                                                                            ed training: False, compute dtype: torch.bfloat16
Downloading: 100%|█████████████████████████████████████████████████████████| 654/654 [00:00<00:00, 2.56kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████| 48.0/48.0 [00:00<00:00, 183B/s]
Downloading: 100%|███████████████████████████████████████████████████████████| 187/187 [00:00<00:00, 759B/s]
Downloading: 100%|█████████████████████████████████████████████████████| 7.62k/7.62k [00:00<00:00, 29.9kB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 4.63G/4.63G [01:33<00:00, 53.4MB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 4.66G/4.66G [01:02<00:00, 79.9MB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 4.58G/4.58G [01:00<00:00, 81.7MB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 1.09G/1.09G [00:22<00:00, 51.6MB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 23.4k/23.4k [00:00<00:00, 53.6kB/s]
Downloading: 100%|██████████████████████████████████████████████████████| 36.3k/36.3k [00:00<00:00, 125kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████| 73.0/73.0 [00:00<00:00, 293B/s]
Downloading: 100%|█████████████████████████████████████████████████████| 8.66M/8.66M [00:00<00:00, 13.5MB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 49.8k/49.8k [00:00<00:00, 90.0kB/s]
Downloading: 100%|█████████████████████████████████████████████████████| 4.59k/4.59k [00:00<00:00, 18.7kB/s]
[INFO|tokenization_utils_base.py:2287] 2024-08-01 17:53:53,510 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 17:53:53,511 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 17:53:53,511 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2287] 2024-08-01 17:53:53,511 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2533] 2024-08-01 17:53:53,854 >> Special tokens have been added in the voca                                                                            bulary, make sure the associated word embeddings are fine-tuned or trained.
08/01/2024 17:53:53 - INFO - llamafactory.data.template - Replace eos token: <|eot_id|>
08/01/2024 17:53:53 - INFO - llamafactory.data.template - Add pad token: <|eot_id|>
08/01/2024 17:53:53 - INFO - llamafactory.data.loader - Loading dataset identity.json...
Generating train split: 91 examples [00:00, 10580.81 examples/s]
Converting format of dataset (num_proc=16): 100%|███████████████████| 91/91 [00:00<00:00, 427.78 examples/s]
08/01/2024 17:53:56 - INFO - llamafactory.data.loader - Loading dataset alpaca_en_demo.json...
Generating train split: 1000 examples [00:00, 66788.28 examples/s]
Converting format of dataset (num_proc=16): 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:00<00:00, 4688.60 examples/s]
Running tokenizer on dataset (num_proc=16): 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1091/1091 [00:03<00:00, 295.08 examples/s]
training example:
input_ids:
[128000, 128006, 882, 128007, 271, 6151, 128009, 128006, 78191, 128007, 271, 9906, 0, 358, 1097, 5991, 609, 39254, 459, 15592, 18328, 8040, 555, 5991, 3170, 3500, 13, 2650, 649, 358, 7945, 499, 3432, 30, 128009]
inputs:
<|begin_of_text|><|start_header_id|>user<|end_header_id|>hi<|eot_id|><|start_header_id|>assistant<|end_header_id|>Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?<|eot_id|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 9906, 0, 358, 1097, 5991, 609, 39254, 459, 15592, 18328, 8040, 555, 5991, 3170, 3500, 13, 2650, 649, 358, 7945, 499, 3432, 30, 128009]
labels:
Hello! I am {{name}}, an AI assistant developed by {{author}}. How can I assist you today?<|eot_id|>
[INFO|configuration_utils.py:731] 2024-08-01 17:54:02,106 >> loading configuration file /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct/config.json
[INFO|configuration_utils.py:800] 2024-08-01 17:54:02,108 >> Model config LlamaConfig {"_name_or_path": "/root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct","architectures": ["LlamaForCausalLM"],"attention_bias": false,"attention_dropout": 0.0,"bos_token_id": 128000,"eos_token_id": 128009,"hidden_act": "silu","hidden_size": 4096,"initializer_range": 0.02,"intermediate_size": 14336,"max_position_embeddings": 8192,"mlp_bias": false,"model_type": "llama","num_attention_heads": 32,"num_hidden_layers": 32,"num_key_value_heads": 8,"pretraining_tp": 1,"rms_norm_eps": 1e-05,"rope_scaling": null,"rope_theta": 500000.0,"tie_word_embeddings": false,"torch_dtype": "bfloat16","transformers_version": "4.43.3","use_cache": true,"vocab_size": 128256
}[INFO|modeling_utils.py:3631] 2024-08-01 17:54:02,139 >> loading weights file /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct/model.safetensors.index.json
[INFO|modeling_utils.py:1572] 2024-08-01 17:54:02,140 >> Instantiating LlamaForCausalLM model under default dtype torch.bfloat16.
[INFO|configuration_utils.py:1038] 2024-08-01 17:54:02,142 >> Generate config GenerationConfig {"bos_token_id": 128000,"eos_token_id": 128009
}Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.68it/s]
[INFO|modeling_utils.py:4463] 2024-08-01 17:54:03,708 >> All model checkpoint weights were used when initializing LlamaForCausalLM.[INFO|modeling_utils.py:4471] 2024-08-01 17:54:03,709 >> All the weights of LlamaForCausalLM were initialized from the model checkpoint at /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use LlamaForCausalLM for predictions without further training.
[INFO|configuration_utils.py:991] 2024-08-01 17:54:03,712 >> loading configuration file /root/.cache/modelscope/hub/LLM-Research/Meta-Llama-3-8B-Instruct/generation_config.json
[INFO|configuration_utils.py:1038] 2024-08-01 17:54:03,713 >> Generate config GenerationConfig {"bos_token_id": 128000,"do_sample": true,"eos_token_id": [128001,128009],"max_length": 4096,"temperature": 0.6,"top_p": 0.9
}08/01/2024 17:54:03 - INFO - llamafactory.model.model_utils.checkpointing - Gradient checkpointing enabled.
08/01/2024 17:54:03 - INFO - llamafactory.model.model_utils.attention - Using torch SDPA for faster training and inference.
08/01/2024 17:54:03 - INFO - llamafactory.model.adapter - Upcasting trainable params to float32.
08/01/2024 17:54:03 - INFO - llamafactory.model.adapter - Fine-tuning method: LoRA
08/01/2024 17:54:03 - INFO - llamafactory.model.model_utils.misc - Found linear modules: q_proj,down_proj,o_proj,k_proj,gate_proj,up_proj,v_proj
08/01/2024 17:54:08 - INFO - llamafactory.model.loader - trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO|trainer.py:648] 2024-08-01 17:54:08,091 >> Using cpu_amp half precision backend
[INFO|trainer.py:2134] 2024-08-01 17:54:09,008 >> ***** Running training *****
[INFO|trainer.py:2135] 2024-08-01 17:54:09,008 >>   Num examples = 981
[INFO|trainer.py:2136] 2024-08-01 17:54:09,008 >>   Num Epochs = 3
[INFO|trainer.py:2137] 2024-08-01 17:54:09,008 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:2140] 2024-08-01 17:54:09,008 >>   Total train batch size (w. parallel, distributed & accumulation) = 8
[INFO|trainer.py:2141] 2024-08-01 17:54:09,008 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:2142] 2024-08-01 17:54:09,008 >>   Total optimization steps = 366
[INFO|trainer.py:2143] 2024-08-01 17:54:09,012 >>   Number of trainable parameters = 20,971,5200%|                                                                                                                                                           | 0/366 [00:00<?, ?it/s

错误原因:当前PyTorch不支持DCU,导致程序卡住,模型无法微调训练。

解决方法:在光合社区中查询并下载安装PyTorch。以 torch-2.1.0+das1.1.git3ac1bdd.abi1.dtk2404-cp310-cp310-manylinux_2_31_x86_64 为例,尝试安装 torch-2.1.0

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com