LMDeploy 模型参数更新

用户794

2025年3月6日修改

经过测试，耗时主要在模型加载部分，​LMDeploy 模型加载释放时间，空模型创建和销毁占比较小(turbomind backend)，所以目前更新参数的方案是销毁 pipeline / server，并重新创建，为了支持参数的更新，支持用户在创建 pipeline / server 的时候提供新的模型参数。因为模型的构建还是沿用的之前的逻辑，所以创建模型的时候还是需要提供一个包含配置文件的模型路径，不过构建 pipeline / server 的时候不会使用目录中的 ckpt，而是用户提供的模型参数。​

Offline:

•

基于 torchrun 验证了在不同卡上启动 lmdeploy 的可行性，实验中发现了一些使用方面的一些限制，不确定是否会影响 RL 训练，如果无法接入的话，可能还是得支持在指定显卡上创建模型的能力。（https://github.com/InternLM/lmdeploy/pull/3203）

•
提供了 pipeline 创建时使用传入参数的方案和 demo​

基于 torchrun 创建模型

实验

141 上 6 卡，每两块卡创建一个 tp2 的模型（即在 rank 0, 2, 4 上面创建模型)​

代码块

CUDA_VISIBLE_DEVICES=0,1,4,5,6,7 torchrun --nproc_per_node 6 /home/chenxin/ws3/topk/offline.py​

code:

代码块

import torch​
from unittest.mock import patch​
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig​
import torch.distributed as dist​
import os​
import time​
from filelock import FileLock​
​
def lock_print(*args, **kwargs):​
    with FileLock('print.lock'):​
        print(*args, **kwargs, flush=True)​
​
def main():​
    rank = int(os.environ['LOCAL_RANK'])​
    world_size = int(os.environ['LOCAL_WORLD_SIZE'])​
​
    # set CUDA_VISIBLE_DEVICES​
    # CUDA_VISIBLE_DEVICES shoud be set before torch.cuda.init() and dist.init_process_group​
    if rank % 2 == 0:​
        assert(torch.cuda.is_initialized() == False)​
        assert(dist.is_initialized() == False)​
        if 'CUDA_VISIBLE_DEVICES' in os.environ:​
            ids = list(map(int, os.environ.get("CUDA_VISIBLE_DEVICES", "").split(",")))​
        else:​
            ids = list(range(world_size))​
        os.environ['CUDA_VISIBLE_DEVICES'] = f'{ids[rank]},{ids[rank+1]}'​
​
    dist.init_process_group(backend="nccl")​
    lock_print(f'rank={dist.get_rank()}, world_size={dist.get_world_size()}')​
​
    if rank % 2 == 0:​
        # patch mmengine.logger which uses LOCAL_RANK and CUDA_VISIBLE_DEVICES envs​
        lmdeploy_logger_patch = patch.dict(os.environ, {'LOCAL_RANK': '0'})​
        with lmdeploy_logger_patch:​
            lock_print('CUDA_VISIBLE_DEVICES', os.environ['CUDA_VISIBLE_DEVICES'], torch.cuda.device_count())​
            pipe = pipeline('Qwen/Qwen2.5-7B-Instruct', backend_config=TurbomindEngineConfig(tp=2, cache_max_entry_count=0.1))​
        for i in range(10):​
            out = pipe('tell me a joke', gen_config=GenerationConfig(do_sample=True))​
            lock_print(f'rank={rank}, i={i}, res={out.text}')​
​
        time.sleep(10)​
        # release pipeline memory​
        pipe.close()​
​
    # check if dist works well​
    # on some ranks, we set CUDA_VISIBLE_DEVICES env variable, therefore we shoud use the 'local device id'​
    for idx in range(world_size):​
        local_device_id = rank if rank % 2 != 0 else rank % 2​
        tensor = torch.randint(0, 100, (5,), device=local_device_id)​
        before = tensor.clone()​
        dist.broadcast(tensor, src=idx)​
​
        if rank == 0:​
            lock_print(f'src={idx}')​
        dist.barrier()​
        lock_print(f'rank={rank}, before={before}, after={tensor}')​
        dist.barrier()​
​
    dist.barrier()​
    dist.destroy_process_group()​
​
if __name__ == '__main__':​
    main()​
​
# CUDA_VISIBLE_DEVICES=0,1,4,5,6,7 torchrun --nproc_per_node 6 /home/chenxin/ws3/topk/offline.py​

LMDeploy 模型参数更新​

LMDeploy 模型参数更新