AI多模态- Janus-Pro-7B模型推理微调,融合实战2

发布时间：2025-04-23 23:20:44编辑：123阅读（2884）

Janus-Pro是DeepSeek最新开源的多模态模型，是一种新颖的自回归框架，统一了多模态理解和生成。通过将视觉编码解耦为独立的路径，同时仍然使用单一的、统一的变压器架构进行处理，该框架解决了先前方法的局限性。这种解耦不仅缓解了视觉编码器在理解和生成中的角色冲突，还增强了框架的灵活性。Janus-Pro 超过了以前的统一模型，并且匹配或超过了特定任务模型的性能。

ms-swift微调

ms-swift是魔搭社区提供的大模型与多模态大模型训练部署框架。现已支持450+大模型与150+多模态大模型的训练

(预训练、微调、人类对齐)

推理、评测、量化与部署。模型开发者可以在ms-swift框架中一站式完成围绕大模型的各类需求。

目前ms-swift的主要能力包含：

模型类型：支持450+纯文本大模型、150+多模态大模型，All-to-All全模态模型的训练到部署全流程。

数据集类型：内置150+预训练、微调、人类对齐、多模态等各种类型的数据集，并支持自定义数据集。

硬件支持：CPU、RTX系列、T4/V100、A10/A100/H100、Ascend NPU等。

轻量训练：支持了LoRA、QLoRA、DoRA、LoRA+、ReFT、RS-LoRA、LLaMAPro、Adapter、GaLore、Q-Galore、

LISA、UnSloth、Liger-Kernel等轻量微调方式。

分布式训练：支持分布式数据并行（DDP）、device_map简易模型并行、DeepSpeed ZeRO2 ZeRO3、

FSDP等分布式训练技术。

量化训练：支持对BNB、AWQ、GPTQ、AQLM、HQQ、EETQ量化模型进行训练。

RLHF训练：支持纯文本大模型和多模态大模型的DPO、CPO、SimPO、ORPO、KTO、RM、PPO等人类对齐训练方法。

多模态训练：支持对图像、视频和语音不同模态模型进行训练，支持VQA、Caption、OCR、Grounding任务的训练。

界面训练：以界面的方式提供训练、推理、评测、量化的能力，完成大模型的全链路。

插件化与拓展：支持自定义模型和数据集拓展，支持对loss、metric、trainer、loss-scale、callback、

optimizer等组件进行自定义。

工具箱能力：除了对大模型和多模态大模型的训练支持外，还支持其推理、评测、量化和部署全流程。

推理加速：支持PyTorch、vLLM、LmDeploy推理加速引擎，并提供OpenAI接口，为推理、部署和评测模块提供加速。

模型评测：以EvalScope作为评测后端，支持100+评测数据集对纯文本和多模态模型进行评测。

模型量化：支持AWQ、GPTQ和BNB的量化导出，导出的模型支持使用vLLM/LmDeploy推理加速，并支持继续训练。

SWIFT安装，使用源代码安装：

git clone https://github.com/modelscope/ms-swift.git

cd ms-swift

pip install -e . -i https://mirrors.aliyun.com/pypi/simple

pip install evalscope -i https://mirrors.aliyun.com/pypi/simple

pip install opencompass -i https://mirrors.aliyun.com/pypi/simple

运行 web ui

WEBUI_SERVER=0.0.0.0 WEBUI_PORT=6006 swift web-ui

浏览器访问 http://192.168.71.11:6006/

准备数据集,数据集来源查看Selenium加载用户目录爬取某宝电商数据 http://py3study.com/Article/details/id/20102.html

window路径换成linux路径

import json
aa = "D:\\SpiderTaobao\\images\\"
bb = r'/home/sam_admin/vllm/images/'
with open(r'taobao_women_clothing.jsonl', mode='r', encoding='utf-8') as fb:
    ss = fb.read()
    content = ss.split('\n')
    for i in content:
        if i:
            data_ = json.loads(i)
            images_path = data_.get('images')[0].replace(aa, bb)
            data_.get('images').clear()
            data_.get('images').append(images_path)
            with open(r'linux_taobao_women_clothing.jsonl', mode='a', encoding='utf-8') as fb:
                fb.write(json.dumps(data_, ensure_ascii=False))
                fb.write('\n')

把数据集 linux_taobao_women_clothing.jsonl 和 images放到/home/sam_admin/vllm/ms-swift目录下，如图：

微调命令如下：

CUDA_VISIBLE_DEVICES=0 \

swift sft \

--model deepseek-ai/Janus-Pro-7B \

--dataset AI-ModelScope/LaTeX_OCR:human_handwrite#20000 \

--train_type lora \

--torch_dtype bfloat16 \

--num_train_epochs 1 \

--per_device_train_batch_size 1 \

--per_device_eval_batch_size 1 \

--learning_rate 1e-4 \

--lora_rank 8 \

--lora_alpha 32 \

--target_modules all-linear \

--freeze_vit true \

--gradient_accumulation_steps 16 \

--eval_steps 100 \

--save_steps 100 \

--save_total_limit 2 \

--logging_steps 5 \

--max_length 2048 \

--output_dir output \

--warmup_ratio 0.05 \

--dataloader_num_workers 4 \

--dataset_num_proc 4

上面是官方的给的参数，参数是可选的，执行命令：

swift sft --model /home/sam_admin/vllm/deepseek-ai/Janus-Pro-7B --dataset linux_taobao_women_clothing.jsonl --train_type lora --torch_dtype bfloat16 --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --learning_rate 1e-4 --lora_rank 8 --lora_alpha 32 --target_modules all-linear --freeze_vit true

微调之后就可以执行推理了，加上--merge_lora true，就可以得到融合后的结果执行命令：

swift infer --adapters /home/sam_admin/vllm/ms-swift/output/Janus-Pro-7B/v4-20250423-205908/checkpoint-1 --stream false --max_batch_size 1 --load_data_args true --max_new_tokens 2048 --merge_lora true

融合后得到的结果：

测试融合后的模型, 测试图片为:

代码如下：

import torch
from transformers import AutoModelForCausalLM
from janus.models import MultiModalityCausalLM, VLChatProcessor
from janus.utils.io import load_pil_images

model_path = "output/Janus-Pro-7B/v4-20250423-205908/checkpoint-1-merged/"

image1 = '/home/sam_admin/vllm/ms-swift/aac.jpg'

question='描述这张图片，给出详细的信息'
vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)
tokenizer = vl_chat_processor.tokenizer

vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True
)
vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

conversation = [
    {
        "role": "<|User|>",
        "content": f"<image_placeholder>\n{question}",
        "images": [image1],
    },
    {"role": "<|Assistant|>", "content": ""},
]

# load images and prepare for inputs
pil_images = load_pil_images(conversation)
prepare_inputs = vl_chat_processor(
    conversations=conversation, images=pil_images, force_batchify=True
).to(vl_gpt.device)

# # run image encoder to get the image embeddings
inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

# # run the model to get the response
outputs = vl_gpt.language_model.generate(
    inputs_embeds=inputs_embeds,
    attention_mask=prepare_inputs.attention_mask,
    pad_token_id=tokenizer.eos_token_id,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    max_new_tokens=512,
    do_sample=False,
    use_cache=True,
)

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)
print(f"{prepare_inputs['sft_format'][0]}", answer)

结果如下：

关键字：

上一篇： WSL从C盘迁移到D盘

下一篇：大模型基础架构



搜索

热门推荐

最新文章

Python搭建一个RAG系统(分片/检索/召回/重排序/生成)
 2149°
Browser-use:智能浏览器自动化(Web-Agent)
 2856°
使用 LangChain 实现本地 Agent
 2375°
使用 LangChain 构建本地 RAG 应用
 2319°
使用LLaMA-Factory微调大模型的function calling能力
 2853°
复现一个简单Agent系统
 2330°
LLaMA Factory-Lora微调实现声控语音多轮问答对话-1
 3118°
LLaMA Factory微调后的模型合并导出和部署-4
 5107°
LLaMA Factory微调模型的各种参数怎么设置-3
 4940°
LLaMA Factory构建高质量数据集-2
 3533°

博主信息

姓名：Run
职业：谜
邮箱：383697894@qq.com
定位：上海 · 松江

扫我打开

友情链接

百度 淘宝 腾讯 慕课网 CSDN 博客园 51cto博客