LoRA(Low-Rank Adaptation)是一种参数高效的微调方法,通过在原始权重矩阵旁边插入低秩矩阵来模拟参数更新,从而大幅减少需要训练的参数数量。
核心原理
低秩分解思想
LoRA基于这样的假设:在微调过程中,权重的更新矩阵具有较低的内在维度(intrinsic dimension)。
数学原理
原始更新:W_new = W_original + ΔW
LoRA更新:W_new = W_original + A × B
其中:
W_original
:预训练的权重矩阵 (d × k)A
:低秩矩阵 (d × r)B
:低秩矩阵 (r × k)r
:秩,远小于 min(d, k)
参数量对比
# 原始参数量
original_params = d * k
# LoRA参数量
lora_params = d * r + r * k = r * (d + k)
# 参数减少比例
reduction_ratio = lora_params / original_params = r * (d + k) / (d * k)
技术实现
基础LoRA实现
import torch
import torch.nn as nn
class LoRALayer(nn.Module):
def __init__(self, in_features, out_features, rank=4, alpha=1):
super().__init__()
self.rank = rank
self.alpha = alpha
# LoRA矩阵A和B
self.lora_A = nn.Parameter(torch.randn(in_features, rank))
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
# 缩放因子
self.scaling = alpha / rank
def forward(self, x):
# LoRA前向传播:x @ (A @ B)
return (x @ self.lora_A @ self.lora_B) * self.scaling
class LoRALinear(nn.Module):
def __init__(self, original_layer, rank=4, alpha=1):
super().__init__()
self.original_layer = original_layer
self.lora = LoRALayer(
original_layer.in_features,
original_layer.out_features,
rank, alpha
)
# 冻结原始层
for param in self.original_layer.parameters():
param.requires_grad = False
def forward(self, x):
return self.original_layer(x) + self.lora(x)
使用PEFT库实现
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
# 加载基础模型
model = AutoModelForCausalLM.from_pretrained("chatglm3-6b")
# 配置LoRA
lora_config = LoraConfig(
r=16, # 秩
lora_alpha=32, # 缩放因子
target_modules=[ # 目标模块
"q_proj", "v_proj", "k_proj", "o_proj"
],
lora_dropout=0.1, # dropout率
bias="none", # 偏置处理
task_type="CAUSAL_LM" # 任务类型
)
# 应用LoRA
model = get_peft_model(model, lora_config)
# 查看可训练参数
model.print_trainable_parameters()
# 输出:trainable params: 4,194,304 || all params: 6,244,558,848 || trainable%: 0.067
关键参数详解
rank (r)
- 作用:控制低秩矩阵的秩,决定模型容量
- 影响:r越大,模型表达能力越强,但参数也越多
- 建议值:
- 简单任务:r=4-8
- 中等任务:r=16-32
- 复杂任务:r=64-128
lora_alpha
- 作用:缩放因子,控制LoRA的影响程度
- 计算:实际缩放 = alpha / r
- 建议设置:
- 保守:alpha = r
- 标准:alpha = 2 * r
- 激进:alpha = 4 * r
target_modules
- 作用:指定应用LoRA的模块
- 常用配置:
# 最小配置(注意力机制的查询和值) target_modules = ["q_proj", "v_proj"] # 标准配置(完整注意力机制) target_modules = ["q_proj", "v_proj", "k_proj", "o_proj"] # 完整配置(包含FFN层) target_modules = [ "q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ]
训练流程
完整训练示例
from transformers import Trainer, TrainingArguments
from datasets import Dataset
# 准备数据
def prepare_data(examples):
inputs = tokenizer(
examples["instruction"],
truncation=True,
padding=True,
max_length=512
)
inputs["labels"] = inputs["input_ids"].copy()
return inputs
train_dataset = Dataset.from_list(train_data)
train_dataset = train_dataset.map(prepare_data, batched=True)
# 训练参数
training_args = TrainingArguments(
output_dir="./lora_output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=1e-4,
warmup_ratio=0.1,
logging_steps=10,
save_steps=500,
evaluation_strategy="steps",
eval_steps=500,
save_total_limit=2,
remove_unused_columns=False,
)
# 训练器
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
)
# 开始训练
trainer.train()
# 保存LoRA权重
model.save_pretrained("./lora_weights")
模型合并与部署
合并LoRA权重
# 方法1:临时合并(推理时)
merged_model = model.merge_and_unload()
# 方法2:永久合并并保存
from peft import PeftModel
# 加载基础模型
base_model = AutoModelForCausalLM.from_pretrained("chatglm3-6b")
# 加载LoRA权重
model = PeftModel.from_pretrained(base_model, "./lora_weights")
# 合并并保存
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model")
推理优化
# 使用合并后的模型进行推理
def inference_with_merged_model(text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = merged_model.generate(
**inputs,
max_length=512,
temperature=0.7,
do_sample=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# 多LoRA切换推理
def inference_with_adapter_switching(text, adapter_name):
# 切换到指定的adapter
model.set_adapter(adapter_name)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=512)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
优势与局限
优势
- 参数效率:通常只需训练0.1%-1%的参数
- 显存友好:大幅减少显存占用
- 训练速度快:参数少,训练时间短
- 易于管理:LoRA权重文件小,便于存储和分享
- 可组合性:支持多个LoRA的组合使用
局限性
- 表达能力有限:低秩假设可能限制模型容量
- 任务相关性:对于与预训练差异很大的任务效果可能不佳
- 超参数敏感:r和alpha的选择对效果影响较大
- 推理开销:未合并时推理需要额外计算
最佳实践
参数选择策略
def get_lora_config_by_task(task_type, model_size):
"""根据任务类型和模型大小选择LoRA配置"""
configs = {
"classification": {
"small": {"r": 8, "alpha": 16},
"medium": {"r": 16, "alpha": 32},
"large": {"r": 32, "alpha": 64}
},
"generation": {
"small": {"r": 16, "alpha": 32},
"medium": {"r": 32, "alpha": 64},
"large": {"r": 64, "alpha": 128}
},
"complex_reasoning": {
"small": {"r": 32, "alpha": 64},
"medium": {"r": 64, "alpha": 128},
"large": {"r": 128, "alpha": 256}
}
}
return configs.get(task_type, {}).get(model_size, {"r": 16, "alpha": 32})
训练监控
def monitor_lora_training(trainer):
"""监控LoRA训练过程"""
# 检查梯度范数
def log_gradient_norm(model):
total_norm = 0
for name, param in model.named_parameters():
if param.grad is not None and "lora" in name:
param_norm = param.grad.data.norm(2)
total_norm += param_norm.item() ** 2
total_norm = total_norm ** (1. / 2)
return total_norm
# 检查LoRA权重变化
def log_lora_weight_changes(model, initial_weights):
changes = {}
for name, param in model.named_parameters():
if "lora" in name and name in initial_weights:
change = torch.norm(param.data - initial_weights[name]).item()
changes[name] = change
return changes
相关概念
- PEFT参数高效微调 - LoRA所属的技术类别
- QLoRA微调 - LoRA的量化版本
- 全参数微调方法 - 传统微调方法对比
- 微调参数调优 - 参数优化策略
- 模型推理部署 - 部署相关技术