Qwen2-Audio-7B-DPO-CodeSwitch
A LoRA adapter for Qwen/Qwen2-Audio-7B-Instruct fine-tuned with DPO (Direct Preference Optimization) on code-switching speech transcription data.
Evaluation Results (MER - Mixed Error Rate, lower is better)
| Benchmark |
Baseline |
This Model |
Improvement |
| SEAME |
0.6681 |
0.5692 |
+14.8% |
| EMILIA |
0.5267 |
0.4766 |
+9.5% |
| CS Dialogue |
0.5073 |
0.3631 |
+28.4% |
Benchmark Descriptions
- SEAME: English-Mandarin code-switching conversational speech from Singapore/Malaysia (out-of-distribution test set, 9,764 samples)
- EMILIA: In-distribution evaluation set (1,000 samples)
- CS Dialogue: In-distribution evaluation set (359 samples)
Training Configuration
Model Architecture
| Parameter |
Value |
| Base Model |
Qwen/Qwen2-Audio-7B-Instruct |
| Adapter Type |
LoRA (Low-Rank Adaptation) |
| LoRA Rank (r) |
256 |
| LoRA Alpha |
128 |
| LoRA Dropout |
0.05 |
| LoRA Target Modules |
All attention (q_proj, k_proj, v_proj, o_proj) + MLP (up_proj, down_proj, gate_proj) |
| Trainable Parameters |
~1.28B (adapter only) |
Training Hyperparameters
| Parameter |
Value |
| Training Method |
DPO (Direct Preference Optimization) |
| DPO Beta (Ξ²) |
0.3 |
| DPO Loss |
Sigmoid |
| Learning Rate |
3e-5 |
| LR Scheduler |
Cosine |
| Warmup Ratio |
0.1 |
| Batch Size (per device) |
1 |
| Gradient Accumulation Steps |
4 |
| Global Batch Size |
32 (8 GPUs Γ 1 Γ 4) |
| Precision |
BF16 |
| Max Sequence Length |
8192 |
| Weight Decay |
0.01 |
| Max Gradient Norm |
1.0 |
Usage
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch
base_model = Qwen2AudioForConditionalGeneration.from_pretrained(
"Qwen/Qwen2-Audio-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
"Qwen/Qwen2-Audio-7B-Instruct",
trust_remote_code=True
)
model = PeftModel.from_pretrained(base_model, "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch")
model.eval()
conversation = [
{"role": "user", "content": [
{"type": "audio", "audio_url": "path/to/audio.wav"},
{"type": "text", "text": "Please transcribe this speech."}
]}
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = [librosa.load("path/to/audio.wav", sr=processor.feature_extractor.sampling_rate)[0]]
inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
generated_ids = model.generate(**inputs, max_new_tokens=256)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
Sample Outputs
Example 1: Language Mixing
|
Text |
| Ground Truth |
german θ· english spanish θ· english ζ―θΎ ε |
| Baseline |
German and English, Spanish and English. |
| This Model |
German θ· English Spanish θ·English ζ―θΎε |
Example 2: Code-Switching Preservation
|
Text |
| Ground Truth |
δΈθ½ δΈθ½ carry forward δΉ δΈθ½ ζ’ζ ιι± |
| Baseline |
δΈθ½δΈθ½ carefree, also can't be replaced by money. |
| This Model |
δΈθ½δΈθ½ carry forward δΉδΈθ½ζ’ζιι± |
Example 3: Mixed Language Utterance
|
Text |
| Ground Truth |
then ζ ζθΏ εη° like more and more people becoming vegetarians |
| Baseline |
ε δΈΊζζθΏεη°θΆζ₯θΆε€ηδΊΊζδΈΊη΄ ι£θ
(fully translated) |
| This Model |
ε δΈΊζζθΏεη° like more and more people becoming vegetarians |
Files
βββ README.md # This file
βββ adapter_config.json # LoRA configuration
βββ adapter_model.safetensors # LoRA adapter weights (~1.28 GB)
βββ tokenizer files # Tokenizer assets
βββ eval_results/
βββ baseline_seame.json # Baseline model results on SEAME
βββ baseline_emilia.json # Baseline model results on EMILIA
βββ baseline_csdialogue.json # Baseline model results on CS Dialogue
βββ trained_seame.json # This model's results on SEAME
βββ trained_emilia.json # This model's results on EMILIA
βββ trained_csdialogue.json # This model's results on CS Dialogue
License
This adapter inherits the license of the base Qwen2-Audio model.