Qwen2-Audio-7B-DPO-CodeSwitch

A LoRA adapter for Qwen/Qwen2-Audio-7B-Instruct fine-tuned with DPO (Direct Preference Optimization) on code-switching speech transcription data.

Evaluation Results (MER - Mixed Error Rate, lower is better)

Benchmark Baseline This Model Improvement
SEAME 0.6681 0.5692 +14.8%
EMILIA 0.5267 0.4766 +9.5%
CS Dialogue 0.5073 0.3631 +28.4%

Benchmark Descriptions

  • SEAME: English-Mandarin code-switching conversational speech from Singapore/Malaysia (out-of-distribution test set, 9,764 samples)
  • EMILIA: In-distribution evaluation set (1,000 samples)
  • CS Dialogue: In-distribution evaluation set (359 samples)

Training Configuration

Model Architecture

Parameter Value
Base Model Qwen/Qwen2-Audio-7B-Instruct
Adapter Type LoRA (Low-Rank Adaptation)
LoRA Rank (r) 256
LoRA Alpha 128
LoRA Dropout 0.05
LoRA Target Modules All attention (q_proj, k_proj, v_proj, o_proj) + MLP (up_proj, down_proj, gate_proj)
Trainable Parameters ~1.28B (adapter only)

Training Hyperparameters

Parameter Value
Training Method DPO (Direct Preference Optimization)
DPO Beta (Ξ²) 0.3
DPO Loss Sigmoid
Learning Rate 3e-5
LR Scheduler Cosine
Warmup Ratio 0.1
Batch Size (per device) 1
Gradient Accumulation Steps 4
Global Batch Size 32 (8 GPUs Γ— 1 Γ— 4)
Precision BF16
Max Sequence Length 8192
Weight Decay 0.01
Max Gradient Norm 1.0

Usage

from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

# Load base model
base_model = Qwen2AudioForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-Audio-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2-Audio-7B-Instruct",
    trust_remote_code=True
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch")
model.eval()

# Inference example
conversation = [
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "path/to/audio.wav"},
        {"type": "text", "text": "Please transcribe this speech."}
    ]}
]

text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = [librosa.load("path/to/audio.wav", sr=processor.feature_extractor.sampling_rate)[0]]

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=256)

transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

Sample Outputs

Example 1: Language Mixing

Text
Ground Truth german 跟 english spanish 跟 english ζ―”θΎƒ 像
Baseline German and English, Spanish and English.
This Model German 跟 English Spanish 跟English 比较像

Example 2: Code-Switching Preservation

Text
Ground Truth 不能 不能 carry forward 也 不能 捒成 金钱
Baseline 不能不能 carefree, also can't be replaced by money.
This Model 不能不能 carry forward δΉŸδΈθƒ½ζ’ζˆι‡‘ι’±

Example 3: Mixed Language Utterance

Text
Ground Truth then ζˆ‘ ζœ€θΏ‘ ε‘ηŽ° like more and more people becoming vegetarians
Baseline ε› δΈΊζˆ‘ζœ€θΏ‘ε‘ηŽ°θΆŠζ₯θΆŠε€šηš„δΊΊζˆδΈΊη΄ ι£Ÿθ€… (fully translated)
This Model ε› δΈΊζˆ‘ζœ€θΏ‘ε‘ηŽ° like more and more people becoming vegetarians

Files

β”œβ”€β”€ README.md                      # This file
β”œβ”€β”€ adapter_config.json            # LoRA configuration
β”œβ”€β”€ adapter_model.safetensors      # LoRA adapter weights (~1.28 GB)
β”œβ”€β”€ tokenizer files                # Tokenizer assets
└── eval_results/
    β”œβ”€β”€ baseline_seame.json        # Baseline model results on SEAME
    β”œβ”€β”€ baseline_emilia.json       # Baseline model results on EMILIA
    β”œβ”€β”€ baseline_csdialogue.json     # Baseline model results on CS Dialogue
    β”œβ”€β”€ trained_seame.json         # This model's results on SEAME
    β”œβ”€β”€ trained_emilia.json        # This model's results on EMILIA
    └── trained_csdialogue.json      # This model's results on CS Dialogue

License

This adapter inherits the license of the base Qwen2-Audio model.

Downloads last month
109
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for myaccountfor/Qwen2-Audio-7B-DPO-CodeSwitch

Adapter
(9)
this model