Eva-4B: Financial Evasion Detection Model
Eva-4B is a 4B-parameter model for detecting evasive answers in earnings call Q&A.
Model Summary
- Model name: Eva-4B
- Task: 3-way classification of Q&A pairs into:
directintermediatefully_evasive
- Base model:
Qwen/Qwen3-4B-Instruct-2507 - Training method: full-parameter fine-tuning
- Training data: EvasionBench training set (30,000 samples; 10,000 per class)
Intended Use
Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A.
Task Definition
Given an earnings call Question (analyst) and Answer (management), the model predicts one of:
- direct: answers the core question with specific information
- intermediate: provides related information but sidesteps the core question
- fully_evasive: does not address the question (refusal, redirection, non-response)
This taxonomy follows the Rasiah framework referenced in the paper.
Dataset: EvasionBench (as reported in the paper)
Sources
- Earnings call transcripts from the S&P Capital IQ database.
Splits
- Training: 30,000 samples (balanced)
- direct: 10,000
- intermediate: 10,000
- fully_evasive: 10,000
- Test (Human): 1,000 samples (natural distribution)
- direct: 412 (41.2%)
- intermediate: 256 (25.6%)
- fully_evasive: 332 (33.2%)
Labeling / Construction
The training set is constructed via a multi-model annotation framework:
- Two annotators: Claude Opus 4.5 and Gemini-3-Flash
- Agreement cases (~70–80%) are treated as high-confidence
- Disagreement cases (~20–30%) are resolved by an LLM-as-Judge protocol using Claude Opus 4.5
- Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%)
Human validation (test set)
- A 100-sample subset is double-annotated by two experts.
- Reported inter-annotator agreement: Cohen’s Kappa = 0.835.
Training Details
- Base model: Qwen3-4B-Instruct-2507
- Fine-tuning: full-parameter fine-tuning
- Framework: MS-Swift
- Hardware: 2× NVIDIA B200 SXM6 (180GB VRAM each)
- Epochs: 2
- Learning rate: 2e-5 (linear warmup; 3% warmup ratio)
- Batch size: 8 per GPU
- Gradient accumulation: 2 (effective batch size 32)
- Precision: bfloat16
- Max sequence length: 2048
- Optimizer: AdamW
- Gradient checkpointing: enabled
Performance
Top-5 models on the 1,000-sample human test set
| Rank | Model | Accuracy | F1-Macro |
|---|---|---|---|
| 1 | Claude Opus 4.5 | 83.9% | 0.838 |
| 2 | Gemini-3-Flash | 83.7% | 0.833 |
| 3 | GLM-4.7 | 82.6% | 0.809 |
| 4 | Eva-4B (Ours) | 81.3% | 0.807 |
| 5 | GPT-5.2 | 80.5% | 0.805 |
Note: based on the accuracy values, Eva-4B is 2nd among open-source models, after GLM-4.7 (82.6%).
Per-class F1 (Eva-4B)
| Class | F1 |
|---|---|
| direct | 0.851 |
| intermediate | 0.698 |
| fully_evasive | 0.873 |
The paper notes most errors are confusion between direct and intermediate.
Ablation (label-source comparison)
The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction:
- Qwen-Opus-Only: 78.9% accuracy
- Eva-4B: 81.3% accuracy (+2.4% absolute)
The paper reports the Opus-only baseline achieves lower training loss but worse generalization.
Quick Start
The prompt below matches prompts/evasion_rasiah_fine_tuning_minimalist.txt in this repo.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "FutureMa/Eva-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A
Question: {{question}}
Answer: {{answer}}
Response format:
```json
{"reason": "brief explanation under 100 characters", "label": "direct|intermediate|fully_evasive"}
```
Answer in json block content, no other text"""
question = "What are your revenue expectations for next quarter?"
answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities."
prompt = (
PROMPT_TEMPLATE
.replace("{{question}}", question)
.replace("{{answer}}", answer)
)
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.7,
do_sample=True,
)
generated = output_ids[0][inputs["input_ids"].shape[1]:]
response = tokenizer.decode(generated, skip_special_tokens=True)
print(response)
Expected output format:
{"reason": "...", "label": "direct|intermediate|fully_evasive"}
Limitations
- Domain-specific to earnings call Q&A
- English-only evaluation
- Multi-model + judge labeling increases annotation cost (~2.2–2.3× vs single-model)
- Judge position bias risk (no position randomization)
- Potential self-preference concerns (Opus judging its own predictions)
- Subjectivity in the intermediate class (lower agreement)
- Temporal drift (training data spans 2005–2023)
Ethics
Eva-4B is a research artifact and not financial advice. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions.
Citation
If you use this model, please cite the accompanying paper:
@article{ma_evasionbench,
title={EvasionBench: Detecting Evasive Answers in Financial Q\&A via Multi-Model Consensus and LLM-as-Judge},
author={Ma, Shijian}
}
Author
- Shijian Ma (mas8069@foxmail.com)
Last updated: 2026-01-12
- Downloads last month
- -