Eva-4B: Financial Evasion Detection Model

Eva-4B is a 4B-parameter model for detecting evasive answers in earnings call Q&A.

Model Summary

Model name: Eva-4B
Task: 3-way classification of Q&A pairs into:
- direct
- intermediate
- fully_evasive
Base model: Qwen/Qwen3-4B-Instruct-2507
Training method: full-parameter fine-tuning
Training data: EvasionBench training set (30,000 samples; 10,000 per class)

Intended Use

Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A.

Task Definition

Given an earnings call Question (analyst) and Answer (management), the model predicts one of:

direct: answers the core question with specific information
intermediate: provides related information but sidesteps the core question
fully_evasive: does not address the question (refusal, redirection, non-response)

This taxonomy follows the Rasiah framework referenced in the paper.

Dataset: EvasionBench (as reported in the paper)

Sources

Earnings call transcripts from the S&P Capital IQ database.

Splits

Training: 30,000 samples (balanced)
- direct: 10,000
- intermediate: 10,000
- fully_evasive: 10,000
Test (Human): 1,000 samples (natural distribution)
- direct: 412 (41.2%)
- intermediate: 256 (25.6%)
- fully_evasive: 332 (33.2%)

Labeling / Construction

The training set is constructed via a multi-model annotation framework:

Two annotators: Claude Opus 4.5 and Gemini-3-Flash
Agreement cases (~70–80%) are treated as high-confidence
Disagreement cases (~20–30%) are resolved by an LLM-as-Judge protocol using Claude Opus 4.5
Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%)

Human validation (test set)

A 100-sample subset is double-annotated by two experts.
Reported inter-annotator agreement: Cohen’s Kappa = 0.835.

Training Details

Base model: Qwen3-4B-Instruct-2507
Fine-tuning: full-parameter fine-tuning
Framework: MS-Swift
Hardware: 2× NVIDIA B200 SXM6 (180GB VRAM each)
Epochs: 2
Learning rate: 2e-5 (linear warmup; 3% warmup ratio)
Batch size: 8 per GPU
Gradient accumulation: 2 (effective batch size 32)
Precision: bfloat16
Max sequence length: 2048
Optimizer: AdamW
Gradient checkpointing: enabled

Performance

Top-5 models on the 1,000-sample human test set

Rank	Model	Accuracy	F1-Macro
1	Claude Opus 4.5	83.9%	0.838
2	Gemini-3-Flash	83.7%	0.833
3	GLM-4.7	82.6%	0.809
4	Eva-4B (Ours)	81.3%	0.807
5	GPT-5.2	80.5%	0.805

Note: based on the accuracy values, Eva-4B is 2nd among open-source models, after GLM-4.7 (82.6%).

Per-class F1 (Eva-4B)

Class	F1
direct	0.851
intermediate	0.698
fully_evasive	0.873

The paper notes most errors are confusion between direct and intermediate.

Ablation (label-source comparison)

The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction:

Qwen-Opus-Only: 78.9% accuracy
Eva-4B: 81.3% accuracy (+2.4% absolute)

The paper reports the Opus-only baseline achieves lower training loss but worse generalization.

Quick Start

The prompt below matches prompts/evasion_rasiah_fine_tuning_minimalist.txt in this repo.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "FutureMa/Eva-4B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A

Question: {{question}}
Answer: {{answer}}

Response format:
```json
{"reason": "brief explanation under 100 characters", "label": "direct|intermediate|fully_evasive"}
```

Answer in json block content, no other text"""

question = "What are your revenue expectations for next quarter?"
answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities."

prompt = (
    PROMPT_TEMPLATE
    .replace("{{question}}", question)
    .replace("{{answer}}", answer)
)

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.7,
        do_sample=True,
    )

generated = output_ids[0][inputs["input_ids"].shape[1]:]
response = tokenizer.decode(generated, skip_special_tokens=True)
print(response)

Expected output format:

{"reason": "...", "label": "direct|intermediate|fully_evasive"}

Limitations

Domain-specific to earnings call Q&A
English-only evaluation
Multi-model + judge labeling increases annotation cost (~2.2–2.3× vs single-model)
Judge position bias risk (no position randomization)
Potential self-preference concerns (Opus judging its own predictions)
Subjectivity in the intermediate class (lower agreement)
Temporal drift (training data spans 2005–2023)

Ethics

Eva-4B is a research artifact and not financial advice. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions.

Citation

If you use this model, please cite the accompanying paper:

@article{ma_evasionbench,
  title={EvasionBench: Detecting Evasive Answers in Financial Q\&A via Multi-Model Consensus and LLM-as-Judge},
  author={Ma, Shijian}
}