Eva-4B: Financial Evasion Detection Model

Eva-4B is a 4B-parameter model for detecting evasive answers in earnings call Q&A.

Model Summary

  • Model name: Eva-4B
  • Task: 3-way classification of Q&A pairs into:
    • direct
    • intermediate
    • fully_evasive
  • Base model: Qwen/Qwen3-4B-Instruct-2507
  • Training method: full-parameter fine-tuning
  • Training data: EvasionBench training set (30,000 samples; 10,000 per class)

Intended Use

Eva-4B is intended for research and tooling around corporate disclosure quality and evasiveness in earnings call Q&A.

Task Definition

Given an earnings call Question (analyst) and Answer (management), the model predicts one of:

  • direct: answers the core question with specific information
  • intermediate: provides related information but sidesteps the core question
  • fully_evasive: does not address the question (refusal, redirection, non-response)

This taxonomy follows the Rasiah framework referenced in the paper.

Dataset: EvasionBench (as reported in the paper)

Sources

  • Earnings call transcripts from the S&P Capital IQ database.

Splits

  • Training: 30,000 samples (balanced)
    • direct: 10,000
    • intermediate: 10,000
    • fully_evasive: 10,000
  • Test (Human): 1,000 samples (natural distribution)
    • direct: 412 (41.2%)
    • intermediate: 256 (25.6%)
    • fully_evasive: 332 (33.2%)

Labeling / Construction

The training set is constructed via a multi-model annotation framework:

  • Two annotators: Claude Opus 4.5 and Gemini-3-Flash
  • Agreement cases (~70–80%) are treated as high-confidence
  • Disagreement cases (~20–30%) are resolved by an LLM-as-Judge protocol using Claude Opus 4.5
  • Final training mix reported: ~25,000 consensus samples (83.5%) + ~5,000 judge-resolved samples (16.5%)

Human validation (test set)

  • A 100-sample subset is double-annotated by two experts.
  • Reported inter-annotator agreement: Cohen’s Kappa = 0.835.

Training Details

  • Base model: Qwen3-4B-Instruct-2507
  • Fine-tuning: full-parameter fine-tuning
  • Framework: MS-Swift
  • Hardware: 2× NVIDIA B200 SXM6 (180GB VRAM each)
  • Epochs: 2
  • Learning rate: 2e-5 (linear warmup; 3% warmup ratio)
  • Batch size: 8 per GPU
  • Gradient accumulation: 2 (effective batch size 32)
  • Precision: bfloat16
  • Max sequence length: 2048
  • Optimizer: AdamW
  • Gradient checkpointing: enabled

Performance

Top-5 models on the 1,000-sample human test set

Rank Model Accuracy F1-Macro
1 Claude Opus 4.5 83.9% 0.838
2 Gemini-3-Flash 83.7% 0.833
3 GLM-4.7 82.6% 0.809
4 Eva-4B (Ours) 81.3% 0.807
5 GPT-5.2 80.5% 0.805

Note: based on the accuracy values, Eva-4B is 2nd among open-source models, after GLM-4.7 (82.6%).

Per-class F1 (Eva-4B)

Class F1
direct 0.851
intermediate 0.698
fully_evasive 0.873

The paper notes most errors are confusion between direct and intermediate.

Ablation (label-source comparison)

The paper compares Eva-4B training labels (multi-model + judge) vs an Opus-only construction:

  • Qwen-Opus-Only: 78.9% accuracy
  • Eva-4B: 81.3% accuracy (+2.4% absolute)

The paper reports the Opus-only baseline achieves lower training loss but worse generalization.

Quick Start

The prompt below matches prompts/evasion_rasiah_fine_tuning_minimalist.txt in this repo.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "FutureMa/Eva-4B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

PROMPT_TEMPLATE = """You are a financial analyst. Your task is to Detect Evasive Answers in Financial Q&A

Question: {{question}}
Answer: {{answer}}

Response format:
```json
{"reason": "brief explanation under 100 characters", "label": "direct|intermediate|fully_evasive"}
```

Answer in json block content, no other text"""

question = "What are your revenue expectations for next quarter?"
answer = "We remain optimistic about our business trajectory and will continue to focus on executing our strategic priorities."

prompt = (
    PROMPT_TEMPLATE
    .replace("{{question}}", question)
    .replace("{{answer}}", answer)
)

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=128,
        temperature=0.7,
        do_sample=True,
    )

generated = output_ids[0][inputs["input_ids"].shape[1]:]
response = tokenizer.decode(generated, skip_special_tokens=True)
print(response)

Expected output format:

{"reason": "...", "label": "direct|intermediate|fully_evasive"}

Limitations

  • Domain-specific to earnings call Q&A
  • English-only evaluation
  • Multi-model + judge labeling increases annotation cost (~2.2–2.3× vs single-model)
  • Judge position bias risk (no position randomization)
  • Potential self-preference concerns (Opus judging its own predictions)
  • Subjectivity in the intermediate class (lower agreement)
  • Temporal drift (training data spans 2005–2023)

Ethics

Eva-4B is a research artifact and not financial advice. Outputs should be used as one signal among many and should be reviewed by humans for high-stakes decisions.

Citation

If you use this model, please cite the accompanying paper:

@article{ma_evasionbench,
  title={EvasionBench: Detecting Evasive Answers in Financial Q\&A via Multi-Model Consensus and LLM-as-Judge},
  author={Ma, Shijian}
}

Author


Last updated: 2026-01-12

Downloads last month
-
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FutureMa/Eva-4B

Finetuned
(375)
this model
Quantizations
5 models

Collection including FutureMa/Eva-4B