We fine-tune the teacher using Cross-Entropy (CE) loss for three tasks: extractive VQA, SER,
and KIE. For VQA, the model predicts start and end positions in a sequence
x = x₁,…,xₙ
. It outputs probabilities pˢ(x)
and
pᵉ(x)
, where pˢ(x)ᵢ
and pᵉ(x)ᵢ
are the probabilities
of token i
being the start or end. The loss is:
ℒCE,QA = - [log(pˢ(x)yₛ) + log(pᵉ(x)yₑ)]
.
For SER and KIE, each token is classified into one of C
classes. The model
outputs p(x)ᵢʲ
, the probability of token i
being class
j
. The CE loss is:
ℒCE,SER+KIE = - (1/n) ∑₁ⁿ log(p(x)ᵢyᵢ)
.
To distill a student from the teacher, we use our method SlimDoc, split into two
phases. In phase one, we align internal signals using MSE loss. For hidden states:
ℒhidden = (1/M) ∑₁^M MSE(H𝒮ⁱ(x), H𝒯lᵢ(x))
,
with
MSE = (1/dn) ∑ⱼ=1ⁿ ∑ₖ=1ᵈ (H𝒮,jkⁱ - H𝒯,jklᵢ)²
.
The embedding loss is
ℒemb = MSE(H𝒮⁰(x), H𝒯⁰(x))
.
For attention (shape H × n × n
):
ℒattn = (1/M) ∑₁^M MSE(A𝒮ⁱ(x), A𝒯lᵢ(x))
,
with
MSE = (1/Hn²) ∑ₕ=1ᴴ ∑ⱼ=1ⁿ ∑ₖ=1ⁿ (A𝒮,hjkⁱ - A𝒯,hjklᵢ)²
.
Combined distillation loss is
ℒdistill = α · ℒemb + β · ℒhidden + γ · ℒattn
.
In phase two, we align the model outputs using KL divergence:
ℒKL = (1/nC) ∑ⱼ=1ⁿ ∑ₖ=1ᶜ p𝒮(x)jk · log(p𝒮(x)jk / p𝒯(x)jk)
.
For VQA, this is applied separately to start and end outputs.