We fine-tune the teacher using Cross-Entropy (CE) loss for three tasks: extractive VQA, SER,
and KIE. For VQA, the model predicts start and end positions in a sequence
x = x₁,…,xₙ. It outputs probabilities pˢ(x) and
pᵉ(x), where pˢ(x)ᵢ and pᵉ(x)ᵢ are the probabilities
of token i being the start or end. The loss is:
ℒCE,QA = - [log(pˢ(x)yₛ) + log(pᵉ(x)yₑ)].
For SER and KIE, each token is classified into one of C classes. The model
outputs p(x)ᵢʲ, the probability of token i being class
j. The CE loss is:
ℒCE,SER+KIE = - (1/n) ∑₁ⁿ log(p(x)ᵢyᵢ).
To distill a student from the teacher, we use our method SlimDoc, split into two
phases. In phase one, we align internal signals using MSE loss. For hidden states:
ℒhidden = (1/M) ∑₁^M MSE(H𝒮ⁱ(x), H𝒯lᵢ(x)),
with
MSE = (1/dn) ∑ⱼ=1ⁿ ∑ₖ=1ᵈ (H𝒮,jkⁱ - H𝒯,jklᵢ)².
The embedding loss is
ℒemb = MSE(H𝒮⁰(x), H𝒯⁰(x)).
For attention (shape H × n × n):
ℒattn = (1/M) ∑₁^M MSE(A𝒮ⁱ(x), A𝒯lᵢ(x)),
with
MSE = (1/Hn²) ∑ₕ=1ᴴ ∑ⱼ=1ⁿ ∑ₖ=1ⁿ (A𝒮,hjkⁱ - A𝒯,hjklᵢ)².
Combined distillation loss is
ℒdistill = α · ℒemb + β · ℒhidden + γ · ℒattn.
In phase two, we align the model outputs using KL divergence:
ℒKL = (1/nC) ∑ⱼ=1ⁿ ∑ₖ=1ᶜ p𝒮(x)jk · log(p𝒮(x)jk / p𝒯(x)jk).
For VQA, this is applied separately to start and end outputs.