SlimDoc: Lightweight Distillation of Document Transformer Models

Marcel Lamott¹, Muhammad Armaghan Shakir², Adrian Ulges¹, Yves-Noel Weweler³, Faisal Shafait²

¹RheinMain University of Applied Sciences, Wiesbaden, Germany.
²National University of Sciences and Technology, Islamabad, Pakistan.
³Insiders Technologies GmbH, Kaiserslautern, Germany.
International Journal on Document Analysis and Recognition (IJDAR) 2025

Paper Code

We study a transititive distillation procedure to distill small-scale document transformers for specific document types and tasks. Our key focus is SlimDoc (right, blue), a feature-based distillation procedure for document transformers that focuses on distilling hidden states, attentions and output logits at different stages of training. To apply SlimDoc in unsupervised settings, we preceed it with a label distillation from labels generated with an LLM (red, left)

Abstract

Deploying state-of-the-art document understanding models remains resource-intensive and impractical in many real-world scenarios, particularly where labeled data is scarce and computational budgets are constrained. To address these challenges, this work proposes a novel approach towards parameter-efficient document understanding models capable of adapting to specific tasks and document types without the need for labeled data. Specifically, we propose an approach coined SlimDoc to distill multimodal document transformer encoder models into smaller student models, using internal signals at different training stages, followed by external signals. Our approach is inspired by TinyBERT and adapted to the domain of document understanding transformers. We demonstrate SlimDoc to out- perform both a single-stage distillation and a direct fine-tuning of the student. Experimental results across six document understanding datasets demonstrate our approach’s effectiveness: Our distilled student models achieve on average 93.0% of the teacher’s performance, while the fine-tuned students achieve 87.0% of the teacher’s performance. Without requiring any labeled data, we create a compact student which achieves 96.0% of the performance of its supervised-distilled counterpart and 86.2% of the performance of a supervised-fine-tuned teacher model. We demonstrate our distil- lation approach to pick up on document geometry and to be effective on the two popular document understanding models LiLT and LayoutLMv3. Our implementation and training data is available at https://github.com/marcel-lamott/SlimDoc.

SlimDoc: Our Distillation Approach

Overview of our distillation approach SlimDoc: Distillation of the student model with M transformer layers from a teacher model with N layers is split into two phases: in the first phase, embeddings, attention scores and hidden states of the text flow from the student model are aligned with the teacher model using MSE loss. In the second phase, the student model's logits are aligned with the teacher model's using KL divergence loss.

We fine-tune the teacher using Cross-Entropy (CE) loss for three tasks: extractive VQA, SER, and KIE. For VQA, the model predicts start and end positions in a sequence x = x₁,…,xₙ. It outputs probabilities pˢ(x) and pᵉ(x), where pˢ(x)ᵢ and pᵉ(x)ᵢ are the probabilities of token i being the start or end. The loss is: ℒ_CE,QA = - [log(pˢ(x)_yₛ) + log(pᵉ(x)_yₑ)].

For SER and KIE, each token is classified into one of C classes. The model outputs p(x)ᵢʲ, the probability of token i being class j. The CE loss is: ℒ_CE,SER+KIE = - (1/n) ∑₁ⁿ log(p(x)ᵢ^yᵢ).

To distill a student from the teacher, we use our method SlimDoc, split into two phases. In phase one, we align internal signals using MSE loss. For hidden states: ℒ_hidden = (1/M) ∑₁^M MSE(H_𝒮ⁱ(x), H_𝒯^lᵢ(x)), with MSE = (1/dn) ∑ⱼ=1ⁿ ∑ₖ=1ᵈ (H_𝒮,jkⁱ - H_𝒯,jk^lᵢ)². The embedding loss is ℒ_emb = MSE(H_𝒮⁰(x), H_𝒯⁰(x)).

For attention (shape H × n × n): ℒ_attn = (1/M) ∑₁^M MSE(A_𝒮ⁱ(x), A_𝒯^lᵢ(x)), with MSE = (1/Hn²) ∑ₕ=1ᴴ ∑ⱼ=1ⁿ ∑ₖ=1ⁿ (A_𝒮,hjkⁱ - A_𝒯,hjk^lᵢ)².

Combined distillation loss is ℒ_distill = α · ℒ_emb + β · ℒ_hidden + γ · ℒ_attn.

In phase two, we align the model outputs using KL divergence: ℒ_KL = (1/nC) ∑ⱼ=1ⁿ ∑ₖ=1ᶜ p_𝒮(x)_jk · log(p_𝒮(x)_jk / p_𝒯(x)_jk). For VQA, this is applied separately to start and end outputs.

Results

General and vocabulary-specific results. The upper half compares our distilled students (DT) to the teacher models and fine-tuned students (FT), where best student results are shown in bold. It shows that the distillation procedure provides an advantage compared to regular fine-tuning: the distilled 4-layer student models achieve on average 93.4% and 92.5% of the performance of the 12-layer teachers for LiLT and LayoutLMv3, respectively, while the fine-tuned students only achieve 87.3% and 86.6%. Further, the data indicates that an explicit distillation of layout signals is not necessary for LiLT, as the LiLT-Layout adaptation is on par with the distillation of LiLT's text-flow, which captures layout signals via the BiACM mechanism. Furthermore, it can be observed that omission of layout signals during training leads to a sharp drop in performance (LiLT-NoLayout, LayoutLMv3-NoVisionNoLayout) and that our approach implicitly distills layout signals.

The lower half evaluates different vocabulary sizes for the student models: the full vocabulary with 50k tokens is compared against the small variant with 15k tokens and the tiny variant with 5k tokens. Best student results for small vocab are shown in bold and best student results for tiny vocab are shown in bold and italics. Column Comp. shows relative performance compared to full-size vocabulary students in percent. The results suggest that reducing vocabulary size is an efficient approach to decreasing model size, having only a moderate impact on performance: on average, the distilled student with the small vocabulary variant achieves 97.7% and 98.0% of the performance of distilled student with the full vocabulary for LiLT and LayoutLMv3, respectively.

Single-Phase VS Two-Phase Distillation

Results of our ablation studies, where we use a single-phase distillation (DT-1Phase) compared to our usual two-phase setup (DT) and fine-tune a student model initialized with the weights of the pre-trained base model (FT-NewInit) instead of the fine-tuned teacher (FT). Best student results are shown in bold. The results suggest that the two-phase distillation setup provides an advantage compared to distillation of hidden states, attention scores and logits together, leading to improvements of up to 2.5 pp. Further, the data demonstrates that the initialization of the student model's weights plays a crucial role in the learning process, with the teacher-initialized student model achieving better scores on all datasets.

Supervised VS Unsupervised Distillation

Comparison of supervised and unsupervised distillation: FT and DT denote fine-tuned and distilled student models, respectively. Models marked with ^✦ were trained using LLM-supplied silver-standard labels in place of gold-standard GT annotations. Column C-ST shows average performance relative to the supervised-trained teacher. Column C-CS shows average performance relative to the corresponding supervised-trained model, i.e., teacher, fine-tuned student or distilled student. Row ChatGPT 3.5 denotes performance of LLM-supplied answers on the test-split. Best student results are shown in bold. The results show that unsupervised training achieves performance comparable to supervised training, with the unsupervised-distilled LiLT student reaching 96.0% of its supervised counterpart’s performance.

Impact of Student Layer Selection

Relative improvement of distillation over fine-tuning by layers for LiLT (left bars) and LayoutLMv3 (right bars). X-labels denote the set of layers 𝓘ₛ. Averages are calculated over all datasets for the 4, 3, 2 and 1-layer students. The results indicate that the middle layers of these models benefit more from distillation compared to the outer layers. For LiLT, distillation with the set of layers 𝓘ₛ = ⟨3,7,11⟩ gives the greatest improvement, while for LayoutLMv3 this is the case for 𝓘ₛ = ⟨2,5,8,11⟩.

Model Size and Inference Efficiency

Overview of model size and inference times for teacher and student models. Inference time is measured on 100 batches of DocVQA for batch size 16 (column Inf. (ms)). Column D.s. indicates the downscaling factor by comparing the size of the teacher to the student. SV and TV denote the small and tiny vocabulary variants, respectively. Column Best Avg Score shows the best average score across all datasets achieved by an n-layer student. Best student metrics are shown in bold. It is shown that even the largest student models with 4 layers reach considerable improvements in model size and inference time, while maintaining a performance close to that of the teacher. While both downscaling strategies—vocabulary reduction and layer reduction—produce smaller models, reducing the number of layers below 4 leads to a more significant performance drop, whereas vocabulary reduction more effectively preserves accuracy at comparable model sizes.

Conclusion

In this work we present SlimDoc, a novel and effective distillation framework for compressing multimodal document transformer models into lightweight, task-specific students without requiring labeled data. By combining feature-based and output-based distillation across two stages, SlimDoc achieves strong performance retention while significantly reducing model size and computational costs. Evaluations across six datasets and multiple transformer backbones confirm that our approach consistently outperforms traditional fine-tuning, maintains spatial layout awareness, and operates effectively even with LLM-generated silver-standard labels. Additionally, we demonstrate that vocabulary pruning and strategic layer selection further enhance efficiency with minimal performance trade-off. SlimDoc paves the way for scalable and label-efficient deployment of document understanding models in real-world, resource-constrained environments.

BibTeX

@article{Lamott_Shakir_Ulges_Weweler_Shafait_2025a,
    title={SlimDoc: Lightweight distillation of document Transformer models}, 
    DOI={10.1007/s10032-025-00542-w}, 
    journal={International Journal on Document Analysis and Recognition (IJDAR)}, 
    author={Lamott, Marcel and Shakir, Muhammad Armaghan and Ulges, Adrian and Weweler, Yves-Noel and Shafait, Faisal}, 
    year={2025}, 
    month={Jun}
}