Domain-Adaptive Dense Retrieval for Brazilian Legal Search

2026-05-05Information Retrieval

Information Retrieval
AI summary

The authors studied different ways to train computer models that help find legal information in Brazilian texts. They tried three approaches: no extra training, training only on legal texts, and training on both legal texts and a question-answering dataset. They found that training only on legal data works best for specialized legal searches, while mixing in general question data helps the model perform better across different types of searches. This mixed training especially improved results on a general Portuguese dataset, showing it makes the model more flexible. Both specialized and mixed models are made publicly available.

Dense retrieversLegal retrievalFine-tuningQwen3-Embedding-4BNDCG@10MRR@10MAP@10JUÁ leaderboardSQuAD-ptOut-of-domain generalization
Authors
Jayr Pereira, Roberto Lotufo, Luiz Bonifacio
Abstract
Brazilian legal retrieval is heterogeneous, covering case law, legislation, and question-based search. This makes training dense retrievers a trade-off between stronger domain specialization and broader robustness across retrieval types of search. In this paper, we explore this trade-off using three training setups based on Qwen3-Embedding-4B: a base model with no fine-tuning, a version trained only on legal data, and a mixed setup that combines legal data with SQuAD-pt supervised dataset. We evaluate these models on five legal datasets from the JUÁ leaderboard, along with Quati dataset as an extra Portuguese retrieval benchmark to test out-of-domain generalization. The legal-only model performs best on the most specialized legal tasks. The mixed setup keeps strong performance on legal data while offering a better overall balance, improving average NDCG@10 from 0.414 to 0.447, MRR@10 from 0.586 to 0.595, and MAP@10 from 0.270 to 0.308 across all six datasets. The biggest improvement appears on Quati, where the mixed model clearly outperforms the legal-only one. Overall, the results show that legal-only and mixed training lead to different strengths: the first is better for specialization, while the second is more robust across different types of search, especially question-based ones. Both adapted models are available on Hugging Face