823-OLT @ BUET DL Sprint 4.0: Context-Aware Windowing for ASR and Fine-Tuned Speaker Diarization in Bengali Long Form Audio

2026-02-24 • Sound

Sound

AI summaryⓘ

The authors worked on improving technology that converts long Bengali speech recordings into written text and identifies who is speaking when. They used existing models for speech recognition and speaker separation but adapted them to handle long audio with special techniques that keep context and manage pauses. For identifying different speakers, they fine-tuned a model using a dataset focused on Bengali conversations. Their work aims to make speech technologies more accessible and effective for Bengali, a language that has not had much attention in this area.

Bengali languageautomatic speech recognition (ASR)speaker diarizationWhisper modelvoice activity detectionspeech segmentationfine-tuninglong form speechcontext preservationlow resource languages

Authors

Ratnajit Dhar, Arpita Mallik

Abstract

Bengali, despite being one of the most widely spoken languages globally, remains underrepresented in long form speech technology, particularly in systems addressing transcription and speaker attribution. We present frameworks for long form Bengali speech intelligence that address automatic speech recognition using a Whisper Medium based model and speaker diarization using a finetuned segmentation model. The ASR pipeline incorporates vocal separation, voice activity detection, and a gap aware windowing strategy to construct context preserving segments for stable decoding. For diarization, a pretrained speaker segmentation model is finetuned on the official competition dataset (provided as part of the DL Sprint 4.0 competition organized under BUET CSE Fest), to better capture Bengali conversational patterns. The resulting systems deliver both efficient transcription of long form audio and speaker aware transcription to provide scalable speech technology solutions for low resource languages.

View PDFOpen arXiv