dataset 2025 年 1 月 12 日

Optimizing Estonian TV Subtitles with Semisupervised Learning and LLMs

dataset all search terms

title: Optimizing Estonian TV Subtitles with Semisupervised Learning and LLMs

publish date:

2025-01-09

authors:

Artem Fedorchenko et.al.

paper id

2501.05234v1

download

2501.05234v1

abstracts:

This paper presents an approach for generating high-quality, same-language subtitles for Estonian TV content. We fine-tune the Whisper model on human-generated Estonian subtitles and enhance it with iterative pseudo-labeling and large language model (LLM) based post-editing. Our experiments demonstrate notable subtitle quality improvement through pseudo-labeling with an unlabeled dataset. We find that applying LLM-based editing at test time enhances subtitle accuracy, while its use during training does not yield further gains. This approach holds promise for creating subtitle quality close to human standard and could be extended to real-time applications.

QA:

coming soon

编辑整理： wanghaisheng 更新日期：2025 年 1 月 12 日