Research paper for CiiT conference

Title: Evaluating STT Models for Macedonian and English: A Comparative Study

Authors: Cvetanka Nechevska and Ivan Kitanovski

Subject: STT models, Whisper, Chirp, Nova3, Vosk, Universal-2.

Language: English

Introduction

Speech-to-text models have become increasingly important in a variety of applications, their popularity revolutionized user interactions and overall experience. This paper presents a comparison of five speech-to-text (STT) models: Chirp, Whisper, Universal-2, Nova3, and Vosk, with a focus on their performance for Macedonian and English transcription tasks.

The models are evaluated using key metrics including Word Error Rate (WER), Words Per Second (WPS), Punctuation Accuracy (PA), Capitalization Accuracy (CA) and Real-Time Factor (RTF). Among the models tested, Whisper demonstrated the highest transcription accuracy and formatting quality, while Universal-2 exhibited superior speed and real-time performance. Chirp showed relatively strong results for Macedonian transcription, though formatting capabilities were limited. Vosk, while offline-capable, lagged behind in overall accuracy.

The study aims to inform the selection of STT systems for multilingual and user-friendly web applications, particularly those involving the Macedonian language.

Once the full paper is officially published, I’ll share more details and findings here as well.