Hello Quynh Huynh (NON EA SC ALT),
Thank you for sharing these details and for raising this important feedback around Japanese transcription segmentation. The earlier explanation already covers the core product context and planned improvements well I’d like to add a few additional insights and practical recommendations that might further help while Microsoft continues to enhance segmentation for character-based languages.
Additional Context on Segmentation Behavior
For character-based languages such as Japanese, segmentation challenges arise primarily because:
- Azure AI Video Indexer’s current models rely partly on punctuation and acoustic pauses for boundary detection.
In Japanese, 句点 (。) and 読点 (、) are often optional or inconsistently used in conversational speech, so pauses may not always align with clear textual boundaries.
This leads to longer caption blocks when speech is continuous and punctuation cues are limited.
Microsoft’s speech and video indexing teams are actively exploring acoustic-prosody-based segmentation, which uses intonation, pitch, and silence detection to split captions more naturally even in the absence of punctuation.
Configuration & Optimization Tips
While product-level improvements are underway, the following techniques can help optimize output today:
Explicit Language Tagging When uploading or processing Japanese media, explicitly set the language parameter to "ja-JP" in the API or portal. This ensures that the correct acoustic and language models are applied and can improve segmentation sensitivity.
Hybrid Workflow (Speech + Video Indexer) Consider combining Azure Speech to Text (for fine-grained punctuation and sentence boundary control) with Video Indexer (for timeline alignment and captions). You can generate transcripts with custom punctuation prediction models in Speech service and import them into Video Indexer for synchronized caption generation.
Post-processing Segmentation Scripts Beyond VTT editing, some customers automate re-segmentation using time-based heuristics or pause detection thresholds from the Speech API output. For example:
- Split text every n seconds (e.g., every 6–7 seconds).
- Break lines on speech pauses longer than 400–600ms.
- Re-wrap text using accessibility guidelines (max ~42 characters per line).
Model Fine-tuning with Custom Speech Training a custom speech model with data from your domain can improve punctuation prediction and, consequently, segmentation quality. Custom models better learn where natural speech boundaries occur for your speakers and content style.
Batch Processing with Azure Functions If you regularly process large volumes of Japanese media, you can build a small Azure Function or Logic App that automatically:
- Retrieves the VTT from Video Indexer
- Applies custom segmentation logic
- Saves the refined captions back to Blob Storage or a CMS
Microsoft’s product teams are aware that CJK segmentation is a key area for accessibility improvement. Beyond punctuation-based splitting, upcoming iterations will leverage multimodal models that combine both audio and semantic cues to ensure captions remain concise, readable, and accessible across all languages.
Please refer this
- Azure AI Video Indexer Overview
- Get Media Transcription, Translation, and Language Identification Insights
- Automatic Language Detection Model
I Hope this helps. Do let me know if you have any further queries.
Thank you!