Will transcription segmentation improve by product?

Question

Will transcription segmentation improve by product?

Quynh Huynh (NON EA SC ALT) 40 Microsoft Employee

For certain languages, transcripts are not properly segmented, resulting in downloaded VTT files with large blocks of text that exceed the typical character limits for accessible captioning. For example, English and German video transcripts are segmented into shorter chunks in the timeline view, which translates into appropriately sized captions in the VTT file. However, for Japanese videos, the lack of conventional period marks—common in character-based languages—leads to oversized text blocks that span a single caption line for up to 30 seconds, making them difficult to read and less accessible.

SRILAKSHMI C 10,805 Reputation points Microsoft External Staff Moderator

2025-10-23T06:51:23.48+00:00

Hi Quynh Huynh (NON EA SC ALT),

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!

Answer accepted by question author

1 additional answer

Your answer

SRILAKSHMI C 10,805 Reputation points Microsoft External Staff Moderator

2025-10-23T06:51:23.48+00:00

Hi Quynh Huynh (NON EA SC ALT),

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!

Answer 1

Hello Quynh Huynh (NON EA SC ALT),

Thank you for sharing these details and for raising this important feedback around Japanese transcription segmentation. The earlier explanation already covers the core product context and planned improvements well I’d like to add a few additional insights and practical recommendations that might further help while Microsoft continues to enhance segmentation for character-based languages.

Additional Context on Segmentation Behavior

For character-based languages such as Japanese, segmentation challenges arise primarily because:

Azure AI Video Indexer’s current models rely partly on punctuation and acoustic pauses for boundary detection.

In Japanese, 句点 (。) and 読点 (、) are often optional or inconsistently used in conversational speech, so pauses may not always align with clear textual boundaries.

This leads to longer caption blocks when speech is continuous and punctuation cues are limited.

Microsoft’s speech and video indexing teams are actively exploring acoustic-prosody-based segmentation, which uses intonation, pitch, and silence detection to split captions more naturally even in the absence of punctuation.

Configuration & Optimization Tips

While product-level improvements are underway, the following techniques can help optimize output today:

Explicit Language Tagging When uploading or processing Japanese media, explicitly set the language parameter to "ja-JP" in the API or portal. This ensures that the correct acoustic and language models are applied and can improve segmentation sensitivity.

Hybrid Workflow (Speech + Video Indexer) Consider combining Azure Speech to Text (for fine-grained punctuation and sentence boundary control) with Video Indexer (for timeline alignment and captions). You can generate transcripts with custom punctuation prediction models in Speech service and import them into Video Indexer for synchronized caption generation.

Post-processing Segmentation Scripts Beyond VTT editing, some customers automate re-segmentation using time-based heuristics or pause detection thresholds from the Speech API output. For example:

Split text every n seconds (e.g., every 6–7 seconds).
Break lines on speech pauses longer than 400–600ms.
Re-wrap text using accessibility guidelines (max ~42 characters per line).

Model Fine-tuning with Custom Speech Training a custom speech model with data from your domain can improve punctuation prediction and, consequently, segmentation quality. Custom models better learn where natural speech boundaries occur for your speakers and content style.

Batch Processing with Azure Functions If you regularly process large volumes of Japanese media, you can build a small Azure Function or Logic App that automatically:

Retrieves the VTT from Video Indexer
Applies custom segmentation logic
Saves the refined captions back to Blob Storage or a CMS

Microsoft’s product teams are aware that CJK segmentation is a key area for accessibility improvement. Beyond punctuation-based splitting, upcoming iterations will leverage multimodal models that combine both audio and semantic cues to ensure captions remain concise, readable, and accessible across all languages.

Please refer this

I Hope this helps. Do let me know if you have any further queries.

Thank you!

Answer 2

Hello Quynh Huynh (NON EA SC ALT),

Thank you for raising this important accessibility issue regarding transcription segmentation for character-based languages like Japanese.

Current Status:

You're correct that Azure AI Video Indexer's transcription segmentation works well for languages with clear sentence boundaries (like English and German using periods), but struggles with character-based languages like Japanese, Chinese, and Korean where traditional punctuation marks may be less common or used differently.

Product Improvements Underway:

Yes, Microsoft is actively working on improving transcription segmentation for CJK (Chinese, Japanese, Korean) and other character-based languages. Here are the key developments:

Enhanced Natural Language Processing: Azure AI Video Indexer is incorporating more sophisticated NLP models that can detect natural pause points and semantic boundaries in character-based languages, rather than relying solely on punctuation marks.
Language-Specific Segmentation Rules: The product team is implementing language-specific algorithms that understand:
- Natural pause patterns in speech
- Semantic sentence boundaries
- Clause breaks using particles (e.g., Japanese particles like は、が、を)
- Context-aware segmentation based on speech rhythm and intonation
Accessibility Compliance: Microsoft is prioritizing WCAG 2.1 compliance, which requires captions to be:
- No more than 2-3 lines of text
- Maximum 32-42 characters per line (varies by standard)
- Displayed for no more than 6-7 seconds per caption

Current Workarounds:

While product improvements are being developed, here are some temporary solutions:

Manual Transcript Editing: Use the Azure AI Video Indexer portal to manually edit and split transcript segments:
- Navigate to your video in the Video Indexer website
- Click on the "Timeline" tab
- Manually adjust transcript line breaks
- Re-download the VTT file
Post-Processing VTT Files: Write a script to automatically segment long caption blocks:


import re

def segment_vtt_by_duration(vtt_content, max_duration_seconds=6):

    # Parse VTT and split long segments based on time

    # Insert breaks at natural boundaries (commas, pauses)

    # Ensure each segment stays within character limits

    pass

Use Custom Speech Models: Train custom models with better punctuation prediction for your specific use case through Azure Speech Services.

Feedback and Feature Requests:

Since you're a Microsoft employee, I recommend:

File a Feature Request: Submit this through the Azure Video Indexer feedback portal or directly to the product team
Share Use Cases: Provide specific examples and VTT files showing the accessibility issues
Engage with Product Group: Connect with the Azure Media Services team to prioritize this enhancement

Timeline Expectations:

While I don't have specific dates, improvements to CJK language segmentation are typically rolled out as:

Preview features (3-6 months for specific languages)
General availability (6-12 months after preview)
Continuous improvements through ML model updates

Additional Resources:

Review the Azure Video Indexer roadmap on the Azure updates page
Monitor release notes for transcription and captioning improvements
Consider the Azure Speech Service's custom speech capabilities for more control over transcription output

This is a known challenge across the industry for AI transcription services, and Microsoft has acknowledged the importance of multilingual accessibility. Your feedback as an internal user helps prioritize these improvements.

Best Regards,

Jerald Felix

Share via

Will transcription segmentation improve by product?

1 additional answer

Your answer