Hi Vishal Rawat,
Thank you for reaching out, and I completely understand your frustration with this issue. You're not alone. Several developers have encountered the same challenge when trying to switch modalities mid-session.
Changing modalities mid-session is problematic in both OpenAI and Azure implementations.
When you send a session.update event to switch between ["text", "audio"] and ["text"], the modalities configuration appears to be locked at session initialization. This is because audio and text processing use fundamentally different pipelines, and the underlying connections are established when the session starts.
Here are some relevant references that discuss this behavior:
https://community.openai.com/t/realtime-api-updating-modalities/996243/3
Recommended Solution:
Keep modalities: ["text", "audio"] enabled throughout the session and control the audio behavior at the application level essentially managing when audio features are actively used through your client-side logic rather than trying to reconfigure the session.
Feel free to accept this as an answer.
Thank you for reaching out to the Microsoft Q&A portal!