Direct Methods 'Not Found' under load in IoT Hub

Question

Direct Methods 'Not Found' under load in IoT Hub

Iain White 146

We are performing some load tests to check that our Edge Module is able to process the number of Direct Methods we need it to.

We've been using the azure-iot-hub python module to call the direct methods. The payload is very small and is the same for every call.

Most of the direct methods are handled, but a few get a 'Not Found' response (~10 out of 10000 fail - and they all fail at the same time).

When they works we see:

urllib3.connectionpool - DEBUG - https://<OUR_IOT_HUB>.azure-devices.net:443 "POST /twins/<OUR_EDGE_DEVICE_NAME>/modules/<OUR_MODULE_NAME>/methods?api-version=2021-04-12 HTTP/1.1" 200 27

And when they fail we see:

urllib3.connectionpool - DEBUG - https://<OUR_IOT_HUB>.azure-devices.net:443 "POST /twins/<OUR_EDGE_DEVICE_NAME>/modules/<OUR_MODULE_NAME>/methods?api-version=2021-04-12 HTTP/1.1" 404 348

The module is always deployed so it can't be the module that isn't being found. Network connectivity looks solid and there are no errors in the edge module logs - seems like the failed calls aren't reaching it.

Is there any explanation for why this might be happening. Seems like it is at the Azure side. It doesn't look like throttling because I'd expect a 429 response.

Any suggestions for what might be happening here?

Iain White 146 Reputation points

2025-04-02T07:45:38.2866667+00:00

Thanks @VSawhney .

Do you have any thoughts on how a memory failure could result in a 404 response to a Direct Method call?

I agree that adding resilience to the application making the calls is a good approach, but my main focus here is understanding why I’m seeing this specific response.

Rather than suggestions on handling failures client-side, I’m trying to pinpoint why IoT Hub is returning a 404 when the Direct Method exists on the target machine. Any insights would be much appreciated!
VSawhney 320 Reputation points Microsoft External Staff

2025-04-02T10:37:42.6466667+00:00

Hello Iain White,

As per my past experience, the module might be getting temporarily unavailable (getting unhealthy) due to restarts or resource or request constraints on each regions. These are transient issues which can be tackled through retry logic and modifying other parameters like sleep time, TTL as mentioned above.
Hope this helps. If you need any further assistance, please feel free to reach us.

Thank you!
VSawhney 320 Reputation points Microsoft External Staff

2025-04-03T10:30:01.8266667+00:00

Hello Iain White,

I hope you went through the suggestion provided and solved your issue. If you any further query please feel free to reach us.

Thank you!
VSawhney 320 Reputation points Microsoft External Staff

2025-04-04T08:49:15.7433333+00:00

Hello Iain White,

Following up again to check if the suggestion provided solved your issue. If you any further query please feel free to reach us.

Thank you!
Iain White 146 Reputation points

2025-04-04T15:15:36.6966667+00:00

@VSawhney No, I haven't solved. Ideally someone could tell me what causes the IoT Hub to return a 404348. I've found this list of error codes but 404348 isn't listed.

I think this could be a problem at the Azure side because I'm not seeing it hit our edge device. If anyone can suggest ways to prove or disprove this theory please let me know.
LeelaRajeshSayana-MSFT 17,601 Reputation points

2025-04-08T02:40:48.0166667+00:00
@Iain White Can you provide us more information to the below questions to helps us better understand the behavior you notice

When you say ~10 out of 10000 fail, did you notice the 10 calls fail consecutively

Is it always 404 348 the response you get for all the 10 failed cases. 404 is the status code being returned here and 348 is part of response payload. I suspect it could be timeout in seconds.

What is the interval between each call you are make to the direct method

Did you set any explicit responseTimeoutInSeconds as you make the call

Can you share the snippet of the code on how you are invoking the direct method

Can you call the ping direct method on the Iot Edge module

Capture additional debug logs by using support-bundle and see if the logs capture any additional information

If the calls to the direct method resume after a slight delay, it sounds like a transient error on IoT Edge end. Try to increase the timeout on the direct method call by increasing the responseTimeoutInSeconds parameter. Please refer the example for the following document on how to make this call. Incorporate a retry mechanism and try to make a call after a slight delay when error occurs.
AshokPeddakotla-MSFT 35,951 Reputation points

2025-04-09T03:50:43.0866667+00:00

@Iain White Did you get a chance to see Leela Suggestions?
Please let us know if you are still blocked.
Iain White 146 Reputation points

2025-04-09T10:10:40.43+00:00

Thanks for the assistance @LeelaRajeshSayana-MSFT .

When failures happen they there tend to be a number of failures consecutively. Seems to me like there's some kind of blip in the handling of the Direct Method requests at the Azure side. Maybe while a refresh of config or something happens.

My current theory is that this is a issue at the IoT Hub because I can't see any failures in my edge module. I see all the successful calls. Judging by the 404 'Not Found' response, the IoT Hub seems to think the module doesn't exist and is 'Not Found' for a short period. This response is coming from the IoT Hub. It isn't coming from my module because it isn't reaching my module.

It is always 404 348, but I think after investigation this is the number of bytes in the response and nothing more insightful.

To rule out connectivity issues between our Edge Device and the IoT Hub, I've also tested on a Azure hosted VM running as an edge device. We see exactly the same sporadic 'Not Found' failures in Direct Methods calls when the load test are pointed towards it. We send around 5 Direct Method calls a second.
I didn't set specific a timeout - but since I am getting a 404 I don't expect this to be related to timeouts. If it is then I should be getting a timeout response code, not a 404, right?

Sorry - I don't understand why you're asking if I am able to call the ping Direct Method. I am able to call my actual Direct Method on the vast majority of occasions. But if it does help, yes I can call the 'ping' direct method on the edgeAgent.

We do have a robust retry mechanism implemented, but I want to find out why this is happening at relatively low load. Our system needs to be able to potentially handle hundreds of Direct Methods a second. I need to be sure it can be handled by our module and also by the IoT Hub.

Here's the class I'm using to invoke the Direct Methods. As you can see it uses the azure-iot-hub python module.
Iain White 146 Reputation points

2025-04-10T10:48:24.03+00:00
A bit more info @LeelaRajeshSayana-MSFT after I've added more logging.

When we get these 'Not Found' responses, the requests are actually reaching the edge module and our edge module is sending a success response, within the same timeframe as all of the other responses (less than a second).

When the issue occurs, it seems we get a few 200 responses but with empty payloads and a response.status of '0'- in the most recent example there were 3 of these. Then we got a 19 404's with a response.status of 'Not Found'. This happened within a 2 second window and then it was fine again.

We've been running these load tests a lot now and the issue only seems to happen every hour or so.

Today it happened at 09.17 and 10.27.

I am still trying to see if this is an issue on our side, but it would be helpful to know the extent of the load testing that has been done already at the Azure IoT Hub side by Microsoft.

Why I think this is an issue at the Azure side -

My module logs show no errors and for every failure I can see in the logs the particular unique command ID being received and responded to.

The Azure side Direct Method logging reports '"The operation failed because the requested device isn't online. To learn more, see https://aka.ms/iothub404103\"' but the device clearly is online as I can see it handling the requests in the module logs.
LeelaRajeshSayana-MSFT 17,601 Reputation points

2025-04-15T14:29:08.5866667+00:00

@Iain White I have reached out to you through offline message for next steps on this issue. Please respond back to the private message with the requested information.

1 answer

Your answer

Iain White 146 Reputation points

2025-04-02T07:45:38.2866667+00:00

Thanks @VSawhney .

Do you have any thoughts on how a memory failure could result in a 404 response to a Direct Method call?

I agree that adding resilience to the application making the calls is a good approach, but my main focus here is understanding why I’m seeing this specific response.

Rather than suggestions on handling failures client-side, I’m trying to pinpoint why IoT Hub is returning a 404 when the Direct Method exists on the target machine. Any insights would be much appreciated!
VSawhney 320 Reputation points Microsoft External Staff

2025-04-02T10:37:42.6466667+00:00

Hello Iain White,

As per my past experience, the module might be getting temporarily unavailable (getting unhealthy) due to restarts or resource or request constraints on each regions. These are transient issues which can be tackled through retry logic and modifying other parameters like sleep time, TTL as mentioned above.
Hope this helps. If you need any further assistance, please feel free to reach us.

Thank you!
VSawhney 320 Reputation points Microsoft External Staff

2025-04-03T10:30:01.8266667+00:00

Hello Iain White,

I hope you went through the suggestion provided and solved your issue. If you any further query please feel free to reach us.

Thank you!
VSawhney 320 Reputation points Microsoft External Staff

2025-04-04T08:49:15.7433333+00:00

Hello Iain White,

Following up again to check if the suggestion provided solved your issue. If you any further query please feel free to reach us.

Thank you!
Iain White 146 Reputation points

2025-04-04T15:15:36.6966667+00:00

@VSawhney No, I haven't solved. Ideally someone could tell me what causes the IoT Hub to return a 404348. I've found this list of error codes but 404348 isn't listed.

I think this could be a problem at the Azure side because I'm not seeing it hit our edge device. If anyone can suggest ways to prove or disprove this theory please let me know.
LeelaRajeshSayana-MSFT 17,601 Reputation points

2025-04-08T02:40:48.0166667+00:00

@Iain White Can you provide us more information to the below questions to helps us better understand the behavior you notice

When you say ~10 out of 10000 fail, did you notice the 10 calls fail consecutively

Is it always 404 348 the response you get for all the 10 failed cases. 404 is the status code being returned here and 348 is part of response payload. I suspect it could be timeout in seconds.

What is the interval between each call you are make to the direct method

Did you set any explicit responseTimeoutInSeconds as you make the call

Can you share the snippet of the code on how you are invoking the direct method

Can you call the ping direct method on the Iot Edge module

Capture additional debug logs by using support-bundle and see if the logs capture any additional information

If the calls to the direct method resume after a slight delay, it sounds like a transient error on IoT Edge end. Try to increase the timeout on the direct method call by increasing the responseTimeoutInSeconds parameter. Please refer the example for the following document on how to make this call. Incorporate a retry mechanism and try to make a call after a slight delay when error occurs.
AshokPeddakotla-MSFT 35,951 Reputation points

2025-04-09T03:50:43.0866667+00:00

@Iain White Did you get a chance to see Leela Suggestions?
Please let us know if you are still blocked.
Iain White 146 Reputation points

2025-04-10T10:48:24.03+00:00

A bit more info @LeelaRajeshSayana-MSFT after I've added more logging.

When we get these 'Not Found' responses, the requests are actually reaching the edge module and our edge module is sending a success response, within the same timeframe as all of the other responses (less than a second).

When the issue occurs, it seems we get a few 200 responses but with empty payloads and a response.status of '0'- in the most recent example there were 3 of these. Then we got a 19 404's with a response.status of 'Not Found'. This happened within a 2 second window and then it was fine again.

We've been running these load tests a lot now and the issue only seems to happen every hour or so.

Today it happened at 09.17 and 10.27.

I am still trying to see if this is an issue on our side, but it would be helpful to know the extent of the load testing that has been done already at the Azure IoT Hub side by Microsoft.

Why I think this is an issue at the Azure side -

My module logs show no errors and for every failure I can see in the logs the particular unique command ID being received and responded to.

The Azure side Direct Method logging reports '"The operation failed because the requested device isn't online. To learn more, see https://aka.ms/iothub404103\"' but the device clearly is online as I can see it handling the requests in the module logs.
LeelaRajeshSayana-MSFT 17,601 Reputation points

2025-04-15T14:29:08.5866667+00:00

@Iain White I have reached out to you through offline message for next steps on this issue. Please respond back to the private message with the requested information.

Answer 1

Hello Iain White,

As you mentioned 10 out 10000 request are failing intermittently. It might be memory failure on local or server machine.
You may try the following steps:

Retry Logic: Optimize no of retries and delay in between concurrent requests.
Resetting TLS State
Increase TTL time to keep the session connected
Increase timeout param in your code.
Please distribute traffic between multiple devices.

Ref doc: Understand and invoke direct methods from IoT Hub

Hope this helps. Please let us know if you need any further assistance.

Thank you!

Share via

Direct Methods 'Not Found' under load in IoT Hub

1 answer

Your answer