Catastrophic api “code 500, Internal error” error rate on all api call verbs
Hello Leslie,
Rather than a long speech, I'm attaching a vision of my Firbase console logging the api's errors...
Allow me to ask you the following question:
- Why do you take so long to respond to a critical problem (see my previous post)?
- Why don't you communicate proactively and transparently?
- When can we expect a return to normal?
In short, I'd like to know what I should tell my users...
Thank you in advance for the detailed answer you can give me.
Comments
10 comments
Hello,
We performed some tests with the developer in the past few days :
The problem seems to affect "big" requests. For a call to /getmeasure, if you don't use the "limit" or a "date_begin" parameter, 1024 values are displayed in the JSON response. I tested it and about 3 of 4 requests send a 500 error response after a long delay (4 to 6 seconds)
If you reduce for example limit to 30, or set a date_begin a few days before, the API responds correctly
And in this case the 3hours scale also works correctly
Developers tried to perform a scale-up on some services, but unfortunately it didn't seem to improve the performances. There is a meeting tomorrow with the infrastructure team to discuss it
Have a good day,
Leslie - Community Manager
Thank you Leslie for this feedback,
However, the proposed workaround for the “getmeasure” call verb is not tenable. In fact, I'll automatically come up against the api's rate limit for requests if I want to have the same number of measurement histories.
Furthermore, as I pointed out in the title of this thread, “code 500 internal error” errors now occur on all types of call verbs: getstationdata, gethomecoachdata, getmeasure, homestatus, etc. I've attached a graph of “code 500, internal error” errors, which clearly shows the deterioration in your api's quality of service since October 15.
Do not hesitate to show this graph to your infrastructure team tomorrow, as it is now vital and urgent to take the measure of the problem. I've been publishing applications using your api since 2017, and I've never seen such a deterioration in stability.
I've come to the point where I'm clearly wondering whether I should stop being compatible with your Netatmo equipment in favor of competing equipment. The response of your teams to this problem will be a demonstrator in the willingness to move forward in this partnership or not ...
I wish you a good day and look forward to hearing from you.
Hello,
A little follow-up after the infrastructure team meeting :
They improved the performance of the service at the beginning of the week to try to reduce failures. They have noticed that the number of failures has decreased to get closer to what it was before the beginning of the problems, but the 500 error rate is still above normal. They will try another optimization of the service soon in the hope that it resolves the situation
Please note I'll be off next week, but the teams are aware of the importance of addressing this issue. If there is some news to give, I asked my colleagues to publish information here
Have a good day,
Leslie - Community Manager
Hello Leslie,
I'm relieved to hear that the matter has been taken in hand. I'll be following the evolution of the code 500 error rate and I sincerely hope that things will get back to normal. If we can at the same time correct the 3-hour scale problem that I mentioned in my other thread, that would be ideal.
In any case, thank you for your efforts to solve these problems. I sincerely hope that we'll come out of this crisis on top.
I look forward to your feedback on the results of your optimization operations.
I wish you a good day and a happy vacation.
Hello,
Last week (29 October), the infra team configured more powerful machines to process measurement calls. Did you notice an improvement of the service on your side ?
Thanks for your feedback and have a good day,
Leslie - Community Manager
Hi Leslie,
I can confirm an improvement in the “500 internal error” rate. I've attached the graph showing a real improvement.
The 3-hour scale for the verb getmeasure still doesn't seem to work.
However, there's still a 500 internal error background that's rising little by little. I've attached the graph since October 29 showing this increase.
Lastly, between 11 a.m. and 1.40 p.m. today (French time, November 6) we saw a very high number of “code 11, internal error” errors.
Can you confirm that these improvements will be sustained over time, especially when we see the second curve, and ask for an investigation into this new code 11 problem I detected between 11 a.m. and 1:40 p.m.?
Thanks in advance for your feedback.
To complete my alert of yesterday with a very high “code 11: internal error” rate, I'm attaching the graphs, Leslie. You'll agree that the trend is really not good.
I'll be interested in your explanation and any future action you'll take to find out what I can do to answer users' questions.
Thanks in advance for your feedback.
Hello,
The high rate of code 11 errors the 6 between 12/14h corresponds to a deployment in Production of fixes concerning the 500 errors problem. So, it's normal
As these improvements appear to be necessary for a good quality of service, they will be maintained. And as you can see, they are still working on it and will continue to improve the service over the time
Have a good day,
Leslie - Community Manager
Hello Leslie,
Thank you for these explanations.
Am I to understand that, after your optimizations, we should also be able to have fully-functioning measurement history collections with a scale of 3 hours?
Can we be warned a little in advance of these production start-ups, which may result in occasional unavailability of the API?
I'm sure you'll agree that communication in response to customer complaints on November 6 is likely to be a little complicated if I say: “this malfunction is normal, it's a production release of a fix aimed at improving service quality...”.
Good day to you.
Hi again,
Yes, it's the goal. We had about 100 errors/hour on this endpoint before the problem, up to 3000 during, and now it stays around 200. The developers told me that they are still working to try to improve the service
It's difficult for me to warn because I don't have a specific broadcast channel for applications developers (main users are not developers and only have a developer app to link it to a third-party service, so these technical communications are useless for them)
In the present case we wanted to push the fix as soon as possible. But usually these kinds of operations are planned to not have much visible impacts on the availability of service
Have a good day,
Leslie - Community Manager
Please sign in to leave a comment.