RMS API / Remote access outage

bcraig · December 18, 2024, 4:59pm

Hi,

There seem to been a RMS (API / Remote WebUI and Device LCI) outage currently that seems to have start around 11:28 EST. All are field device doesn’t received the command that we are sending them througth the RMS API, also the remote WebUI and Device LCI request timeout for all our device.

Could you advice of any current outage and ETA for resolution ?

thanks,

Estexs · December 18, 2024, 5:13pm

Same Issue here

I already tried to call Teltonika but nobody answered.

ETA would be good, as I also mentioned that all our Devices are currently not responding despite showing online.

Its not the first time, after the major outtage when they had Maintenance im very nervous about this Issue again…

GPGC24 · December 18, 2024, 5:18pm

Same issue here. WebUI times out for all connected devices

GPGC24 · December 18, 2024, 5:53pm

Anyone remember how long it took to get functionality back after “last time”?

Estexs · December 18, 2024, 5:56pm

@GPGC24
The last big outtage was ab 20 hours long. Which I really hope that will NOT happen again and this is a different Situation…

ato · December 18, 2024, 6:52pm

Same for me - can’t connect to any of our RUT200’s - get timeout message

Do these outages happen often?

Updated:
Seems the last happened on Nov 24th so not long ago…

## Summary of the Incident
Unexpected complications during planned maintenance led to extended downtime. Server updates caused connection issues with devices, and despite efforts, these could not be resolved within a reasonable timeframe. The system was reverted, though connection problems persisted, resulting in approximately 24 hours of downtime, with devices gradually reconnecting over the next 72 hours.

ato · December 18, 2024, 8:56pm

Back online now for me - yay

estretch · December 18, 2024, 9:00pm

Yup, seems to be working for me again too. Hopefully this isn’t a regular thing for them…

JustasB · December 19, 2024, 11:45am

Hello Everyone,

Current Status
We are pleased to inform you that the RMS system and its services have been fully restored. Our team is actively monitoring the systems to ensure continued stability and performance.

Incident Summary

Issue: On 2024-12-18, between 16:00 and 18:00 UTC, users experienced “Timeout” errors while performing actions within the RMS platform or using the RMS API. Affected operations included firmware upgrades, backup uploads, retrieving monitoring data, and generating new remote access links.

Root Cause: The incident was caused by a memory leak in one of the system applications. This led to an unexpected overload of a key virtual machine, which serves as the foundation for all dependent system components. As a result, services and operational applications froze, preventing the system from processing new actions.

Actions Taken and Next Steps

The main virtual machine and associated services were promptly restarted, restoring RMS functionality.
We contacted AWS, as such VM RAM overload should not have been possible.
Our team is actively monitoring memory usage across the infrastructure to identify and address the root cause of the memory leak.
We have added a temporary measure to restart relevant services should server would be running out of memory.
Impact in case of such an issue: action may succeed not from 1st, but on second try.

We sincerely apologize for any inconvenience caused and appreciate your patience and understanding as we work diligently to enhance the resilience of our systems and minimize future disruptions

system · February 16, 2025, 4:59pm

This topic was automatically closed after 60 days. New replies are no longer allowed.