There seem to been a RMS (API / Remote WebUI and Device LCI) outage currently that seems to have start around 11:28 EST. All are field device doesn’t received the command that we are sending them througth the RMS API, also the remote WebUI and Device LCI request timeout for all our device.
Could you advice of any current outage and ETA for resolution ?
Same for me - can’t connect to any of our RUT200’s - get timeout message
Do these outages happen often?
Updated:
Seems the last happened on Nov 24th so not long ago…
## Summary of the Incident Unexpected complications during planned maintenance led to extended downtime. Server updates caused connection issues with devices, and despite efforts, these could not be resolved within a reasonable timeframe. The system was reverted, though connection problems persisted, resulting in approximately 24 hours of downtime, with devices gradually reconnecting over the next 72 hours.
Current Status
We are pleased to inform you that the RMS system and its services have been fully restored. Our team is actively monitoring the systems to ensure continued stability and performance.
Incident Summary
Issue: On 2024-12-18, between 16:00 and 18:00 UTC, users experienced “Timeout” errors while performing actions within the RMS platform or using the RMS API. Affected operations included firmware upgrades, backup uploads, retrieving monitoring data, and generating new remote access links.
Root Cause: The incident was caused by a memory leak in one of the system applications. This led to an unexpected overload of a key virtual machine, which serves as the foundation for all dependent system components. As a result, services and operational applications froze, preventing the system from processing new actions.
Actions Taken and Next Steps
The main virtual machine and associated services were promptly restarted, restoring RMS functionality.
We contacted AWS, as such VM RAM overload should not have been possible.
Our team is actively monitoring memory usage across the infrastructure to identify and address the root cause of the memory leak.
We have added a temporary measure to restart relevant services should server would be running out of memory.
Impact in case of such an issue: action may succeed not from 1st, but on second try.
We sincerely apologize for any inconvenience caused and appreciate your patience and understanding as we work diligently to enhance the resilience of our systems and minimize future disruptions