RUT950 has partial traffic flow failure regularly

We’ve noticed an issue with a couple of our sites where we’ve deployed a RUT950. Originally we were still running the v6 firmware series, but we’ve since moved it to v7 but the problem hasn’t gone away (although the behaviour is slightly different).

The basic issue is:

After a period of time (could be hours or days, but usually ‘a few hours’) - but not every single day[1], the router will stop routing some traffic, but not all, and the behaviour of the admin interface (web UI) changes or stops working.

We route all traffic via a L2TP VPN, and thus normally what we do is via that tunnel, connect to the router WebUI if we want to look at it - this works fine normally (and works fine at our other sites). I’ve plenty of sites where we’ve never seen an issue and the router can work for months just happily.

On v6 what we’ve seen at a couple of sites is that after the varying time, the ability to connect to the web interface goes - eg we can’t get to the WebUI at all - you can’t get to the login screen whatsoever. At the same time, we notice that VOIP phones can no longer make calls. However, general internet access works to a PC connected to the network, and the VOIP phones are still registered - and can receive calls.

On v7, a similar experience happens, although the web UI login screen DOES appear, but it tells you ‘The device is unreachable. Please check the connection and try again’ on the /login page - some of the time - the rest o the time it just fails entirely, like v6.

[1] To try and combat this, we told the kit to reboot overnight every night, this hasn’t solved the issue as it will still happen but the number of days this is an issue has reduced a little.

The only thing we can do is reboot the router when this happens and as soon as we do, everything works just fine again.

That makes troubleshooting further a bit awkward of course, as rebooting it means I can’t see what the router is doing, but I can’t login to it anyhow.

Has anyone experienced his bizarre behaviour at all or know any way to make it stop acting like this? I can’t replicate this readily - so it’s not obvious which thing is triggering this, but it does happen several times a week at a couple of sites, but not anywhere else ever, and they’re all pretty similar installations.

Incidentally, I can still SMS the router to ask for the status, and there is still internet access - as I say, the PCs can still use the internet, and the phones can still receive calls, The L2TP tunnel is still up - and I can still ping the Router Lan IP over that tunnel just fine - so it doesn’t seem like it’s just connectivity failing, as nothing suggests that happens at all - it’s that ‘some’ traffic stop flowing, including the Web UI…

Any ideas folks?

