Telstra Is Restoring Service After A Massive Mobile Network Outage [Updated]

Telstra has confimred a mass outage in its 3G and 4G network is affecting customers nationwide.

TK Kurikawa / Shutterstock.com

Update: Telstra has said that it is now restoring service to customers, after finding the issue and resolving it.

Customers in Melbourne began reporting an inability to make calls, receive texts or access the internet around 12pm today. The outage continued on to Sydney, followed by multiple areas across the country.

iTnews sources revealed that "a network core switch restart resulted in significant congestion across the mobile network".

"We are aware of an issue currently affecting mobile voice and data nationally," Telstra said in a statement. "We are working to resolve the issue as quickly as possible and thank customers for their patience."

Telstra has suggested that customers using 4G data that have voice over LTE enabled have a better chance of being able to utilise the network.

Telstra has been contacted by Gizmodo for further comment, and we'll update as details come to light.

[Business Insider]


Comments

    What sort of PR bs is "a network core switch restart"? Is a switch belonging to the network core? Is a core switch of the network? And why does a switch need to restart?

      People who do not know networking shouldn't come on comments and talk about things they know nothing about.

        If they toss techno-babble about then they should be prepared to explain it. Also I wish I was your manager and that was the "explanation" you gave me for an outage.

          Even someone who doesn't understand the specifics (which is what would be reported by Gizmodo given its average readership's level of tech literacy) you should be able to grasp that it's a network issue.

          Incidentally, if my manager were ignorant about the specifics of what they were managing such that they didn't grasp general industry terms, I'd quit in disgust. This is a pretty basic explanation that someone with even a fairly small amount of network knowledge would understand.

      hahahahahahahahaha.

      it probably couldn't be explained to you without first explaining about 5 other terms.

      Last edited 09/02/16 3:39 pm

      The longer version is: a switch in the network core (the centralised system of nodes that manage the broader network, such as SGSN and HLR nodes) was restarted in an attempt to fix an issue it was having (have you tried turning it off and then on again?). It seems from the reports that it wasn't isolated properly and caused an avalanche of traffic to other, similar switches, overloading them (I'm making a conjecture on this particular point based on the information available from news reports). Does that make more sense?

        And I should also clarify that "switch" is referring to the networking term of Layer 2/Layer 3 switch, and not something like a light switch.

        So, they've called IT support and the person on the other end said, "Have you tried turning it off and then on again?" LOL. sounds familiar......I'm kidding, it just sounds so plausible.

    I'd be interested to know if it was an intentional restart or not.

      "oooh... what does this switch do?"

        "PUH-LEEESE! PUH-LEEEESE! Do not flip dah switch! You have no idea wut it..."

        *ZAP*

        ".....does....."

        Last edited 09/02/16 3:23 pm

      I'd be interested to know if it was an intentional
      Fairfax reporting "human error" which would lead one to believe it was intentional.

      Telstra's chief operations officer Kate McKenzie said the outage was caused by "human error". "We apologise right across our customer base. This is an embarrassing human error"

        To add to that, it seems as though a node failed, and the 'human error' was that they didn't redirect people away from the failed node while it was being fixed.

    So one switch needing a restart caused massive congestion on a nationwide scale? What ever happened to redundancy for such critical infrastructure?

      GREED!!
      hmmmm 2 of the same thing incase something goes wrong..... or.... yeah one should do.

        Working in a industry that uses redundancy it's not always just 2 of everything. Our redundancy requires 2 different things of everything plus monitoring equipment to detect faults. You have to use different components because if it's a design fault then the redundant section will suffer the same fault. So instead of double the cost it's more tripple.

        And from other reports. Procedures that should have been followed weren't so the congestion was through traffic being not correctly routed around the outage.

        "The network is configured to manage this, however, in this instance we had issues transferring customers to other nodes which caused congestion on the network for some"

        After admitting human error.

        Last edited 09/02/16 5:21 pm

    Network Core Switch is a terminology... not a single switch.

    Ahh the joy of still having a Landline

      The slow adsl.. the dropouts when it rains... thank goodness we got ftth nbn.

    Telstra have stated " The outage was triggered when one of these nodes experienced a technical fault and was taken offline to fix. This normally wouldn’t impact services as we have processes in place to make sure any customers currently connected to a node are transferred to another node before it is taken offline. Unfortunately on this occasion the right procedures were not followed and this resulted in customers being disconnected and consequent heavy congestion on other nodes as customers attempted to reconnect to the network. "

    The reality is regardless of whether the users were gracefully migrated prior to taking the node out of service or the node had an outright failure, in a correctly designed network, the remaining pool of nodes should be able to accommodate for the restoration. Someone has a lot of explaining to do.

    This potential catastrophe is well documented by mobile vendors such as Alcatel Lucent/Nokia and Ericsson with articles such as "A signalling storm is gathering - Is your network ready?" or "LTE signalling - Preventing attach storms" . Furthermore, the 3GPP CT4 working group implemented changes to Core Network and Terminal Restoration Procedure technical specifications in September of 2011. These updated procedures ensured that the subscriber and the mobile device were reachable, even after an MME node failure. Prior to release 10.5 In case of MME/S40SGSN failure, the user’s session would immediately drop and the service would terminate.

    Before an LTE service could resume, the user must initiate a procedure so that the device can reattach itself to the network (e.g. service request, tracking area update). This could generate an “attach storm” with potentially thousands of LTE subscribers assigned to the failed node simultaneously signaling the network to reattach.

    Vendors have additional methodologies for managing signalling in their EPC (Evolve Packet Core) Eg Alcatel Lucent introduces the concept of a Session Restoration Server. Normally, in the case of a node failure, the attached subscriber sessions are considered stale and purged. It is only when the user reconnects to the network that the contextual data about the user is relearned. In the event of a node failure the session restoration server stores key attributes allowing the remaining nodes in the pool to immediately retrieve information about the subscriber which eliminates the need for the subscriber to reattach to the network.

    The bottom line is, you can't blame one guy for bad network design. There are mechanisms available to detect and suppress network storms, loops and so forth. There are mechanisms in place to monitor measure and minimise the signalling traffic that can lead to this type of collapse. And the last line of defence is peer review, surely when you are taking of 1 of 9 critical national nodes offline you have someone watching what just for good measure. Are you telling me Telstra does mission critical changes in the middle of the day and there is only 1 set of eyes watching what's going on? Bullcrap. All of that said, even if the network engineer plugs everything in upside down, the network should auto shut down the offending elements. There should never be a situation where 1 engineer can cause tear down of the entire National Network, that's just crazy talk.

      If the boxes that were involved used sigtran which is widely used in Mobile networks then the problem is due to limitations of then sigtran protocol sctp layer handling massive signalling message avalanches that can ocurr when a switch/box fails in an 'ungraceful' way. So its not neccesarily due to insufficient designed redundancy but rather defiency in 3gpp spec that calls for use of sigtran in 3g mobile networks. VoLTE traffic not impacted as tcp over ip used for call estsblishment and sigtran signalling gateway boxes not used. Surpriding that work on this eauipment was done during the day. Suggests the problem may have been catastrophic if no action was taken anyway.

    a security video has been leaked showing the "error"
    https://youtu.be/NITBfc1EOBo

Join the discussion!

Trending Stories Right Now