November 16, 2020

NBN Fault

Cockatoo

A little history

Earlier this year we finally got NBN at our house. When we moved in about 3 years ago, it was supposed to be commissioned (or RFS, ready for service in NBN parlance) in about 6 months. Then it kept moving out and out and out and out. You get the idea.

Needless to say this ended up with finding the Telstra Wholesale RFS list (now defunct) and a friend of mine building a site to track the changes. At one point my suburb was at the top for number of movements of the RFS date. I even wrote to my local MP and to the local newspaper.

The ADSL outage

Anyway, leading up to our house going RFS, we had an outage with our ADSL. All of a sudden one weekend, we started to have stability issues. Logged a fault with Telstra and an engineer came out and fixed it Monday. I wasn’t at home (we weren’t in lock down yet) and so I didn’t know the cause. They also crimped out the second socket so we only had one in the house.

Going RFS

Fast forward about 1 month and we finally go RFS (we were 1-2 weeks behind the rest of the suburb and I wonder if it’s because some internal testing wasn’t working and someone had to flick it over manually. I’ll explain in a moment) - At that same time I actually raised an issue with my RSP (ISP in old school terms, RSP - Retail Service Provider) even though I wasn’t a full customer yet. - Whether this was a coincidence or not, we went RFS the next day.

Going RFS finally triggered the automated shipment of my NCD (NBN Connection Device) for FttC (Fibre to the Curb). This is basically a VDSL2 modem. You could almost call it a media converter at a stretch, but it’s what used to be CPE (Customer Premises Equipment). Side question, why did NBN come up with so many new terms?

Connecting to the NBN finally

Our shipment arrives the next week and the NCD won’t connect. I raise a fault and the NBN sends out a technician. This is about 1 day (give or take) before we go into the federal lock down, so everyone was on edge. This technician just happened to be the same one who was at my house 6 weeks ago fixing the ADSL. They know exactly why it doesn’t work. They broke it to fix the ADSL. During the transition phase NBN decided to have two service classes for FttC; 32 and 33. Basically the difference is that an automatic transfer switch is placed in line somehow/somewhere that works out if you are using ADSL or VDSL (or maybe it picks up the NCD reverse power, who knows) and disables or enables the DSL connection in the right direction. It was this device that caused the ADSL to fail. The technician then bypassed this device and the connection back to the DSLAM into the exchange so that we could only use NBN. This was now working. 100/40 tested. NICE!!!

The Intermittent Outage begins

Then, all of a sudden in mid August we started to get dropouts. This manifested itself as the blue DSL light going off and flashing on the NCD like it wasn’t able to connect to the DPU. This suggested to me at the time that it was an issue with the NCD, DPU or the copper in between. I raised an issue with my RSP and of course had to do the usual resets, tests, etc, which of course all passed at the time.

During the course of these events I went from just running smokeping every 5 minutes to my first hop (the default route on your router to the RSP network) to a high resolution smokeping (every minute), monitoring the line sync speeds and collecting the router logs. Because the RSP nor NBN seem to have enough visibility of the infrastructure.

Technician One

The first technician turns up two days later, and replaces the NCD and states everything is good. I’m also certain that they stated that the DPU was ~185M away and they couldn’t see any issues on the TDR (time-domain reflectometer). The interesting thing here, I never got to see him actually use the TDR.

Issue reappears within a few hours and I escalate to my RSP. But, you have to wait 24 hours. Why? NBN thinks you have to wait for things to settle down? (Who comes up with this stuff?)

Manages to convince them the next day to escalate back to NBN.

Technician Two

The next technician arrives a couple of days later (this is now 6 days after the fault first appeared). This person replaces the NCD (again) and re-crimps everything. It’s at this point we work out that my copper takes a very unusual path. (Which I had feared when I saw the build team working over summer, as I watched them push copper up the street conduit). My copper is delivered overhead from a pool on the corner of my house to the pole on the corner of my street and the next then back down the road to another pole, down that pole (with a join), then back up the road in the conduit under ground to the pit under the first pole where the DPU is. Crazy you say?

Here’s an image to show you:

Copper Path

This technician then walks out the cable path, and reckons it’s about 110M (the diagram above begs to differ). The FttC specifications is 150M, so 110M would put it under, my diagram and estimate puts it at boarder line. They also believe that the next step would be to replace the DPU if this doesn’t fix it.

Sure enough, 30 minutes later, and it’s not working, again.

I escalate back to my RSP and also raise a complaint about the copper lead-in length and path. (This actually gets some traction at NBN, but they want me to pay for it). More on this soon.

Technician Three

NBN again sends out a technician a third technician. This is now 9 days after the fault was first lodged. They’ve got no background (sigh) so have to explain it all. This technician tells me it’s more likely Fibre. (I’m skeptical, but at this point in time, I’m starting to believe anything, wish I had stuck to what little knowledge of telecoms that I do have.)

They try to change the DPU to what looks to be an old one (pretty sure it was the old Netcomms, and they don’t support G.Fast apparently). My NCD failed to power it up. (the neighbours worked first time). The technician didn’t think anything of it. At this point I was more convinced the length of my copper was more to do with the issue. They put the old DPU back in and we are now back at where we were.

Apparently, they provide all this information back to NBN, and about an hour later they are back to take a fibre light measurement (-23db). Now I don’t know what’s good or bad here, but apparently this is borderline. Takes photos and provides them back to the NBN.

This time I don’t have to go back to NBN or my RSP as NBN actually reach out about the fault and book another technician in.

Technician Four

We are now 16 days after the fault was originally lodged. This person starts to get close to actually resolving the issues, but misses, but a whisker. (It’ll become clear with the next technician.) They use their TDR and gets a measurement of 135M from the Telstra demarcation on the side of the house. They think that maybe be the length could be an issue too. But, they also noted that the pair in use is the red/black, not the blue/white. Also notices something weird on the TDR but dismisses it as it wasn’t consistent.

Has a new DPU and replaces this (again, my NCD has trouble powering it by itself). Takes photos of TDR, also another reading of the fibre light levels (-21.7dB this time) and provides it all back to NBN. They get information from NBN that they will be raising a fibre incident and we should see someone out to fix it in the next few days. (note, never see them)

At this time I’m getting frustrated. My RSP is requiring me to provide all the evidence of when it goes down. I now have several items setup to monitor, and I’m providing all the information that the technicians are stating back to the RSP (because apparently NBN don’t give it to them, they also don’t seem to share it for context with technicians. What’s with the secrecy?) I now ask for escalation within my RSP and an escalation with NBN.

Technician Five

The last technician. This is now 23 days after the fault was originally lodged. And again, this person has little context. (Really, what benefit to people see with starting from nothing?). They start from scratch. Pull DPU, out, NCD, look at the cable, joins, length, etc.

Now I should note, it has been particularly bad this morning for drop outs. It’s a bit windy outside also. They take the TDR inside and we immediately start to see the issue come and go on the screen, at about 40-50M. This must be past the Telstra demarcation and so this then gets cut, and the TDR placed on the red/black pair leaving the house. Again they note the odd pair being used, and suggest there has likely been a fault in the past to have swapped the pair over. This fault shows up about 21M. We estimate it must be about the corner of the house.

They get a small ladder to investigate the cable coming out of the corner of the house, and looks out across the cable, and about 1.5M out from the corner of the house is bits of the black sheath sticking up in the air.

They investigate this further and find the that all the copper in the that section have been pecked at, at some point. (Suggestion is this is caused by a Cockatoo, known bug). Most likely cause that we come to, at some point in the past, a Cockatoo has attacked the cable (for some reason), a fault was raised with Telstra and the technician switched to the red/black pair, which resolved the issue. The red/black pair, were on the cusp of failing, and the drastic warm/cold and wind that we experienced in August caused the copper to finally break just enough to cause dropouts under certain conditions.

At this time, they replaced the cable, stating they didn’t want to replace all of it, and it made more sense to go down the first pole, straight into the pit. (Sadly I couldn’t convince them to dig a trench to the house).

This now reduced the length by about 100M to about 60M in total.

Replacing this piece of cable, resolved the issue and no further escalations were required.

Summary

It feels less than acceptable that in this day and age a system as complex as NBN has minimal (at least from where I’m standing) observability. I’m sure that there is good levels on the health of the system from a PoI (Point of Interconnect) or to some degree the customer fibre network, but that seems to be where it ends. I had to provide evidence that I was experiencing an issue. (My RSP was doing some monitoring, but it seemed to be adhoc and limited. It was only enabled once I started the fault, and assume it has now been disabled again)

Requiring that customers provide evidence of faults for non-technical people is just not fair. NBN should be able to gain information from the DPU and the NCD to understand a fault. The Internet and by extension the NBN in Australia needs to be treated like a utility.

When I had a possible gas leak outside my house, a plumber was on site within a couple of hours to repair the fault. When I’ve had power outages, we get push notifications to our mobile phones about the outage with estimated times to remediate, if I never called about an NBN outage (at the last leg) the NBN would likely never know about it.

What has this caused me to do:

  • Monitoring Latency at high resolution. - I was using smokeping to monitor latency at a 5 minute resolution, but this was sometimes missing the short outages that were happening when the issue first appeared. I now also have smokeping running at 1 minute resolution. I ping the first hop past my router from an IP perspective (obviously there is lots of layer 2 equipment in that path, but I don’t have visibility of it)
  • Monitoring Speed. - I now have a regular speed test running that utilises my RSP’s speed test server.
  • Monitoring FttC Line Sync. - Luckily for me my RSP makes it possible to run DPU Port Status checks. This allows me to get line sync rates. I collect these and graph them regularly.
  • Monitoring Router Logs. - I am now running a syslog server to push logs from my router (Ubiquiti USG) to a server. (I was also running a daily report on the number of times the link was going down)

Here is a link to the utilities I built:

A screen shot showing the line sync and speed test. The ups and downs in the middle are when I was experiencing the issues.

Questions

I’ll leave you with these parting questions:

  • Why can’t/doesn’t the NBN monitor the NCD and DPU and use logs to distinguish what the issue is?
  • Why is the information so secret and not shared between all the parties?
  • Why does no one take ownership of faults?
  • How many people live with poor Internet connections because they aren’t persistent and/or have enough knowledge to know when to push?
  • Why would you run components, on the edge of acceptable (the length of copper and the light level readings)?
  • Why was NBN going to charge me for the shortening my copper path when the technician did it as part of the fault resolution?

© Greg Cockburn

Powered by Hugo & Kiss.