Blog | Issue #9 – Long-term Maintenance

Issue #9 – Long-term Maintenance – The FB outage

<1/> A framework for project maintenance

Projects that have managed to stay alive for years go through many changes, and it’s easy to understand given all the new technologies and practices released every year.

Developers dedicate a lot of time to developing and architecting a project to be scalable. But often forget about the processes and frameworks needed to keep that standard of scalability.

Retire, have an exit strategy for when things do get to the end. A deprecated or tool that better caters to the project’s needs might force you to move. Most of the time, we will know when that time comes; it is a matter of getting the wheels to turn that becomes a challenge.
Restrict, business requirements and constraints most of the time drive decisions made on a project. Moving to a new technology involves time and resources, so look at how you can at least restrict the use of a legacy tool instead of opting for the rewrite. (Could introduce a lot of legacy code if not done correctly).
Standardizing & Documenting, this one just goes down to processes and structure, and one can argue that it could be solved with a few no-code solutions out there. Many times are undervaluing the impact this can have on the long term sustainability of a project.

Investing, some decisions drive the technical aspects of a project, and a lot of that can be decided based on the points above, but there are times when adopting new technology.

Software is typically meant to last for 20 years, and even with all the new advancements, that can still be achieved with the proper framework. Maintenance management, if properly executed, can be an essential component of well-functioning and long-lasting software.

<2/> Explaining the FB outage

Facebook experienced a week of shortages and lost billions in market share value and maybe users in the process. But in the world of the internet, that was ages ago. For those more interested in the technical aspect, we will take a deeper dive into what occurred.

It may be no secret that Facebook would have hundreds of machines running in its data centres that make up what is the behemoth of a social network. They refer to it as the Backbone in a blog post published by the team sometime after the incident.

Let us first look at DNS, which is popularly known as the address book for the internet. There’s BGP (Border Gateway Protocol) which is the protocol that is mainly used by the routers that connect networks such as the Backbone and is responsible for exchanging routing information with other servers across a network.

We refer back to the article, as the team does an excellent job explaining how its network works and what systems are put in place in the event of mini outbreaks. But in short, a misconfiguration error pushed onto the main network during a maintenance run disabled the connections BGP usually distributes and handles, which also affected all of their data centres, effectively disconnecting FB from the internet.

Incidents like this probably occur all the time, but in smaller volumes, so Facebook is ready for significant outages like this. Now your systems may not be as extensive as Facebook, but you to can set up services, rules and practices on your network layer that prevent similar things from happening for you.

<3/> Inside the console

We’ve already delved into a few networking concepts and briefly mentioned DNS’s role in making up the internet. AWS Route53 is a service built to provide DNS services in the cloud.

Allowing developers to manage traffic between applications both set up in your internal cloud environment and applications sitting externally.