Our primary router for the pmtpa cluster had to be rebooted today at 12:00 GMT. A line card had died and needed replacing, and the

system required a reboot for it to fully take effect. Once that finished, CentralNotice was adding a lot of overhead and had to be disabled for our caching cluster to catch up. Then the overload caused the primary database master for S3 to overload, and we are in the process of switching database masters to another server.
If all went as planned, this would have been a quick 5 minute router reboot and back online. Unfortunately, things do not always work smoothly, so what would have been 5 minutes has been awhile. This post will be updated as more details are resolved.
Update: We have switched database masters successfully and all sites and projects should once again be fully functional as of 14:13 GMT.
Rob Halsell, Operations Engineer

Can you help us translate this article?

In order for this article to reach as many people as possible we would like your help. Can you translate this article to get the message out?

Welcome to Diff

Welcome to Diff, a community blog by – and for – the Wikimedia movement. Join Diff today to share stories from your community and comment on articles. We want to hear your voice!

Diff

PMTPA Router Reboot – Scheduled Downtime (Resolved)

Can you help us translate this article?

Related

Welcome to Diff

Subscribe to Diff via Email

Wikimania Katowice

Wikimedia CEE Meeting 2024

Celtic Knot 2024

Wikimedia Foundation News

Wikimedia Technology Blog

Down the Rabbit Hole

	This comment is spam
	This comment is a violation of the Code of Conduct
	Other