View Single Post
Posts: 35 | Thanked: 504 times | Joined on Jan 2013 @ Germany
#3
Hi everyone,

tl;dr: half of infrastucture broken, fix expected early next week, film at eleven.

This maintainance didn't go to plan, here's a short post-mortem:

Timeline:

10:00 - start updates and backups on blade-a
14:30 - backups and updates complete on blade-a, reboot confirmed successful
14:31 - uptime induced filesystem check after 1347 days
15:00 - start of backups on blade-b
17:12 - filesystem check complete, blade-a up and running
17:30 - first systems on blade-a confirmed up and working
18:30 - software upgrade on stage and mail complete
20:15 - backups of blade-b finished and copied onto blade-a backup space
20:16 - start of updates on blade-b
21:00 - updates on blade-b complete, reboot
21:01 - blade-b stuck in boot with corrupt bios image in flash
23:30 - all available remote recovery options tried, none working
23:40 - decision to go for Plan B, boot talk.maemo.org on blade-a, redirect everything else to talk.m.o
23:45 - blade-b turned off through IPMI
23:53 - talk.m.o available again

Fallbacks in place:

www.maemo.org, wiki.maemo.org, garage.maemo.org are redirected to talk.maemo.org

Next Action Items:

I'll visit the datacenter monday after work (around 18:00 CET) to try to recover the bios of the broken machine with a physical USB stick.

If this is successful we'll migrate talk.m.o back to it's original host and reenable www.m.o, wiki.m.o, garage.m.o through DNS after the VMs and the blade are confirmed working


Best,

xes & falk
__________________
--
We reject kings, presidents and voting.
We believe in rough consensus and running code.
- David Clark
 

The Following 23 Users Say Thank You to fstern For This Useful Post: