155: The Great AWS Outage of 2017
Where were you during the great outage of 2017
Good morning my name is Chris Kalaboukis because I am once again we’re coming to you live from deep in the heart of Silicon Valley California. So if you were on the internet at all last week you probably experienced the great outage of 2017 that’s right folks the Great Eight WS Amazon Web Services outage of twenty seventeen where a huge swath of the East Coast servers. On Amazon Web Services went down. And it’s very interesting this is the second time I have experienced something like this the first time I experience a like this was back in two thousand and nine when I was running my own startup we had a small small company called invention are it’s running up in San Francisco and we had a service that we call tweet us and tweet us was a service that harvested. Let’s say terms and other information from Twitter and Facebook feeds public Twitter and Facebook feeds and was able to using algorithms that we came up with to come up with a report on the most important or the most interesting tweets and information on from those services basically we would work as you would put in a term and it would spend some time harvesting and then would run some algorithms against and then we’ll come back with a report saying hey here’s the top ten most interesting things in sailing or whatever and we made the assumption of the very beginning that the tweet itself was not very important but the payload that the tweet carried was more important so we decided to focus on the payload and do all sorts of great stuff like that and what’s interesting is that at the time this was just the very beginning of when these cloud services were starting to become prevalent and I was using the rails we focused on using a Roku and Heroku was a service that does webs. Services at the time and it was a very easy way I’m even wrote an article about it basically saying we went from the days where if you wanted to run a service you’d have to actually get a physical server and put it on the Internet and then configure it and all that stuff and Nowadays all you do is go to a website and put your credit card information in and there’s all these masses of virtual servers available at your beck and call and they’re instantly scalable you start from one that even sure shoot up to millions based on how much scale you have or scale you need so this is the early days of something like that at the time and we were using a Roku and Heroku only supported Ruby on Rails and I didn’t know this but Heroku also ran an Amazon Web Services and there is tons of companies at the time there were also running on Amazon Web Services at the back end and this was in two thousand and they were plenty of them and I remember that was a day or two before launch of twee bas that everything was looking great and everything was working well and then boom we had an. Outage. Every now and I was on the phone with Heroku day after day trying to figure out what’s going on to try and get things up and things finally came up and it was really interesting because all the other partners at my firm were like well when is this kind of come up well it’s affecting a lot of people and they couldn’t understand I mean in the old days where you had one server it was just your server that went down it didn’t affect anybody else when something like Amazon Web Services goes down it affects hundreds if not thousands if not millions of websites that are out there so I went through this experience back in two thousand and nine let me tell you it was not a very fun experience for the two or three days that things were down I think we actually missed our launch date we had to push things out a little bit but eventually came back up now it doesn’t mean that I don’t like the model I love the model in the ease of use as a start up if you want to create something very quickly. You just sort of go to these websites boom ramp it up and you’re done with and things like S three things like elastic website there’s so much that Amazon provides that just makes it so easy for a startup to just spin up a server in no time at all and get the thing off the ground and it can be configured any way you like the the whole cloud virtual infrastructure thing has been an incredible boon to the start a business to almost any business in just sort of removing that layer of work that needed to be done and get in order to get things up quickly I mean it’s gotten to the point where you can have an A D A one day and a couple days later you can have a fully finished product out there on the web for people to try and see it’s just amazing what can be done so what happened on the other day was almost the exact same thing apparently. It was human error some individual working at Amazon had probably way too much power in his. In his user role actually put a typo in I think he was trying to shut down the small set of servers in order to test them and accidentally shut down the entire east coast infrastructure and I’m not sure why it took such a long time to bring it back up I think there were think it was down for six or seven hours or something like that but because it’s twenty seventeen and there were so many more services running on top of that it affected a ton of people a ton of people and in fact it affected me in a very interesting way in that it we were in the dark so I’ve taken my house and turn it into a sort of test bed for all of these devices Smart home devices etc So we have an Amazon echo in the Amazon eco takes care of turning the lights on and off in in our home in right now with. Shooting smart switches and smart outlets that communicate better hub lists and they communicate directly to the Internet and then they communicate to T.P. link which then communicates to Amazon so when we asked Alexa to turn our lights on we were unable to because. The servers were not able to connect to the T.P. link servers and to be able to turn them on now I was able to turn on the lights myself using the T.P. link Kos app so it was just disconnected from Amazon but if thinking about Ted A is one thing but what about tomorrow I mean these things are becoming so effective and so ingrained people love them so much that I mean I could have actually gone in and just unplug everything and manually turn things on and plug them back in so that’s not a big deal today but at some point in the fairly near future we’re going to have smart homes where all this stuff is integrated where you won’t even need to be able to control the lights anymore because you’ll just be saying Alexa turn on the kitchen lights Alexa turn off the kitchen lights turn on the kitchen lights whatever you’ll be able to run the entire home by voice command and it becomes incredibly enter important to make sure that you have an manual override in place because if something like this were to occur then you could end up with interesting situations which could cause a problem so this is the first time I’ve ever seen an Internet outage in some cloud service somewhere actually affecting you at home and this is going to become more and more prevalent so it’s very important that we have the manual overrides in place it’s very important that we are able to manually take control I mean if think about it what would happen if that was that server was running on autonomy as a vehicle right let’s say we’re in autonomy as a vehicle and we’re driving along and then all of a sudden there’s an outage Now obviously the atomic vehicles need to be built so that. If there is some kind of an Internet outage or an A.B.S. outage or whatever that the atomic vehicles can continue to move forward and do whatever they want and when we get to an age of ubiquitous computing and unlimited data and we can have pretty much everything already on this car then we don’t have to worry about the disconnection occurring the disconnection occurs in the car will continue to be able to operate and not get lost or whatever but it’s just very interesting that we are putting so much. Trust in. This cloud in these services and a lot of companies are depending on these clouds in these services to be working because if you think about it I mean I read that this outage cost cumulatively over all of the companies that were involved in the outage about one hundred fifty million dollars one hundred fifty million dollars were lost was lost due to this outage Can you imagine if you’re And this is the situation that I was in in two thousand and nine. A So you’re running a small startup you just about to do an investor presentation or you’re just about to launch the thing or you’re just about to do a huge expansion and get a whole bunch users or enough users of you can actually survive to the next point where you can get some more funding and suddenly this happens it’s Tim completely out of your hands it’s a helpless helpless feeling. To not be able to do anything about it and we are investing more and more of our infrastructure in things like this and it basically just says to me is that we need to have a more robust infrastructure and there may be opportunity here there may be opportunity here for startups or infrastructure companies to say listen I know you host sun E.W.S. but has had a major outage recently had a major outage and maybe there are some ways. We can provide redundant services on a number of different clouds for you in case something like this happens again so maybe there’s a service out there or there’s an idea for a service out there that not only runs you on one set of cloud services like every else but also shares you and your and any of these other cloud services that are out there so we’ve become so. Reliant on these services to be up all the time if you think about it we’re still in the extremely early days of being able to ensure that our networks are resilient reliable and self-healing. Because if a little bit of human error can shut down the entire east coast set of servers and can you imagine if someone was actually my level at least attempting to do that if someone who had the access was actually attempting to shut down those servers as opposed to doing it by accident things could have been a heck of a lot worse and then not today because we probably still have those manual backup systems in place but what will happen tomorrow when there’s less of a chance that we will be running the show ourselves that’s it for me for today C N X time and until then don’t forget to think future.