Murphy’s Law

At Jet-Stream we were in the midst of a major platform upgrade last week. Most updates are ‘under the hood’, to clean up old code, to move to the latest operating systems and applications and to enhance IPv6 and SSL delivery. Our tech team has been preparing this upgrade for weeks, including extensive testing on our acceptance platform. Last week, something went wrong: Murphy brought a visit to our platform.

We never do upgrades on Friday, as a policy. On Thursday 8, we upgraded several intake, core and overflow and edge servers. These upgrades included new OS versions and application upgrades, such as our Wowza transmuxing engines and our own tuned NGINX caching engines. Because we have a redundant platform, with active request routing, we can instantly take servers out of our active pool. After doing so, we wait for traffic to drop, then we upgrade a server using automated packages to prevent human errors. We then run an extensive test to make sure all features work, and then add the updated server back to our active pool. Everything went smooth. Viewers experienced zero outages and all vod and live streams continued to perform all the way through the process of upgrading our server farm.

We also started our CDN management platform upgrade, which is the intelligent heart of our platform. Everything seemed to be working, except for a glitch in the way our new SSL certificates behaved, so we decided to halt the upgrade process, and roll back this specific SSL certificate implementation, and planned to finalise the upgrade on Monday. We would work through the weekend to do offline preparations. On Friday, Saturday and Sunday, all streams were up and running, as always, at the high performance our customers trust us to deliver.

Early Monday morning, we continued our work on upgrading our CDN management platform. During this process, we received an alert from upstream provider Leaseweb that they had encountered minor network issues on a specific 10Gbps port. Their advise was to switch a link to another port on their router. This action could cause an uplink hiccup of a few seconds but would not affect any stateless HTTP streaming. We were able to route all icecast, RTSP and RTMP sessions based streaming protocols through other uplinks, so customers and end viewers experience would not be affected. Within minutes the action was done, and the network issues were over, without any disturbances.

A few minutes later, red blinkenlights happened. Red lights flashing on our monitoring dashboard. Red lights are bad. One of the many tests we run is a permanent platform-wide caching test script, to check the entire caching chain from origins / intake servers, core servers, overflows, edges and even all the 3rd party CDNs integrated in our platform as well. This test alerts us if content is cached longer than it is supposed to, or when caches can’t fetch the content upwards in the caching chain. Multiple 3rd party CDNs suddenly reported caching issues randomly. Random sucks. The status of individual CDN caching reports randomly went from green to red. Green, red, red, green, green, green, red, red…

The quality of some live streams went down. Only specific streams seemed affected. Streams that ran perfect for minutes suddenly started to have performance issues and then magically restored. We could not pinpoint this to a specific customer (FYI, 99% of the performance issues we encounter are related to faulty encoder settings, high CPU load on encoders and the mother of all issues: encoder uplink limitations). Conclusion: this issue should be treated as a system wide disturbance. That qualification gets the highest priority. So we decided to pause the upgrade process -although we were almost finished- and fully focus on this quality issue. All streams are up, and most of them are running perfectly, but seeing streams randomly underperform, after upgrades and network changes, is bad, and needs full attention.

Since the disturbance started right after the uplink port switch, we first focused on the network. Were other routers and switches affected by the port switch? Full routing tables perhaps? We analysed the network and found no issues. We could do a power cycle of switches and routers, just to be sure, but this would cause connection loss for encoders, and would affect many more customers than the ones affected right now, so we decided to focus on other potential sources.

We started to restart caches serially, minimising the effect on viewers: third party CDNs and edge caches can retrieve content from multiple locations in our platform, so streams are not affected if we restart applications. And yeah, all streams stabilised! The blinkenlights were green. We had not found the problem, but had resolved the issue. We decided to discuss the event and further actions in our standard Tuesday technical meeting. We continued our work on upgrading the CDN management platform. Because of the disturbances, we had encountered delays, so we would finalise our work on Tuesday.

Tuesday early morning. The same random disturbances came back and again nobody had changed anything. Madness. We decided to drop all meetings that day and went into what we call war-room mode with the team. We needed to find the source and fix everything. ASAP. And then our senior ops team member called in sick. Ouch. Hello Murphy. The other team members were feeling the hours and hours of work from the past days. We decided to start with more testing on streams: which streams are affected and what do they have in common? Deduction: HTTP vod streams. Not affected. RTMP streams. Not affected. Icecast streams, not affected. HTTP live streams: some affected, sometimes, most not affected. Affected streams randomly restored and randomly experienced 404 chunk errors.

We also made a list of all changes in the past days: the network port, the OS upgrades, Wowza upgrades, NGINX upgrades, IPv6, SSL… except for the network port all upgrades had been tested over and over on our acceptance platform. We decided to setup some streams on our acceptance platform, to see if we could replicate the problem. We could not. Everything worked fine.

Then a customer sent in a support ticket: a new problem: they could not download VOD videos from our ingest points. Strange, because we don’t allow downloads from ingest points. We discussed whether we needed to give this issue priority: a team member tested whether upload worked, and it did, even with the customers credentials. The performance issue affected multiple customers, so we decided to prioritise on that.

We knew that 3rd party global CDNs have issues relating to IPv6 and SSL. We wondered if it made sense to roll back to IPv4 and not enforce SSL. However, we found that some streams on our own caches were affected too. Those caches work guaranteed with IPv6 and SSL, so we ruled our IPv6 and SSL.

We asked ourselves if the increased the security of our anti-deep-linking technology could break caching in edges and CDNs. The developers and ops checked code and the system, however everything seemed to work fine.

Then another customer sent in a new ticket: yet another problem. They could not connect their encoder. We immediately found the cause: we had paused the upgrade of our CDN management platform and had not finished white listing of all encoder types. This was fixed instantly. Customer happy. His high quality HD HTTP streams performed beautifully, even though we still needed to fix the caching performance issue. We also realised that we had not yet turned on TLS for FTP, since we had paused our upgrade process, and enabled this as well. This solved the ticket of the previous customer: they had mistaken ‘upload’ for ‘download’. Most customers use automatic rollover from FTP/TLS to FTP if TLS is not active, however this specific customer used enforced FTP/TLS ingest. Problem solved and their content batch started to flow in.

Because restarting NGINX had temporarily fixed the caching performance issues, we made the decision to roll back all our NGINX configurations to last week’s version. To do this smoothly, without affecting end users, we needed to do this staged. At first hand, the errors were gone. And then came back. We got more and more frustrated, since it was so hard to pinpoint the source.

The ops team member who had called in sick, came into the office to help. He ran more network tests but found nothing. He tested the operating systems and everything was running fine. He did some research on NGINX and found out that the latest version implemented a new caching algorithm. He suspected this could possibly interfere with our anti-thundering herd (intelligent live HTTP streaming caching) features. We decided to roll back NGINX entirely, not just the configuration, but the entire engine, to last week’s state on some machines, to see if it made a difference. It did not. We were running out of options.

We had ignored Wowza so far. Our release included a minor Wowza upgrade. It was the lowest on our suspect list because our experience is that we can usually simply replace Wowza without any compliancy issues. We were still analysing the affected streams. And there we found the cause! We saw extremely large headers. Each time content was requested from a Wowza engine again, Wowza doubled content in the header. VOD manifest files are cached for a longer period, so they are typically requested just a few times. Live manifest files however, are updated every few seconds, and need to be requested every few seconds. A bug apparently made the header double in size by each new request. At some point the header becomes so ridiculously large that caches and video players decide to deny to retrieve/cache the requested file. The process of growing headers can take hours, even days. That’s why we could not immediately pinpoint it to the upgrades done on Thursday.

We commented the code for these headers, restarted the Wowza farm (sorry for the RTMP streams connection loss people) and everything was running fine. Yes!

We further analysed which streams had been affected. Customers with very low viewer numbers were not affected, since their headers did not grow that large. Customers who sent their live streams redundantly to our platform were mostly not affected because our caches had twice the sources to retrieve content from. This automatic failover system is good because end users simply don’t encounter issues, however it made it harder for us to pinpoint the cause.

And then a customer sent in a new ticket: they could not setup new live streams. But their information was scarce and we could not reach them. Which live stream type? Push or pull? RTMP, RTSP, HTTP? We tried to reproduce their problem, but when we setup new live streams in their account, everything seemed to work fine. After we learned the exact steps the customer took to setup their specific live stream workflow, it turned out that in the chaos of the day, we had not fully finalised all the configurations so one of the many servers in the pool was missing a script to setup specific publishing points. Sorry. Fixed immediately.

The day was saved. Except for a handful of streams which had random 404’s on video segments, and except for a single brief interruption of RTMP streams due to a required application restart, no viewers were affected, and the three reported incidents -which affected only their workflow, not streams- were resolved.

Every incident is one too many of course. But I am very proud of our team. One customer complained we had not fixed their workflow issue immediately: they are used to us helping them instantly. I fully support the deduction choices and prioritisation choices of the team. In such a situation, we need to instantly prioritise: availability of system wide streams always has the highest priority. Then comes performance. The show must go on. Customer workflow is important, but never as important as content availability and performance. System wide issues always get priority over individual customer issues. The team has prioritised very well. After we explained our day to the customer, fortunately they understood. The team took a well deserved day off.

I am also proud of our technology. If we would have run a DNS based system, like most CDNs do, we would never have been able to smoothly take so many servers out of active pools, update them and reactive them. Our active platform helped us make changes and test them on the fly, without affecting end users.

The technical knowledge of HTTP streaming and caching in our team also helped us faster pinpoint the root of the performance issue. HTTP streaming is great technology, and many claim that life has become easier thanks to it (no more proprietary streaming servers and protocols), but the opposite is true as well: HLS, DASH, HDS, Smooth, manifests, segments, caching headers, caching settings, there is so much more to tweak and tune and since it so open and configurable, multiple parties in a complex workflow can mess up each others systems. Combine the many combinations with hundreds of demanding customers with each their specific custom workflow and you need an expert team to run a scalable streaming company.

Fortunately, we hardly experience such incidents. We had planned and tested thoroughly. We lost precious time because at first we suspected a network issue. We had not found the header doubling bug on our acceptance platform because it only affected live streams in specific configurations at high number of requests.

Even though hardly any streams were affected, we want to learn from each incident. We are now in the process of discussing improvements. Is it possible to further isolate upgrades, so we can sooner pinpoint a cause? Is the bug we found due to our own customisations? Can we add volume testing to our feature testing on our acceptance platforms? Can we improve communication with customers? How can we make sure customers feed us with the right information so we can faster pinpoint a problem?

 

Plaats een reactie