Without going into my own professional background, I have *some* familiarity with the systems and approaches they are using, and am confident that there is an architectural problem with their solution that was breaking the queuing function of the flow. Probably due to demand.Wonder if it's something more serious than just a "how do we keep the page from crashing" and instead is something like duplicate ticket numbers were going out, or they weren't being recorded on the back end, even though ticket emails were going out (and will be honored). Something that is seriously going to cause them problems, not just managing demand. Again, they COULD just beef up the bandwidth and processing to the site (especially because as someone else said, it likely is hosted via AWS or something) and open general sales without a queue. Like main ticket sales still work...
As someone else mentioned, the way this is designed to work is to segregate those purchasing tickets into two pools - a 'queue' pool that sorts folks into a virtual line, and a 'order' pool that has a limited window of time to purchase tickets. Folks at the front of the virtual line are migrated from the queue pool into the order pool at a controlled rate, such that the order flow remains stable, the ordering system can't be overwhelmed, and the user experience is better as a minimal number of tickets are held up while individuals complete purchases. In principle, this is a good idea, because it is computationally much easier to hold folks in a line than it is to try and process hundreds of thousands of competing order attempts all at once.
Because some folks were able to get through to the order pool, we know that ordering more or less functioned as expected. The problem is with the queue.
It appears that, shortly after opening sales, Disney published the message regarding paused orders. However, some folks were still able to eventually enter the order pool and purchase tickets, and as far as I can tell this persisted for most of the day. This suggests that from the onset, their solution for migrating individuals from the queue pool to the order pool was not working correctly.
I personally had multiple devices, browsers, and connections open from just before 9 AM PT to when they shut down the sale. But I wasn't able to order tickets. However, one of my party members, in comparison, was able to open the order page in one shot without any queuing at all, during their lunch break. They were successful in getting confirmed tickets with almost no effort. Knowing their experience and my own, this outcome is fundamentally at odds with the intended functionality of the system, which is to migrate people to the order flow on a first-come, first-serve basis (with those who joined before 9 AM being randomized to 'start' the line). So, we can probably be more specific and say that, there was a problem with either how the queue was being ordered, or how folks were being flowed into and out of the queue pool.
It is very hard when you are talking about events that have levels of demand for which there are few other instances. I don't say this to defend Disney; they screwed this up. But it is important to understand that solutions that work at certain levels of scale break down at others, and you often have no way of knowing until you encounter it. It could be that logic to spin up more instances to distribute the queue failed. It could be that the logic across each instance no longer works reliably after a number of instances occur. It could be that exceptions happen when the rate of individuals entering the queue exceed a certain threshold. It is honestly all but impossible to anticipate all the contingencies, and that is why even the best and most prepared companies still fail at this, and can fail often.This is how companies like Oracle and IBM stay in business, they make sure things work. It's not that hard
From what we can observe, the odds are that demand was much more than Disney expected, to the point where the existing solution started breaking down in a way that was not recoverable. They probably tried for the entire day to find a way to get it working well enough to resume normal operations, but were ultimately not successful, hence the need to buy time and be non-committal about when orders will resume as they post-mortem and devise a solution.