Jump to content

2023 State of The Sim


sviridovt

Recommended Posts

Hi folks! 

As we approach the end of 2023, we can reflect on the changes that happened to the sim over the past year. Although there were only a handful visible changes, there was significant work done to improve stability and optimize the back end to improve the experience for all players. Early in the year we migrated hosting from AWS to Linode, which has reduced the overall bill. Further I have used this opportunity to improve the overall infrastructure and reliability. 

The bottom line is that when the sim was created 5 years ago, I was not nearly the programmer that I am today (and hopefully in 5 years I will be able to say the same about myself today). I have always credited ASW with being the single biggest learning in my programming career, although I have been programming from an early age, this was my first long term project, and on more than one occasion my lack of experience with long term projects led to taking short cuts that led to issues long term. I have gotten better over the years, and I thank you all for giving me the opportunity to better myself. 

That said, I have taken the past year primarily to take what I have learned from 5 years of keeping the sim up, combined with my recent experience programming professionally for one of the largest tech companies to improve the sim stability, drive better user experience and create a solid platform before moving on to major feature work. 

 

Stability Improvements

Improved Monitoring and Reliability 

One of the things that I did as part of migration was add in better monitoring capability for errors and latency. We have, since the early days of the sim had analytics on things like user counts, as well as taking a snapshot of the sim state during any crashes, which remains useful to this day for debugging specific problems. However due to how this data was stored, it was difficult to drive any actionable decisions regarding what to prioritize and where to spend more time on, both in development as well as bug fixes. As such, we added server side monitoring and log indexing (previously logging was done to a file, making it cumbersome to work with as there was limited support for even the basics like searching for a specific request). This gave important visibility into which pages were causing the most issues, latency for each page, as well as the nature of the issues. Along with that it gave important metrics regarding which pages were getting the most traffic, and which features were being used most vs least etc. 

Fatals reductions 

One of the main goals to come out of the expanded monitoring was tracking how many fatal requests were were receiving on each page, at the time we started tracking this metric I found that we were getting about 20 fatals per day (A fatal being a 5xx error, namely that page with the error code, that error code identifies a snapshot that I can refer back to debug the issue), however the vast majority of the fatals came from only a handful of pages. As such, much of the first couple of months was spent focusing on pages that were causing the most issues which included the scheduler, as well as the flight update page. The vast majority of the issues were down to lack of input validations (like having a string instead of number etc., notably while we were sanitizing input for things like SQL injection attacks, we weren't verifying input validity for the specific context). 

However some bigger issues were discovered as well. For example, about halfway through last year I transitioned to a CI pipeline for deployments (automating deployments such that deploying new features or bug fixes did not require manual actions on my side but rather a press of a button, a great improvement in my ability to get things out faster), however as part of that change I moved the beta server from my home server to the production server, and used the same instance for beta as I do for prod. This created issues causing fatals due to the fact that owing to the Event Manager taking a number of connections, when combining connections required for both beta and prod instances, we were running out of connections. Discovering this issue prompted to disabling beta briefly and undertaking a containerization effort which completely isolated beta and prod dependencies and greatly improved reliability, creating a firewall between the two stages. We also now take greater care to track database connections to avoid this issue from happening again. 

All of this work focusing on stability paid off, following the improvements we have reduced from having about 20 fatals per day, to now having about 2, with all pages having fatal % well under 10%. A great improvement from previous years.

Latency Improvements

After concluding the fatals effort, the next area of focus was on improving the latency. A number of pages were marked as needing improvement, chief among them was the route research page. Apart from being the most requested page in the sim, it was also the highest latency with a P50 (median request) latency of about 23 seconds, and P95 (95th percentile) latency of over a minute and a half. This was unacceptable. I have taken a personal goal of reducing latency across the sim to a P50 of 2.5 seconds and P95 of 5 seconds. To achieve this goal, I have taken the strategy of reducing dependence on server-side rendering in favor of API calls, allowing data to load as it arrived. As the vast majority of the latency issues was due to waiting for data to load from the database, this kind of divide and conquer approach led to significant improvements, further pagination was added server-side to limit the data that is being loaded. When applied to route research page, where all of the table data (my flights, all flights data) was moved to API based approach we transitioned the page from being one of the worst pages for latency, to being one of the fastest with a P50 of about 250 ms with little variability between busy/non-busy flights. This approach was further applied to other pages, such as the airline ranking and flights list page to overall improve latency across the board. 

When I first undertook this effort, i set up an alarm for P50/P95 latency metrics, and at first almost every part of the site was in alarm for one or the other. After all of these changes we are regularly green on the alarms for specific pages. Although the work is not done as we are at 5 seconds for P95 sim-wide (and P99 over), we are in a much better place. 

One of the other changes that came out of this effort was a complete redesign of the new flight/update flight flow called by the scheduler, this page was problematic for both latency as well as fatals and bugs. As a result, I took the decision to completely redesign the page, and although the scheduler API still remains one of the higher latency pages (I plan to undertake an improvement when I redesign the scheduler which is planned) it was a noticeable improvement with a significant reduction of bugs in the flow and improvement of latency. 

Ongoing Feature Work 

New Dashboard/UI Redesign 

Although initially only meant as an improvement to support new fuel model (which is otherwise ready to go), this ended up a much bigger effort as I decided to take this opportunity to start migrating the broader sim to react rather than Django based templates, this is an improtant step to improve the usability of the website, further the new UI is designed to be mobile-first, addressing a major issue that currently exists with the sim. This is all part of a broader effort to move completely off of Django based views in favor of a React site, however as this would be a significant undertaking that would take over a year, I am in the meanwhile working on this 'halfway' solution where we integrate React components inside the current Django site that we can later transition. This way, you will get the advantages of the new UI sooner. 

As of now, the base work for UI is done, and I am currently working on integrating the newly created SDK to establish a front end/back end connection. Once that is done the new dashboard should be ready for release. 

New Fuel Model/Maintenance

The fuel model is complete but is pending dashboard (I do not wish to release the fuel model without a way to see fuel prices etc.). As it stands, the model introduced dynamic fuel prices that are variable day to day. Further it fixes one of the main issues with the current fuel model, which is that it does not adjust to inflation, since at present we do not model inflation in the sim (and have no plans to do so to avoid having to constantly update flight prices), meaning that fuel is unrealistically cheap in earlier years. 

Maintenance Update 

One of the major updates planned is to the maintenance system, part of that update will also be a redesign of the scheduler which is intended to allow you to move flights between planes. More will be coming on this at a later time. 

New Demand Model 

One of my major undertakings for 2024 will be a redesign of the demand system, this will mean the creation of a new service, what I termed Atlanta service written in Rust to take advantage of it's more optimized runtime performance (also I just want to learn Rust), as well as a graph based database to allow implementation of connecting flights. The model itself is already worked out (math wise) and will introduce 7 different types of passengers each with their own needs and priorities when it comes to choosing flights, price will no longer be king when it comes to certain kinds of passengers. This should allow better diversity of airline carriers as particularly wealthier passengers would be a lot more pickier about which flights they want to take, though would be willing to pay top dollar to ensure their demands are met. 

Closing Thoughts

As we go into the new year, the 5 year anniversary of ASW, I am excited for the future and to get back into feature work now that we have a good platform to build on. While my availability unfortunately remains unpredictable due to my job and other commitments, it always brings me great joy to work on ASW and interact with the community. Thank you for making it possible and I wish you all a happy 2024 and I hope you will join us as we celebrate our 5th anniversary! 

 

PS: As a thank you to our Patreons, they got to see this post earlier. Interested in supporting the sim? Become a Patreon here!

  • Thanks 2
Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...