Solving Complexity and Scale Challenges

Published by cyang on

Solving Complexity and Scale Challenges: How We Saved Our Client 1,000+ Annual Hours

In 2020, we built a new performance engineering solution for a client in the video game industry that replaced their home-grown tool.

The problem they faced was their ongoing traffic load. However, it was a unique problem because they were supporting, approximately, 10,000 logins per second.

A traffic load this size required an unprecedented load test.

Additionally, a load like this was far beyond the capabilities of any commercial and open-source solution on the market.

In response, TPC developed a new customized solution that supported both HTTP and RPC protocols, created a custom framework, and trained and implemented this new framework across their ecosystem.

As software systems become more complex over time and as the number of users increases, developers need solutions and strategies to ensure that the system performs well under various load conditions and continues to exceed customer satisfaction.

The Problem

Our new client developed a home-grown performance solution years ago but was unsure how to use it correctly and without a consistent process.

They also lacked a dedicated performance engineering (PE) team of experts.

Additionally, the company required an unprecedented load test with tens of thousands of logins per second.

Their existing home-grown solution and processes simply did not meet the performance standards required.

Because there were no commercial or open-source solutions available to evaluate the system at that scale, they reached out to us in hopes to drive significant time savings, easier test executions, and launching a new internal testing paradigm for the team.

What Is Complexity and Scale of Systems?

Complexity and scale are two of the most significant challenges facing software development teams.

Complexity refers to the number of interconnected systems and the interdependencies among those systems.

Scale refers to the size of the system, the number of users, and the amount of data processed.

Software systems tend to become more complex over time for one reason or another. This is due to multiple factors, such as:

Interactions between multiple systems within or outside the organization via microservices
Multi cloud provider environments
Hybrid environments
Large computing resources, serverless solutions like AWS lambda, Azure functions, queuing systems like SQS, non-SQL databases like DynamoDB and cosmos DB
Autoscaling of resources, implementation of monitoring systems like Azure application insights, AWS cloud watch, Datadog, or other APM systems

As the number of users increases and consumer expectations remain the same. In this case, they expect a best-in-class customer experience.

Our client’s developers needed the processes, solutions, and strategies to ensure that their system performed well under various load conditions.

Challenge #1: Unprecedented Load Test

The client required simulation of traffic for launches and regular volumes, using 10K logins/sec and 10M concurrent users.

Traffic was mainly RPC.

Two problems emerged:

No available performance testing environment
No off-the-shelf tool to support concurrent users with RPC protocol

Creating a performance testing environment was challenging due to the high cost and size of infrastructure needed.

Integration with other large systems for authentication, microservices, loyalty data, and social features would also be required. There would also be implications involving licensing costs and effort.

We decided not to create a performance testing environment due to the abhorrent cost and effort required to build and scale all systems.

Instead, we chose to use production as the environment.

Although not usually recommended, in this case, it was the right choice. This was one of the more complex environments we’ve seen from a performance standpoint.

The client built a custom tool for the second issue. However, it had limited features and couldn’t cover the complete testing scope.

A multi-point performance testing tools analysis was done for market tools and the client’s custom tool. Results showed the custom tool was needed based on their business requirements and internal technical capabilities.

No market tools met concurrent user requirements. Additionally, they needed controls for live testing during peak traffic times. Also, the license costs for market tools were exceedingly high.

The decision was to invest more time and resources into the client’s custom tool proposed by TPC.

We needed the right performance solution to support this exceptional load testing.

Requirements	CustomTool	OctoPerf	NeoLoad	BlazeMeter	CloudTest
Support 10 million Concurrent	Yes	No	No	No	No
Support 50 load generators per 1.25 million Concurrent Users (ability to drive 30k concurrent users per load generator)	Yes	No	No	No	No
Support test level pause, ramp up and ramp down	Yes	Partial	Partial	Partial	Yes
Support request level pause, ramp up and down	Yes	No	No	No	No
One dashboard for all the tests	No	Yes	Yes	Yes	Yes
Run test from multi-cloud environment	Yes	Yes	Yes	Yes	Yes
Off the Shelf	No	Yes	Yes	Yes	Yes
Improved Ease of Use	No	Yes	Yes	Yes	Yes
Improved Ease of Scripting	No	Yes	Yes	Yes	Yes

Challenge #2: A Different Protocol Complicated Development

Video games run on RPC instead of the regular HTTP protocol.

This added an extra layer of complexity and disqualified most commercial and open-source solutions that only supported HTTP protocol.

To address this, we built in the ability to handle both HTTP and RPC protocols.

Challenge #3: Inadequate Testing Framework or Training Resources

The customer’s performance tests were too different from their production models. The custom framework we built generated enough load for RPC calls but didn’t simulate gamers’ use of the system.

During analysis, the custom tool’s inability to create real-world gaming models for better testing was identified as a drawback.

Features were recommended to improve the creation of player models and increase the scope of testing.

Our changes enabled the client to run tests with HTTP and RPC in the same game session, closer to what gamers were doing on the platform.

Challenge #4: No Qualified Developers to Build The Tool

The task was to enable the customer to run real-world gaming models with tens of millions of concurrent users using a sophisticated and unique performance-testing solution. This requires skilled developers with expertise in performance testing and engineering as well.

At TPC, we understand the challenges of finding the right talent for your specific requirements. This client approached us seeking a custom solution that demanded a team proficient in C# and .NET, coupled with expertise in performance engineering. Recognizing the rarity of such a skill set, we embarked on a mission to build a team tailored to their needs, ensuring the successful realization of their project.

Result: A New Testing Paradigm That Met the Client’s Needs, Saved 1,000+ Hours, and Produces 50% Easier Test Executions

As a result of Total Performance Consulting’s work, our client had a new testing paradigm that met the following needs:

Test the system at the appropriate scale
Handle production-like traffic
Work with both HTTP and RPC models
Allow for 10k+ logins per second
Execute test with a smaller team
Repeatable at scale

Additionally, test executions became 50% easier, and they were able to save over 1,000 hours in one year alone!

Since working with us, our client has been running performance tests multiple times per week.

At Total Performance Consulting, our commitment lies in providing tailored solutions and assembling teams that are uniquely suited to tackle complex challenges. With our unwavering dedication to quality and the acquisition of specialized talent, we are poised to deliver the exceptional results our clients seek.

If you’d like to consult with us about the complexity and scale challenges that you’re currently facing, schedule a free session and connect with us today.

Frequently Asked Questions

What is perf and how does it help solve complexity and scale challenges?

Perf (short for performance engineering) refers to the process of identifying and addressing bottlenecks and other issues that impact the performance of software systems.

Perf can help solve complexity and scale challenges by providing insights into the performance of the system and identifying areas where improvements can be made to improve scalability and efficiency.

By using tools and techniques to measure and analyze system performance, software engineers can make informed decisions about how to optimize system design and architecture to handle large-scale, complex workloads.

What are the signs of a complex system?

A complex system typically exhibits several signs that indicate it is difficult to manage or understand.

Signs include:

A large number of interconnected components
High levels of interdependence between components
Nonlinear behavior
Emergent behavior
Sensitivity to initialconditions
High levels of uncertainty or unpredictability
Multiple feedback loops
Difficulty in predicting system behavior

How do you solve complexity and scale challenges in software development?

Solving complexity and scale challenges in software development requires a systematic approach.

It involves identifying the root causes of performance issues and implementing targeted solutions to address them.

Common solutions include:

Optimizing system architecture and design
Using tools and techniques to monitor and analyze system performance
Making changes to software code to improve performance
Implementing load balancing and caching mechanisms
Using cloud-based services to scale up and down resources as needed
Automating testing and deployment processes to improve efficiency

What are the best practices for managing complexity and scale issues in software development?

Best practices for managing complexity and scale issues in software development include:

Identifying and defining clear performance metrics
Regularly monitoring and analyzing system performance
Using appropriate solutions and techniques to identify and address performance issues
Prioritizing and implementing targeted solutions to address performance issues
Continuously iterating and refining system design and architecture
Leveraging automation to streamline testing and deployment processes
Implementing robust error handling and logging mechanisms
Collaborating closely with stakeholders to ensure that performance goals are aligned with business objectives

What are the benefits of implementing machine complexity and data analysis?

Implementing machine complexity and data analysis can provide several benefits in software development such as improved performance and scalability, better predictability and control over system behavior, and increased efficiency and productivity.

Software engineers use tools and techniques to monitor and analyze system performance. This allows them to quickly identify and address performance issues, optimize system design and architecture, and make informed decisions about how to scale resources up or down.

Additionally, machine complexity and data analysis can help software teams identify patterns and trends in system behavior. This can be used to inform future development efforts and drive continuous improvement.

Solving Complexity and Scale Challenges