Artificial Intelligence (AI) has become a prominent subject recently, especially with the advancements in large language models (LLMs) using transformer architecture. These models, trained on extensive web-sourced data, have amazed many with their coherent responses and performance in standardized tests. Amidst this excitement, there's frequent curiosity about how Aptori's Semantic Reasoning Platform stacks up against LLMs. Our experiments with LLMs have led us to some interesting conclusions.
Limitations and Challenges of Large Language Models
Foundation models, such as OpenAPI's GPT or Facebook's LLAMA, hold an extensive repository of information, and their ability to retrieve information through natural language queries is a notable strength. While these premier LLMs are adept at producing coherent text from their vast stored information, they also exhibit limitations in reliability, reasoning, and usability, when applied to tasks like software testing.
The reliability of these models can be uncertain, as illustrated below. For example, the same prompt can lead to different responses. Or, the response may be inaccurate or incorrect, as acknowledged by the implementations' authors. Further, a request to devise steps to accomplish an objective in a comprehensive manner often returns a response that fails to completely fulfill the objective. Finally, the sizes of a prompt, context, and response are limited and thus require thoughtful engineering to devise a system that can handle testing large and complex software APIs.
Some recently available applications of LLMs to aid software development include intelligent code completion in IDEs and code generation for small to medium-sized snippets. When considered in the context of application testing, engineers can employ LLMs to create code that implements test cases. However, this requires the test case to be carefully described through skillful prompt engineering. Consideration for the time investment by the engineer is paramount. How much time is saved between writing code for a test case versus the learning curve to discover the proper prompt language and then scrutinize the generated code to identify and correct errors? More importantly, with such an approach, the solution still has two drawbacks. First, the code produced by LLMs is often in the style of a single example-based test. Second, the generated code must be maintained.
Test Generation with LLMs
Let's consider an example of testing that validates whether a client of an API can retrieve sensitive data. A critical requirement of the application logic that handles API requests is properly implementing the authorization policy. Improper access control is a leading vulnerability of APIs that leads to information disclosure. For our example, I used the Stripe API because it is a well-known API that has existed long enough for LLMs to have knowledge about it and I chose ChatGPT-3.5 as a representative LLM. The example test will verify that an account (tenant) may not access customer details from another account.
When I gave the following prompt to ChatGPT-3.5:
The response included this Python code snippet:
Observe a couple notes about this exchange:
- The prompt was loosely worded with an assumption that "check" should mean "verify correct behavior" - that is, "user A cannot retrieve a customer created by user B." To the contrary, ChatGPT interpreted "check" to mean that "user A can retrieve a customer created by user B." This is a fair interpretation on behalf of ChatGPT, but underscores the importance of prompt engineering.
- ChatGPT did not understand the distinction between a Stripe API account key and a user account. Variables user_account_id_a and user_account_id_b are not necessary.
- The code does not include the step for account B to create a customer. It is left to the engineer to provide a valid customer ID that was created by account B. This leaves more work for the engineer.
- The use of global variables may cause conflicts when trying to incorporate this code into a larger code base - part of the maintainability problem of generated code.
Let's refine the prompt to be more precise about the expected behavior of the retrieve customer operation.
Notice the following details of the response:
- This time ChatGPT produced code that creates the customer using account B and uses the standard unittest package. This highlights how LLMs are not deterministic machines.
- The code uses an assertion to verify that the customer was successfully created before attempting to retrieve the customer. Unfortunately, this may lead to a false test result. If the test fails, one must pay attention to the details of the result to determine whether the failure is due to a mistake in the application logic of the retrieve operation or an error in the test setup.
The presented example underscores several drawbacks in the use of LLMs for software testing. First, the importance of skilled prompt engineering becomes evident, as ambiguous prompts may lead to misinterpretations and unintended code outcomes. Additionally, the challenge of maintainability of generated code is highlighted, especially when integration into a larger code base is necessary. An LLM's limitations in understanding crucial semantic distinctions, such as between API keys and user accounts, can result in inaccuracies and the need for manual corrections. Furthermore, the non-deterministic nature of output is evident, as different prompts with similar intents produce varying code snippets. These drawbacks collectively emphasize the need for careful consideration, validation, and refinement when utilizing LLMs to generate software test code.
Aptori's Semantic Reasoning Platform: A Specialized Approach
Aptori takes a different approach than broad-purpose LLMs that offer a prompt-response interface. Aptori's Semantic Reasoning Platform leverages AI in a more task-specific manner to construct a Semantic Model of the Application’s API. Aptori uses this model to both generate and execute test cases that are effective, extensive, and efficient.
As input, Aptori takes a description of an application's API (e.g., OpenAPI definition, Postman Collection) from which it constructs a Semantic Model..The Semantic Model enables the necessary reasoning to solve the question of how to effectively test each operation in an API, by generating test cases that chain together related requests, passing necessary information from the response of one request to the input of another. The Semantic Model also provides the understanding necessary to enumerate all possible input scenarios required to achieve comprehensive test coverage.
Property-Based Testing and AI Efficiency
Moreover, the test cases that are generated and executed by Aptori are built upon a property-based testing technique that allows for a variety of valid and invalid input values to be used while ensuring that the behavior of the application conforms to the requirements. For users, this means there is no generated example-based test code to maintain. Instead, users only need to express the expected behaviors (i.e., functional and non-functional business requirements), in the form of configurable application properties and checks, such as the expected authorization policy, detection of sensitive information in responses, or improper handling of invalid inputs including injection attacks.
Returning to the example of validating an authorization policy, Aptori methodically formulates and executes test cases covering all conceivable scenarios for a set of given user roles. Thus, Aptori will systematically test when the customer is created by account A and retrieved by user B, as well as, when the customer is created by account B and retrieved by account A. The example contained only a single operation (retrieve) on a single resource (customer) for two user accounts. Modern applications have dozens of resource types, each with a handful of operations, and a handful of user roles. In reality, the number of test scenarios quickly grows to 100s when you consider additional operations (update, delete, list) on a resource and even one additional user role: 10 resource types x 4 operations per resource x 3 possible user roles that can create the resource x 2 other user roles that can retrieve = 240 test scenarios.
Implications for Developers and the Industry
Aptori unburdens developers from writing and maintaining test code. Our Semantic Reasoning Platform constructs a stateful API call graph and autonomously walks the graph interrogating the API, uncovering functional defects and business logic vulnerabilities. Our Semantic Reasoning Platform can traverse the API call graph in multiple ways, from a minimal set of sequences that execute all operations at least once to sequences that repeat operations multiple times for performance and load testing.