Artificial Intelligence (AI) has become a prominent subject recently, especially with the advancements in large language models (LLMs) using transformer architecture. These models, trained on extensive web-sourced data, have amazed many with their coherent responses and performance in standardized tests. Amidst this excitement, there's frequent curiosity about how Aptori's Semantic Reasoning Platform stacks up against LLMs. Our experiments with LLMs have led us to some interesting conclusions.

Limitations and Challenges of Large Language Models

Foundation models, such as OpenAPI's GPT or Facebook's LLAMA, hold an extensive repository of information, and their ability to retrieve information through natural language queries is a notable strength. While these premier LLMs are adept at producing coherent text from their vast stored information, they also exhibit limitations in reliability, reasoning, and usability, when applied to tasks like software testing.‍

The reliability of these models can be uncertain, as illustrated below. For example, the same prompt can lead to different responses. Or, the response may be inaccurate or incorrect, as acknowledged by the implementations' authors. Further, a request to devise steps to accomplish an objective in a comprehensive manner often returns a response that fails to completely fulfill the objective. Finally, the sizes of a prompt, context, and response are limited and thus require thoughtful engineering to devise a system that can handle testing large and complex software APIs.

Some recently available applications of LLMs to aid software development include intelligent code completion in IDEs and code generation for small to medium-sized snippets. When considered in the context of application testing, engineers can employ LLMs to create code that implements test cases. However, this requires the test case to be carefully described through skillful prompt engineering. Consideration for the time investment by the engineer is paramount. How much time is saved between writing code for a test case versus the learning curve to discover the proper prompt language and then scrutinize the generated code to identify and correct errors? More importantly, with such an approach, the solution still has two drawbacks. First, the code produced by LLMs is often in the style of a single example-based test. Second, the generated code must be maintained.

Test Generation with LLMs

Let's consider an example of testing that validates whether a client of an API can retrieve sensitive data. A critical requirement of the application logic that handles API requests is properly implementing the authorization policy. Improper access control is a leading vulnerability of APIs that leads to information disclosure. For our example, I used the Stripe API because it is a well-known API that has existed long enough for LLMs to have knowledge about it and I chose ChatGPT-3.5 as a representative LLM. The example test will verify that an account (tenant) may not access customer details from another account.

When I gave the following prompt to ChatGPT-3.5:

Using the Stripe REST API, write a test that checks if user account A can retrieve a customer created by user account B.

The response included this Python code snippet:

import requests
import json

# Replace with your actual Stripe API keys
stripe_api_key_account_a = 'sk_test_account_a'
stripe_api_key_account_b = 'sk_test_account_b'

# Replace with the user account IDs
user_account_id_a = 'user_account_id_a'
user_account_id_b = 'user_account_id_b'

# Replace with the actual customer ID created by user account B
customer_id_created_by_account_b = 'customer_id_created_by_account_b'

# Set up the API endpoint for retrieving a customer by ID
stripe_api_url = 'https://api.stripe.com/v1/customers/{}'.format(customer_id_created_by_account_b)

# Set up headers with the appropriate API key for user account A
headers_account_a = {
    'Authorization': 'Bearer {}'.format(stripe_api_key_account_a),
}

# Set up headers with the appropriate API key for user account B
headers_account_b = {
    'Authorization': 'Bearer {}'.format(stripe_api_key_account_b),
}

def test_user_a_can_retrieve_customer_created_by_user_b():
    # Attempt to retrieve the customer using user account A's API key
    response_account_a = requests.get(stripe_api_url, headers=headers_account_a)
    data_account_a = json.loads(response_account_a.text)

    # Check if the request was successful (status code 200) and if the retrieved customer belongs to user account B
    assert response_account_a.status_code == 200
    assert data_account_a['metadata']['user_account_id'] == user_account_id_b
    print("Test passed: User account A can retrieve a customer created by user account B.")

    # Additional assertions or checks can be added based on your specific requirements

if __name__ == '__main__':
    test_user_a_can_retrieve_customer_created_by_user_b()

Observe a couple notes about this exchange:

The prompt was loosely worded with an assumption that "check" should mean "verify correct behavior" - that is, "user A cannot retrieve a customer created by user B." To the contrary, ChatGPT interpreted "check" to mean that "user A can retrieve a customer created by user B." This is a fair interpretation on behalf of ChatGPT, but underscores the importance of prompt engineering.
ChatGPT did not understand the distinction between a Stripe API account key and a user account. Variables user_account_id_a and user_account_id_b are not necessary.
The code does not include the step for account B to create a customer. It is left to the engineer to provide a valid customer ID that was created by account B. This leaves more work for the engineer.
The use of global variables may cause conflicts when trying to incorporate this code into a larger code base - part of the maintainability problem of generated code.

Let's refine the prompt to be more precise about the expected behavior of the retrieve customer operation.

Using the Stripe REST API, write a test that verifies that user account A cannot retrieve a customer created by user account B.

import unittest
import requests

class TestStripeAPI(unittest.TestCase):
    def setUp(self):
        # Set up any necessary test data or configuration
        self.base_url = "https://api.stripe.com/v1"
        self.api_key_user_a = "YOUR_API_KEY_USER_A"  # Replace with the actual API key for user A
        self.api_key_user_b = "YOUR_API_KEY_USER_B"  # Replace with the actual API key for user B

    def test_user_a_cannot_retrieve_customer_created_by_user_b(self):
        # Create a customer using user B's account
        customer_creation_response = self.create_customer(self.api_key_user_b)
        self.assertEqual(customer_creation_response.status_code, 200)
        customer_id = customer_creation_response.json()["id"]

        # Attempt to retrieve the customer using user A's account
        customer_retrieval_response = self.retrieve_customer(self.api_key_user_a, customer_id)
        
        # Assert that user A cannot retrieve the customer created by user B
        self.assertEqual(customer_retrieval_response.status_code, 404)

    def create_customer(self, api_key):
        # Make a request to create a customer using the provided API key
        url = f"{self.base_url}/customers"
        headers = {
            "Authorization": f"Bearer {api_key}",
            "Content-Type": "application/x-www-form-urlencoded",
        }
        data = {"email": "test@example.com"}
        response = requests.post(url, headers=headers, data=data)
        return response

    def retrieve_customer(self, api_key, customer_id):
        # Make a request to retrieve a customer using the provided API key and customer ID
        url = f"{self.base_url}/customers/{customer_id}"
        headers = {"Authorization": f"Bearer {api_key}"}
        response = requests.get(url, headers=headers)
        return response

if __name__ == "__main__":
    unittest.main()

Notice the following details of the response:

This time ChatGPT produced code that creates the customer using account B and uses the standard unittest package. This highlights how LLMs are not deterministic machines.
The code uses an assertion to verify that the customer was successfully created before attempting to retrieve the customer. Unfortunately, this may lead to a false test result. If the test fails, one must pay attention to the details of the result to determine whether the failure is due to a mistake in the application logic of the retrieve operation or an error in the test setup.

The presented example underscores several drawbacks in the use of LLMs for software testing. First, the importance of skilled prompt engineering becomes evident, as ambiguous prompts may lead to misinterpretations and unintended code outcomes. Additionally, the challenge of maintainability of generated code is highlighted, especially when integration into a larger code base is necessary. An LLM's limitations in understanding crucial semantic distinctions, such as between API keys and user accounts, can result in inaccuracies and the need for manual corrections. Furthermore, the non-deterministic nature of output is evident, as different prompts with similar intents produce varying code snippets. These drawbacks collectively emphasize the need for careful consideration, validation, and refinement when utilizing LLMs to generate software test code.

Aptori's Semantic Reasoning Platform: A Specialized Approach

Aptori takes a different approach than broad-purpose LLMs that offer a prompt-response interface. Aptori's Semantic Reasoning Platform leverages AI in a more task-specific manner to construct a Semantic Model of the Application’s API. Aptori uses this model to both generate and execute test cases that are effective, extensive, and efficient.

As input, Aptori takes a description of an application's API (e.g., OpenAPI definition, Postman Collection) from which it constructs a Semantic Model..The Semantic Model enables the necessary reasoning to solve the question of how to effectively test each operation in an API, by generating test cases that chain together related requests, passing necessary information from the response of one request to the input of another. The Semantic Model also provides the understanding necessary to enumerate all possible input scenarios required to achieve comprehensive test coverage.

Property-Based Testing and AI Efficiency

Moreover, the test cases that are generated and executed by Aptori are built upon a property-based testing technique that allows for a variety of valid and invalid input values to be used while ensuring that the behavior of the application conforms to the requirements. For users, this means there is no generated example-based test code to maintain. Instead, users only need to express the expected behaviors (i.e., functional and non-functional business requirements), in the form of configurable application properties and checks, such as the expected authorization policy, detection of sensitive information in responses, or improper handling of invalid inputs including injection attacks.

Returning to the example of validating an authorization policy, Aptori methodically formulates and executes test cases covering all conceivable scenarios for a set of given user roles. Thus, Aptori will systematically test when the customer is created by account A and retrieved by user B, as well as, when the customer is created by account B and retrieved by account A. The example contained only a single operation (retrieve) on a single resource (customer) for two user accounts. Modern applications have dozens of resource types, each with a handful of operations, and a handful of user roles. In reality, the number of test scenarios quickly grows to 100s when you consider additional operations (update, delete, list) on a resource and even one additional user role: 10 resource types x 4 operations per resource x 3 possible user roles that can create the resource x 2 other user roles that can retrieve = 240 test scenarios.

Implications for Developers and the Industry

Aptori unburdens developers from writing and maintaining test code. Our Semantic Reasoning Platform constructs a stateful API call graph and autonomously walks the graph interrogating the API, uncovering functional defects and business logic vulnerabilities. Our Semantic Reasoning Platform can traverse the API call graph in multiple ways, from a minimal set of sequences that execute all operations at least once to sequences that repeat operations multiple times for performance and load testing.

Why Product Security Teams choose Aptori

Reduce Risk with Proactive Application Security
Are you in need of an automated API security solution that's a breeze to set up? Aptori is your answer. Aptori effortlessly discovers your APIs, secures your applications, and can be implemented in just minutes.
‍
‍✅ AI-Powered Risk Assessment and Remediation
‍Aptori leverages advanced AI to assess risks and automate remediation. This intelligent approach ensures vulnerabilities are identified and fixed swiftly, minimizing your exposure to potential threats.
‍
‍✅ Seamless SDLC Integration and Lightning-Fast Setup
‍With Aptori, setting up and conducting application security scans is a breeze. Our solution seamlessly integrates into your SDLC, providing comprehensive security insights and expediting the remediation process, all in a matter of minutes.

Ready to see Aptori in action? Schedule a live demo and witness its capabilities with your Applications. We're excited to connect and showcase how Aptori can transform your security posture!