Usage.AI Raises $9M Led by Amity Ventures.

Learn More
March 31, 2023 | 5 min read
How To Reduce your OpenAI Bill
Kaveh Khorram

Kaveh Khorram

Founder & CEO, Usage

Kaveh is the Founder and CEO of Usage - a leading AWS optimization company.
<p>How To Reduce your OpenAI Bill</p>
GPT-4 Pricing Summary

The GPT-4 API pricing is based on tokens, with different rates for prompt tokens and sampled tokens:

8k context length models (e.g., gpt-4 and gpt-4-0314): $0.03/1k prompt tokens and $0.06/1k sampled tokens

32k context length models (e.g., gpt-4-32k and gpt-4-32k-0314): $0.06/1k prompt tokens and $0.12/1k sampled tokens

Mastering prompt engineering techniques is critical for optimizing costs and minimizing token usage when working with the GPT-4 API. In this definitive guide, we'll dive deep into the most advanced and effective prompt engineering techniques that engineers can use to optimize GPT-4 costs while maintaining high-quality outputs.

Dynamic prompt templates and conditional prompting

Create adaptable prompt templates that adjust based on context, user input, or specific conditions to maintain consistency while tailoring them to unique situations. Utilize conditional prompting to adapt prompts based on certain conditions or input characteristics.

Example: For a weather application, you could use a template like "{user_location} weather {date_or_time_period}" and adjust the date_or_time_period conditionally based on the user's query.

Context control and window manipulation

Strategically manipulate the context window and extract or summarize key sections of the input to fit within the model's context window for optimal token usage. Focus on crafting concise prompts and inputs that fit within the model's limitations while still providing the necessary context.

Example: Instead of including the entire user's conversation history, only include the most relevant pieces to create a shorter and more focused prompt.

Temperature, max tokens, and early stopping

Experiment with temperature settings to control output variability, set the 'max tokens' parameter to limit output length, and implement early stopping mechanisms to save tokens when a satisfactory response is generated.

Example: For a more deterministic output, set the temperature to a lower value (e.g., 0.2), limit the response length to 50 tokens, and stop generating tokens once a coherent answer is obtained.

Advanced token management

Analyze and optimize token distribution, truncate tokens in inputs and outputs when necessary, and develop custom tokens or encoding strategies to represent complex or repetitive information more efficiently.

Example: When processing a list of items with a repeating structure, use a custom encoding strategy like "{item_name}: {item_value}" to reduce token usage.

User feedback, A/B testing, and iterative prompt improvement

Collect and analyze user feedback to identify patterns indicating potential prompt improvements. Perform A/B testing by creating multiple prompt variations and comparing their performance. Continually review and refine prompts to maximize efficiency.

Example: Test variations of a summarization prompt, such as "Summarize the following text:" and "Provide a brief summary of the text below:", then analyze user feedback to determine which version performs better.

Leveraging user history and context

In multi-turn interactions or when users have a history with your application, leverage this information to optimize your prompts and reduce token usage. Reference previous requests or user-specific context to create shorter and more relevant prompts.

Example: In a customer support chatbot, use previous conversation data to pre-fill user information in the prompt, such as "Help {user_name} with their {product} issue related to {previous_issue}."

Prompt segmentation for multi-stage tasks

For complex tasks that require multiple stages of processing, segment your prompts into smaller, more focused prompts. Guide the model step-by-step, optimizing token usage and costs.

Example: For a task that requires both translation and summarization, first send a prompt to translate the input text, then send another prompt to summarize the translated text, rather than trying to achieve both objectives in a single prompt.

Adapting to model capabilities

Adapt your prompt engineering strategies based on the GPT-4 model you are working with to optimize costs and performance. Different models have varying capabilities and token limits, which require tailored approaches.

Example: When working with a smaller model like 'ada', craft more explicit prompts with clear instructions, whereas for a larger model like 'davinci', you may rely on its better understanding of implicit tasks.

Combining prompt engineering with fine-tuning

Enhance prompt engineering by fine-tuning GPT-4 models on your specific task. Fine-tuning enables the model to perform better with fewer examples in the prompt, ultimately saving tokens and reducing costs.

Example: Fine-tune a model for sentiment analysis, and then use shorter prompts, like "Sentiment: {text}", instead of providing multiple examples in each prompt.

By mastering advanced prompt engineering techniques, engineers can develop cost-effective and high-performing applications using the GPT-4 API. Continually reviewing and refining these strategies will ensure that your application remains cost-optimized and provides the best performance possible. With these advanced prompt engineering techniques, you can push the boundaries of GPT-4's capabilities while keeping costs in check.

If you'd like a tool that automatically helps you optimize LLM queries, send us a note at [email protected]

Kaveh Khorram

Kaveh Khorram

Founder & CEO, Usage

Kaveh is the Founder and CEO of Usage - a leading AWS optimization company.
725 5th Ave, New York, NY, 10022
Copyright © 2023 Usage AI.