When we first integrated AI capabilities into one of our client’s applications, we did it using simple synchronous OpenAI API calls. It worked perfectly for features like text summarization and classification and the implementation was straightforward, responses almost immediate, so simple to debug and test and everything worked as expected.
However, when we began expanding into more data-intensive use cases like large-scale content classification, we realized that our initial solution would spend the budget allocated to AI too fast. What started as a seamless integration, while technically sound, ended up being a financial challenge. So, we needed to come up with a strategy to reduce operational costs drastically.
These were some of the alternatives we evaluated:
- Do more prompt-engineering to reduce the amount of input tokens
- Use smaller and cheaper models
- Use the same prompts and models via OpenAI’s batch API
Doing more prompt engineering was a dead end. We had already optimized the prompt quite enough for an MVP. There might have been some more juice to squeeze from the lemon, but we were past the point of the optimal bang for the buck.
We were using GPT-4o-mini, which was at the time the most cost-effective of OpenAI’s models. So, cheaper models would also mean changing providers and redoing a lot of the prompt optimizations we did (as every model has its nuances and tendencies one must adjust for). Also, in some cases, the greater cost of the prompt was input images, which are priced completely independent of the selected model, to the extent of anti-intuitiveness: processing images with GPT-4o-mini is more expensive than processing it with its parent bigger model GPT-4o, but that is for another post.
Using OpenAI’s batch API seemed like an easy win. We could keep using the same models, prompts, and provider, and just get a 50% discount by doing the processing asynchronously. Sounds like a no-brainer, right?
However, there are some challenges no one tells you about.
The untold details of batch processing
If you search online for “OpenAI batch processing”, you’ll find dozens of articles with more or less the same recipe you find in OpenAI’s documentation: upload a JSONL file, use the batch API to create a batch to process the file, and once the batch is finished, download and processes the results. Simple. What these articles don’t tell you is that the system is simple on OpenAI’s side, but that simplicity means you need to take into account some edge cases and implementation details that we found no mention anywhere. These are the ones we hit our heads against and how we solved them.
Polling
OpenAI’s batch API doesn’t use webhooks. This means you can’t receive notifications when batches are complete, you have to build your polling mechanism that:
- Periodically checks the batch status
- Downloads and processes results for completed batches
- Handles timeouts, retries, and failures
- Avoids excessive polling that could trigger rate limits
Batch status and individual request status
There are some not-so-straightforward relations between the batch status and the status of the requests that were included in that batch. For example:
- A successful batch can (and probably will) contain failed requests, invalid requests, and malformed or unexpected outputs.
- An expired batch might have some requests processed and some discarded. But also, the processed ones, like in successful batches, might have failed or invalid output.
- An invalid batch won’t even reach the processing phase, so no requests will be executed, which is a similar case to the requests that weren’t executed in an expired batch. You have to do something with those elements that still need to be processed.
Token limits, tier levels, and release schedules
When using OpenAI’s Batch API, there’s a TOKEN LIMIT per model, that limits how many tokens you can send across all your batches to a single model. If you go over that limit, your batches will fail.
For example, if you plan to send a batch with 10,000 requests, where each one is estimated at 1,200 tokens, then you’d be using about 12 million tokens. If your limit is only 5 million tokens per day, that batch will fail, so you should break it up into smaller chunks.
Once a batch fails due to hitting token limits, subsequent batches queued for the same model will also fail until the earlier batches are finished. And the bad part about service tier upgrades is that they are driven by a complex system of automated rules that combine usage and time, so it’s not as straightforward as “I’ll pay you more”.
So, you either need to spend tokens before releasing the feature (for example, using a production account for testing and development), do a flagged roll-out to a small set of users, and/or have a system to retry or delay the batch dispatching and some way to notify end users that there’s an operational limit and things might take longer than expected.
Building the batch processing system
Automated batch workflow and polling system
The core of our solution is the polling system. We use it not only for checking statuses but driving the entire process. Here’s a brief outline of the workflow:
-
Batch Creation A batch is created for a specific task type (e.g., summarization descriptions, classification images).
-
Token Validation Before submission, the system checks for available token quota to avoid API-level failures.
-
File Preparation Input data is formatted into JSONL files and stored, ready for submission.
-
Batch Submission The polling uploads the input file and submits the batch to OpenAI.
-
Status Monitoring At configurable intervals, the polling checks the status of active batches using the API.
-
Result Processing Once a batch completes, its output and error files are downloaded, validated, and stored.
-
Continuation and Retry Logic If items failed or were not processed, we queued them into new batches automatically.
-
Finalization Once all the results have been processed, the batch files are deleted and marked as completed.
-
Re-enqueueing Each time a batch finishes another one of the same type is queued, this is done as a way to have control of how many batches we want to run simultaneously.
Batch and request status tracking
We implemented a dual-level tracking system:
- Batch-level tracking: Tracks both OpenAI data as external_id, status, and file IDs, and internal fields like task_type used to manage different processing workflows.
- Request-level tracking: Follows each item through the pipeline, from submission to completion
This approach gives us granular visibility and control, allowing us to process any type of content, articles, products, or images, while handling failures at the appropriate level and re-queuing any items that, due to previously mentioned conditions, were left unprocessed.
Token limit management
We have a validation system that:
- Estimates token usage before batch creation: for each request, the system estimates the number of input tokens it will consume. This is achieved by accurately calculating input tokens using a tokenizer.
- Tracks tokens across all active batches: you can easily check how many tokens are currently in use by querying the database.
- Holds new batches when quotas are exceeded: we wait until have the proper space to create new batches.
- Provides visibility into current token utilization: allowing you to track and calculate token consumption over a specific period.
- We configured the system to remain below the tier limits to have a safe margin.
Handling partial successes and failures
One of the most challenging aspects of batch processing is dealing with incomplete or partially successful batches.
Sometimes a batch expires before completion but still contains valid responses for some items. We handle this by:
- Treating each item independently
- Validating individual outputs regardless of batch status
- Identifying which items need to be retried
- Resubmitting only the failed items in new batches
The entire decision logic is based on explicit conditions derived from the internal state of each item, rather than relying solely on the batch status or the technical outcome of the API request. For example, if an image has been sent for classification but doesn’t receive any category, it will be retried even if both the batch and the item were marked as “finished”.
We validate each result against task-specific criteria:
- Is the output well-formed? (eg. valid JSON)
- Does it match the expected schema? (eg. recognized categories or required fields)
- Is it complete or truncated?
Items failing validation remain unchanged and are processed in new batches.
Conclusion
We’ve spent a fair amount of time building the batch-processing system and addressing all the issues presented above. The solution we came up with lets us choose if a request is sent synchronously or in a batch, and switching from one option to the other requires a minimum code change. Creating new batch tasks is super straightforward: we only need to subclass and implement 4 or 5 methods. Most of the work goes into prompt engineering and quality testing. We ended up with a fully automated batch-processing system, that is able to start processing data as soon as the polling system discovers it. We’ve added expense and error monitoring, as once you reach this point, as this strategy can lead to recurrent unexpected failures or costs.
Implementing batch processing was more challenging than expected, but it is a very effective way to reduce costs by 50% while keeping the same prompts and models as the synchronous requests and maintaining the output quality. It was worth the investment.