Batch processing large PDF collections via API

Working on a data pipeline that needs to process ~10,000 PDFs (research papers, 5-20 pages each). Looking for the most efficient approach.

Currently doing sequential uploads at ~15 seconds per document = 41+ hours. Not great.

for pdf_path in pdf_files:
    with open(pdf_path, 'rb') as f:
        resp = requests.post(
            'https://api.qomplement.com/v1/parse',
            files={'file': f},
            headers={'Authorization': 'Bearer <token>'}
        )
    results.append(resp.json())
    time.sleep(1)

Is there a batch endpoint? Max concurrent requests? Async processing option?

1 Like

I dealt with similar scale. Use concurrent uploads — 5-8 parallel requests worked fine for me:

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=5) as executor:
    futures = {executor.submit(process_pdf, p): p for p in pdf_files}
    for future in concurrent.futures.as_completed(futures):
        result = future.result()

Brought my 10K doc processing down to about 8 hours.

2 Likes