Using Qomplement + Pandas for data analysis pipeline

Sharing our pipeline setup for extracting data from medical research papers and analyzing with Pandas.

The flow:

  1. Papers come in as PDFs via Email Inbox
  2. Qomplement extracts structured data (study size, outcomes, methodology, etc.)
  3. Webhook triggers our Python pipeline
  4. Pandas for cleaning, normalization, and analysis
  5. Results go to a Jupyter dashboard
import pandas as pd
import requests

def fetch_extracted_data(doc_id):
    resp = requests.get(
        f'https://api.qomplement.com/v1/documents/{doc_id}',
        headers={'Authorization': f'Bearer {API_KEY}'}
    )
    return resp.json()['data']

def normalize_study_data(raw):
    """Clean and normalize extracted fields."""
    df = pd.DataFrame([raw])
    # Standardize date formats
    df['publication_date'] = pd.to_datetime(df['publication_date'], format='mixed')
    # Normalize sample sizes to integers
    df['sample_size'] = pd.to_numeric(
        df['sample_size'].str.replace(',', '').str.extract(r'(\d+)')[0]
    )
    return df

# Aggregate results across multiple papers
all_studies = pd.concat([normalize_study_data(d) for d in extracted_data])
print(all_studies.describe())

The key insight: Qomplement handles the unstructured→structured conversion, then you can use standard data tools for everything else. Way better than trying to do NLP on raw PDFs.

3 Likes

This is very close to what we’re building for clinical trial data. Thanks for sharing! Quick question — how do you handle papers where the template doesn’t match well? Do you have a fallback?

1 Like

Good question. We have 3 templates for different paper formats (standard research, case study, meta-analysis). We run a quick classification step first based on the first page layout, then route to the right template. If confidence is below 70% we flag for manual review.

2 Likes