Sharing our pipeline setup for extracting data from medical research papers and analyzing with Pandas.
The flow:
- Papers come in as PDFs via Email Inbox
- Qomplement extracts structured data (study size, outcomes, methodology, etc.)
- Webhook triggers our Python pipeline
- Pandas for cleaning, normalization, and analysis
- Results go to a Jupyter dashboard
import pandas as pd
import requests
def fetch_extracted_data(doc_id):
resp = requests.get(
f'https://api.qomplement.com/v1/documents/{doc_id}',
headers={'Authorization': f'Bearer {API_KEY}'}
)
return resp.json()['data']
def normalize_study_data(raw):
"""Clean and normalize extracted fields."""
df = pd.DataFrame([raw])
# Standardize date formats
df['publication_date'] = pd.to_datetime(df['publication_date'], format='mixed')
# Normalize sample sizes to integers
df['sample_size'] = pd.to_numeric(
df['sample_size'].str.replace(',', '').str.extract(r'(\d+)')[0]
)
return df
# Aggregate results across multiple papers
all_studies = pd.concat([normalize_study_data(d) for d in extracted_data])
print(all_studies.describe())
The key insight: Qomplement handles the unstructured→structured conversion, then you can use standard data tools for everything else. Way better than trying to do NLP on raw PDFs.