Recipe: Building a document classification layer on top of Qomplement

If you process diverse document types, you might want auto-classification before extraction. Here’s how I built a simple classifier that routes documents to the right Qomplement template.\n\n## Approach\n\nUse the raw text extraction from Qomplement as input to a lightweight classifier.\n\npython\nfrom sklearn.feature_extraction.text import TfidfVectorizer\nfrom sklearn.naive_bayes import MultinomialNB\nimport joblib\n\n# Train on a labeled set of your documents\nvectorizer = TfidfVectorizer(max_features=5000)\nclassifier = MultinomialNB()\n\n# After training:\ndef classify_and_process(pdf_path):\n # Step 1: Quick extraction (no template)\n raw = qomplement_extract(pdf_path, template=None)\n text = raw['full_text']\n \n # Step 2: Classify\n features = vectorizer.transform([text])\n doc_type = classifier.predict(features)[0]\n \n # Step 3: Process with the right template\n template_map = {\n 'invoice': 'tmpl_invoice_v2',\n 'contract': 'tmpl_contract_msa',\n 'receipt': 'tmpl_receipt_std',\n 'form': 'tmpl_generic_form'\n }\n \n result = qomplement_extract(pdf_path, template=template_map[doc_type])\n return doc_type, result\n\n\n## Training Data\n\nYou need ~50 labeled examples per category to get decent accuracy. We got 94% accuracy with just 200 total training documents across 4 categories.\n\n## Results\n\nThis eliminated the manual sorting step in our pipeline. Documents are automatically routed to the correct template, which improved both speed and accuracy of the overall extraction.

2 Likes

Nice approach. We also use confidence scores as a signal — if default extraction is above 90%, skip the classifier entirely.

4 Likes