LegalTech / Document AI

Legal Contract Intelligence Pipeline

Raw legal DOCX contracts turned into structured, queryable clause data. 12 clause types tagged automatically. Synced to Firestore and Airtable in real time. Built in Jan 2024 — before legal AI became a headline.

The Problem

Lexxa was building a contract intelligence platform for teams that review legal agreements without a lawyer on every deal. The bottleneck: contracts live as raw DOCX files. Dense prose. No structure. No way to query, compare, or flag risk automatically.

They needed a backend pipeline that could take any standard legal contract, extract every clause, classify it by type, score risk, and push structured data into their product database and ops dashboard — automatically, on upload.

What Made It Hard

Legal contracts are not documents. They are nested logic trees written in prose that evolved over 200 years of drafting conventions.

Three specific problems:

One: Structure is implicit. Section 4.2 references Exhibit A which defines terms used in Section 7. The hierarchy is not in the XML. It's in the numbering conventions, the defined-terms glossary, and decades of legal drafting norms. A naive paragraph iterator misses all of it.

Two: Two document types in the wild. Some contracts came pre-annotated with revision marks and inline comments. Others were clean final versions. The same extraction strategy can't handle both. A clean contract needs regex pattern anchors. An annotated contract needs comment-node traversal in the DOCX XML. One strategy applied to both gives garbage output on half the corpus.

Three: Airtable and Firestore have different data contracts. Firestore wants a document tree per contract. Airtable wants a flat row per contract with aggregate columns for human review. The same object couldn't write to both — it needed a transform layer that produced two different shapes from the same Pydantic model.

The Architecture

Two extraction strategies, running behind one interface:

Strategy A (Rule-Based): Regex anchors detect section numbers (1., 2.1, 3.2.1), legal boilerplate markers (WHEREAS, NOW THEREFORE, IN WITNESS WHEREOF), and heading hierarchy from DOCX styles. Fast, deterministic, works on any clean final contract.

Strategy B (Comment-Based): Walks the DOCX XML directly via lxml, reads revision marks and inline comments as clause boundary signals. Precise on pre-annotated contracts. Falls back to Strategy A for unannotated sections.

The classifier runs 12 clause-type patterns against each extracted section: parties, effective date, term, payment, obligations, IP ownership, confidentiality, termination, liability cap, indemnification, governing law, dispute resolution. Risk scoring runs in parallel — HIGH for IP assignment and liability caps, MEDIUM for termination and NDA provisions.

Output: a validated Pydantic ContractSchema object. Two sync targets: Firestore (full document tree, live product data) and Airtable (flat row with risk flags, pending human review).

The Outcome

A production pipeline that processed Consulting Agreements, NDAs, and SaaS MSAs off the Cooley GO baseline. 12 clause types extracted and tagged per contract. Risk flags surfaced automatically. Firestore live, Airtable populated for ops review.

Built and delivered January–February 2024. At the time, Harvey AI had just opened their first law firm waitlist. Contract AI was not yet a category that engineers were being hired to build. This was early.

The pattern — DOCX XML traversal, dual extraction strategies, typed clause schemas, dual-target sync — is the same architecture pattern that showed up in every "AI contract review" product that launched in 2024 and 2025. We built it before the playbook existed.

Tech Stack

PythonlxmlPydanticFirebase FirestoreAirtablePoetry

← All projects Home