Projects
/ project/Extraction Pipeline

Company Knowledge Base Extractor

Website extraction pipeline that builds structured company profiles with sources, screenshots, and completeness checks.

Status
Smaller experiment
Type
Extraction Pipeline

What it does

This utility maps a company website, selects high-signal pages, and produces a structured knowledge base.

The output can include company basics, product descriptions, pricing tiers, brand colors, screenshots, source URLs, field-level provenance, and a completeness score.

Why split the pipeline

Not every field needs an LLM. The extraction combines targeted model passes with deterministic parsing for things such as colors, logos, prices, calls to action, and change hashes.

There is also a small evaluation path for comparing generated output with labeled examples.

Implementation details

  • URLs are classified into page types such as home, about, products, pricing, resources, careers, legal, and contact before extraction.
  • The typed output model covers company basics, writing guidance, design assets, competition, positioning, culture, development, legal pages, products, and pricing.
  • Deterministic helpers normalize colors, discover likely logo assets, extract pricing signals, and calculate a completeness score.
  • The final JSON records source URLs and can attach field-level provenance for later review.