Skip to content

Preprocessing Assets

ThothAI relies on retrieval assets that are prepared before normal querying.

Main Asset Types

  • schema metadata
  • comments and column descriptions
  • evidence
  • Gold SQL examples
  • vector embeddings
  • LSH-derived similarity artifacts

Why They Matter

These assets shape the context passed to the SQL generation agents.

They help the runtime:

  • map business language to tables and columns
  • recover relevant examples from the vector store
  • reduce the schema for smaller generation models
  • enforce evidence-aware testing before selecting a final query

Operational Consequence

If preprocessing is skipped or incomplete, the system can still run, but:

  • context retrieval becomes weaker
  • schema reduction becomes noisier
  • SQL evaluation has less grounded evidence
  • final selection quality drops