Data Linking

In ThothAI, "data linking" is the combined retrieval process that maps a question to the most relevant schema, evidence, and example SQL before generation.

For the module-by-module developer explanation of validation, retrieval, LSH extraction, vector enrichment, schema reduction, and mschema generation, see Preprocessing And Schema Linking.

Inputs Used For Linking

The current pipeline combines:

validated or translated question text
extracted keywords
evidence retrieved from the vector database
prior SQL examples retrieved from the vector database
LSH-derived schema matches
vector-enriched schema descriptions

Current Linking Flow

The relevant implementation is split across:

helpers/main_helpers/main_preprocessing_phases.py
model/system_state.py
helpers/get_evidences_and_sql_shots.py
helpers/main_helpers/main_schema_extraction_from_lsh.py
helpers/main_helpers/main_schema_extraction_from_vectordb.py
helpers/main_helpers/main_schema_link_strategy.py

What Happens In Practice

The question is validated and optionally translated.
The keyword extraction agent produces the retrieval terms.
state.get_evidence_from_vector_db() retrieves short evidence records.
state.get_sql_from_vector_db() retrieves similar question-SQL-hint examples.
state.extract_schema_via_lsh() recovers likely matching schema elements and example values.
state.extract_schema_from_vectordb() augments the schema with vector-backed descriptions.
decide_schema_link_strategy() determines how aggressive schema reduction should be.
to_mschema() builds the final schema string passed into generation.

Current Schema-Linking Methodology

The current implementation does not solve large schemas by switching to a second full-context strategy. It solves them by deciding whether schema reduction is necessary and, if so, building a filtered schema that preserves only high-signal and structurally necessary columns.

Decision Layer

decide_schema_link_strategy(state) evaluates:

the token cost of the full enriched schema
the total number of columns in the authoritative schema
the configured threshold for how much of the model context may be consumed before linking
the configured threshold for maximum columns before linking

The function returns:

WITHOUT_SCHEMA_LINK
WITH_SCHEMA_LINK

Why Large Databases Trigger Linking

Large databases become difficult for the runtime for two concrete reasons:

the full schema text consumes too much of the available model context window
the raw column count makes broad prompting too noisy even before hard token exhaustion

That is why the code checks both token pressure and column pressure.

LSH Contribution

extract_schema_via_lsh() contributes:

column-level matches
example values for matched columns
a lexical and value-driven narrowing signal

This pass is critical and request-blocking when it fails.

Vector DB Contribution

extract_schema_from_vectordb() contributes:

semantic descriptions
additional enrichment for columns already relevant to the request

This pass is useful but non-blocking: if it fails, the request can continue with reduced enrichment.

Filtered Schema Construction

When the strategy is WITH_SCHEMA_LINK, the runtime uses create_filtered_schema(state).

The reduced schema keeps a column if:

it is a primary key
it is a foreign key
it appears in schema_with_examples
it appears in schema_from_vector_db

This is the core of current schema-linking:

never lose the relational backbone needed for joins
keep columns with LSH evidence
keep columns with semantic vector relevance

Output Representation

The result is then converted to mschema through to_mschema().

That representation preserves:

table descriptions
column types
descriptions
examples
foreign key mappings

while remaining much smaller than a raw full-schema dump.

Why This Matters

This linking stage is what lets ThothAI:

narrow large schemas for smaller models
preserve business terminology through evidence
inject good prior examples into the prompt context
ground SQL testing in retrieved facts rather than prompt-only guesses

Practical Developer Interpretation

When the database is too large for broad-context prompting, the system does not "try harder" with the same full schema. Instead, it transforms retrieval signals into a structurally safe reduced schema and passes that reduced mschema into generation.