DEV Community

Cover image for Parse, Don’t Guess
Eitamos Ring
Eitamos Ring

Posted on

Parse, Don’t Guess

Three days before going open source, I deleted my parser's smartest feature.

333 lines. 6 functions. 12 passing tests. All green, all clever, all gone.

The feature worked. That was the problem.

The feature that knew too much

My PostgreSQL parser (valk-postgres-parser) extracts structure from SQL text: tables, columns, joins, filters. One of its analysis functions did something more ambitious. Give it a query like this:

SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
Enter fullscreen mode Exit fullscreen mode

and it would tell you the foreign key relationship: orders.customer_id is a child pointing at parent customers.id.

No schema. No catalog access. Just the query text. It felt like magic, and users love magic.

Here is how the magic worked:

func isForeignKeyColumn(column, targetTable string) bool {
    // ...
    if strings.HasSuffix(column, "_id") {
        prefix := strings.TrimSuffix(column, "_id")
        if strings.HasPrefix(targetTable, prefix) {
            return true
        }
        // Also check if table contains the prefix (handles prefixed
        // tables like fk_customers). This allows customer_id to match
        // fk_customers, e2e_customers, etc.
        // ...
Enter fullscreen mode Exit fullscreen mode

Naming conventions. customer_id next to a table called customers? Must be a foreign key. Ship it.

The comment that aged badly

The real problem was further down, in the fallback. What happens when neither column matches a naming pattern? The function did this:

// If we can't determine from FK naming conventions, return a default
// relationship based on table order (left table = parent by convention
// in SQL JOINs). This is still a heuristic but is consistent with how
// JOINs are typically written.
return &JoinRelationship{
    ChildTable:  rightTable,
    ParentTable: leftTable,
    // ...
}
Enter fullscreen mode Exit fullscreen mode

Read that again. When the function had no idea, it did not say "I have no idea." It returned an answer anyway, based on which table appeared first in the query.

Every JOIN got an answer. That was the bug. Not a crash, not a parse error. A function that is sometimes right, sometimes wrong, and gives the caller no way to tell which.

Wrong is worse than empty

Think about what downstream code does with a foreign key relationship. In our case it generated test data: parent rows first, then children referencing them. Flip parent and child and you get foreign key violations, or worse, data that inserts fine but means the wrong thing.

An empty result fails loudly. The caller sees nothing came back and handles it. A wrong result fails quietly, three layers up, a week later, in a system that trusted the library.

And the heuristic had plenty of ways to be wrong:

  • customer_id happily matched tables named fk_customers and e2e_customers, because of a prefix check added for one test environment
  • Self-referencing tables (employees.manager_id) confused the direction logic
  • Junction tables, where both joined columns are primary keys, got an arbitrary winner
  • And the table-order fallback was a coin flip dressed up as a convention

Each bug was fixable. Another pattern, another special case, another test. That is exactly how the function grew to be the smartest 333 lines in the codebase. The line count was not the cost. The confidence was.

The replacement: facts or nothing

The fix was not a better heuristic. It was a contract change:

schema := map[string][]analysis.ColumnSchema{
    "customers": {{Name: "id", PGType: "bigint", IsPrimaryKey: true}},
    "orders":    {{Name: "id", IsPrimaryKey: true}, {Name: "customer_id"}},
}

joins, _ := analysis.ExtractJoinRelationshipsWithSchema(query, schema)
// orders.customer_id -> customers.id, because the schema says id is a PK.
Enter fullscreen mode Exit fullscreen mode

You pass schema metadata, you get relationships derived from actual primary keys. You do not pass schema, you get nothing. If the metadata cannot settle a case (both sides are primary keys), the answer is nil, not a guess.

Deleting the no-schema path meant deleting tested, working, green code. The 12 tests I removed were not failing. They were carefully asserting that the guesses came out the way the guesses come out. Tests that lock in behavior nobody should rely on are not coverage. They are a fence around a landmine.

The migration cost turned out to be near zero: every production caller already had schema metadata sitting right there. They had just never been asked for it.

The part I did not expect

Last week a stranger opened an issue on the repo. He was analyzing hundreds of queries and hit a case where a WHERE clause column was not qualified by a table name, so the parser left the table field empty.

He did not ask the parser to guess. He asked for ExtractWhereConditionsWithSchema, by name, with the same schema-map shape the JOIN function uses. The design had taught him what to ask for.

That is the moment you find out an API decision was right: when users start requesting extensions to the constraint instead of exceptions from it.

The rule

You probably know "Parse, don't validate": make illegal states unrepresentable by parsing input into types that carry proof. This is the next clause of the same contract. Parsing tells the truth about what the input is. It must also tell the truth about what it cannot know.

If your library cannot know, it must say so. An empty result is an answer. A guess is a liability with good marketing.

We are currently spending billions of dollars teaching language models to say "I don't know" instead of hallucinating a confident answer. Your API can do it for free, in the type system, today: take the inputs that make the answer knowable, and return nothing when it is not.

The smartest code I ever deleted is the reason people trust what is left.


The parser is open source at github.com/ValkDB/postgresparser, 260 stars and counting. Issues and PRs welcome, especially the ones that ask for facts instead of guesses.

Top comments (0)