hequ.ai/ discovery
Methodology

How a coupling earns its place.

The engine does not trust itself. It does not trust any single model. It does not trust a clever algebraic match. Every claim must survive four independent levels of evidence and a four-phase peer review modelled directly on academic practice.

The thesis

The difference between a coincidence and a genuine cross-domain coupling is executable realization. A real coupling admits a physical system in which both equations hold simultaneously, on the same substrate, and produce the same measurable answer. Newton plus Hooke is real because mass–spring oscillators exist in laboratories and the engine can simulate one and compare against a measured period. Newton plus Shannon entropy is not real because no laboratory object obeys both equations at once. Every prior attempt at cross-domain discovery — SINDy, AI Feynman, PySR, SemGen, DARPA SKEMA — argued about symbols and ignored execution. This engine runs experiments.

Evidence hierarchy

Every coupling hypothesis accumulates evidence at four levels, each stricter than the last. Confidence is multiplicative — any level failing drags the whole down.

L1

Canonical problem

Each equation in the corpus has a real problem with a known reference value. The equation passes L1 only if the problem's pytest unit test executes and the computed answer matches the reference within declared tolerance.

L2

Composite notebook

When two equations are coupled, a real Jupyter notebook solves the composite system numerically and asserts against a board-sourced reference value. The notebook runs in a sandboxed container with no internet, a pinned image SHA, and a content-hashed ledger entry. No hidden human-in-the-loop.

L3

Literature citation

At least two of the three AI reviewers must produce independent primary-source citations to the coupling from established textbooks or peer-reviewed papers. Shared-source or paraphrased citations do not count — DOI-level independence is required.

L4

Live sensor agreement

The strongest level. A real physical sensor (phone IMU, weather station, smart meter, market feed, or quantum computer) produces measurements that match the composite prediction within a declared noise envelope, with Bayesian posterior updates over a rolling window.

Promotion thresholds

Labels are strictly monotonic in rigor. A coupling only reaches the higher tiers by passing more evidence levels and a stricter consensus gate.

LabelConfidenceRequired levelsBoard gate
CONJECTURALC < 0.30any
EMPIRICALC ≥ 0.30L1 + L2≥ 2 of 3 approve
PROVEDC ≥ 0.60L1 + L2 + L3unanimous approve
GROUNDEDC ≥ 0.85L1 + L2 + L3 + L4unanimous + live sensor

A sidebar class DUALITY applies to mathematical isomorphisms such as Wick rotation or log-price substitution. These cannot be promoted past EMPIRICAL on symbolic evidence alone — they need independent physical sensor agreement from both domains. Mathematical equivalence is not a physical coupling.

The four-phase review protocol

Every coupling, and every architectural decision, is reviewed by a board that mirrors real academic peer review. The protocol runs in four explicit phases and produces a binding editorial decision. The three core reviewers are Claude Opus 4.6, OpenAI GPT-5, and Gemini 2.5 Pro, each assigned a specialist persona matched to the topic under review. Grok-4 (xAI) is enrolled as an optional fourth seat that is included on high-stakes governance rounds — architecture design reviews, scope-frozen discharge rounds, brand board reviews — and omitted on operational lookups such as canonical-problem reference sourcing. Whether a given round ran as a 3-seat or 4-seat board is recorded in its ledger entry.

A

Independent draft

Each reviewer receives the submission in complete isolation. No cross-contamination. Every reply begins with a conflict-of-interest self-declaration (did this reviewer encounter the specific formulation during training?), then a five-dimension rubric score, then a draft verdict with strengths and concerns.

B

Informed vote

The three Phase A drafts are consolidated into a board briefing document showing each reviewer what the others said. Each reviewer then produces a final updated verdict, explicitly responding to peer concerns — agreeing, disagreeing, or updating their position. This is the phase that breaks correlated blind spots.

C

Editor synthesis

A fourth distinct model call reads all the Phase B verdicts and writes the binding editorial decision. The editor weights reasoning quality, not vote count, and can override a naive tally when one reviewer's argument is visibly stronger. Reviewer disagreement on fundamentals is explicitly flagged in the ledger.

D

Bounded rebuttal

If the editor returns MODIFY with specific required changes, the submitter is allowed one revision cycle to address them, and the board runs again on the revision. A maximum of two rebuttal cycles are allowed before the decision becomes final. This mirrors the author-response phase standard at CS conferences.

The rubric

Every reviewer produces a structured five-dimension score on a 1–5 integer scale. For coupling hypotheses the dimensions are:

Rubric totals are computed by the editor as weighted sums. Two reviewers scoring a coupling 5 across the board carry more weight in the editor's decision than a single reviewer scoring 1s without reasoning.

Conflict of interest

At the start of every Phase A reply, each reviewer declares whether it encountered the specific formulation during training. A declared COI does not disqualify the reviewer — all four models share some training on foundational physics literature — but the declaration is logged. If every reviewer seated in the round declares training exposure to an exact formulation, the coupling is flagged with HIGH_COI_CORRELATION_RISK and cannot be promoted past EMPIRICAL without independent experimental evidence at L4.

Three-way verification of board-sourced references

Reference values for canonical problems come from the board, not from the architect. The board is asked for {formula_symbolic, substitutions, value, citation} and the architect then runs three independent checks locally before accepting the value:

  1. Symbolic equivalencesp.simplify(board_formula − local_formula) == 0. A stronger guarantee than any single-point numeric check.
  2. Property-based randomized sampling — 50 unit-consistent parameter draws, all must satisfy the board's claimed relationship within tolerance.
  3. Cross-CAS high precisionmpmath at 50 decimal places vs sympy single-precision. Disagreement at > 10 decimal places flags the formula as numerically fragile.

All three must pass. A single-point agreement alone is insufficient. This breaks the circularity risk — the architect's local code is an independent check that does not depend on any model's training data.

The append-only ledger

Every hypothesis, every rejection, every board vote, every sensor reading, every composite execution result is recorded in an append-only JSONL file with a SHA-256 hash chain. Nothing is deleted. The ledger is the single source of truth for what the engine has ever tried and what happened. A future audit can re-derive every promotion decision from its raw evidence.

Sandboxed notebook execution

Every auto-generated composite notebook runs inside a Docker container with pinned image SHA, frozen pip requirements with hashes, no network access, read-only root filesystem, an ephemeral 64 MB tmpfs workspace, non-root UID, memory and CPU cgroup limits, and a per-cell execution timeout. The container's SHA and the notebook's content hash are both logged in the ledger, so the execution environment is reproducible and tamper-evident.

Academic analogs

The four-phase protocol is not invented — it mirrors standard peer review. For the record, the closest academic correspondences:

v5 rules added after Round 2 discharge

The board's discharge round produced four additional binding rules beyond the v4 FCS fixes, each closing a specific structural loophole the engine hit in practice:

v5·1

Board is not a CAS

The first Phase 12 canonical run caught a language-model arithmetic error: correct formula, correct citations, imprecise decimal. Never ask the board for a precision-critical numerical value. Query for formula and citations; compute the number locally with sympy + mpmath + python-flint. The board's value_attempt is logged as capability data but never load-bearing.

v5·2

All must respond

Non-response from any board member (HTTP error, timeout, parse failure) counts as dissent, not neutral. Rounds cannot approve with partial quorum. Every model call has 5-retry exponential backoff (2→4→8→16→32 s). Persistent non-response escalates to the user with a Decision Matrix of retry / proceed-with-acknowledged-dissent / wait-for-vendor.

v5·3

Scope-frozen discharge

After Round 1 produces the Frozen Concern Set, Round 2+ reviewers can only score each FCS item as addressed / partially / not-addressed — they cannot raise new concerns. Process violations are logged and escalated to the user as the only authority that can reopen the FCS. This breaks the infinite-review-drift that would otherwise keep the architecture perpetually in flight.

v5·4

No corner cuts on failure

On any verification failure, the Failure Investigation Protocol (§11h) fires: state audit, assumption audit, literature RAG, board failure review, ledger entry. Tolerance loosening is forbidden. Equation modifications require explicit user approval on a board equation_inadequate verdict. The default hypothesis when theory and experiment disagree is that the experiment is incomplete — not that the theory is wrong.

The Nobel-prize bar

The engine's output standard is external academic peer review, not internal board agreement. Every PROVED or GROUNDED coupling must pass an academic readiness checklist before promotion:

A coupling missing any checklist item stays at EMPIRICAL regardless of board consensus. Internal board agreement is necessary but not sufficient.