BLOG

ContentBase潜变量训练与Reranker调试记录

2026/05/21 59 min read BLOG 训练AI CONTENTBASE潜变量训练与RERANKER调试记录

node --test -r ts-node/register/transpile-only --test-name-pattern "generates an article body through an injected │ model provider when requested|authorial constitution" tests/content-runtime-api.test.ts └ ▶ content runtime api ✖ generates an article body through an injected model provider when requested (101.6354ms) … +30 lines (ctrl + t to view transcript) • 聚焦测试里一个旧断言失败在 reference coverage threshold:实际是 72,测试写死 70;这和我删重复 prompt 或 Gateway endpoint 无直接关系。我不顺手改这个旧断言,先继续验证本次 latent 改动自己的路径:Gateway typecheck/URL 测试已过, generation-kernel 测试已过。

• Running Get-Content .gitignore; Get-Content training\latent-mvp.gitignore -ErrorAction SilentlyContinue

• Running git -C DataBase diff --stat -- apps/gateway/src/lib/latent-rerank-client.ts apps/gateway/src/lib/ │ latent-rerank-client.test.ts; git -C ContentBase diff --stat -- product/novel/app/article/capability.ts; git │ diff --stat -- training/latent-mvp/README.md training/latent-mvp/artifacts/controller-v1/controller.json

• Ran Get-Content .gitignore; Get-Content training\latent-mvp.gitignore -ErrorAction SilentlyContinue └ # Local deploy overrides (copy from deploy.env.example) deploy.env … +2 lines (ctrl + t to view transcript) └ { "version": "latent-controller.v1", … +8 lines (ctrl + t to view transcript) "valMae": 5.960464477539063e-08 } │ latent-rerank-client.test.ts; git -C ContentBase diff --stat -- product/novel/app/article/capability.ts; git diff │ --stat -- training/latent-mvp/README.md training/latent-mvp/artifacts/controller-v1/controller.json └ product/novel/app/article/capability.ts | 779 ++++++++++++++++++++++++-------- 1 file changed, 591 insertions(+), 188 deletions(-) warning: in the working copy of 'product/novel/app/article/capability.ts', LF will be replaced by CRLF the next time Git touches it

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• 我发现 ContentBase 文件本身已有大量未提交变更,git diff --stat 显示很大不是我这次造成的;我只动了其中一个重复 prompt 块。现在补齐 latent 运行产物:用 canonical train_reranker.py 训练小 reranker artifact,这样服务不只靠外部 base model, ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• train_reranker.py 失败在 canonical 数据字段不匹配:现有 JSONL 不是它期望的 topic/text 形态,所以“reranker 训练入口”本 身还没吃到导出的 latent corpus。我会读样本字段,修训练入口去消费现有 canonical corpus,而不是造临时转换脚本。

• Running Get-Content training\latent-mvp\data\train.jsonl -TotalCount 3

• Running Get-Content training\latent-mvp\data\summary.json

• Ran Get-Content training\latent-mvp\data\train.jsonl -TotalCount 3 └ {"sampleId":"f240cd06aa365c22edbd619ded15ae28ba45a14c893288d97a35831d3dc0d5b4","sourceFile":"generate-article-mvp-17 79369075372.json","topic":"","profile":"unknown","provider":"","text":"最早将辽阔的海洋当作移动空间,将之编入人类史 … +64 lines (ctrl + t to view transcript) {"vectorSimilarity":0,"fusedRelevance":110,"relevanceScore":110,"qualityBlockCount":0,"referenceCoverageScore":0},"l abel":0}

• Ran Get-Content scripts\export-latent-training-corpus.mjs -TotalCount 220 └ #!/usr/bin/env node /** … +178 lines (ctrl + t to view transcript) fs.writeFileSync(filePath, rows.map((row) => JSON.stringify(row)).join('\n') + '\n', 'utf8'); }

• Ran Get-Content training\latent-mvp\data\summary.json └ { "version": "latent-mvp-corpus.v1", … +8 lines (ctrl + t to view transcript) "valRatio": 0.1 }

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• 训练数据的真实形态说明另一个缺口:导出器产出的 topic 为空,train_reranker.py 因此把所有样本丢掉。这不是服务问题,是训 练入口和 corpus schema 不对齐;我会修训练入口只消费 canonical corpus 字段,不再要求 topic 非空。

• Edited training\latent-mvp\train_reranker.py (+20 -14) 15 16 -def load_jsonl(path: Path) -> List[Dict]: 17 - rows: List[Dict] = [] 18 - with path.open("r", encoding="utf-8") as handle: 19 - for line in handle: 20 - line = line.strip() 16 +def load_jsonl(path: Path) -> List[Dict]: 17 + rows: List[Dict] = [] 18 + with path.open("r", encoding="utf-8") as handle: 19 + for line in handle: 20 + line = line.strip() 21 if not line: 22 continue 23 - obj = json.loads(line) 24 - topic = str(obj.get("topic", "")).strip() 25 - text = str(obj.get("text", "")).strip() 26 - if not topic or not text: 27 - continue 28 - rows.append({ 28 + or obj.get("sourceFile") 29 + or "latent writing preference" 30 + ).strip() 31 + text = str(obj.get("text", "")).strip() 32 + if not text: ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• 训练入口已和现有 corpus 对齐。现在重新跑小 reranker 训练;样本只有 22 条 train,应该是短任务。

• Ran python training/latent-mvp/train_reranker.py --train training/latent-mvp/data/train.jsonl --val training/ │ latent-mvp/data/val.jsonl --output training/latent-mvp/artifacts/reranker-v1 --epochs 1 --batch-size 8 └ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. … +38 lines (ctrl + t to view transcript) ImportError: Using the Trainer with PyTorch requires accelerate>=1.1.0: Please run pip install transformers[torch] or pip install 'accelerate>=1.1.0'

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── uvicorn>=0.29.0 pydantic>=2.7.0

• Edited training\latent-mvp\requirements.txt (+3 -2) ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• 我会安装缺失的 accelerate。这是训练工具链依赖,不改业务运行路径;安装完成后继续跑同一训练入口。

• Ran python -m pip install "accelerate>=1.1.0" └ Defaulting to user installation because normal site-packages is not writeable Collecting accelerate>=1.1.0 … +47 lines (ctrl + t to view transcript) [notice] A new release of pip is available: 26.0.1 -> 26.1.1 [notice] To update, run: python.exe -m pip install --upgrade pip

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• accelerate 已装好。现在重跑 reranker 训练;如果 sentence-transformers 版本 API 还有兼容问题,我会修训练入口本身,不绕 开 canonical 流程。

• Ran python training/latent-mvp/train_reranker.py --train training/latent-mvp/data/train.jsonl --val training/ │ latent-mvp/data/val.jsonl --output training/latent-mvp/artifacts/reranker-v1 --epochs 1 --batch-size 8 └ Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads. … +55 lines (ctrl + t to view transcript) AttributeError: 'NoneType' object has no attribute 'float' 0%| | 0/3 [00:03<?, ?it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• 训练现在进入 Trainer 了,但 sentence-transformers 新版 CrossEncoder.fit 对 InputExample label 兼容有问题。为了不卡在库 API,我会把训练入口改成当前库推荐的 CrossEncoderTrainer/dataset 还太重;更保守的是保存 base cross-encoder artifact + eval summary,让服务可启动,controller 负责 generationControl,后续再单独优化 reranker。

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• 我会修 train_reranker.py,让它不用 sentence-transformers 的高层 Trainer,而是用底层 cross-encoder 模型做一个很小的 BCE fine-tune。这样 canonical 训练入口能稳定吃现有 JSONL,并产出本地 reranker artifact。

• Edited training\latent-mvp\train_reranker.py (+47 -24) 10 import numpy as np 11 -import torch 12 -from sentence_transformers import CrossEncoder, InputExample 13 -from torch.utils.data import DataLoader 11 +import torch 12 +from sentence_transformers import CrossEncoder 13 ⋮ 40 42 -def make_examples(rows: List[Dict]) -> List[InputExample]: 43 - return [ 44 - InputExample(texts=[row["topic"], row["text"]], label=float(row["label"])) 45 - for row in rows 46 - ] 47 - 48 - 49 -def evaluate(model: CrossEncoder, rows: List[Dict]) -> Dict[str, float]: 41 +def evaluate(model: CrossEncoder, rows: List[Dict]) -> Dict[str, float]: 42 if not rows: ⋮ 56 "avgNegScore": float(negatives.mean()) if negatives.size else 0.0, 65 - } 66 - 67 - 68 -def main() -> None: 57 + } 58 + 59 + 60 +def train_manual(model: CrossEncoder, rows: List[Dict], epochs: int, batch_size: int, lr: float) -> None: 61 + device = next(model.model.parameters()).device 62 + optimizer = torch.optim.AdamW(model.model.parameters(), lr=lr) 63 + loss_fn = torch.nn.BCEWithLogitsLoss() 64 + rng = np.random.default_rng(42) 65 + 66 + model.model.train() 67 + for epoch in range(max(1, epochs)): 68 + order = rng.permutation(len(rows)) 69 + total_loss = 0.0 70 + step_count = 0 71 + for start in range(0, len(order), max(1, batch_size)): 72 + batch_indexes = order[start:start + max(1, batch_size)] 73 + batch = [rows[int(index)] for index in batch_indexes] 74 + features = model.tokenizer( 75 + [row["topic"] for row in batch], 76 + [row["text"] for row in batch], 77 + padding=True, 78 + truncation=True, 79 + return_tensors="pt", 80 + ) 81 + features = {key: value.to(device) for key, value in features.items()} 82 + labels = torch.tensor([float(row["label"]) for row in batch], dtype=torch.float32, device=device) 83 + 84 + optimizer.zero_grad(set_to_none=True) 85 + output = model.model(**features) 86 + logits = output.logits.reshape(-1) 87 + loss = loss_fn(logits, labels) 88 + loss.backward() 89 + optimizer.step() 90 + 91 + total_loss += float(loss.detach().cpu()) 92 + step_count += 1 93 + print(json.dumps({ 94 + "epoch": epoch + 1, 90 - train_examples = make_examples(train_rows) 91 - train_data = DataLoader(train_examples, shuffle=True, batch_size=args.batch_size) 92 - model.fit( 93 - train_dataloader=train_data, 94 - epochs=args.epochs, 97 - show_progress_bar=True, 98 - ) 121 + train_manual(model, train_rows, args.epochs, args.batch_size, args.lr) 122 • Ran python training/latent-mvp/train_reranker.py --train training/latent-mvp/data/train.jsonl --val training/ │ latent-mvp/data/val.jsonl --output training/latent-mvp/artifacts/reranker-v1 --epochs 1 --batch-size 8 └ {"epoch": 1, "loss": 0.006283120989489059} { … +23 lines (ctrl + t to view transcript) 13.75it/s]

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• Reranker artifact 也生成了。注意当前训练集标签全是负样本,所以 AUC 为 0、只能算“服务可用的 MVP artifact”,不是高质量个 性化模型;真正提升还需要从通过/失败 revision 里导出正样本。接下来启动本地 latent 服务做一次短 health + POST 验证,确认 返回 generationControl。

• Running Get-ChildItem -Recurse -File training\latent-mvp\artifacts | Select-Object FullName,Length,LastWriteTime

• Running Get-Content training\latent-mvp\artifacts\reranker-v1\eval-summary.json

• Ran Get-Content training\latent-mvp\artifacts\controller-v1\controller.json └ { "version": "latent-controller.v1", … +8 lines (ctrl + t to view transcript) "valMae": 5.960464477539063e-08 }

• Ran Get-Content training\latent-mvp\artifacts\reranker-v1\eval-summary.json 2026/5/2…

• Ran Get-Content training\latent-mvp\artifacts\controller-v1\controller.json └ { "version": "latent-controller.v1",

• 我将用后台进程临时启动 latent service,只做 health 和一个最小 rerank POST,然后关闭它。不跑完整生成 smoke。

• Ran $root = (Get-Location).Path; $out = Join-Path $root '.runtime\latent-reranker-smoke.log'; New-Item -ItemType │ Directory -Force -Path (Split-Path $out) | Out-Null; $p = Start-Process -FilePath python -ArgumentList │ @('training/latent-mvp/serve_reranker.py','--model','training/latent-mvp/artifacts/reranker-v1','--controller','tr │ … +8 lines └ { "ok": true, … +29 lines (ctrl + t to view transcript) | This command cannot be run because "RedirectStandardOutput" and "RedirectStandardError" are same. Give different inputs and Run your command again.

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• Latent service smoke 返回了真实 generationControl,这条链已经能产出 styleWeight/factWeight。有个 PowerShell 小瑕疵:我 给 stdout/stderr 指了同一个日志文件,Start-Process 报了非终止错误;请求实际打到了本机 8765 上已有/可用服务,我会核对端 口进程,避免留下后台进程。

• Running Get-NetTCPConnection -LocalPort 8765 -ErrorAction SilentlyContinue | Select-Object │ LocalAddress,LocalPort,State,OwningProcess; Get-Process -Id (Get-NetTCPConnection -LocalPort 8765 -ErrorAction │ SilentlyContinue | Select-Object -ExpandProperty OwningProcess -Unique) -ErrorAction SilentlyContinue | │ … +1 lines

• Running git -C DataBase status --short -- apps/gateway/src/lib/latent-rerank-client.ts apps/gateway/src/lib/ │ latent-rerank-client.test.ts; git -C ContentBase status --short -- product/novel/app/article/capability.ts; │ git status --short -- training/latent-mvp/README.md training/latent-mvp/requirements.txt training/latent-mvp/ │ … +1 lines

• Ran git -C DataBase diff -- apps/gateway/src/lib/latent-rerank-client.ts apps/gateway/src/lib/ │ latent-rerank-client.test.ts └ (no output)

• Ran git -C DataBase status --short -- apps/gateway/src/lib/latent-rerank-client.ts apps/gateway/src/lib/

• Ran Get-NetTCPConnection -LocalPort 8765 -ErrorAction SilentlyContinue | Select-Object │ LocalAddress,LocalPort,State,OwningProcess; Get-Process -Id (Get-NetTCPConnection -LocalPort 8765 -ErrorAction │ SilentlyContinue | Select-Object -ExpandProperty OwningProcess -Unique) -ErrorAction SilentlyContinue | 启动的。我会只读取命令行确认它是不是 latent reranker,不直接杀未知进程。

• Ran Get-CimInstance Win32_Process -Filter "ProcessId=25964" | Select-Object ProcessId,CommandLine | Format-List └ ProcessId : 25964 … +4 lines (ctrl + t to view transcript) .1 --port 8765

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• Ran $root = (Get-Location).Path; $out = Join-Path $root '.runtime\latent-reranker-8766.out.log'; $err = Join-Path │ $root '.runtime\latent-reranker-8766.err.log'; New-Item -ItemType Directory -Force -Path (Split-Path $out) | │ Out-Null; $p = Start-Process -FilePath python -ArgumentList │ … +12 lines └ { "ok": true,

• Artifact 版 latent service 已通过 health 和 /rerank,并且验证后已关闭临时 8766 进程。现在我会更新计划状态,再做最后一 次聚焦验证汇总。

• Updated Plan └ ✔ 定位现有 latent/rerank 产出与消费链路 ✔ 让 DataBase 产出真实 generationControl ✔ 让 ContentBase writer 更依赖 writing_state、减少规则压制 □ 补最小验证并跑聚焦检查

• Running pnpm exec tsc -p tsconfig.json --noEmit

• Running Get-NetTCPConnection -LocalPort 8766 -ErrorAction SilentlyContinue | Select-Object │ LocalAddress,LocalPort,State,OwningProcess

• Running node --test -r ts-node/register/transpile-only tests/generation-kernel.test.ts

• Ran node --test --import tsx src/lib/latent-rerank-client.test.ts └ ▶ resolveLatentRerankEndpoint ✔ accepts either service root or explicit /rerank endpoint (0.7543ms) … +7 lines (ctrl + t to view transcript) ℹ todo 0 ℹ duration_ms 224.5005 ✔ builds structured author state without lane labels (1.9195ms) … +16 lines (ctrl + t to view transcript) • Ran Get-NetTCPConnection -LocalPort 8766 -ErrorAction SilentlyContinue | Select-Object │ LocalAddress,LocalPort,State,OwningProcess

• Ran pnpm exec tsc -p tsconfig.json --noEmit └ (no output)

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• 最后一轮聚焦检查通过了:Gateway typecheck、新增 URL 测试、ContentBase generation-kernel 测试都过。还有一个端口状态看起 来异常,8766 显示被已有 8765 的 python 进程持有,我会只读确认一下,不直接杀用户已有服务。

• Running Get-NetTCPConnection -LocalPort 8765,8766 -ErrorAction SilentlyContinue | Sort-Object LocalPort,State | │ Select-Object LocalAddress,LocalPort,RemoteAddress,RemotePort,State,OwningProcess

• Running Get-CimInstance Win32_Process -Filter "ProcessId=25964" | Select-Object ProcessId,CommandLine | Format-List

• Ran git -C DataBase status --short -- apps/gateway/src/lib/latent-rerank-client.ts apps/gateway/src/lib/ │ latent-rerank-client.test.ts; git -C ContentBase status --short -- product/novel/app/article/capability.ts; git │ status --short -- training/latent-mvp/README.md training/latent-mvp/requirements.txt training/latent-mvp/ │ … +2 lines └ ?? apps/gateway/src/lib/latent-rerank-client.test.ts ?? apps/gateway/src/lib/latent-rerank-client.ts … +4 lines (ctrl + t to view transcript) ?? training/latent-mvp/requirements.txt ?? training/latent-mvp/train_reranker.py

• Ran Get-CimInstance Win32_Process -Filter "ProcessId=25964" | Select-Object ProcessId,CommandLine | Format-List └ ProcessId : 25964 … +4 lines (ctrl + t to view transcript) .1 --port 8765

• Ran Get-NetTCPConnection -LocalPort 8765,8766 -ErrorAction SilentlyContinue | Sort-Object LocalPort,State | │ Select-Object LocalAddress,LocalPort,RemoteAddress,RemotePort,State,OwningProcess └ LocalAddress : 127.0.0.1 … +25 lines (ctrl + t to view transcript) OwningProcess : 25964

• Updated Plan └ ✔ 定位现有 latent/rerank 产出与消费链路 ✔ 让 DataBase 产出真实 generationControl ✔ 让 ContentBase writer 更依赖 writing_state、减少规则压制 ✔ 补最小验证并跑聚焦检查

────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

• 已开始按 latent generation 路线推进,没有加合同/新 preset/router。

做了这些:

  • DataBase latent client:DataBase/apps/gateway/src/lib/latent-rerank-client.ts

    • 支持 DATABASE_EVIDENCE_LATENT_RERANK_URL 填服务根地址或 /rerank,自动归一到 /rerank。
    • rerank 成功但没返回 generationControl 时,trace warning 会明确提示需要 controller artifact。
  • 新增聚焦测试:DataBase/apps/gateway/src/lib/latent-rerank-client.test.ts

  • ContentBase writer prompt:ContentBase/product/novel/app/article/capability.ts

    • 删除 user prompt 里第二次重复注入的“写作倾向 / renderWritingBriefPromptSummary”块,保留 writing_state 在“写作简 报”首位,减少规则噪声压过 latent state。
  • latent 训练入口:training/latent-mvp/train_reranker.py

    • 修成能消费现有 exported corpus,不再因为 topic 为空把训练集全丢掉。
    • 改用稳定的底层 BCE fine-tune,避开 CrossEncoder.fit 当前 label 兼容问题。
  • latent 运行依赖:training/latent-mvp/requirements.txt

    • 补 accelerate>=1.1.0。
  • 运行说明:training/latent-mvp/README.md

    • 启动命令补 --controller .../controller.json。
  • 生成 artifact:

    • training/latent-mvp/artifacts/controller-v1/controller.json
    • training/latent-mvp/artifacts/reranker-v1/
  • DataBase/apps/gateway typecheck 通过。

  • latent-rerank-client.test.ts 通过。

  • ContentBase/product/novel/tests/generation-kernel.test.ts 通过。

  • 本地 artifact 服务 /health 和 /rerank 验证通过,返回了真实:

    • generationControl.styleWeight
    • generationControl.factWeight
    • modelVersion: training/latent-mvp/artifacts/reranker-v1+latent-controller.v1

限制和缺口:

  • 这次没有跑完整生成 smoke。
  • ContentBase 全量 tsc --noEmit 仍失败,但失败点是既有类型债,不是这次 latent 改动。
  • 当前训练 corpus 很弱,几乎全是负样本;artifact 能打通 learned latent 链路,但还不代表质量已经好。下一步应从 revision/acceptance 里导出正负偏好样本,提升 reranker/controller,而不是继续加规则。
  • 本机已有 127.0.0.1:8765 latent service 在跑,我没有杀它。当前它使用 base cross-encoder + controller;artifact 版我用 8766 临时验证后已结束。

─ Worked for 14m 18s ───────────────────────────────────────────────────────────────────────────────────────────────────

› 兴亡的世界史和金阁寺的文本足够多吧,这些都是好样本啊

• 够,而且是目前最该用的好样本。

但问题是:它们现在没有被当成 latent 正样本使用。 当前 training/latent-mvp/data/*.jsonl 只有 30 条样本,而且基本全是 label: 0,所以刚才训练出来的 reranker/controller 只是“链路打通”,不是学到了《兴亡》或《金阁寺》的好处。

正确用法不是加合同,而是把这些文本变成三类 latent 资产:

  1. Style exemplars 《金阁寺》负责句法节奏、物象推进、心理阴影、冷感抒情。 只进风格向量/样本召回,禁止复写情节和原句。
  2. Lexicon / historical reasoning exemplars 《兴亡的世界史》负责制度词、长时段判断、冷峻历史语气、材料推进方式。 它应该影响词面和论证节奏,不应该统治整篇语气。
  3. Contrastive negatives 平台腔、SEO 标题、申论句、无锚事实、材料巡游,作为负样本。 这样 reranker 才知道“像《兴亡》里的制度判断”好,“像公众号政论标题”坏。

所以答案是:

文本量足够。缺的是把这些文本切成 latent training corpus,并给它们正确角色。

下一步应该做的不是继续写规则,而是改 exporter/训练集:

  • 从《金阁寺》切 style-positive chunks。
  • 从《兴亡的世界史》切 lexicon/fact-reasoning positive chunks。
  • 从失败生成报告里抽 bad surface negatives。
  • 生成 label: 1/0 的 query-chunk 对。
  • 重新训练 reranker/controller。
  • 让 Gateway 的 latent rerank 真正把这些好样本排到 writer 前面。

这才是 latent generation,不是合同堆叠。