At a Glance
4
Production deploys to green
Never
Reproducible in dev
gnu → musl
The actual fix
~5 hours
Time to resolve
What Broke
OnlineMihna had just shipped CV autofill: a jobseeker uploads their CV — PDF or DOCX — and the app extracts the text to pre-fill their profile. Two days after it went live, one half of it was quietly broken in production.
Every PDF upload failed. DOCX uploads worked fine. And the cruel part: on a developer's machine, every PDF parsed perfectly. The bug existed only on the live Railway server, so the local checks all said "green" while real users hit an error on every PDF.
Worse, the route returned a generic 500 internal_error with no signal about why. The first job wasn't to fix it — it was to make it say what was wrong. From there it took four separate production deploys in one afternoon to fully resolve, because each fix removed one layer and exposed the next.
Root Cause
The feature itself was fine. The failure was a chain of bundler and runtime assumptions that are all invisible in dev mode and only exist in the production build on the deployment host.
Layer 1 — webpack bundling. pdf-parse@2 pulls in pdfjs-dist, which needs a worker and a native canvas. Next.js 14 bundles server code by default, and from the bundled location pdfjs-dist couldn't resolve its worker — it threw ReferenceError: DOMMatrix is not defined at module load. The fix is experimental.serverComponentsExternalPackages: ["pdf-parse"], which tells Next to leave pdf-parse as a plain runtime require from node_modules. That fixed npm run dev. Production still failed.
Layer 2 — file tracing. Production builds with output: "standalone", where Next's file tracer (@vercel/nft) copies only the dependencies it can statically see into .next/standalone. But pdfjs-dist loads canvas through a guarded try { require("@napi-rs/canvas") } catch — a dynamic require NFT does not follow. So @napi-rs/canvas was never copied, its DOMMatrix polyfill silently failed, and the same ReferenceError returned — but only in the standalone prod build. The fix is experimental.outputFileTracingIncludes, force-copying canvas for the two CV routes.
Layer 3 — the wrong libc (the real one). That include named the glibc binary, @napi-rs/canvas-linux-x64-gnu, on the assumption that Railway ran Debian. It does not. The repo's own Dockerfile pins node:20-alpine for every stage — musl libc, not glibc. @napi-rs/canvas's runtime selector checks isMusl() and on Alpine tries only the -musl binary, with no fallback to -gnu. Worse, the gnu binary that got bundled couldn't even dlopen(): the runner stage never installed glibc-compat, so it failed with Error loading shared library ld-linux-x86-64.so.2: No such file or directory. The fix: swap -gnu → -musl, and pin @napi-rs/canvas-linux-x64-musl as a direct optionalDependency so a multi-stage Docker build can't drop it during optional-dependency traversal.
Layer 4 — the worker dir. One gap remained: pdfjs-dist's own legacy/build directory wasn't guaranteed in the trace either. A final commit added ./node_modules/pdfjs-dist/legacy/build/**/* to the includes.
The through-line: externalizing a package from the webpack bundle is not the same as tracing its dynamic requires into the standalone output, and the binary you trace has to match the libc of the host you actually deploy to. Every one of those four facts was checkable — and none of them showed up in dev.
Timeline
Instrument the failure so it's legible
Wrapped the
PDFParsecall to rethrowpdf_extract_failedwith the originalcause, mapped it to422, and logged{ message, cause, stack }. The logged cause —ReferenceError: DOMMatrix is not defined— became the thread for every layer that followed.Act 1 — externalize pdf-parse (PR #198)
Added
serverComponentsExternalPackages: ["pdf-parse"].npm run devnow parsed PDFs cleanly. Production, on astandalonebuild, still returned422.Act 2 — force-trace @napi-rs/canvas (PR #199)
Added
outputFileTracingIncludesso NFT would copy canvas into the standalone bundle — but listed the-gnu(glibc) binary. Shipped to Railway. Still422.Act 3 — gnu → musl after reading the OS (PR #200)
Confirmed Alpine/musl via
cat /etc/os-release+ldd --versionon the live runner, swapped the trace include to-musl, and pinned@napi-rs/canvas-linux-x64-musl@0.1.80as a direct optional dependency.Act 4 — trace pdfjs-dist's build dir; PDFs parse
A final commit added
./node_modules/pdfjs-dist/legacy/build/**/*to both routes' includes, closing the last tracing gap. Production PDF uploads returned200with extracted text; the polyfill and[cv/parse]log lines were gone.
The Fix
There was no single fix. There were four, each shipped to production before the next was even visible. In order:
Step 0 — make the failure legible. Before any real fix, instrument it. The original code let any throw inside PDFParse bubble up as a generic 500 internal_error. Rethrowing as pdf_extract_failed with the original cause (mapped to 422 in the route) — and guarding destroy() in its own try/catch so a throwing cleanup can't mask the rethrow — turned an opaque error into a logged cause. That cause, ReferenceError: DOMMatrix is not defined, is what made every subsequent layer diagnosable.
const { PDFParse } = await import("pdf-parse");
const parser = new PDFParse({ data: buffer });
try {
const result = await parser.getText();
return result.text || "";
} finally {
await parser.destroy();
}let parser: {
getText(): Promise<{ text?: string }>;
destroy(): Promise<void>;
} | null = null;
try {
const { PDFParse } = await import("pdf-parse");
parser = new PDFParse({ data: buffer });
const result = await parser.getText();
return result.text || "";
} catch (e) {
throw new Error("pdf_extract_failed", { cause: e });
} finally {
if (parser) {
try {
await parser.destroy();
} catch {
// reason: destroy() failures must not mask the catch-block's
// pdf_extract_failed rethrow — otherwise the route falls through
// to internal_error and we're back to the bug this PR fixes.
}
}
}Act 1 — externalize pdf-parse from the webpack bundle. Tells Next.js to leave it as a runtime require from node_modules so its pdfjs-dist worker resolves. This fixed npm run dev and any non-standalone build. Production, which builds with output: "standalone", still returned 422.
],
},
async headers() { ],
},
experimental: {
serverComponentsExternalPackages: ["pdf-parse"],
},
async headers() {Act 2 + 3 — the punchline. Act 2 added outputFileTracingIncludes to force NFT to copy @napi-rs/canvas into the standalone bundle (PR #199, shipped). But it traced the glibc binary, -gnu, assuming Railway ran Debian. Railway runs Alpine (musl), pinned in the repo's own Dockerfile. This one-word change — -gnu → -musl — is the actual fix, made only after running cat /etc/os-release and ldd --version on the live runner. The wrapper tries only the musl binary on Alpine, with no fallback, and the gnu binary couldn't even dlopen() for lack of a glibc loader.
outputFileTracingIncludes: {
"/api/jobseeker/cv/parse": [
"./node_modules/@napi-rs/canvas/**/*",
"./node_modules/@napi-rs/canvas-linux-x64-gnu/**/*",
],
"/api/jobseeker/cv/upload-existing": [
"./node_modules/@napi-rs/canvas/**/*",
"./node_modules/@napi-rs/canvas-linux-x64-gnu/**/*",
],
}, outputFileTracingIncludes: {
"/api/jobseeker/cv/parse": [
"./node_modules/@napi-rs/canvas/**/*",
"./node_modules/@napi-rs/canvas-linux-x64-musl/**/*",
],
"/api/jobseeker/cv/upload-existing": [
"./node_modules/@napi-rs/canvas/**/*",
"./node_modules/@napi-rs/canvas-linux-x64-musl/**/*",
],
},Act 3, defensive half. The musl binary is already a transitive optional dependency of @napi-rs/canvas, but multi-stage Docker builds can drop platform variants during optional-dependency traversal. Pinning it directly as an optionalDependency makes it explicit at the root of the resolution graph — installed on Alpine x64, silently skipped on the developer's Windows/macOS host (no EBADPLATFORM).
"typescript": "^5.5.4"
}
} "typescript": "^5.5.4"
},
"optionalDependencies": {
"@napi-rs/canvas-linux-x64-musl": "0.1.80"
}
}Net change across the whole incident: one defensive try/catch wrap, a handful of lines in next.config.js, and one optionalDependencies pin. The PDF-parsing logic itself was never touched — it was correct from day one. The fourth and final commit landed 14 minutes after the musl PR merged, adding ./node_modules/pdfjs-dist/legacy/build/**/* to both routes' trace includes, which closed the last gap and got production PDF uploads to 200.
Verification
Because the bug could not be reproduced in dev, every verification step had to run against the actual Railway runner — not localhost.
Read the runner's OS instead of assuming it
railway run -- cat /etc/os-releasereturnedAlpine Linux v3.23.4, andldd --versionreturnedmusl libc 1.2.5. This is the single command pair that proved the-gnubinary was the wrong choice — and the same fact was sitting in the committedDockerfile(FROM node:20-alpine) the entire time.Prove the gnu binary genuinely couldn't load
On the runner,
node -e "require('@napi-rs/canvas-linux-x64-gnu')"failed withError loading shared library ld-linux-x86-64.so.2: No such file or directory— confirming the glibc loader simply does not exist on the Alpine image, so even a correctly-traced gnu binary would never have worked.Confirm the musl binary landed and loads
After the swap deployed:
ls /app/node_modules/@napi-rs/canvas-linux-x64-musl/showed the.nodebinary in the standalone bundle, andrequire('@napi-rs/canvas')returned the module instead of throwing. TheCannot polyfill DOMMatrixwarnings disappeared from the deploy logs.End-to-end PDF upload in production
Uploading an English and an Arabic PDF through
/jobseeker/info?autofill=1returned200with extracted text; DOCX still returned200(regression check); and Railway logs no longer containedCannot polyfill DOMMatrixor the[cv/parse]error line.
Lessons Learned
I shipped the wrong native binary because I guessed the production OS instead of reading the Dockerfile I'd already committed.
- Production-only bugs are a category, not bad luck.
output: "standalone", NFT tracing, native binaries, and libc variants are each an assumption dev mode never exercises — dev uses realnode_moduleswith no tracing. Expect this class of failure whenever a native dependency meets a standalone build. - Instrument before you fix. The first commit didn't fix anything; it just rethrew
pdf_extract_failedwith the originalcauseand logged it. That single change turned four blind layers into four diagnosable ones — without it, every layer was an opaque500. - Verify environment claims from the host or a checked-in source of truth. The gnu→musl deploy was burned entirely on an unverified "Railway = Debian." Both
cat /etc/os-releaseand the repo's ownDockerfilesaid Alpine. - Externalizing is not tracing.
serverComponentsExternalPackagestells webpack to leave a package alone;outputFileTracingIncludestells NFT to copy it into the standalone bundle. Native deps loaded viatry { require } catchneed the second — with the libc variant that matches the host. - For environment-only bugs, the deploy is the test. We later added a plan-review rule: any claim about the deploy OS, libc, or base image must cite a verification command or a committed file like the Dockerfile.