All notable changes to PDFOxide are documented here.
- Fix
x86_64-pc-windows-gnunative-lib build failing the v0.3.31 release. The newscripts/shrink-staticlib.shintroduced in v0.3.31 ranobjcopy --strip-debugon every archive member. The MinGW cross-compile toolchain emits split-debug.dwomembers that contain only DWARF sections; after stripping those sections the member has no sections left and objcopy aborted the whole archive with'...rcgu.dwo' has no sections, failing the job that produces the Go Windows-x64 FFI tarball. Fix: drop.dwoarchive members viaar dbefore invokingobjcopy. No functional change to Rust, Python, Node, WASM, or C# artifacts — those built and uploaded successfully in v0.3.31; this release exists solely to unblock the Windows-x64 Go install path.
-
extract_text(n)returned page 0's content for everynon PDFs that share one Form XObject across all pages (#346, B1). Certain producers (notably ExpertPdf) emit one big Form XObject containing every page's text and give each page's content stream a different CTM translation to clip into its slice. Two cache/filter bugs stacked: (1)xobject_spans_cachekeyed spans byObjectRefonly and returned CTM-transformed page-0 coordinates to every subsequent page; (2) even once the cache was CTM-gated, the extractor had no awareness of the content-streamW nclipping operator, so every page emitted the whole stack at distinct but out-of-bounds Y coordinates. Fix: cache only when the caller CTM is identity, and post-filter extracted spans by the page's MediaBox (with a 2pt bleed tolerance). AddedMatrix::is_identity(). Regression test attests/test_b1_shared_form_xobject_per_page_ctm.rs. Largest single-fixture improvement in this release —nougat_005.pdfTF1 0.254 → 0.901. -
Running-artifact detector stripped the cover-page title when it happened to repeat as the per-page running header (B3). Reports like "Fiscal Year 2010 Appropriations Act" or "University of Oklahoma 2009" appear at the top of every page and are the document title on page 1. The detector classified them as chrome and removed them from page 1 too. Fix: track first-seen page per signature (across all pages, not just body-content pages), skip the artifact marking on that first occurrence. Covers the edge case where the cover page is all-chrome and would otherwise be skipped by the body-content gate. Regression test at
tests/test_b3_first_occurrence_of_running_header_kept.rs. -
Multi-column reading order via XY-cut (B4).
extract_textused a row-aware Y-band sort that interleaved left/right columns on newspaper / academic layouts:LeftCol-row1 RightCol-row1 LeftCol-row2 …. Addedis_multi_column_pageheuristic (body-span X-center histogram with vertical-overlap confirmation, a 15% chrome-band exclusion so header banners don't trip the detector, 25%-per-side minimum column mass) that routes detected multi-column pages through the existingXYCutStrategy. Single-column pages stay on the cheap row-aware path. Regression test synthesises a 2×20 interleaved grid attests/test_b4_two_column_reading_order.rs. -
Stroke+fill labels no longer produce doubled words (B7). Map/poster PDFs render every label twice — once stroked for the outline, once filled — and both passes landed as distinct
TextSpans at essentially the same CTM. The downstream merge step concatenated them, producing"EverestEverest","CentralCentral". Newdedup_stroke_fill_overlapruns before existing positional dedup: bucket by lowercased text, drop any later span whose bbox overlaps an earlier same-text span by ≥ 70% IoU. Conservative thresholds (≥2-char minimum, using.chars().count()not.len()so non-ASCII glyphs are handled correctly). Regression test attests/test_b7_stroke_fill_dedup.rs. -
Soft-hyphen line-break rejoin (B8a). Typographic hyphenation —
"scruti-\nneer"for"scrutineer","disinfec-\ntion"for"disinfection"— previously preserved the hyphen and newline. Addeddehyphenate_line_breaksto the plain-text cleanup pipeline: rewrites<lowercase>-[ \t]*\n[ \t]*<lowercase>→ concatenation. Conservative on both sides (requires ASCII lowercase before and after) so compound hyphens ("state-of-the-art"), proper-noun fragments ("co-\nWorker"), and bullet markers stay intact. -
TrueType cmap format 0 parser (B9). Microsoft Office Word/Excel subset fonts (Calibri, Times New Roman) sometimes ship only a format-0 (legacy 1-byte Mac Roman) cmap. Previously these fonts bailed with "Unsupported cmap format: 0" and the font had no glyph→Unicode mapping, which cascaded into text extraction losing content from that font. Added
parse_cmap_format0— reads the 6-byte header + 256 glyph IDs, maps byte codes 0x00–0x7F as ASCII pass-through and 0x80–0xFF through the full Mac Roman → Unicode table (so byte 0x8A correctly decodes toä). Truncated glyph arrays surface as parse errors rather than silent zero-glyph output.
-
tools/benchmark-harness/— TF1/SF1 extraction quality measurement crate (#320). New workspace member that computes TF1 (bag-of-words F1 on lowercase alphanumeric tokens) and SF1 (block-weighted structural F1 with LIS ordering penalty) against ground-truth markdown. Methodology mirrors Kreuzberg's harness so numbers are directly comparable. Includes engine adapters forpdf_oxide(in-process),pdftotext(poppler subprocess), andpdfium(gated behind--features pdfium); a consensus-baseline mode for corpora without manual ground truth; and adiffsubcommand with regression gates (default: fail on mean TF1 drop > 0.5pp or per-fixture drop > 5pp).scripts/fetch-fixtures.shclones Kreuzberg's Apache-2.0 fixture set without vendoring PDFs into our repo. Makefile targets:make benchmark-fetch,make benchmark-run,make benchmark-compare. 18 unit tests.Cumulative impact on the 78-unique-fixture Kreuzberg corpus vs v0.3.30 baseline:
- TF1 mean: 0.919 → 0.930 (+1.1pp)
- TF1 p10 (hard tail): 0.776 → 0.849 (+7.3pp — tied with pdftotext)
- SF1 mean: 0.337 → 0.355 (+1.8pp) — pdf_oxide leads pdftotext by +10.8pp SF1 on this corpus
- Runtime: −42%
- Zero per-fixture TF1 regressions > 0.5pp
Four follow-up work items filed with precise reproducers: #363 (ToUnicode CID-miss on specific MS Office subset fonts), #364 (FlateDecode stream offset bug on MS Reporting Services PDFs), #365 (intra-word TJ space calibration), #366 (CI wiring for the harness), #367 (docling-parse adapter), #368 (markdown adapter for the pdf_oxide engine).
- Rendering:
Page index N not found by scanningon PDFs whose xref mis-flags page objects as free — when a producer emits a corrupted xref table withfentries pointing at real objects (common in several large regulatory PDFs in the wild),load_objectpreviously resolved every such object toNullper §7.3.10. If the page objects were uncompressed the page tree traversal would bottom out in nulls; if they were packed into/Type /ObjStmtheget_page_by_scanningfallback never reached the content at all. Two recovery paths now trigger before theNullfallback: (1) if the file body contains anN G objmarker for the supposedly-free id, load it from the scanned offset — the same mechanism already used for objects missing from the xref entirely; (2) if not found in the body, perform a one-time raw-pattern scan for every/Type /ObjStmin the file, parse each, and cache all contained objects (overwriting the earlierNullentries). Theget_page_by_scanningfallback now unionsxref.all_object_numbers()with the newly-cached ObjStm ids so pages whose xref slot saysfbut whose content lives inside an object stream are visible to the scanner; its heuristic second pass (page-shaped dicts without/Type) now also runs as a complement rather than only when pass 1 finds zero pages. Also unifies theid <= 10code path with the general path — previously low-numbered free-flagged objects hit a broken "fall through" branch that still ended up Null. - Rendering:
Invalid object number in header: jon PDFs whose xref offsets are off by a handful of bytes — the same SEBI-style PDF also carries a second corruption shape:in_use=truexref entries whose byte offsets point ~3 bytes BEFORE the realN G objheader (into the previous object'sendobjtail). The existingfind_object_header_backwardsfallback only triggered when noobjkeyword was found at all, not when the keyword parsed but the preceding tokens were junk likej.load_uncompressed_object_implnow catches the parse-as-number failure, re-queriesscan_for_objectfor the same id, and if the scan recorded a different offset retries from there. With both fixes live, the report's 253-page PDF goes from 0 → 253 pages renderable (was 0 before the free-flag recovery, 200 with just that fix, 253 with the combined offset recovery). Regression tests intests/test_xref_free_flag_corruption.rscover both corruption shapes plus the clean-baseline and genuinely-free negative cases.
-
Native Rust libraries are no longer committed to
go/lib/— after landing the 63% staticlib shrink (below), the committed payload would still have been ~130 MB per release, accumulating indefinitely in git history. v0.3.31 instead publishes per-platformpdf_oxide-go-ffi-<platform>.tar.gzas GitHub Release assets and ships a small Go installer atgo/cmd/install. Consumers run it once per machine:go get github.com/yfedoseev/pdf_oxide/go go run github.com/yfedoseev/pdf_oxide/go/cmd/install@latest # Installer prints the CGO_CFLAGS / CGO_LDFLAGS to exportThe installer downloads the matching asset into
~/.pdf_oxide/v<version>/, SHA-256 verifies it against the signed.sha256published alongside, and either prints the env vars to export or (with--write-flags=<dir>) generates acgo_flags.gonext to the user's code. Monorepo / source-tree builds use-tags pdf_oxide_devwhich points CGo attarget/release/libpdf_oxide.adirectly — no installer needed.@latestjust works — the installer reads its own module version fromruntime/debug.ReadBuildInfo(), so every tagged release auto-matches its FFI assets without a release-time sed step.go run .../cmd/install@latestalways resolves to the matching.tar.gz.Why: shipping ~130 MB of binary in git per release was bloating clone time and accumulating to GBs over dozens of releases. This is the approach Kreuzberg (https://github.com/kreuzberg-dev/kreuzberg/blob/main/packages/go/v4/cmd/install/main.go) and similar Rust-in-Go projects use. Repo size per release bump drops to ~0 KB; clone stays fast forever.
Migration: consumers upgrading from v0.3.30 must run the install command once and export the printed
CGO_*env vars (or add them to their shell profile / CI env). No code changes to the Go API.go getwithout the install step will fail at link time withundefined reference to pdf_document_open ...— the installer fixes this. -
Go release pipeline hardening — three ordering + integrity fixes landed alongside the install-flow switch:
- SHA-256 gate end-to-end.
package-go-ffiemitspdf_oxide-go-ffi-<platform>.tar.gz.sha256next to each tarball (attached to the GitHub Release). The Go installer downloads both and aborts with a checksum mismatch if they don't match;--skip-checksumbypasses for offline/air-gapped installs. The same.sha256is verified byverify-go-installin CI before the release is even published. - verify-go-install is now a publish gate.
create-releasedepends onverify-go-install— that job extracts the freshly-built tarball, matches the sha256, then builds aFromMarkdown → Save → Open → PageCountconsumer against a localreplacedirective. A broken.a, a missing symbol, or a stalecgo_dev.goblocks the release instead of leaking through. go/v<version>tag is pushed last, not first. A newtag-go-modulejob runs aftercreate-releasehas uploaded the FFI tarballs. Previously the tag was pushed during packaging, creating a window wherego install @latestcould resolve a tag whose FFI assets 404'd. Tag creation is gated on!contains(version, '-')so prerelease-rc.1tags never reach sum.golang.org.
- SHA-256 gate end-to-end.
- Shrink Rust static libs 62.8% before packaging — Rust-produced staticlibs carry 35 MB of
.llvmbc(LLVM bitcode for cross-crate LTO) + 4 MB of DWARF per platform, none of which CGo's linker or node-gyp ever uses. Newscripts/shrink-staticlib.shstrips both viaobjcopy --remove-section=.llvmbc --remove-section=.llvmcmd --strip-debug(Linux / Windows-GNU) orstrip -S(macOS) inside thebuild-native-libsjob. Per-platformlibpdf_oxide.adrops from ~71 MB to ~26 MB. All 85 Go-consumed FFI symbols verified intact post-strip. - Strip the npm
.nodeaddon —node-gyp rebuildleft the addon unstripped (17 MB,with debug_info, not stripped). Added post-buildstrip --strip-unneeded(Linux) /strip -x(macOS) in thebuild-nodejsjob. Combined with the upstream staticlib shrink, the Linux.nodeis expected to drop from 17 MB to ~7 MB. - Drop sourcemaps from the npm tarball —
js/tsconfig.jsonsetsdeclarationMap: false+sourceMap: falsefor the published build. File count falls from 211 → 107 (removes 104.js.map/.d.ts.mapfiles)..d.ts.mapwas never useful to consumers;.js.mapis moot without TS sources, which we don't ship. - Fix crate sdist leak (pulled in 47 unrelated files) — Cargo's
includeuses gitignore-style globs, so the bare"README.md"entry was matching every README.md recursively, including 27js/node_modules/*/README.mddependency READMEs and 20 subdirectory READMEs. Anchored all patterns with a leading/— sdist file count 308 → 264. - Tighten NuGet symbol package —
EmbedAllSourcesdropped fromtruetofalseincsharp/PdfOxide/PdfOxide.csproj. SourceLink + the embedded PDB already serve sources on demand from the git SHA, so embedding every source file into the.snupkgwas pure bloat. Added defensive<None Remove="..\..\target\**\*.pdb" />to prevent native PDBs from landing inruntimes/(nuget.org's snupkg validator rejects native PDBs).
Issues reported or features requested by: @Goldziher (#320 benchmark harness), @ddxtanx (#346 sort-order panic, #354 memory leak on page 12), @frederikhors (#325 rendering regression), @Charltsing (#344 CMYK JPEGs), @FireMasterK (#345 page-scan failures), @Jeevaanandh (#353 yanked libflate dep).
- Go: migrate from cdylib to staticlib for self-contained binaries (#334) —
pdf_oxidenow produceslibpdf_oxide.aalongside the cdylib (newstaticlibentry inCargo.toml'scrate-type), andgo/pdf_oxide.golinks the archive directly via per-platform#cgo ... LDFLAGSwith the exact system-library list rustc needs. The resulting Go binary is fully self-contained — noLD_LIBRARY_PATH/DYLD_LIBRARY_PATH/PATHconfiguration required. Windows x64 is produced via a newx86_64-pc-windows-gnucross-compile row in the release matrix; Windows ARM64 temporarily stays on dynamicpdf_oxide.dlluntilaarch64-pc-windows-gnullvmstabilises. - Node.js: ship prebuilt native bindings via platform subpackages (#335) — switched to the napi-rs style prebuilt-binary model: the main
pdf-oxidepackage drops the install hook, declares per-platformpdf_oxide-<triple>subpackages asoptionalDependencies, and ships only compiledlib/+README.md.binding.gyplinks thelibpdf_oxide.a/pdf_oxide.libstaticlib with per-OS system-library lists, so the resulting.nodeis self-contained.npm install pdf-oxidenow works out of the box with no TypeScript, Python, C++ toolchain, or native lib on the consumer's machine. - C#: migrate all 881 P/Invoke declarations from DllImport to LibraryImport for NativeAOT (#333) —
PdfOxideon NuGet is now NativeAOT-publish-ready and trim-safe. Target frameworks trimmed tonet8.0;net10.0.IsAotCompatible=trueandIsTrimmable=trueflags enabled. Thebuild-csharprelease job gains aVerify NativeAOT publishstep thatdotnet publisha tiny consumer withPublishAot=true+TreatWarningsAsErrors=trueon net10.0. Requested by @Charltsing. - OCR FFI bridge for Go, C#, and Node.js — added 4
pub extern "C" fndeclarations tosrc/ffi.rswrappingsrc/ocr::OcrEngine:pdf_ocr_engine_create,pdf_ocr_engine_free,pdf_ocr_page_needs_ocr,pdf_ocr_extract_text. Each has#[cfg(feature = "ocr")]with the real implementation and#[cfg(not(feature = "ocr"))]stub returningERR_UNSUPPORTED. Previously only Python had OCR (via direct pyo3); now Go, C#, and Node.js can also use OCR when built with--features ocr. Go gainsNewOcrEngine(),NeedsOcr(),ExtractTextWithOcr(). - Node.js binding.cc cleanup — deleted 12 hallucinated C++ class methods that referenced nonexistent FFI functions (ML/analysis ×7, XFA parse/free ×2, rendering extras ×3). Wired 6 rendering/annotation/PDF-A functions to their real Rust FFI names using Go's working code as the reference. Fixed macOS framework linking (
xcode_settings.OTHER_LDFLAGS) and MSVC C++20 (/std:c++20).
- Image extraction:
Invalid RGB image dimensionserror on PDFs with Indexed color space images (#311) —extract_image_from_xobjectnow resolves Indexed palettes viaresolve_indexed_paletteand expands indices throughexpand_indexed_to_rgb, supporting 1/2/4/8 bpc with RGB/Grayscale/CMYK base color spaces. Reported by @Charltsing. - Encryption: AES-256 (V=5, R=6) PDFs returned empty or garbled text (#313) — three independent fixes: uncompressed-object string decryption, push-button widget
/MK /CAcaption extraction, and Algorithm 2.B termination off-by-one correction. - Reading order:
ColumnAwarefragmented single-column body text (#314) — addedis_single_column_regionguard, fixed vertical-split partition inversion. Verified on RFC 2616, Berkeley theses, EU GDPR. - Tables: product data sheet label/value rows rendered far from their section (#315) — replaced with inline-table-insertion scheme that drains tables at their spatial position.
- Reading order: tabular content interleaved by Y jitter (#316) — added
row_aware_span_cmpwith 3pt Y-band quantisation. CJK rowspan-label columns preserved through spatial table detector (#329). - Text extraction: adjacent Tj/TJ operators concatenated without spaces (#326) — lowered word-separation threshold to match pdfium's heuristic.
- Text extraction: fallback-width inflation on fonts with no
/Widthsarray (#328) — addedFontInfo::has_explicit_widths()andspace_gapcorrection for proportional fonts. - Text extraction: Arabic content in visual order instead of reading order (#330) — added Pass 0 pre-shaped Arabic span reversal.
- Encryption: object cache not invalidated after successful late authenticate() (#323) — drops
object_cacheon the authenticated transition. - Images: Indexed palette expander hardened against DoS and truncation (#324) —
checked_mul+ 256 MiB guard + truncation rejection. - Rendering: slow cold-cache start, dropped ligatures, text missing on subset-CID fonts (#325, #331 R1/R2/R4) — fixed multi-character cluster width accumulation, Arabic/Latin ligature expansion, and system fontdb caching. Reported by @frederikhors.
- Go module tag creation moved to end of pipeline —
update-go-native-libsnow depends on ALL build + verify jobs. The Go tag is created only after the full build matrix is green, and publishes are gated on it. This prevents sum.golang.org from permanently caching a broken tag hash on failed runs. verify-go-installuses local path verification — usesgo mod edit -replaceagainst the locally-staged checkout instead ofgo get @vX.Y.Z, eliminating sum.golang.org contact entirely during CI.- Go tag creation guarded against re-push — skips if tag already exists on remote.
scripts/regression_harness.py— new self-contained regression harness. Subcommands:collect/run/diff/groundtruth/show. 60-PDF curated corpus withtext,markdown, andhtmlformat support.
Thank you to everyone who reported issues or filed detailed reproducers for this release!
- @Charltsing — Reported the Indexed color space image extraction failure (#311) with a reproduction PDF that exposed a long-standing gap in palette handling, and requested the
DllImport → LibraryImportmigration for NativeAOT-ready C# bindings (#333). - @Goldziher — Reported four extraction issues (#313, #314, #315, #316) with clear repro snippets that let us localise the AES-256 string-decryption gap, the Algorithm 2.B termination off-by-one, the single-column XYCut fragmentation, the inline-rendering gap on product data sheets, and the row-aware sort gap for tabular content. Also raised the pdfium-parity bar (#320) that drove the corpus-wide quality audit and the regression harness.
- @frederikhors — Reported the rendering-path bugs on the
renderingfeature (#325): cold-cache slowness, dropped ligatures, missing text on subset-CID fonts, and a font-specific vertical flip. Triage of the report surfaced four distinct signatures (#331 R1-R4); the three that we could reproduce ship in this release.
New Language Bindings: JavaScript / TypeScript, Go, and C#
This release ships official bindings for JavaScript/TypeScript, Go, and C#, built on a shared C FFI layer. 100% Rust FFI parity across all three.
- JavaScript / TypeScript bindings (
pdf-oxideon npm) — N-API native module withBuffer/Uint8Arrayinput,openWithPassword(), worker thread pool,Symbol.dispose, rich error hierarchy, and complete API coverage: document editor, forms, rendering, signatures/TSA, compliance, annotations, extraction with bbox. Full TypeScript type definitions included. - Go bindings (
github.com/yfedoseev/pdf_oxide/go) — Full API with goroutine-safePdfDocument(sync.RWMutex),io.Readersupport, functional options pattern,SetLogLevel(), and ARM64 CGo targets. - C# / .NET bindings (
PdfOxideon NuGet) — P/Invoke withNativeHandle(SafeHandle),IDisposable,ReaderWriterLockSlimthread safety,async Task<T>+CancellationToken, fluent builders, LINQ extensions, plugin system. ARM64 NuGet targets (linux-arm64, osx-arm64, win-arm64). - C FFI layer (
src/ffi.rs) — 270+extern "C"functions covering the full Rust API surface. - Shared C header (
include/pdf_oxide_c/pdf_oxide.h) — Portable header for all FFI consumers. pdf_oxide_set_log_level()/pdf_oxide_get_log_level()— Global log level control exposed to all language bindings.
- Text extraction: SIGABRT on pages with degenerate CTM coordinates (#308) — extracting text from certain rotated dvips-generated pages (e.g., arXiv papers with
Page rot: 90) caused a 38 petabyte allocation and SIGABRT. Degenerate CTM transforms produced text spans with bounding boxes ~19 quadrillion points wide, which blew up the column detection histogram indetect_page_columns(). Per PDF 32000-1:2008 §8.3.2.3, the visible page region is defined by MediaBox/CropBox, not by raw user-space coordinates. Nowdetect_page_columns()uses median-based outlier rejection to exclude degenerate spans from the histogram, with a 10,000pt hard cap as defense-in-depth. Preserves all 1516 characters on the affected page (matching v0.3.19 output). Reported by @ddxtanx. - Editor: images and XObjects stripped on save (#306) — opening a PDF containing images, making any edit (or none), and saving produced an output with all images removed. The cause was that
write_full_to_writeronly serialized Font resources from the page Resources dictionary, silently dropping XObject (images, form XObjects) and ExtGState entries. Now writes XObject and ExtGState dictionary entries alongside fonts. Also wires up pending image XObject references fromgenerate_content_streaminto the page Resources dictionary. Thecreate_pdf_with_imagesexample was also affected — output contained no images. Reported by @RubberDuckShobe. - Rendering: garbled text on systems without common fonts (#307) — rendering any PDF with text produced random symbols or black rectangles on Linux systems without Arial/Times New Roman installed (e.g., minimal EndeavourOS). The PDF's non-embedded fonts (ArialMT, Arial-BoldMT, TimesNewRomanPSMT) relied on system font availability, but font parsing failures were silent and the fallback font list was too narrow. Now logs a warning with the font name when parsing fails, added DejaVu Sans, Noto Sans, and FreeSans to the system font fallback chain, and logs an actionable message suggesting which font packages to install (
liberation-fonts,dejavu-fonts, ornoto-fonts). Reported by @FireMasterK. - Editor: form field page index always reported as 0 —
get_form_fields()hardcodedpage_indexto 0 for all fields read from the source document, so fields on page 2+ were incorrectly placed. Now builds a page-ref-to-index map and resolves the actual page from each widget annotation's/Pentry. - Text extraction: fix Tf inside q/Q test — the
test_extract_save_restoreunit test was ignored due to malformed PDF syntax (q 14 Tfmissing font name operand). Fixed to valid syntax and unignored. The save/restore mechanism itself was already correct.
- Remove stale CID font widths TODO — the comment claimed Type0 CID font widths were "not yet implemented", but
parse_cid_widthsandget_glyph_widthalready handled them correctly.
Thank you to everyone who reported issues for this release!
- @ddxtanx — Reported SIGABRT crash on rotated dvips PDFs (#308) with a clear reproduction case and backtrace. Identified it as a regression from #272.
- @RubberDuckShobe — Reported images being stripped on save (#306). Confirmed the issue also affected the
create_pdf_with_imagesexample. - @FireMasterK — Reported garbled text rendering on EndeavourOS (#307) and provided the test PDF with non-embedded Arial fonts.
Thread-Safe PdfDocument, Async API, Performance, Community Fixes
None. All changes are backward-compatible.
- Thread-safe
PdfDocument— Send + Sync (#302) — replaced all 16RefCell<T>withMutex<T>andCell<usize>withAtomicUsize.PdfDocumentcan now safely cross thread boundaries. RemovesunsendablefromPdfDocument,FormField, andPdfPagePython classes. Enablesasyncio.to_thread(), free-threaded Python (cp314t), and thread pool usage withoutRuntimeError. Reported by @FireMasterK (#298). AsyncPdfDocument,AsyncPdf,AsyncOfficeConverter(#217) — complete async API with auto-generated method wrappers. All sync methods are available as async. Requested by @j-mendez.- Free-threaded Python support (#296) —
#[pymodule(gil_used = false)]declares GIL-free compatibility for cp314t. Requested by @pcen. - Word/line segmentation thresholds (#249) —
extract_words()andextract_text_lines()accept optionalword_gap_threshold,line_gap_threshold, andprofilekwargs. Newpage_layout_params()method andExtractionProfileclass expose adaptive parameters. Contributed by @tboser.
- CLI split/merge blank pages (#297) — merge now writes merged page refs; split now filters removed pages from Kids. Reported by @Suleman-Elahi.
- Rendering: skip malformed images (#299, #300) — images with missing
/ColorSpaceor invalid dimensions are skipped with a warning instead of crashing the page render. Also handles malformed images inside Form XObjects. Reported by @FireMasterK. - Structure tree cycle SIGSEGV (#301) — cyclic
/Kindirect references in malformed tagged PDFs caused stack overflow. A visited-object set now breaks cycles. Contributed by @hoesler. horizontal_strategy: 'lines'text fallback gate (#290) — settinghorizontal_strategytolinesnow correctly suppresses text-based row detection. Each axis is checked independently. Contributed by @hoesler.vertical_strategyPython parsing (#290) —vertical_strategywas never read from the Pythontable_settingsdict, always defaulting toBoth. Contributed by @hoesler.
- Cache structure tree — parsed once and cached; non-tagged PDFs skip parsing via MarkInfo check.
- Cache decompressed page content stream — avoids re-decompression when multiple extractors access the same page.
- Shared XObject stream cache for path extraction — reuses decompressed Form XObject streams already cached by text extraction.
- Cached XObject dictionary for path extraction — avoids re-resolving Resources -> XObject dict chain on every Do operator.
- Byte-level path extraction parser — skips BT/ET text blocks and parses path/state/color operators without Object allocation.
- Allocation-free graphics state for paths — Copy-only state struct eliminates heap allocations on q/Q save/restore.
- Index-based font tracking in prescan — replaces String cloning on every q operator with index into font table.
- Prescan: drop Do positions when Do-dominated — prevents region merging that defeats the prescan optimization.
- Reuse spans in table detection — reuses pre-extracted spans instead of re-parsing the content stream.
- Pre-filter non-table paths — filters to lines/rectangles before the detection pipeline.
- O(1) MCID lookup — HashSet instead of linear search for marked-content identifier matching.
- O(log n) page tree traversal — uses /Count to skip subtrees instead of linear counting.
- Lazy page tree population — defers bulk page tree walk until needed.
- Bump
zip8.5.0 -> 8.5.1 - Bump
pdfium-render0.8.37 -> 0.9.0 - Bump
tokenizers0.15.2 -> 0.22.2
Thank you to everyone who reported issues and contributed PRs for this release!
- @hoesler — Structure tree cycle SIGSEGV fix (#301) and table strategy gating fix (#290). Two high-quality PRs with tests and clean code.
- @tboser — Word/line segmentation thresholds feature (#249). Well-designed API with 14 tests and responsive to review feedback.
- @FireMasterK — Reported thread-safety crash (#298), rendering crashes with missing ColorSpace (#299) and invalid image dimensions (#300). Three critical bug reports that drove the Send+Sync refactor.
- @Suleman-Elahi — Reported CLI split/merge blank pages bug (#297) with clear reproduction steps.
- @pcen — Requested free-threaded Python compatibility (#296).
- @j-mendez — Requested async Python API (#217).
Log Level Honored in Python, Multi-Arch Wheels
- Log level now fully respected in Python (#283) —
extract_log_debug!/extract_log_trace!/ etc. were printing to stderr directly viaeprintln!, bypassing thelogcrate and therefore ignoringpdf_oxide.set_log_level(...)and Python'slogging.basicConfig(level=...). Messages like[DEBUG] Parsing content stream for text extractionand[TRACE] Detected document script: Latinleaked through at ERROR level. The macros now forward tolog::debug!/log::trace!/ etc. and are properly gated by thelogcrate's max level filter. Reported by @marph91 as a follow-up to #280.
- Multi-arch Python wheels (#284) — Added wheels for Linux aarch64 (
manylinux_2_28_aarch64), Linux musl x86_64 and aarch64 (musllinux_1_2_*), and Windows ARM64 (win_arm64). Lowered the manylinux glibc floor from2_34to2_28to cover RHEL 8, Debian 11, Ubuntu 20.04, and Amazon Linux 2023. A source distribution (sdist) is now published for any platform with a Rust toolchain. Reported by @jhhayashi.
Table Extraction Engine — Intersection Pipeline, Text-Edge Detection, Converter Improvements
Major rewrite of the table detection system, implementing the universal Edges → Snap/Merge → Intersections → Cells → Groups pipeline — the gold-standard approach used by Tabula, pdfplumber, and PyMuPDF, now in pure Rust.
- Intersection-based table detection — Finds H×V line crossings, builds cells from 4-corner rectangles, groups into tables via union-find. The gold-standard approach used by Tabula/pdfplumber/PyMuPDF, now in pure Rust.
- Extended grid for non-crossing lines — When H and V lines are in different page regions, creates virtual grid from Cartesian product of all coordinates.
- Column-aware text detection — Segments 2-column layouts via X-projection histogram, runs text-only table detection per column.
- H-rule-bounded text tables — Detects tables bounded by horizontal rules but no vertical lines (common in academic papers).
- Hybrid row detection — Infers row boundaries from text Y-positions when only vertical borders exist (e.g. invoice line items).
- Dotted/dashed line reconstitution — Merges short line segments into continuous edges for row separator detection.
- Section divider splitting — Splits multi-section forms at full-width horizontal dividers.
- Edge coverage filtering — Removes orphan edges that don't participate in any potential grid.
- Configurable V-line split gap —
v_split_gapfield inTableDetectionConfig(default 20pt, was hardcoded 4pt).
- Space-padded column alignment — Clean, readable output replacing ASCII box drawing (
+--+|). Right-aligns currency/number columns. - Form numbering artifact stripping — Removes single-digit prefixes from PDF form templates ("1 Apr 11" → "Apr 11").
- Dash/underscore cell stripping — Removes decorative
------separators from table cells.
- Adjacent value spacing — Inserts space between consecutive currency values in table cells.
- Split decimal merging — Rejoins integer and decimal parts rendered in separate fixed-width boxes.
- Bold span consolidation — Merges adjacent single-character bold spans into a single
**WORD**in markdown. - HTML heading hierarchy — Content-aware detection; addresses and box numbers no longer tagged as
<h1>/<h2>. - Image bloat fix —
include_imagesdefaults tofalse, dramatically reducing output size. - Label-value pairing — Same-Y spans from different reading-order groups rendered on the same output line.
- Content ordering — XYCut group_id propagation keeps spatial regions as contiguous blocks.
- Columnar group merging — Detects column-by-column layouts and re-interleaves into rows.
- Orphaned span recovery — Text spans inside rejected table regions are preserved at correct Y-position.
- Key-value pair merging —
Label\n$Valuepatterns merged toLabel $Valuein post-processing.
- Encrypted PDF clear error — Returns
Error::EncryptedPdfwith helpful message instead of silent zero output. - ObjStm/XRef stream decryption — Object streams are no longer incorrectly decrypted per ISO 32000-2 Section 7.6.3.
- Stream parser trailing newline — Strips CR/LF before
endstreamkeyword, fixing AES block-size errors on encrypted PDFs. - Table detection enabled by default —
extract_text()now usesextract_tables: true. to_plain_text()includes tables — Was silently dropping all detected tables.- Python
extract_tables()config — Now usesdefault()(Both strategy) instead ofrelaxed()(Text-only). - MD table cell dropping — Row padding and centroid drift fix in spatial detector.
- Box label spacing — Inserts space between box number and adjacent currency value.
- Dash cell artifact —
------cells cleared from table output. - Orphaned dollar values — Dollar values no longer silently dropped when table detector misses them.
- Digit→currency spacing — Any positive gap between digit/text and
$/€/£inserts a space.
- UnionFind struct — Extracted from two duplicated inline implementations (DRY).
snap_and_merge()decomposed — Split intosnap_edges(),join_collinear_edges(),reconstitute_dotted_lines()(SRP).- Shared converter helpers —
span_in_table()andhas_horizontal_gap()extracted from 3 duplicated copies toconverters/mod.rs(DRY). detect_tables_from_intersections()decomposed — 229-line 6-responsibility function split intobuild_grid_from_lines(),assign_spans_to_intersection_grid(),finalize_intersection_tables()+ 20-line orchestrator (SRP).- Collinear segment joining — Relaxed coord tolerance from
f32::EPSILONtoSNAP_TOLfor proper chain joining.
- Python, Rust, and WASM
extract_tables()all use the sameTableDetectionConfig::default()(Both strategy) for consistent results across languages.
Library logging now follows standard best practices — silent by default across all bindings.
- Python — Rust
logmacros now flow through Python'sloggingmodule viapyo3-log. Configure with the normal API:New helpers:import logging logging.basicConfig(level=logging.WARNING)
pdf_oxide.set_log_level("warn")andpdf_oxide.disable_logging(). Thesetup_logging()function is kept for backward compatibility (the bridge is initialized automatically on module import). - WASM — New
setLogLevel(level)/disableLogging()functions. Logs are forwarded to the browser console viaconsole_log. Accepts"off","error","warn","info","debug","trace". - Rust — No change; the library continues to use the
logcrate facade without initializing a backend (standard Rust library practice). Applications choose their own logger (env_logger,tracing, etc.).
🥇 @marph91 — Thank you for reporting the logging flood issue (#280) and the thoughtful proposal. This pushed us to audit the bindings against the logging best practices used by pyo3-log-based projects (cryptography, polars) and ship a clean fix across Python, WASM, and Rust! 🚀
Text Extraction Accuracy, Column-Aware Reading Order, and Community Contributions
extract_page_text()Single-Call DTO (#268) — NewPageTextstruct returns spans, characters, and page dimensions from a single extraction pass, eliminating redundant content stream parsing. Available across Rust, Python, and WASM.- Column-Aware Reading Order (#270) — New
extract_spans_with_reading_order()method accepts aReadingOrderparameter.ReadingOrder::ColumnAwareuses XY-Cut spatial partitioning to detect columns and read each column top-to-bottom, fixing garbled text for multi-column PDFs. - Per-Character Bounding Boxes from Font Metrics (#269) —
TextSpannow carries per-glyph advance widths captured during extraction.to_chars()produces accurate per-character bounding boxes using font metrics instead of uniform width division. Available asspan.char_widthsin Python andspan.charWidthsin WASM (omitted when empty). is_monospaceFlag on TextSpan/TextChar (#271) — Exposes the PDF font descriptor FixedPitch bit, with fallback name heuristic (Courier, Consolas, Mono, Fixed). Eliminates the need for fragile font-name string matching.Pdf::from_bytes()Constructor (#252) — Opens existing PDFs from in-memory bytes without requiring a file path. Available across Rust, Python (Pdf.from_bytes(data)), and WASM (WasmPdf.fromBytes(data)).- Path Operations in Python (#261) —
extract_paths()now includes anoperationslist with individual path commands (move_to, line_to, curve_to, rectangle, close_path) and their coordinates. WASMextractPaths()also aligned.
- Fixed panic on multi-byte UTF-8 in debug log slicing (#251) — Replaced raw byte-offset string slices with char-boundary-safe helpers, preventing panics when extracting text from CJK/emoji PDFs with debug logging enabled.
- Fixed markdown spacing around styled text (#273) — Markdown output no longer merges words across annotation/style span boundaries (e.g., "visitwww.example.comto" → "visit www.example.com to").
- Fixed Form XObject /Matrix application (#266) — Text extraction now correctly applies Form XObject transformation matrices and wraps in implicit q/Q save/restore per PDF spec Section 8.10.1.
- Fixed text matrix advance for rotated text (#266) — Replaced incorrect
total_width / text_matrix.d.abs()division (divide-by-zero for 90° rotation) with correctTm_new = T(tx, 0) × Tmper ISO 32000-1 Section 9.4.4. - Fixed prescan CTM loss for deeply nested text (#267) — Replaced backward 4KB scan with forward CTM tracking across the full content stream, capturing outer scaling transforms for text in streams >256KB (e.g., chart axis labels).
- Fixed prescan dropping marked content (BDC/BMC) for tagged PDFs — The forward CTM scan now includes preceding BDC/BMC operators and following EMC operators in region boundaries, preserving MCID, ActualText, and artifact tagging for tagged PDFs in large content streams.
- Fixed deduplication dropping distinct characters (#253) —
deduplicate_overlapping_charsnow checks character identity, not just position. Distinct characters close together (e.g., space followed by 'r' at 1.5pt) are no longer incorrectly removed. - Fixed text dropped with font-size-as-Tm-scale pattern (#254) — Corrected TD/T* matrix multiplication order per ISO 32000-1 Section 9.4.2. PDFs using
/F1 1 Tf+ scaledTm(common in InDesign, LaTeX) no longer silently lose lines. Also tightened containment filter to require text identity match. - Fixed markdown merging words in single-word BT/ET blocks (#260) —
to_markdown()now detects horizontal gaps between consecutive same-line spans and inserts spaces, matchingextract_text()behavior. Fixes PDFs generated by PDFKit.NET/DocuSign. - Fixed CLI merge creating blank documents (#262) —
merge_from/merge_from_bytesnow properly imports page objects with deep recursive copy of all dependent objects (content streams, fonts, images), remapping indirect references.
- pyo3 0.27.2 → 0.28.2 — Added
skip_from_py_object/from_py_objectannotations per newFromPyObjectopt-in requirement. - clap 4.5.60 → 4.6.0
- codecov/codecov-action 5 → 6
- WASM JSON field names now use camelCase —
TextSpan,TextChar,PageText,TextBlock, andTextLineserialized fields changed from snake_case to camelCase (e.g.,font_name→fontName,font_size→fontSize,is_italic→isItalic,page_width→pageWidth) when thewasmfeature is enabled. This aligns with JavaScript naming conventions. Rust JSON serialization via serde is only affected when thewasmfeature is enabled. Python uses PyO3 getters and is unaffected.
🥇 @Goldziher — Thank you for the comprehensive feature requests (#252, #268, #269, #270, #271) that shaped the text extraction improvements in this release. Your detailed issue reports with code examples and spec references made implementation straightforward! 🚀
🥈 @bsickler — Thank you for the Form XObject matrix fix (#266) and prescan CTM rewrite (#267). These are critical correctness fixes for text extraction in rotated documents and large content streams! 🚀
🥉 @hansmrtn — Thank you for the UTF-8 panic fix (#251). This prevents crashes for any user processing non-ASCII PDFs with debug logging! 🚀
🏅 @jorlow — Thank you for the markdown spacing fix (#273). Clean, well-tested fix for a common user-facing issue! 🚀
🏅 @willywg — Thank you for exposing path operations in Python (#261), giving downstream tools access to individual vector path commands! 🚀
🏅 @titusz — Thank you for reporting the character deduplication (#253) and Tm-scale text dropping (#254) bugs with clear root cause analysis! 🚀
🏅 @oscmejia — Thank you for reporting the markdown word merging issue (#260) with a clear reproduction case! 🚀
🏅 @Inklikdevteam — Thank you for reporting the CLI merge blank pages bug (#262)! 🚀
Rendering Engine Overhaul, Visual Parity, and Expanded API
Major rendering improvements achieving near-perfect visual fidelity across academic papers, government documents, CJK content, presentations, forms, and complex multi-layer PDFs.
- Correct Character Spacing — Fixed proportional width resolution for CID, CFF, and TrueType subset fonts. Documents that previously rendered with monospace-like spacing now display with correct kerning and proportional widths.
- Embedded Font Support — Render directly from embedded CFF and TrueType font programs, producing accurate glyph shapes that match the original document's typography.
- Standard Font Metrics — Built-in width tables for the PDF standard 14 fonts (Times, Helvetica, Courier). Fixes uniform character spacing when explicit widths are absent.
- Improved Font Matching — Better system font fallback for URW, LaTeX, and other common font families. Automatic serif/sans-serif detection for appropriate substitution.
- Fill-and-Stroke Support — Full implementation of combined fill-and-stroke operators (
B,B*,b,b*), fixing missing border strokes on rectangles and paths. - Clip Path Support — Proper handling of clip-without-paint patterns, resolving issues where body text was hidden behind unclipped background fills.
- Gradient Shading — Axial (linear) and radial gradient rendering with support for exponential interpolation and stitching functions.
- Negative Rectangle Handling — Correct normalization of rectangles with negative dimensions per the PDF specification.
- Alpha Transparency — Fixed fill and stroke alpha application per PDF specification. Semi-transparent rectangles, images, and paths now blend correctly.
- Graphics State Resolution — Proper indirect reference resolution for extended graphics state parameters, ensuring alpha and blend mode values are applied.
- Isolated Transparency Groups — Support for rendering transparency groups to separate compositing surfaces.
- Stencil Image Masks — Support for 1-bit stencil masks with CCITT Group 4 decompression. Fixes decorative borders, corner ornaments, and masked image elements.
- Page Rotation — Full support for the
/Rotateattribute (90°, 180°, 270°), correctly rendering landscape slides and rotated documents.
- Separation Color Spaces — Proper tint transform evaluation for Separation and DeviceN colors against their alternate color spaces.
- Fixed process abort on degenerate CTM coordinates — A malformed CTM could place text spans at extreme coordinates, causing allocation abort. Projection functions now safely skip the split instead of crashing.
- FlateDecode flate-bomb protection — All zlib/deflate decompression paths are now capped, preventing a crafted PDF stream from exhausting virtual memory. The cap defaults to 256 MB and can be adjusted via the
PDF_OXIDE_MAX_DECOMPRESS_MBenvironment variable or programmatically withFlateDecoder::with_limit(n). - Fixed Clipping Stack Synchronization — Resolved a critical issue where the clipping stack could get out of sync with the graphics state, leading to incorrect content being hidden.
- Standardized Image Extraction — Refactored the image extraction logic to support document-wide color space resolution.
- Fixed Python Rendering Accessibility (#240) — Resolved an issue where the
render_pagemethod was unreachable in standard Python builds.
- Python type stubs — Switched from mypy stubgen to Rylai for generating
.pyifrom PyO3 Rust source statically (no compilation). CI and release workflows updated.
New methods on PdfDocument:
validate_pdf_a(level)— PDF/A compliance validation (1a/1b/2a/2b/2u/3a/3b/3u)validate_pdf_ua()— PDF/UA accessibility validationvalidate_pdf_x(level)— PDF/X print complianceextract_pages(pages, output)— Extract page subset to a new PDF filedelete_page(index)— Remove a page by indexmove_page(from, to)— Reorder pagesflatten_to_images(dpi)— Create flattened PDF from rendered pagesPdfDocument(path, password=)— Open encrypted PDFs in one step (#247)PdfDocument.from_bytes(data, password=)— Same for in-memory PDFsPdf.merge(paths)— Merge multiple PDF files into one
New methods on WasmPdfDocument:
validatePdfA(level)— PDF/A compliance validationdeletePage(index)— Remove a pageextractPages(pages)— Extract pages to new PDF bytessave()— Save modified PDF (alias forsaveToBytes())new WasmPdfDocument(data, password?)— Open encrypted PDFs (#247)WasmPdf.merge(pdfs)— Merge multiple PDFs from byte arrays
rendering::flatten_to_images(doc, dpi)— Shared implementation for all bindingsapi::merge_pdfs(paths)— Merge multiple PDFs (shared across all bindings)
- Rendering Engine Overhaul — Major improvements to the rendering pipeline, achieving high visual parity with industry standards.
- Batteries-Included Python Bindings — The Python distribution now automatically enables page rendering, parallel extraction, digital signatures, and office document conversion by default. (#240)
🥇 @tiennh-h2 — Thank you for reporting the rendering accessibility issue (#240). Your feedback helped us identify that our Python distribution was too minimal, leading to an improved "batteries-included" experience for all Python users! 🚀
🥈 @Suleman-Elahi — Thank you for the suggestion to add flattened PDF creation (#240). This led to the new flatten_to_images() API available across Rust, Python, and WASM! 🚀
🥉 @hoesler — Thank you for the XY-cut projection fix (#274) that prevents allocation abort on degenerate CTM coordinates, and the FlateDecoder configurability improvement (#275)! 🚀
🏅 @Leon-Degel-Koehn — Thank you for fixing the Quick Start Rust documentation (#277)! 🚀
🏅 @XO9A8 — Thank you for improving the PdfDocument::from_bytes documentation (#276)! 🚀
🏅 @monchin — Thank you for replacing manual stub generation with Rylai (#250) and for helping diagnose the password API issue (#247) with a clear workaround and API improvement suggestion! 🚀
🏅 @marph91 — Thank you for reporting the password constructor issue (#247), improving the developer experience for encrypted PDF workflows! 🚀
Stable Recursion and Refined Table Heuristics
- Refined Table Detection — The spatial table detector now requires at least 2 columns to identify a region as a table. This significantly reduces false positives where single-column lists or bullet points were incorrectly wrapped in ASCII boxes.
- Optimized Text Extraction — Refactored the internal extraction pipeline to eliminate redundant work when processing Tagged PDFs. The structure tree and page spans are now extracted once and shared across the detection and rendering phases.
- Resolved
RefCellalready borrowed panic (#237) — Fixed a critical reentrancy issue where recursive Form XObject processing (e.g., extracting images from nested forms) could trigger a runtime panic. Replaced long-lived borrows with scoped, tiered cache access using Rust best practices. (Reported by @marph91)
🥇 @marph91 — Thank you for identifying the complex RefCell borrow conflict in nested image extraction (#237). This report led to a comprehensive safety audit of our interior mutability patterns and a more robust, recursion-safe caching architecture! 🚀
Advanced Visual Table Detection and Automated Python Stubs
- Smart Hybrid Table Extraction (#206) — Introduced a robust, zero-config visual detection engine that handles both bordered and borderless tables.
- Localized Grid Detection: Uses Union-Find clustering to group vector paths into discrete table regions, enabling multiple tables per page.
- Visual Line Analysis: Detects cell boundaries from actual drawing primitives (lines and rectangles), significantly improving accuracy for untagged PDFs.
- Visual Spans: Identifies colspans and rowspans by analyzing the absence of internal grid lines and text-overflow signals.
- Visual Headers: Heuristically identifies hierarchical (multi-row) header rows.
- Professional ASCII Tables: Added high-quality ASCII table formatting for plain text output, featuring automatic multiline text wrapping and balanced column alignment.
- Auto-generated Python type stubs (#220) — Integrated automated
.pyistub generation using mypy's stubgen in the CI pipeline, ensuring Python IDEs always have up-to-date type information for the Rust bindings. - Python
PdfDocumentpath-like and context manager (#223) —PdfDocumentnow acceptspathlib.Path(or any path-like object) and supports the context manager protocol (with PdfDocument(path) as doc:), ensuring scoped usage and automatic resource cleanup. - Enabled by Default: Table extraction is now active by default in all Markdown, HTML, and Plain Text conversions.
- Robust Geometry: Updated
Rectprimitive to handle negative dimensions and coordinate normalization natively.
- Fixed segfault in nested Form XObject text extraction (#228) — Resolved aliased
&mutreferences during recursive XObject processing using interior mutability (RefCell/Cell). - Fixed Python Coordinate Scaling: Corrected
erase_regioncoordinate mapping in Python bindings to use the standard[x1, y1, x2, y2]format. - Improved ASCII Table Wrapping: Reworked text wrapping to be UTF-8 safe, preventing panics on multi-byte characters.
- Refined Rendering API: Restored backward compatibility for the
render_pagemethod.
🥇 @hoesler — Huge thanks for PR #228! Your fix for the nested XObject aliasing UB is a critical stability improvement that eliminates segfaults in complex PDFs. By correctly employing interior mutability, you've made the core extraction engine significantly more robust and spec-compliant. Outstanding work! 🚀
🥈 @monchin — Thank you for the fantastic initiative on automated stub generation (#220) and the ergonomic improvements for Python (#223)! We've integrated these into the v0.3.16 release, providing consistent, IDE-friendly type hints and modern path-like/context manager support. Outstanding contributions! 🚀
Header & Footer Management, Multi-Column Stability, and Font Fixes
- PDF Header/Footer Management API (#207) — Added a dedicated API for managing page artifacts across Rust, Python, and WASM.
- Add: Ability to insert custom headers and footers with styling and placeholders via
PageTemplate. - Remove: Heuristic detection engine to automatically identify and strip repeating artifacts. Includes modular methods:
remove_headers(),remove_footers(), andremove_artifacts(). Prioritizes ISO 32000 spec-compliant/Artifacttags when available. - Edit: Ability to mask or erase existing content on a per-page basis via
erase_header(),erase_footer(), anderase_artifacts().
- Add: Ability to insert custom headers and footers with styling and placeholders via
- Page Templates — Introduced
PageTemplate,Artifact, andArtifactStyleclasses for reusable page design. Supports dynamic placeholders like{page},{pages},{title}, and{author}. - Scoped Extraction Filtering — Updated all extraction methods to respect
erase_regions, enabling clean text extraction by excluding identified headers and footers. - Python
PdfDocument.from_bytes()— Open PDFs directly from in-memory bytes without requiring a file path. (Contributed by @hoesler in #216) - Future-Proofed Rust API — Implemented
Defaulttrait for key extraction structs (TextSpan,TextChar,TextContent) to protect users from future field additions.
- Fixed Multi-Column Reading Order (#211) — Refactored
extract_words()andextract_text_lines()to use XY-Cut partitioning. This prevents text from adjacent columns from being interleaved and standardizes top-to-bottom extraction. (Reported by @ankursri494) - Resolved Font Identity Collisions (#213) — Improved font identity hashing to include
ToUnicodeandDescendantFontsreferences. Fixes garbled text extraction in documents where multiple fonts share the same name but use different character mappings. (Reported by @productdevbook) - Fixed
Linestable strategy false positives (#215) —extract_tables()withhorizontal_strategy="lines"now builds the grid purely from vector path geometry and returns empty when no lines are found, preventing spurious tables on plain-text pages. (Contributed by @hoesler) - Optimized CMap Parsing — Standardized 2-byte consumption for Identity-H fonts and improved robust decoding for Turkish and other extended character sets.
🥇 @hoesler — Huge thanks for PR #216 and #215! Your contribution of from_bytes() for Python unlocks new serverless and in-memory workflows for the entire community. Additionally, your fix for the Lines table strategy significantly improves the precision of our table extraction engine. Outstanding work! 🚀
🥈 @ankursri494 (Ankur Srivastava) — Thank you for identifying the multi-column reading order issue (#211). Your detailed report and sample document were the catalyst for our new XY-Cut partitioning engine, which makes PDFOxide's reading order detection among the best in the ecosystem! 🎯
🥉 @productdevbook — Thanks for reporting the complex font identity collision issue (#213). This report led to a deep dive into PDF font internals and a significantly more robust font hashing system that fixes garbled text for thousands of professional documents! 🔍✨
Parity in API & Bug Fixing (Issue #185, #193, #202)
- High-Level Rendering API (#185, #190) — added
Pdf::render_page()to Rust, Python, and WASM. Supports rendering any page toImage(Png/Jpeg). Restored backward compatibility for Rust by maintaining the 1-argumentrender_pageand addingrender_page_with_options. - Word and Line Extraction (#185, #189) — added
extract_words()andextract_text_lines()to all bindings. Provides semantic grouping of characters with bounding boxes, font info, and styling (parity withpdfplumber). - Geometric Primitive Extraction (#185, #191) — added
extract_rects()andextract_lines()to identify vector graphics. - Hybrid Table Detection (#185, #192) — updated
SpatialTableDetectorto use vector lines as hints, significantly improving detection of "bordered" tables. - API Harmonization — implemented the fluent
.within(page, rect)pattern across Rust, Python, and WASM for scoped extraction. - Area Filtering — added optional
regionsupport to all extraction methods (extract_text,extract_chars, etc.) in Python and WASM, using backward-compatible signatures. - Deep Data Access — added
.charsproperty toTextWordandTextLineobjects in Python, enabling granular access to individual character metadata. - CLI Enhancements — added
pdf-oxide renderfor image generation andpdf-oxide pathsfor geometric JSON extraction. Integrated--areafiltering across all extraction commands.
Reported by @MarcRene71 — AttributeError: 'builtins.PdfDocument' object has no attribute 'extract_text_ocr' when using the library without the OCR feature enabled.
- Improved Feature Gating Discovery (#204) — ensured that all optional features (OCR, Office, Rendering) are always visible in the Python API. If a feature is disabled at build time, calling its methods now returns a helpful
RuntimeErrorexplaining how to enable it (e.g.,pip install pdf_oxide[ocr]), instead of throwing anAttributeError. - Always-on Type Stubs (#204) — updated
.pyifiles to include all methods regardless of build features, providing full IDE autocompletion support for all capabilities.
Reported by @cole-dda — repeated calls to extract_texts() and extract_spans() return inconsistent results (empty lists on second/third calls).
- Fixed XObject span cache poisoning (#193) — resolved an issue where
extract_chars()(low-level API) would incorrectly populate the high-levelxobject_spans_cachewith empty results. Becauseextract_chars()does not collect spans, it was "poisoning" the cache for subsequentextract_spans()calls, causing them to return empty data for any content inside Form XObjects. - Improved extraction mode isolation (#193) — ensured that the text extractor explicitly separates character and span extraction paths. The span result cache is now only accessed and updated when in span extraction mode, and internal span buffers are cleared when entering character mode.
Reported by @vincenzopalazzo — extract_text() returns empty string for encrypted PDFs with CID TrueType Identity-H fonts.
- Support for V=4 Crypt Filters (#202) — fixed a bug in
EncryptDictwhere version 4 encryption was hardcoded to AES-128. It now correctly parses the/CFdictionary and/CFMentry to select between RC4-128 (/V2) and AES-128 (/AESV2), enabling support for PDFs produced by OpenPDF. - Encrypted CIDToGIDMap decryption (#202) — fixed a missing decryption step when loading
CIDToGIDMapstreams. Previously, the stream was decompressed but remained encrypted, causing invalid glyph mapping and failed text extraction. - Enhanced font diagnostic logging (#202) — replaced silent failures with descriptive warnings when ToUnicode CMaps or FontFile2 streams fail to load or decrypt, making it easier to diagnose complex extraction issues.
- Consolidated text decoding and positioning logic (#187) — unified the high-level
extract_text_spans()and low-levelextract_chars()paths into a single shared engine to prevent logic drift and ensure consistent character handling. - Fixed render_page for in-memory PDFs — ensured that PDFs created from bytes or strings can be rendered by automatically initializing a temporary editor if needed.
- Improved Clustering Accuracy — updated character clustering to use gap-based distance instead of center-to-center distance, ensuring accurate word grouping regardless of font size.
Thank you to @MarcRene71 for identifying the critical API discoverability issue with OCR (#204). Your report led to a more robust "Pythonic" approach to feature gating, ensuring that users always see the full API and receive helpful guidance when features are disabled!
Thank you to @vincenzopalazzo for identifying and fixing the critical issues with encrypted CID fonts and V=4 crypt filters (#202). Your contribution of both the fix and the reproduction fixture was essential for ensuring PDFOxide handles professional PDFs from diverse producers!
Thank you to @ankursri494 (Ankur Srivastava) for the excellent proposal to bridge the gap between PdfPlumber's flexibility and PDFOxide's performance (#185). Your detailed breakdown of word-level and table extraction requirements was the roadmap for this release!
Thank you to @cole-dda for identifying the critical caching bug (#193). The detailed reproduction case was essential for pinpointing the interaction between the low-level character API and the document-level XObject caches.
Character Extraction Quality, Multi-byte Encoding (Issue #186)
Reported by @cole-dda — garbled output when using extract_chars() on PDFs with multi-byte encodings (CJK text, Type0 fonts).
- Multi-byte decoding in show_text — fixed
extract_chars()to correctly handle 2-byte and variable-width encodings (Identity-H/V, Shift-JIS, etc.). Previously, characters were processed byte-by-byte, causing multi-byte characters to be split and garbled. Now uses the same robust decoding logic asextract_spans(). - Improved character positioning accuracy — replaced the 0.5em fixed-width estimate in
show_textwith actual glyph widths from the font dictionary. This ensures that character bounding boxes (bbox) and origins are precisely positioned, matching the actual PDF rendering. - Accurate character advancement — character spacing (
Tc) and word spacing (Tw) are now correctly scaled by horizontal scaling (Th) during character-level extraction, ensuring correct text matrix updates.
Thank you to @cole-dda for identifying and reporting the character extraction quality issue with an excellent reproduction case (#186). Your report directly led to identifying the divergence between our high-level and low-level extraction paths, making extract_chars() significantly more robust for CJK and other multi-byte documents. We really appreciate your contribution to making PDF Oxide better!
Text Extraction Quality, Determinism, Performance, Markdown Conversion
Reported by @Goldziher — systematic evaluation across 10 PDFs covering word merging, encoding failures, and RTL text.
-
CID font width calculation — fixed text-to-user space conversion for CID fonts. Glyph widths were not correctly scaled, causing word boundary detection to merge adjacent words (
destinationmachine→destination machine,helporganizeas→help organize as). -
Font-change word boundary detection — when PDF font changes mid-line (e.g., regular→italic for product names in LaTeX), we now detect this as a word boundary even if the visual gap is small. Previously, these were merged into single words with mixed formatting.
-
Non-Standard CID mapping fallback — implemented a fallback mechanism for CID fonts with broken
/ToUnicodemaps. If mapping fails, we now attempt to use the font's internalcmaptable directly. Fixed encoding failures in 3 PDFs from the corpus. -
RTL text directionality foundation — added basic support for identifying RTL (Right-to-Left) script spans (Arabic, Hebrew) based on Unicode range. Provides correctly ordered spans for simple RTL layouts.
- Optimized Markdown engine — significantly improved the performance of
to_markdown()by implementing recursive spatial partitioning (XY-Cut). This ensures that multi-column layouts and complex document structures are converted into accurate, readable Markdown. - Heading Detection — automated identification of headers (H1-H6) based on font size variance and document-wide frequency analysis.
- List Reconstruction — detects bulleted and numbered lists by analyzing leading character patterns and indentation consistency.
- Zero-copy page tree traversal — refactored internal page navigation to avoid redundant dictionary cloning during deep page tree traversal for multi-page extraction.
- Structure tree caching — Structure tree result cached after first access, avoiding redundant parsing on every
extract_text()call (major impact on tagged PDFs like PDF32000_2008.pdf). - BT operator early-out —
extract_spans(),extract_spans_with_config(), andextract_chars()skip the full text extraction pipeline for image-only pages that contain noBT(Begin Text) operators. - Larger I/O buffer for big files —
BufReadercapacity increased from 8 KB to 256 KB for files >100 MB, reducing syscall overhead on 1.5 GB newspaper archives. - Xref reconstruction threshold removed — Eliminated the
xref.len() < 5heuristic that triggered full-file reconstruction on valid portfolio PDFs with few objects (5-13s → <100ms).
Thank you to @Goldziher for the exhaustive evaluation of PDF extraction quality (#181). Your systematic approach to testing across 10 diverse documents directly resulted in critical fixes for font scaling and encoding fallbacks. The feedback from power users like you is what drives PDF Oxide's quality forward!
Stability, Image Extraction & Error Recovery (Issue #41, #44, #45, #46)
- 100% pass rate on 3,830 PDFs across three independent test suites: veraPDF (2,907), Mozilla pdf.js (897), SafeDocs (26).
- Zero timeouts, zero panics — every PDF completes within 120 seconds.
- p50 = 0.6ms, p90 = 3.0ms, p99 = 33ms — 97.6% of PDFs complete in under 10ms.
- Added
verify_corpusexample binary for reproducible batch verification with CSV output, timeout handling, and per-corpus breakdown.
- Owner password authentication (Algorithm 7 for R≤4, Algorithm 12 for R≥5).
- R≤4: Derives RC4 key from owner password via MD5 hash chain, decrypts
/Ovalue to recover user password, then validates via user password authentication. - R≥5: SHA-256 verification with SASLprep normalization and owner validation/key salts per PDF spec §7.6.3.4.
- Both algorithms now fully wired into
EncryptionHandler::authenticate().
- R≤4: Derives RC4 key from owner password via MD5 hash chain, decrypts
- R≥5 user password verification with SASLprep — Full AES-256 password verification using SHA-256 with validation and key salts per PDF spec §7.6.4.3.3.
- Public password authentication API —
Pdf::authenticate(password)andPdfDocument::authenticate(password)exposed for user-facing password entry.
- XMP metadata validation — Parses XMP metadata stream and checks for
pdfaid:partandpdfaid:conformanceidentification entries (clause 6.7.11). - Color space validation — Scans page content streams for device-dependent color operators (
rg,RG,k,K,g,G) without output intent (clause 6.2). - AFRelationship validation — For PDF/A-3 documents with embedded files, validates each file specification dictionary contains the required
AFRelationshipkey (clause 6.8).
- XMP PDF/X identification — Parses XMP metadata for
pdfxid:GTS_PDFXVersion, validates against declared level (clause 6.7.2). - Page box relationship validation — Validates TrimBox ⊆ BleedBox ⊆ MediaBox and ArtBox ⊆ MediaBox with 0.01pt tolerance (clause 6.1.1).
- ExtGState transparency detection — Checks
SMask(not/None),CA/ca< 1.0, andBMnotNormal/Compatiblein extended graphics state dictionaries (clause 6.3). - Device-dependent color detection — Flags DeviceRGB/CMYK/Gray color spaces used without output intent (clause 6.2.3).
- ICC profile validation — Validates ICCBased color space profile streams contain required
/Nentry (clause 6.2.3).
- Spec-correct clipping (PDF §8.5.4) — Clip state scoped to
q/Qsave/restore via clip stack; new clips intersect with existing clip region;W/W*no longer consume the current path (deferred to next paint operator); clip mask applied to all painting operations including text and images. - Glyph advance width calculation — Text position advances per PDF spec §9.4.4:
tx = (w0/1000 × Tfs + Tc + Tw) × Thwith 600-unit default glyph width. - Form XObject rendering — Parses
/Matrixtransform, uses form's/Resources(or inherits from parent), and recursively executes form content stream operators.
- Missing objects resolve to Null — Per PDF spec §7.3.10, unresolvable indirect references now return
Nullinstead of errors, fixing 16 files across veraPDF/pdf.js corpora. - Lenient header version parsing — Fixed fast-path bug where valid headers with unusual version strings were rejected.
- Non-standard encryption algorithm matching — V=1,R=3 combinations now handled leniently instead of rejected.
- Non-dictionary Resources — Pages with invalid
/Resourcesentries (e.g., Null, Integer) treated as empty resources instead of erroring. - Null nodes in page tree — Null or non-dictionary child nodes in page tree gracefully skipped during traversal.
- Corrupt content streams — Malformed content streams return empty content instead of propagating parse errors.
- Enhanced page tree scanning —
/Resources+/Parentheuristic and/Kidsdirect resolution added as fallback passes for damaged page trees.
- Bogus /Count bounds checking — Page count validated against PDF spec Annex C.2 limit (8,388,607) and total object count; unreasonable values fall back to tree scanning.
- Content stream image extraction —
extract_images()now processes page content streams to findDooperator calls, extracting images referenced via XObjects that were previously missed. - Nested Form XObject images — Recursive extraction with cycle detection handles images inside Form XObjects.
- Inline images —
BI...ID...EIsequences parsed with abbreviation expansion per PDF spec. - CTM transformations — Image bounding boxes correctly transformed using full 4-corner affine transform (handles rotation, shear, and negative scaling).
- ColorSpace indirect references — Resolved indirect references (e.g.,
7 0 R) in image color space entries before extraction.
- Multi-line object headers — Parser now handles
1 0\nobjformat used by Google-generated PDFs instead of requiring1 0 objon a single line. - Extended header search — Header search window extended from 1024 to 8192 bytes to handle PDFs with large binary prefixes.
- Lenient version parsing — Malformed version strings like
%PDF-1.aor truncated headers no longer cause parse failures in lenient mode.
- Missing Contents entry — Pages without a
/Contentskey now return empty content data instead of erroring. - Cyclic page tree detection — Page tree traversal tracks visited nodes to prevent stack overflow on malformed circular references.
- Null stream references — Null or invalid stream references handled gracefully instead of panicking.
- Wider page scanning fallback — Page scanning fallback triggers on more error conditions, improving compatibility with damaged PDFs.
- Pages without /Type entry — Page scanning now finds pages missing the
/Type /Pageentry by checking for/MediaBoxor/Contentskeys.
- Short encryption key panic — AES decryption with undersized keys now returns an error instead of panicking.
- Xref stream parsing hardened — Malformed xref streams with invalid entry sizes or out-of-bounds data no longer cause panics.
- Indirect /Encrypt references —
/Encryptdictionary values that are indirect references are now resolved before parsing.
- Dictionary-as-Stream fallback — When a stream object is a bare dictionary (no stream data), it is now treated as an empty stream instead of causing a decode error.
- Filter abbreviations — Abbreviated filter names (
AHx,A85,LZW,Fl,RL,CCF,DCT) and case-insensitive matching now supported. - Operator limit — Content stream parsing enforces a configurable operator limit (default 1,000,000) to prevent pathological slowdowns on malformed streams.
- Structure tree indirect object references —
ObjectRefvariants in structure tree/Kentries are now resolved at parse time instead of being silently skipped, ensuring complete structure tree traversal. - Lexer
Rtoken disambiguation —tag(b"R")no longer matches theRprefix ofRG/ri/reoperators;1 0 RGis now correctly parsed as a color operator instead of indirect reference1 0 R+ orphanG. - Stream whitespace trimming —
trim_leading_stream_whitespacenow only strips CR/LF (0x0D/0x0A), no longer strips NUL bytes (0x00) or spaces from binary stream data (fixes grayscale image extraction and object stream parsing).
- 8 previously ignored tests un-ignored and fixed:
test_extract_raw_grayscale_image_from_xobject— Fixed stream trimming stripping binary pixel data.test_parse_object_stream_with_whitespace— Fixed stream trimming affecting object stream offsets.test_parse_object_stream_graceful_failure— Relaxed assertion for improved parser recovery.test_markdown_reading_order_top_to_bottom— Fixed test coordinates to use PDF convention (Y increases upward).test_html_layout_multiple_elements— Fixed assertions for per-character positioning.test_reading_order_graph_based_simple— Fixed test coordinates to PDF convention.test_reading_order_two_columns— Fixed test coordinates to PDF convention.test_parse_color_operators— Fixed lexer R/RG token disambiguation.
- Deleted empty
PdfImagestub (src/images.rs) and its module export — image extraction usesImageInfofromsrc/extractors/images.rs. - Deleted commented-out
DocumentType::detect()test block insrc/extractors/gap_statistics.rs. - Removed stale TODO comments in
scripts/setup-hooks.sh,src/bin/analyze_pdf_features.rs,src/document.rs.
🥇 @SeanPedersen — Huge thanks for reporting multiple issues (#41, #44, #45, #46) that drove the entire stability focus of this release. His real-world testing uncovered a parser bug with Google-generated PDFs, image extraction failures on content stream references, and performance problems — each report triggering deep investigation and significant fixes. The parser robustness, image extraction, and testing infrastructure improvements in v0.3.5 all trace back to Sean's thorough bug reports. 🙏🔍
Parsing Robustness, Character Extraction & XObject Paths
parse_header()function signature - Now includes offset tracking.- Before:
parse_header(reader) -> Result<(u8, u8)> - After:
parse_header(reader, lenient) -> Result<(u8, u8, u64)> - Migration: Replace
let (major, minor) = parse_header(&mut reader)?;withlet (major, minor, _offset) = parse_header(&mut reader, true)?; - Note: This is a public API function; consider using
doc.version()for typical use cases instead.
- Before:
- Header offset support - PDFs with binary prefixes or BOM headers now open successfully.
- Parse header function now searches first 1024 bytes for
%PDF-marker (PDF spec compliant). - Supports UTF-8 BOM, email headers, and other leading binary data.
parse_header()returns byte offset where header was found.- Lenient mode (default) handles real-world malformed PDFs; strict mode for compliance testing.
- Fixes parsing errors like "expected '%PDF-', found '1b965'".
- Parse header function now searches first 1024 bytes for
extract_chars()API - Low-level character-level extraction for layout analysis.- Returns
Vec<TextChar>with per-character positioning, font, and styling data. - Includes transformation matrix, rotation angle, advance width.
- Sorted in reading order (top-to-bottom, left-to-right).
- Overlapping characters (rendered multiple times) deduplicated.
- 30-50% faster than span extraction for character-only use cases.
- Exposed in both Rust and Python APIs.
- Python binding:
doc.extract_chars(page_index)returns list ofTextCharobjects.
- Returns
- Form XObject support in path extraction - Now extracts vectors from embedded XObjects.
extract_paths()recursively processes Form XObjects viaDooperator.- Image XObjects properly skipped (only Form XObjects extracted).
- Coordinate transformations via
/Matrixproperly applied. - Graphics state properly isolated (save/restore).
- Duplicate XObject detection prevents infinite loops.
- Nested XObjects (XObject containing XObject) supported.
- Dependencies: Upgraded nom parser library from 7.1 to 8.0.
- Updated all parser combinators to use
.parse()method. - No user-facing API changes.
- All parser functionality maintained.
- Performance stable (no regressions detected).
- Updated all parser combinators to use
parse_header()signature updated: now returns(major, minor, offset)tuple.- All parse_header test cases updated to use new signature.
Form Fields, Multimedia & Python 3.8-3.14
- Parent/Child Field Structures - Create complex form hierarchies like
address.street,address.city.add_parent_field()- Create container fields without widgets.add_child_field()- Add child fields to existing parents.add_form_field_hierarchical()- Auto-create parent hierarchy from dotted names.ParentFieldConfigfor configuring container fields.- Property inheritance between parent and child fields (FT, V, DV, Ff, DA, Q).
- Edit All Field Properties - Beyond just values.
set_form_field_readonly()/set_form_field_required()- Flag manipulation.set_form_field_rect()- Reposition/resize fields.set_form_field_tooltip()- Set hover text (TU).set_form_field_max_length()- Text field length limits.set_form_field_alignment()- Text alignment (left/center/right).set_form_field_default_value()- Default values (DV).BorderStyleandAppearanceCharacteristicssupport.
- Critical Bug Fix - Modified existing fields now persist on save (was only saving new fields).
- Forms Data Format Export - ISO 32000-1:2008 Section 12.7.7.
FdfWriter- Binary FDF export for form data exchange.XfdfWriter- XML XFDF export for web integration.export_form_data_fdf()/export_form_data_xfdf()on FormExtractor, DocumentEditor, Pdf.- Hierarchical field representation in exports.
- TextChar Transformation - Per-character positioning metadata (#27).
origin- Font baseline coordinates (x, y).rotation_degrees- Character rotation angle.matrix- Full transformation matrix.- Essential for pdfium-render migration.
- DPI Calculation - Resolution metadata for images.
horizontal_dpi/vertical_dpifields onImageContent.resolution()- Get (h_dpi, v_dpi) tuple.is_high_resolution()/is_low_resolution()/is_medium_resolution()helpers.calculate_dpi()- Compute from pixel dimensions and bbox.
- Spatial Filtering - Extract text from rectangular regions.
RectFilterMode::Intersects- Any overlap (default).RectFilterMode::FullyContained- Completely within bounds.RectFilterMode::MinOverlap(f32)- Minimum overlap fraction.TextSpanSpatialtrait -intersects_rect(),contained_in_rect(),overlap_with_rect().TextSpanFilteringtrait -filter_by_rect(),extract_text_in_rect().
- MovieAnnotation - Embedded video content.
- SoundAnnotation - Audio content with playback controls.
- ScreenAnnotation - Media renditions (video/audio players).
- RichMediaAnnotation - Flash/video rich media content.
- ThreeDAnnotation - 3D model embedding.
- U3D and PRC format support.
ThreeDView- Camera angles and lighting.ThreeDAnimation- Playback controls.
- PathExtractor - Vector graphics extraction.
- Lines, curves, rectangles, complex paths.
- Path transformation and bounding box calculation.
- XfaExtractor - Extract XFA form data.
- XfaParser - Parse XFA XML templates.
- XfaConverter - Convert XFA forms to AcroForm.
- True Python 3.8-3.14 Support - Fixed via
abi3-py38(was only working on 3.11). - Modern Tooling - uv, pdm, ruff integration.
- Code Quality - All Python code formatted with ruff.
🥇 @monchin - Massive thanks for revolutionizing our Python ecosystem! Your PR #29 fixed a critical compatibility issue where PDFOxide only worked on Python 3.11 despite claiming 3.8+ support. By switching to abi3-py38, you enabled true cross-version compatibility (Python 3.8-3.14). The introduction of modern tooling (uv, pdm, ruff) brings PDFOxide's Python development to 2026 standards. This work directly enables thousands more Python developers to use PDFOxide. 💪🐍
🥈 @bikallem - Thanks for the thoughtful feature request (#27) comparing PDFOxide to pdfium-render. Your detailed analysis of missing origin coordinates and rotation angles led directly to our TextChar transformation feature. This makes PDFOxide a viable migration path for pdfium-render users. 🎯
Unified API, PDF Creation & Editing
- One API for Extract, Create, and Edit - The new
Pdfclass unifies all PDF operations.Pdf::open("input.pdf")- Open existing PDF for reading and editing.Pdf::from_markdown(content)- Create new PDF from Markdown.Pdf::from_html(content)- Create new PDF from HTML.Pdf::from_text(content)- Create new PDF from plain text.Pdf::from_image(path)- Create PDF from image file.- DOM-like page navigation with
pdf.page(0)for querying and modifying content. - Seamless save with
pdf.save("output.pdf")orpdf.save_encrypted().
- Fluent Builder Pattern -
PdfBuilderfor advanced configuration.PdfBuilder::new() .title("My Document") .author("Author Name") .page_size(PageSize::A4) .from_markdown("# Content")?
- PDF Creation API - Fluent
DocumentBuilderfor programmatic PDF generation.Pdf::create()/DocumentBuilder::new()entry points.- Page sizing (Letter, A4, custom dimensions).
- Text rendering with Base14 fonts and styling.
- Image embedding (JPEG/PNG) with positioning.
- Table Rendering -
TableRendererfor styled tables.- Headers, borders, cell spans, alternating row colors.
- Column width control (fixed, percentage, auto).
- Cell alignment and padding.
- Graphics API - Advanced visual effects.
- Colors (RGB, CMYK, grayscale).
- Linear and radial gradients.
- Tiling patterns with presets.
- Blend modes and transparency (ExtGState).
- Page Templates - Reusable page elements.
- Headers and footers with placeholders.
- Page numbering formats.
- Watermarks (text-based).
- Barcode Generation (requires
barcodesfeature)- QR codes with configurable size and error correction.
- Code128, EAN-13, UPC-A, Code39, ITF barcodes.
- Customizable colors and dimensions.
- Editor API - DOM-like editing with round-trip preservation.
DocumentEditorfor modifying existing PDFs.- Content addition without breaking existing structure.
- Resource management for fonts and images.
- Annotation Support - Full read/write for all types.
- Text markup: highlights, underlines, strikeouts, squiggly.
- Notes: sticky notes, comments, popups.
- Shapes: rectangles, circles, lines, polygons, polylines.
- Drawing: ink/freehand annotations.
- Stamps: standard and custom stamps.
- Special: file attachments, redactions, carets.
- Form Fields - Interactive form creation.
- Text fields (single/multiline, password, comb).
- Checkboxes with custom appearance.
- Radio button groups.
- Dropdown and list boxes.
- Push buttons with actions.
- Form flattening (convert fields to static content).
- Link Annotations - Navigation support.
- External URLs.
- Internal page navigation.
- Styled link appearance.
- Outline Builder - Bookmark/TOC creation.
- Hierarchical structure.
- Page destinations.
- Styling (bold, italic, colors).
- PDF Layers - Optional Content Groups (OCG).
- Create and manage content layers.
- Layer visibility controls.
- PDF/A Validation - ISO 19005 compliance checking.
- PDF/A-1a, PDF/A-1b (ISO 19005-1).
- PDF/A-2a, PDF/A-2b, PDF/A-2u (ISO 19005-2).
- PDF/A-3a, PDF/A-3b (ISO 19005-3).
- PDF/A Conversion - Convert documents to archival format.
- Automatic font embedding.
- XMP metadata injection.
- ICC color profile conversion.
- PDF/X Validation - ISO 15930 print production compliance.
- PDF/X-1a:2001, PDF/X-1a:2003.
- PDF/X-3:2002, PDF/X-3:2003.
- PDF/X-4, PDF/X-4p.
- PDF/X-5g, PDF/X-5n, PDF/X-5pg.
- PDF/X-6, PDF/X-6n, PDF/X-6p.
- 40+ specific error codes for violations.
- PDF/UA Validation - ISO 14289 accessibility compliance.
- Tagged PDF structure validation.
- Language specification checks.
- Alt text requirements.
- Heading hierarchy validation.
- Table header validation.
- Form field accessibility.
- Reading order verification.
- Encryption on Write - Password-protect PDFs when saving.
- AES-256 (V=5, R=6) - Modern 256-bit encryption (default).
- AES-128 (V=4, R=4) - Modern 128-bit encryption.
- RC4-128 (V=2, R=3) - Legacy 128-bit encryption.
- RC4-40 (V=1, R=2) - Legacy 40-bit encryption.
Pdf::save_encrypted()for simple password protection.Pdf::save_with_encryption()for full configuration.
- Permission Controls - Granular access restrictions.
- Print, copy, modify, annotate permissions.
- Form fill and accessibility extraction controls.
- Digital Signatures (foundation, requires
signaturesfeature)- ByteRange calculation for signature placeholders.
- PKCS#7/CMS signature structure support.
- X.509 certificate parsing.
- Signature verification framework.
- Page Labels - Custom page numbering.
- Roman numerals, letters, decimal formats.
- Prefix support (e.g., "A-1", "B-2").
PageLabelsBuilderfor creation.- Extract existing labels from documents.
- XMP Metadata - Extensible metadata support.
- Dublin Core properties (title, creator, description).
- PDF properties (producer, keywords) .
- Custom namespace support.
- Full read/write capability.
- Embedded Files - File attachments.
- Attach files to PDF documents.
- MIME type and description support.
- Relationship specification (Source, Data, etc.).
- Linearization - Web-optimized PDFs.
- Fast web view support.
- Streaming delivery optimization.
- Text Search - Pattern-based document search.
- Regex pattern support.
- Case-sensitive/insensitive options.
- Position tracking with page/coordinates.
- Whole word matching.
- Page Rendering (requires
renderingfeature)- Render pages to PNG/JPEG images.
- Configurable DPI and scale.
- Pure Rust via tiny-skia (no external dependencies).
- Debug Visualization (requires
renderingfeature)- Visualize text bounding boxes.
- Element highlighting for debugging.
- Export annotated page images.
- Office to PDF (requires
officefeature)- DOCX: Word documents with paragraphs, headings, lists, formatting.
- XLSX: Excel spreadsheets via calamine (sheets, cells, tables).
- PPTX: PowerPoint presentations (slides, titles, text boxes).
OfficeConverterwith auto-detection.OfficeConfigfor page size, margins, fonts.- Python bindings:
OfficeConverter.from_docx(),from_xlsx(),from_pptx().
Pdfclass for PDF creation.Color,BlendMode,ExtGStatefor graphics.LinearGradient,RadialGradientfor gradients.LineCap,LineJoin,PatternPresetsfor styling.save_encrypted()method with permission flags.OfficeConverterclass for Office document conversion.
- Description updated to "The Complete PDF Toolkit: extract, create, and edit PDFs".
- Python module docstring updated for v0.3.0 features.
- Branding updated with Extract/Create/Edit pillars.
- Outline action handling - correctly dereference actions indirectly referenced by outline items.
🥇 @jvantuyl - Thanks for the thorough PR #16 fixing outline action dereferencing! Your investigation uncovered that some PDFs embed actions directly while others use indirect references - a subtle PDF spec detail that was breaking bookmark navigation. Your fix included comprehensive tests ensuring this won't regress. 🔍✨
🙏 @mert-kurttutan - Thanks for the honest feedback in issue #15 about README clutter. Your perspective as a new user helped us realize we were overwhelming people with information. The resulting documentation cleanup makes PDFOxide more approachable. 📚
CJK Support & Structure Tree Enhancements
- TagSuspect/MarkInfo support (ISO 32000-1 Section 14.7.1).
- Parse MarkInfo dictionary from document catalog (
marked,suspects,user_properties). PdfDocument::mark_info()method to retrieve MarkInfo.- Automatic fallback to geometric ordering when structure tree is marked as suspect.
- Parse MarkInfo dictionary from document catalog (
- Word Break /WB structure element (Section 14.8.4.4).
- Support for explicit word boundaries in CJK text.
StructType::WBvariant andis_word_break()helper.- Word break markers emitted during structure tree traversal.
- Predefined CMap support for CJK fonts (Section 9.7.5.2).
- Adobe-GB1 (Simplified Chinese) - ~500 common character mappings.
- Adobe-Japan1 (Japanese) - Hiragana, Katakana, Kanji mappings.
- Adobe-CNS1 (Traditional Chinese) - Bopomofo and CJK mappings.
- Adobe-Korea1 (Korean) - Hangul and Hanja mappings.
- Fallback identity mapping for common Unicode ranges.
- Abbreviation expansion /E support (Section 14.9.5).
- Parse
/Eentry from marked content properties. expansionfield onStructElemfor structure-level abbreviations.
- Parse
- Object reference resolution utility.
PdfDocument::resolve_references()for recursive reference handling in complex PDF structures.
- Type 0 /W array parsing for CIDFont glyph widths.
- Proper spacing for CJK text using CIDFont width specifications.
- ActualText verification tests - comprehensive test coverage for PDF Spec Section 14.9.4.
- Soft hyphen handling (U+00AD) - now correctly treated as valid continuation hyphen for word reconstruction.
- Enhanced artifact filtering with subtype support.
ArtifactType::Paginationwith subtypes: Header, Footer, Watermark, PageNumber.ArtifactType::LayoutandArtifactType::Backgroundclassification.
OrderedContent.mcidchanged toOption<u32>to support word break markers.
Image Embedding & Export
- Image embedding: Both HTML and Markdown now support embedded base64 images when
embed_images=true(default).- HTML:
<img src="data:image/png;base64,..."> - Markdown:
(works in Obsidian, Typora, VS Code, Jupyter).
- HTML:
- Image file export: Set
embed_images=false+image_output_dirto save images as files with relative path references. - New
embed_imagesoption inConversionOptionsto control embedding behavior. PdfImage::to_base64_data_uri()method for converting images to data URIs.PdfImage::to_png_bytes()method for in-memory PNG encoding.- Python bindings: new
embed_imagesparameter forto_html,to_markdown, and*_allmethods.
CTM Fix & Formula Rendering
- CTM (Current Transformation Matrix) now correctly applied to text positions per PDF Spec ISO 32000-1:2008 Section 9.4.4 (#11).
- Structure tree:
/Alt(alternate description) parsing for accessibility text on formulas and figures. - Structure tree:
/Pg(page reference) resolution - correctly maps structure elements to page numbers. FormulaRenderermodule for extracting formula regions as base64 images from rendered pages.ConversionOptions: new fieldsrender_formulas,page_images,page_dimensionsfor formula image embedding.- Regression tests for CTM transformation.
🐛➡️✅ @mert-kurttutan - Thanks for the detailed bug report (#11) with reproducible sample PDF! Your report exposed a fundamental CTM transformation bug affecting text positioning across the entire library. This fix was critical for production use. 🎉
BT/ET Matrix Reset & Text Processing
- BT/ET matrix reset per PDF spec Section 9.4.1 (PR #10 by @drahnr).
- Geometric spacing detection in markdown converter (#5).
- Verbose extractor logs changed from info to trace (#7).
- docs.rs build failure (excluded tesseract-rs).
apply_intelligent_text_processing()method for ligature expansion, hyphenation reconstruction, and OCR cleanup (#6).
- Removed unused tesseract-rs dependency.
🥇 @drahnr - Huge thanks for PR #10 fixing the BT/ET matrix reset issue! This was a subtle PDF spec compliance bug (Section 9.4.1) where text matrices weren't being reset between text blocks, causing positions to accumulate and become unusable. Your fix restored correct text positioning for all PDFs. 💪📐
🔬 @JanIvarMoldekleiv - Thanks for the detailed bug report (#5) about missing spaces and lost table structure! Your analysis even identified the root cause in the code - the markdown converter wasn't using geometric spacing analysis. This level of investigation made the fix straightforward. 🕵️♂️
🎯 @Borderliner - Thanks for two important catches! Issue #6 revealed that apply_intelligent_text_processing() was documented but not actually available (oops! 😅), and #7 caught our overly verbose INFO-level logging flooding terminals. Both fixed immediately! 🔧
Discoverability Improvements
- Optimized crate keywords for better discoverability.
Encrypted PDF Fixes
- Encrypted stream decoding improvements (#3).
- CI/CD pipeline fixes.
🥇 @threebeanbags - Huge thanks for PRs #2 and #3 fixing encrypted PDF support! 🔐 Your first PR identified that decryption needed to happen before decompression - a critical ordering issue. Your follow-up PR #3 went deeper, fixing encryption handler initialization timing and adding Form XObject encryption support. These fixes made PDFOxide actually work with password-protected PDFs in production. 💪🎉
- Encrypted stream decoding (#2).
- Documentation and doctest fixes.
- Encrypted stream decoding refinements.
- Python 3.13 support.
- GitHub sponsor configuration.
- Cross-platform binary builds (Linux, macOS, Windows).
- Initial release.
- PDF text extraction with spec-compliant Unicode mapping.
- Intelligent reading order detection.
- Python bindings via PyO3.
- Support for encrypted PDFs.
- Form field extraction.
- Image extraction.
💖 @magnus-trent - Thanks for issue #1, our first community feedback! Your message that PDFOxide "unlocked an entire pipeline" you'd been working on for a month validated that we were solving real problems. Early encouragement like this keeps open source projects going. 🚀