From 1c4cd33edabe203650c1985cbd72dfedddc155a5 Mon Sep 17 00:00:00 2001 From: Jamie Lemon Date: Tue, 2 Jun 2026 22:32:15 +0100 Subject: [PATCH] Documentation updates for version 1.28.0 --- docs/about-feature-matrix.rst | 10 +++ docs/about.rst | 4 +- docs/app3.rst | 82 ++++++++++++++++++++++ docs/archive-class.rst | 2 +- docs/converting-files.rst | 120 ++++++++++++++++++++++++++++----- docs/document.rst | 66 ++++++++++++------ docs/how-to-open-a-file.rst | 23 ++++++- docs/images/icons/icon-md.svg | 10 +++ docs/page.rst | 2 +- docs/recipes-stories.rst | 9 +-- docs/supported-files-table.rst | 6 ++ docs/the-basics.rst | 5 +- 12 files changed, 284 insertions(+), 55 deletions(-) create mode 100644 docs/images/icons/icon-md.svg diff --git a/docs/about-feature-matrix.rst b/docs/about-feature-matrix.rst index 6f94130d2..f3efcf63d 100644 --- a/docs/about-feature-matrix.rst +++ b/docs/about-feature-matrix.rst @@ -55,6 +55,10 @@ :width: 0 :height: 0 +.. image:: images/icons/icon-md.svg + :width: 0 + :height: 0 + .. raw:: html @@ -181,6 +185,11 @@ background-size: 40px 40px; } + #feature-matrix .icon.md { + background: url("_images/icon-md.svg") 0 0 transparent no-repeat; + background-size: 40px 40px; + } + @@ -207,6 +216,7 @@ CBZ SVG TXT + MD Image
DOCX diff --git a/docs/about.rst b/docs/about.rst index f5e8b5b24..159922ea3 100644 --- a/docs/about.rst +++ b/docs/about.rst @@ -60,6 +60,8 @@ The following table illustrates how |PyMuPDF| compares with other typical soluti Therefore input files are mostly in a form that's useful for text extraction. + If faithful reproduction of layout is important, then consider using :ref:`PyMuPDF Pro `. + ---- @@ -97,7 +99,7 @@ The following table illustrates what features the products offer: - PyMuPDF Pro - PyMuPDF4LLM * - **Input Documents** - - `PDF`, `XPS`, `EPUB`, `CBZ`, `MOBI`, `FB2`, `SVG`, `TXT`, Images (*standard document types*) + - `PDF`, `XPS`, `EPUB`, `CBZ`, `MOBI`, `FB2`, `SVG`, `TXT`, `MD`, Images (*standard document types*) - *as PyMuPDF* and: `DOC`/`DOCX`, `XLS`/`XLSX`, `PPT`/`PPTX`, `HWP`/`HWPX` - *as PyMuPDF* diff --git a/docs/app3.rst b/docs/app3.rst index 086f21a47..1aeb9f6c3 100644 --- a/docs/app3.rst +++ b/docs/app3.rst @@ -421,6 +421,88 @@ Typical document page sizes are **ISO A4** and **Letter**. A **Letter** page has + +.. _CSS_Support: + +CSS Support +-------------------------------------------- + +For now, only a subset of CSS properties are supported. + +The underlying C library MuPDF supports a subset of HTML4 and CSS2. The primary goal of the HTML/CSS support is to serve as a popular and convenient way to style text — not to faithfully reproduce websites in PDF. + +What Works +~~~~~~~~~~~~~ + +The following list shows the supported properties, grouped by category. + +Box Model & Layout +"""""""""""""""""" + +``margin``, ``margin-top``, ``margin-right``, ``margin-bottom``, ``margin-left``, ``padding``, ``padding-top``, ``padding-right``, ``padding-bottom``, ``padding-left``, ``width``, ``height``, ``display``, ``position``, ``top``, ``right``, ``bottom``, ``left``, ``inset``, ``overflow-wrap``, ``columns`` + +.. note:: + + The properties ``position`` & ``display`` are supported in a very limited way. Only the values ``position: relative`` and ``display: block`` are supported. + + +Border +"""""""""""""""""" + +``border``, ``border-top``, ``border-right``, ``border-bottom``, ``border-left``, ``border-color``, ``border-style``, ``border-width``, ``border-spacing``, ``border-collapse``, ``border-top-color``, ``border-right-color``, ``border-bottom-color``, ``border-left-color``, ``border-top-style``, ``border-right-style``, ``border-bottom-style``, ``border-left-style``, ``border-top-width``, ``border-right-width``, ``border-bottom-width``, ``border-left-width`` + +Background +"""""""""""""""""" + +``background``, ``background-color`` + +.. note:: + + Background images are not supported, but the ``background`` property can be used to set a background color for a text block, which is then rendered as a filled rectangle behind the text. + +Font +"""""""""""""""""" + +``font``, ``font-family``, ``font-size``, ``font-style``, ``font-variant``, ``font-weight`` + +Text +"""""""""""""""""" + +``color``, ``letter-spacing``, ``line-height``, ``text-align``, ``text-decoration``, ``text-indent``, ``text-transform``, ``word-spacing``, ``white-space``, ``vertical-align``, ``direction``, ``hyphens`` + +List +"""""""""""""""""" + +``list-style``, ``list-style-image``, ``list-style-position``, ``list-style-type`` + +Page +"""""""""""""""""" + +``page-break-before``, ``page-break-after``, ``orphans``, ``widows`` + +Visibility +"""""""""""""""""""""""""""""""""""" + +``visibility`` + +MuPDF-specific / WebKit extensions +"""""""""""""""""""""""""""""""""""" + +``-mupdf-leading``, ``-webkit-text-fill-color``, ``-webkit-text-stroke-color``, ``-webkit-text-stroke-width`` + +Other +"""""""""""""""""" + +``src`` (for @font-face), ``overflow-wrap`` + + + + +What Doesn't Work +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Modern CSS (CSS3+): no ``flexbox``, ``grid``, ``custom properties`` (--vars), ``calc()``, ``transitions``, ``animations``, ``position: absolute`` / ``fixed``, ``float``, ``clear`` and so on. + .. rubric:: Footnotes .. [#f1] MuPDF supports "deep-copying" objects between PDF documents. To avoid duplicate data in the target, it uses so-called "graftmaps", like a form of scratchpad: for each object to be copied, its :data:`xref` number is looked up in the graftmap. If found, copying is skipped. Otherwise, the new :data:`xref` is recorded and the copy takes place. PyMuPDF makes use of this technique in two places so far: :meth:`Document.insert_pdf` and :meth:`Page.show_pdf_page`. This process is fast and very efficient, because it prevents multiple copies of typically large and frequently referenced data, like images and fonts. However, you may still want to consider using garbage collection (option 4) in any of the following cases: diff --git a/docs/archive-class.rst b/docs/archive-class.rst index ab3c0b7c5..3faa04b7d 100644 --- a/docs/archive-class.rst +++ b/docs/archive-class.rst @@ -10,7 +10,7 @@ Archive This class represents a generalization of file folders and container files like ZIP and TAR archives. Archives allow accessing arbitrary collections of file folders, ZIP / TAR files and single binary data elements as if they all were part of one hierarchical tree of folders. -In PyMuPDF, archives are currently only used by :ref:`Story` objects to specify where to look for fonts, images and other resources. +In PyMuPDF, archives are currently only used by :ref:`Story` objects and as an :ref:`option when opening files ` to specify where to look for fonts, images and other resources. ================================ =================================================== **Method / Attribute** **Short Description** diff --git a/docs/converting-files.rst b/docs/converting-files.rst index d27da3679..475f57563 100644 --- a/docs/converting-files.rst +++ b/docs/converting-files.rst @@ -11,7 +11,7 @@ Converting Files Files to PDF ~~~~~~~~~~~~~~~~~~ -:ref:`Document types supported by PyMuPDF` can easily be converted to |PDF| by using the :meth:`Document.convert_to_pdf` method. This method returns a buffer of data which can then be utilized by |PyMuPDF| to create a new |PDF|. +:ref:`Document types supported by PyMuPDF ` can easily be converted to |PDF| by using the :meth:`Document.convert_to_pdf` method. This method returns a buffer of data which can then be utilized by |PyMuPDF| to create a new |PDF|. @@ -20,38 +20,97 @@ Files to PDF .. code-block:: python import pymupdf - + + # Convert Markdown to PDF + md_doc = pymupdf.open("example.md") + pdfdata = md_doc.convert_to_pdf() + pdf_doc = pymupdf.open(stream=pdfdata) + pdf_doc.save("example.pdf") + + # Convert XPS to PDF xps = pymupdf.open("input.xps") - pdfbytes = xps.convert_to_pdf() - pdf = pymupdf.open("pdf", pdfbytes) + pdfdata = xps.convert_to_pdf() + pdf = pymupdf.open(stream=pdfdata) pdf.save("output.pdf") +.. _Markdown_to_PDF: +Markdown to PDF +~~~~~~~~~~~~~~~~~ -PDF to SVG -~~~~~~~~~~~~~~~~~~ +As Markdown files are supported input files they can be easily converted to PDF using the :meth:`Document.convert_to_pdf` method. -Technically, as SVG files cannot be multipage, we must export each page as an SVG. +In the simplest case you can just open the Markdown file and call the method to get a PDF representation of the content. -To get an SVG representation of a page use the :meth:`Page.get_svg_image` method. -**Example** +Defining paper size +""""""""""""""""""" + +The default paper size is 400 x 600 :doc:`rect` but you can specify a custom paper size if you wish, to do this just send through the `rect` parameter as required, for example: .. code-block:: python - import pymupdf + md_doc = pymupdf.open("example.md", rect=pymupdf.paper_rect("A4")) # A4 size - doc = pymupdf.open("input.pdf") - page = doc[0] - # Convert page to SVG - svg_content = page.get_svg_image() +Defining CSS +"""""""""""" + +By default, the Markdown content will be converted to PDF using a default CSS stylesheet. However, you can specify your own CSS stylesheet to customize the appearance of the resulting PDF. To do this, define your `css` and apply it. + +For example, to make all ``h1`` headers red (The single ``#`` symbol in Markdown), you could do the following: + +.. code-block:: python + + md_doc = pymupdf.open( # open the Markdown document in A4 size + "example.md", + rect=pymupdf.paper_rect("A4") + ) + + css = "h1 {color:red;}" + md_doc.apply_css(css) + + pdf_doc = pymupdf.open(stream=md_doc.convert_to_pdf()) + pdf_doc.ez_save("red-colored-header.pdf") + +.. note:: + + The :ref:`support for CSS ` is currently limited. + + +Defining Fonts +""""""""""""""""" + +Fonts can be defined by using the `archive` parameter to provide a custom :ref:`Archive` containing the font files. + +The fonts must exist in an archive which is provided to the `archive` parameter when opening the Markdown file. The CSS can then refer to these fonts by their names as defined in the archive. + +For example, assuming you have access to the source files for the "Comic Sans" font for all text, you could do the following: + +.. code-block:: python + + # Global CSS instructions to use the "Comic Sans" font for all text. The font files must be provided in the archive. + css = """ + @font-face {font-family: sans-serif; src: url(comic.ttf);} + @font-face {font-family: sans-serif; src: url(comicbd.ttf); font-weight: bold;} + @font-face {font-family: sans-serif; src: url(comicz.ttf); font-weight: bold; font-style: italic;} + @font-face {font-family: sans-serif; src: url(comici.ttf); font-style: italic;} + """ + + archive = pymupdf.Archive("C:/Windows/Fonts") # the fonts are here + archive.add(".") # we've stored the archive image in this script's folder + + md_file = "sample.md" + md_doc = pymupdf.open( # open the Markdown document + md_file, + archive=archive, # where to look for resources (fonts, images) + rect=pymupdf.paper_rect("A4"), # page dimension ISO A4 + ) + + md_doc.apply_css(css) + - # Save to file - with open("output.svg", "w", encoding="utf-8") as f: - f.write(svg_content) - doc.close() PDF to Markdown @@ -72,6 +131,31 @@ By utlilizing the :doc:`PyMuPDF4LLM API ` we are able to conver pathlib.Path("4llm-output.md").write_bytes(md_text.encode()) +PDF to SVG +~~~~~~~~~~~~~~~~~~ + +Technically, as SVG files cannot be multipage, we must export each page as an SVG. + +To get an SVG representation of a page use the :meth:`Page.get_svg_image` method. + +**Example** + +.. code-block:: python + + import pymupdf + + doc = pymupdf.open("input.pdf") + page = doc[0] + + # Convert page to SVG + svg_content = page.get_svg_image() + + # Save to file + with open("output.svg", "w", encoding="utf-8") as f: + f.write(svg_content) + + doc.close() + PDF to DOCX ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/document.rst b/docs/document.rst index a703f5f13..0ea00e760 100644 --- a/docs/document.rst +++ b/docs/document.rst @@ -10,7 +10,7 @@ Document This class represents a document. It can be constructed from a file or from memory. -There exists the alias *open* for this class, i.e. `pymupdf.Document(...)` and `pymupdf.open(...)` do exactly the same thing. +There is an alias :meth:`open` for this class, i.e. `pymupdf.Document(...)` and `pymupdf.open(...)` do exactly the same thing. For details on **embedded files** refer to Appendix 3. @@ -29,6 +29,7 @@ For details on **embedded files** refer to Appendix 3. ======================================= ========================================================== :meth:`Document.add_layer` PDF only: make new optional content configuration :meth:`Document.add_ocg` PDF only: add new optional content group +:meth:`Document.apply_css` Markdown only: apply CSS stylesheet to Markdown content :meth:`Document.authenticate` gain access to an encrypted document :meth:`Document.bake` PDF only: make annotations / fields permanent content :meth:`Document.can_save_incrementally` check if incremental save is possible @@ -169,7 +170,7 @@ For details on **embedded files** refer to Appendix 3. pair: rect; Document pair: fontsize; Document - .. method:: __init__(self, filename=None, stream=None, *, filetype=None, rect=None, width=0, height=0, fontsize=11) + .. method:: __init__(self, filename=None, stream=None, filetype=None, archive=None, rect=None, width=0, height=0, fontsize=11) Create a ``Document`` object. @@ -183,11 +184,13 @@ For details on **embedded files** refer to Appendix 3. :arg str filetype: A string specifying the type of document. This is only ever needed when file content inspection fails. Text types like "txt", "html", "xml" etc. cannot be disambiguated by their content. When such files are provided in memory or being provided with the wrong file extension, this parameter **must** be used. - :arg rect_like rect: a rectangle specifying the desired page size. This parameter is only meaningful for documents with a variable page layout ("reflowable" documents), like e-books or HTML, and ignored otherwise. If specified, it must be a non-empty, finite rectangle with top-left coordinates (0, 0). Together with parameter :data:`fontsize`, each page will be accordingly laid out and hence also determine the number of pages. + :arg Archive archive: An optional :ref:`Archive` object to use as a source for resources like fonts and images. - :arg float width: may used together with ``height`` as an alternative to ``rect`` to specify layout information. + :arg rect_like rect: A rectangle specifying the desired page size. This parameter is only meaningful for documents with a variable page layout ("reflowable" documents), like e-books, MD or HTML, and ignored otherwise. If specified, it must be a non-empty, finite rectangle with top-left coordinates (0, 0). Together with parameter :data:`fontsize`, each page will be accordingly laid out and hence also determine the number of pages. - :arg float height: may used together with ``width`` as an alternative to ``rect`` to specify layout information. + :arg float width: May be used together with ``height`` as an alternative to ``rect`` to specify layout information. + + :arg float height: May be used together with ``width`` as an alternative to ``rect`` to specify layout information. :arg float fontsize: the default :data:`fontsize` for reflowable document types. This parameter is ignored if none of the parameters ``rect`` or ``width`` and ``height`` are specified. Will be used to calculate the page layout. @@ -201,24 +204,29 @@ For details on **embedded files** refer to Appendix 3. In case of problems you can see more detail in the internal messages store: `print(pymupdf.TOOLS.mupdf_warnings())` (which will be emptied by this call, but you can also prevent this -- consult :meth:`Tools.mupdf_warnings`). - Overview of possible forms, note: `open` is a synonym of `Document`:: - >>> # from a file - >>> doc = pymupdf.open("some.xps") - >>> # handle wrong extension - >>> doc = pymupdf.open("some.file", filetype="xps") # assert expected type - >>> doc = pymupdf.open("some.file", filetype="txt") # treat as plain text - >>> - >>> # from memory - >>> doc = pymupdf.open(stream=mem_area) # works for any supported type - >>> doc = pymupdf.open(stream=unknown-type, filetype="txt") # treat as plain text - >>> - >>> # new empty PDF - >>> doc = pymupdf.open() - >>> doc = pymupdf.open(None) - >>> doc = pymupdf.open("") + Overview of possible forms, note: :meth:`open` is a synonym of :meth:`Document`:: + + # from a file + doc = pymupdf.open("some.xps") + # handle wrong extension + doc = pymupdf.open("some.file", filetype="xps") # assert expected type + doc = pymupdf.open("some.file", filetype="txt") # treat as plain text + + # from memory + doc = pymupdf.open(stream=mem_area) # works for any supported type + doc = pymupdf.open(stream=unknown_type, filetype="txt") # treat as plain text + + # new empty PDF + doc = pymupdf.open() + doc = pymupdf.open(None) + doc = pymupdf.open("") - .. note:: Raster images with a wrong (but supported) file extension **are no problem**. MuPDF will determine the correct image type when file **content** is actually accessed and will process it without complaint. + .. note:: + + Raster images with a wrong (but supported) file extension **are no problem**. MuPDF will determine the correct image type when file **content** is actually accessed and will process it without complaint. + + See :ref:`supported file types ` for more information. The Document class can be also be used as a **context manager**. Exiting the content manager will close the document automatically. @@ -2030,6 +2038,20 @@ For details on **embedded files** refer to Appendix 3. This is a normal PDF document with no usage restrictions whatsoever. If it is not being changed in any way, it can be used together with its journal to undo / redo operations or continue updating. + .. method:: apply_css(css, append=True) + + * New in v1.28.0 + + Apply CSS styles to the document. This is a global operation, which means that the styles will be applied to all pages and all elements of the document. The CSS syntax is the same as for HTML documents, but only a subset of CSS properties is supported. + + :arg str css: a string containing the CSS styles to be applied. + :arg bool append: whether to append the new styles to existing ones (if any) or to replace them. + + .. note:: This method is primarily intended for use with :ref:`Markdown documents `. + + + + .. attribute:: outline Contains the first :ref:`Outline` entry of the document (or `None`). Can be used as a starting point to walk through all outline items. Accessing this property for encrypted, not authenticated documents will raise an *AttributeError*. @@ -2064,7 +2086,7 @@ For details on **embedded files** refer to Appendix 3. .. attribute:: is_reflowable - ``True`` if document has a variable page layout (like e-books or HTML). In this case you can set the desired page dimensions during document creation (open) or via method :meth:`layout`. + ``True`` if document has a variable page layout (like e-books, HTML or Markdown). In this case you can set the desired page dimensions during document creation (open) or via method :meth:`layout`. :type: bool diff --git a/docs/how-to-open-a-file.rst b/docs/how-to-open-a-file.rst index d40fb2cf9..06ddf0039 100644 --- a/docs/how-to-open-a-file.rst +++ b/docs/how-to-open-a-file.rst @@ -69,9 +69,15 @@ To open a file, do the following: doc = pymupdf.open("a.pdf") -.. note:: The above creates a :ref:`Document`. The instruction `doc = pymupdf.Document("a.pdf")` does exactly the same. So, `open` is just a convenient alias and you can find its full API documented in that chapter. +.. note:: The above creates a :ref:`Document`. The instruction `doc = pymupdf.Document("a.pdf")` does exactly the same. So, :meth:`open` is just a convenient alias. +To open an empty document, just do: + +.. code-block:: python + + doc = pymupdf.open() + File Recognizer: Opening with :index:`a Wrong File Extension ` """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" @@ -192,5 +198,20 @@ And so on! As you can imagine many text based file formats can be *very simply opened* and *interpreted* by |PyMuPDF|. This can make data analysis and extraction for a wide range of previously unavailable files possible. +---------- + + +.. _Full_Options_for_Opening_a_File: + +Full Options for Opening a File +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The `pymupdf.open` function has a number of parameters to give you full control over how files are opened. For the full API, please see the :ref:`Document` chapter, as `open` is just an alias for the :meth:`Document` constructor. + +.. method:: open(filename=None, stream=None, filetype=None, archive=None, rect=None, width=0, height=0, fontsize=11) + + See the :meth:`Document` constructor for details. + + :return: A document object. .. include:: footer.rst diff --git a/docs/images/icons/icon-md.svg b/docs/images/icons/icon-md.svg new file mode 100644 index 000000000..68df30252 --- /dev/null +++ b/docs/images/icons/icon-md.svg @@ -0,0 +1,10 @@ + + + + + + + + + + diff --git a/docs/page.rst b/docs/page.rst index 57069b81d..4c4c9ef4f 100644 --- a/docs/page.rst +++ b/docs/page.rst @@ -854,7 +854,7 @@ In a nutshell, this is what you can do with PyMuPDF: :arg rect_like rect: rectangle on page to receive the text. :arg str,Story text: the text to be written. Can contain a mixture of plain text and HTML tags with styling instructions. Alternatively, a :ref:`Story` object may be specified (in which case the internal Story generation step will be omitted). A Story must have been generated with all required styling and Archive information. - :arg str css: optional string containing additional CSS instructions. This parameter is ignored if ``text`` is a Story. + :arg str css: optional string containing additional CSS instructions. This parameter is ignored if ``text`` is a Story. See :ref:`CSS_Support` for more. :arg float scale_low: if necessary, scale down the content until it fits in the target rectangle. This sets the down scaling limit. Default is 0, no limit. A value of 1 means no down-scaling permitted. A value of e.g. 0.2 means maximum down-scaling by 80%. :arg Archive archive: an Archive object that points to locations where to find images or non-standard fonts. If ``text`` refers to images or non-standard fonts, this parameter is required. This parameter is ignored if ``text`` is a Story. :arg int rotate: one of the values 0, 90, 180, 270. Depending on this, text will be filled: diff --git a/docs/recipes-stories.rst b/docs/recipes-stories.rst index d633b9281..1ca5b1dc6 100644 --- a/docs/recipes-stories.rst +++ b/docs/recipes-stories.rst @@ -549,14 +549,7 @@ in a PDF-specific API.] .. note:: - At the time of writing the HTML engine for Stories is fairly basic and supports a subset of CSS2 attributes. - - Some important CSS support to consider: - - - The only available layout is relative layout. - - `background` is unavailable, use `background-color` instead. - - `float` is unavailable. - + At the time of writing the HTML engine for Stories is fairly basic and supports a subset of CSS2 attributes - see :ref:`CSS_Support` for details. .. include:: footer.rst diff --git a/docs/supported-files-table.rst b/docs/supported-files-table.rst index b0dcd7476..bdb4da1da 100644 --- a/docs/supported-files-table.rst +++ b/docs/supported-files-table.rst @@ -101,6 +101,11 @@ background-size: 40px 40px; } + #feature-matrix .icon.md { + background: url("_images/icon-md.svg") 0 0 transparent no-repeat; + background-size: 40px 40px; + } + @@ -122,6 +127,7 @@ CBZ SVG TXT + MD diff --git a/docs/the-basics.rst b/docs/the-basics.rst index 36a9f276c..da60d7c4e 100644 --- a/docs/the-basics.rst +++ b/docs/the-basics.rst @@ -1085,9 +1085,8 @@ Another example could be redacting an area of a page, but not to redact any line Converting PDF Documents ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -We recommend the pdf2docx_ library which uses |PyMuPDF| and the **python-docx** library to provide simple document conversion from |PDF| to **DOCX** format. - - +See :doc:`converting-files` for how to convert |PDF| documents to other formats and vice versa. .. include:: footer.rst + \ No newline at end of file