Take Charge of PDF in GNU Emacs

Display popup annotation in docview


 

All or None

Docview is the default major-mode for displaying PDF in Emacs. It converts each page into an image using ghostscript into a temporary folder (/tmp/docview<nnnn>) and displays the same. For large PDF document, this conversion is unnecessary in most cases - especially when one is looking at the document for quick reference.
 
An ideal solution would be to convert pages on demand. However, for this, one would need to know the number of pages in the document. This can be done either by some external tools like pdfinfo or one would need to parse the metadata in the document. So here's some elisp code for the same.
 
(require 'pdf)
(pdf-get "Root.Pages.Count")
 
or
 
M-x pdf-total-pages
 
;; Metadata about the document
(pdf-get "Root.Metadata")
 
 
So here we are - now that one can parse the metadata, what else can we do?

Form fields and Embedded scripts


M-x pdf-form

PDF supports embedded Javascript for adding simple interactivity to PDF forms. You can list the scripts embedded in the document. Now you can take an informed decision before clicking the warning dialog for running script. 

 

There's many a slip...

Were you ever caught off-guard while uploading a filled PDF form on a website and seeing incomplete form upon verification in-browser? Don't blame yourself - it's not your fault.

PDF specification has many versions, interpretation and implementation. Then there are linearized (optimized for web) and flattened (best for printing but no interactivity) versions. 
 
Initially, a cross-reference table was included at the end of the PDF for easy random access to any object in the document. With advent of web, this table was moved to the beginning of the file to enable faster rendering on websites. These are called Linearized PDF. However, when a PDF form is saved, new objects are added at the end of the file. This invalidates the xref table which needs to be regenerated. This process of regenerating the cross-reference table (or streams in later versions) is called Flattening.
 
Hence, all PDF clients are not the same and their behavior differ. In such cases, your best bet is to use the suggested software throughout i.e. both parties use the same software for creation or verification. If you must verify with a different software and the form looks incorrect, try to verify the flattened version.

Annotations

Use following examples to extract annotations from the document. The number of Kids subscript will differ based on document structure. If the annotation has rich content, capable viewer can enhance the display rather than displaying plain text.
 
;; Annotation text
(pdf-get "Root.Pages.Kids[0].Kids[0].Kids[0].Annots[1].Contents")
 
;; Annotation Rich content XHTML
(pdf-get "Root.Pages.Kids[0].Kids[0].Kids[0].Annots[1].RC")
 
;; Popup object associated with Text annotation
;; Should have /Open true to show up
(pdf-get "Root.Pages.Kids[0].Kids[0].Kids[0].Annots[1].Popup") 
 
 
A text annotation represents a “sticky note” attached to a point in the PDF docu-
ment. When closed, the annotation appears as an icon; when open, it displays a
pop-up window containing the text of the note in a font and size chosen by the
viewer application.
 
Use a to toggle annotation and <tab>/<backtab> to circle through annotations.
 
Note: If annotations are missing, ensure you're using GS for conversion and not mutool.
(setq doc-view-pdf->png-converter-function #'doc-view-pdf->png-converter-ghostscript)
 

Incremental update

Objects (and thereby their appearances) in a PDF document can be altered by adding an updated definition of the object at the end of the file. When you fill up a PDF form, that's how the values are saved in the document. e.g. below is a sample of PDF objects added while saving a text field with value "form".
 
11 0 obj
<< /Type /Annot /Subtype /Widget /Parent 7 0 R /AP << /N 28 0 R>> /Rect [89 799 238 810] /F 4 /Border [0 0 0.72] /BS 16 0 R /MK << /BC [0.4 0.4 0.4]>> /V (form) /M (D:20221118060522)>>
endobj

28 0 obj
<< /Length 84 /Subtype /Form /Resources << /Font << /Helv 18 0 R>>>> /BBox [0 0 149 11]>> stream
/Tx BMC q 0.72 w 0.4 G 0 0 149 11 re S BT /Helv 6.66 Tf 0 g 1 0 0 1 0 0 Tm 2 3.38 Td (form) Tj ET Q EMC
endstream
endobj

  
Warning: If the PDF form has been updated and saved multiple times, chances are that it has complete history of edits in the document.
 

Redundant Processes

In docview, sometimes conversion from PDF to PNG fail. Those processes are not killed automatically. You'll notice the conversion process number in mode-line. You can use K to kill those hanging processes.
 

Code

 

References


 

Comments

Popular posts from this blog

GNU Emacs as a Comic Book Reader

Data Visualization with GNU Emacs

Mozilla Readability in GNU Emacs