Components All New MacOS Windows Linux iOS
Examples Mac & Win Server Client Guides Statistic FMM Blog Deprecated Old

DynaPDF.ExtractText

Extracts the text of the page PageNum.

Component Version macOS Windows Linux Server iOS SDK
DynaPDF 8.0 ✅ Yes ✅ Yes ✅ Yes ✅ Yes ✅ Yes
MBS( "DynaPDF.ExtractText"; PDF; PageNum { ; Flags; AreaLeft; AreaTop; AreaRight; AreaBottom } )   More

Parameters

Parameter Description Example Flags
PDF The PDF reference returned from DynaPDF.New. $pdf
PageNum The page number. 1
Flags The flags for text extraction.
Can include Default, SortTextX, SortTextY, SortTextXY, DeleteOverlappingText and/or NoHeuristic.
Usually you may want to use SortTextX here.

The flag MediaBox limits text extraction to the media box. The flag CropBox uses the crop box (if missing media box) for the rectangle.
"SortTextX" Optional
AreaLeft The left coordiante of the area. Optional
AreaTop The top coordiante of the area. Optional
AreaRight The right coordiante of the area. Optional
AreaBottom The bottom coordiante of the area. Optional

Result

Returns text or error.

Description

Extracts the text of the page PageNum.
The first page is denoted by 1.

Text lines can be sorted in x- and y-direction. The flag DeleteOverlappingText causes that identical text records which are placed on the same position (with a tolerance of 2 units) will be deleted. The records must occur one after the other in order to detect them.

The optional parameter Area can be set to restrict text extraction to that rectangle. The rectangle
must be defined as if the page would be viewed in a PDF viewer. That means in bottom up coordinates and the orientation must be considered. The page coordinate system is de-rotated
before text extraction starts since this produces better results. The width and height must be
calculated from the crop box if set, or from the media box otherwise. Note also that the width
and height must be exchanged if the orientation is 90, -90, 270, or -270 degrees.

If the function succeeds the return value is the text. If the function fails the return value is an error.

Special thing: If this function is called with two parameters, it redirects to old function DynaPDF.ExtractDocumentText to keep compatibility with existing scripts. If area parameters are not given or all zero, the area is not used.

Needs DynaPDF Lite license.
Please use DynaPDF.SetCMapDir to define the CMap folder to handle encodings better.

If you have an open page, we close it automatically for you before doing the import.

See also ExtractText function in DynaPDF manual.

Examples

Extract some text:

Set Variable [ $pdf ; Value: MBS("DynaPDF.New") ]
Set Variable [ $r ; Value: MBS("DynaPDF.OpenPDFFromContainer";$pdf; Test::data) ]
Set Variable [ $r ; Value: MBS("DynaPDF.ImportPDFFile";$pdf) ]
Set Field [ Test::PageText ; MBS("DynaPDF.ExtractText"; $pdf; 1; 0) ]
# Cleanup
Set Variable [ $r ; Value: MBS("DynaPDF.Release"; $pdf) ]

Extract text in area:

Set Field [ Test::PageText ; MBS("DynaPDF.ExtractText"; $pdf; 1; "Default"; 200; 200; 400; 400) ]

Extract text with cropbox as rectangle:

Set Field [ Extract Text::Text of first page ; MBS( "DynaPDF.ExtractText"; $pdf; 1; "SortTextX CropBox NoHeuristic") ]
Set Field [ Extract Text::Text of second page ; MBS( "DynaPDF.ExtractText"; $pdf; 2; "SortTextX CropBox NoHeuristic") ]

See also

Release notes

Example Databases

Blog Entries

This function checks for a license.

Created 21st December 2017, last changed 2nd November 2025


DynaPDF.ExtractPageText - DynaPDF.FileAttachAnnot