PDF to XML Conversion
FormTrap's XML conversion is often used to convert PDF documents to input for computer systems as XML files. This was outside of the original specification, but is very simple to do, via the following three steps:
- Convert to Text using program pdftotxt.exe (this is a third-party program which FormTrap recommends but assumes NO responsibility for)
- Identify the sender and document type, either V8 document identification or version 7 rules are available for this function
- Use the standard Version 8 Text to XML functions to generate the XML file, including rubbish removal and identification of sender, again Version 7 Repagination is available as an alternate to this process
Note that this applies ONLY to PDFs that carry text rather than graphics. You have the option of first converting all-graphic PDFs to searchable PDFs which inserts the required text using OCR methods. Results are NOT 100% in many cases for OCR'd documents.
Conversion of PDF to Text File
You may either drag 'n drop the PDF to a Spooler Queue which runs the PDF to Text filter.
This must the first filter program, after which the normal locale to UTF-8 filter (normally Western) converts to UTF-8 which is the normal FormTrap format for input.
Check alignment and possibly vary the pitch (digits) in the command line.
The same program may be run externally via a CMD prompt if required.
Note this program is available and while used by FormTrap is NOT covered by your FormTrap warranty. Please report any errors to us for review of this recommendation.
This is the remainder of the "Target:" line (following program name):
-fixed 5 -layout %1 %2
where path is the full folder name.
Identifying the PDF Document
Document identification (Version 8, master record rules, generally literals plus a Page: 1 test) quickly identifies the other party and document type.
The alternate is to use Version 7 Split Rules to distribute documents to their respective queues for processing.
Do this with Version 7 Rules files, program shortcut SplitDef as shown.
Rules normally comprise the party name and document type. Below are two entries (parties) in a rule file. Include all parties as individual entries in the same rule file. Rules (highlights) are all "equal to" rules.
Update the modified rules file to the FormTrap Server, set up the new queue to handle this party and change the Process tab of the queue with the split rule to send this party's documents to the correct queue.
See here for information on the Version 7 Split Rules program.
Removing Rubbish
In Version 8, remove rubbish using Redundant records to remove subsequent page headers, detail headers and inter-page connectors.
Version 7 Repagination may be used as an alternative to remove rubbish from the file ahead of XML conversion.
Repagination is defined using program shortcut RePageDef as shown.
Four elements are normally defined, in this order:
Header - first page (down to and including detail line headers)
Trailer - document total (include everything, this is to avoid removing redundant lines later)
Detail - page footer (with property, Suppressed ticked)
Detail - page header for second and subsequent pages (also Suppressed)
Detail required lines (define as non-blank) which keeps all non-blank lines.
Additional Details may be defined (ahead of the required Detail line for specific inter-page connectors such as "... continues" and "Continued ...", also Suppressed).
Copy the modified, tested repagination ".rpg" file to the spooler - copy to folder "Repagination Rules" within the %fthome folder, if that folder is not present please create it (the address of the %fthome folder is shown in FormTrap Spooler under Setup, Core Components).
Use FormTrap Spooler's Setup, Filters to define this repagination, using the above rules file. Run this filter via the Filters tab as a Pre-identification filter.
See here for information on the Version 7 Re-Pagination program.
XML Conversion
Run the standard text to XML conversion to generate your XML file and ensure you follow standards for the XML names to go into your systems.