How to Extract Text from a PDF Using GhostScript (Command Line)

Extract text from PDF files using Ghost Script

This is a re-post from one of my favorite articles that I originally posted on 7/23/2018 on my old Blogger blog.

I think I would really like to revisit automating the extraction of text from PDF files. There is a lot of untapped value many companies could be leveraging but aren’t.

Recently, I received a request from a team member to find a way to:

  1. Extract a large amount of text from a large PDF file. 
  2. Once I get the text out I’ll need to parse and get specific elements in to an excel file.
  3. Format the Excel file in to specific tabs for each type of report I extract and add column headers
  4. Create validation code where I connect to a data warehouse using an Ajax web service and Ajax call in the Excel macro to validate the data based on an ID in one of the columns 

Pretty cool right? Just finished the prototype today! 7/31/2018.

In this article I’ll be covering the first step of this task where I use a free tool called Ghostscript to extract text from a PDF file. 

What is Ghostscript?

Ghostscript is a high-performance Postscript and PDF interpreter and rendering engine with the most comprehensive set of page description languages (PDL’s) on the market today and technology conversion capabilities covering PDF, PostScript, PCL and XPS languages.

Ghostscript has been under active development for over 20 years, and offers an extremely versatile feature set and can be deployed across a wide range of platforms, modules, end uses (embedding in hardware, as an engine in document management systems, providing cloud solution integration and as an engine in leading PDF generators and tools).

How to extract text from a PDF using GhostScript – Step by Step

Please note that the PDF file must be formatted correctly (text not image only).

  1. – Download Ghostscript
  2. – Install Ghostscript
  3. – Copy your pdf file to the bin directory where you installed Ghostscript
  4. – Open a command line window at the bin directory (as Administrator if you get access error when running).
  5.  – Sample Command: gswin64 -sDEVICE=txtwrite -o[Output File Name] [Input File Name]
  6. – Sample ghostscript command: gswin64 -sDEVICE=txtwrite -ooutput2.txt test.pdf

Sample Ghostscript from Command Line

c:>Program Files/gs/gs9.23/gswin64 -sDEVICE=txtwrite -ooutput2.txt test.pdf

Automating GhostScript using Excel Macro

In an earlier blog post I discussed and gave an example of how to extract text from a PDF file using a free software tool called GhostScript from the command line.

Now I want to use an Excel macro to pass the PDF file name to a batch file which will execute GhostScript on that specific PDF.

Using the batch file also allows me to push an argument to the batch file when calling it so we’ll use this to pass the file name from a cell in the Excel macro (“A4”).

For this solution to work, you will need:

  1. Excel Macro and batch file located in the working directory
  2. A copy of the GhostScript executables and DLL files in the working directory

Excel Macro code:

Sub RunGhostScriptExtract()
       Dim folderPath As String
    Dim shellCommand As String
    Dim ghostCommand As String
    Dim PDFFileName As String
    
    Worksheets("Master").Activate
    PDFFileName = Range("A4").Value
    folderPath = Application.ActiveWorkbook.Path
    ghostCommand = "run-ghostscript.bat " & PDFFileName
    '/S      Modifies the treatment of string after /C or /K (see below)    
    '/C      Carries out the command specified by string and then terminates
    '/K      Carries out the command specified by string but remains       

    shellCommand = "cmd /k " & folderPath & "\" & ghostCommand    Call 
    Shell(shellCommand, vbNormalFocus)       
End Sub

Batch file code:

pushd %~dp0REM startstart gswin32 -sDEVICE=txtwrite -ooutput.txt test.pdfstart gswin32 -sDEVICE=txtwrite -ooutput.txt %~1exit 

I hope this helps someone!

~ Cyber Abyss

Author: Rick Cable / AKA Cyber Abyss

A 16 year US Navy Veteran with 25+ years experience in various IT Roles in the US Navy, Startups and Healthcare. Founder of FinditClassifieds.com in 1997 to present and co-founder of Sports Card Collector Software startup, LK2 Software 1999-2002. For last 7 years working as a full-stack developer supporting multiple agile teams and products in a large healthcare organization. Part-time Cyber Researcher, Aspiring Hacker, Lock Picker and OSINT enthusiast.

Leave a Reply