Using Ghostscript to Extract Text from a PDF file (Command Line)

This is a re-post from one of my favorite articles that I originally posted on 7/23/2018 on my old Blogger blog.

I think I would really like to revisit automating the extraction of text from PDF files. There is a lot of untapped value many companies could be leveraging but aren’t.

Recently, I received a request from a team member to find a way to:

  1. Extract a large amount of text from a large PDF file. 
  2. Once I get the text out I’ll need to parse and get specific elements in to an excel file.
  3. Format the Excel file in to specific tabs for each type of report I extract and add column headers
  4. Create validation code where I connect to a data warehouse using an Ajax web service and Ajax call in the Excel macro to validate the data based on an ID in one of the columns 

Pretty cool right? Just finished the prototype today! 7/31/2018.

In this article I’ll be covering the first step of this task where I use a free tool called Ghostscript to extract text from a PDF file. 

What is Ghostscript?

Ghostscript is a high-performance Postscript and PDF interpreter and rendering engine with the most comprehensive set of page description languages (PDL’s) on the market today and technology conversion capabilities covering PDF, PostScript, PCL and XPS languages.
Ghostscript has been under active development for over 20 years, and offers an extremely versatile feature set and can be deployed across a wide range of platforms, modules, end uses (embedding in hardware, as an engine in document management systems, providing cloud solution integration and as an engine in leading PDF generators and tools).

How to extract text from a PDF using GhostScript

Please note that the PDF file must be formatted correctly (text not image only).

Steps:
– Download Ghostscript
– Install Ghostscript
– Copy your pdf file to the bin directory where you installed Ghostscript
– Open a command line window at the bin directory (as Administrator if you get access error when running).
 – Sample Command: gswin64 -sDEVICE=txtwrite -o[Output File Name] [Input File Name]
– Sample ghostscript command: gswin64 -sDEVICE=txtwrite -ooutput2.txt test.pdf


Please leave a comment if this article helped you.

How to Use Windows Snipping Tool to Capture Right Click Context Menus

I’m working on a huge documentation project where I’m documenting operational support for a suite of C# MVC portal sites with a lot of back end SQL administrative functions.

I used to have SnagIt but my company has been cutting back on licenses.

I’m forced to rely on the Windows native screenshot tool, the Windows Snipping Tool.

One of my first big struggles was how do I capture right click context menus with the Windows Snipping Tool.

In my case, I’m documenting a folder structure and how to commit code to a SVN repository.

  1. Open Snipping Tool, cancel current snippet and leave in standby mode. 
  2. Get focus on your window / folder.
  3. Keys: Shift + F10
  4. Keys: Ctrl + fn + Print Screen (prtsc)
  5. You should have right click menu open and Snipping Tool should prompt you to select an area to capture. Select your menu area.

Copying Files to Windows from Linux Using Putty’s PSCP Command Line Tool

I’m wanting to look at my cloud hosted Linux Apache web server log files on my local Windows PC and I’m using PuTTY SSH as my Windows SSH telnet client.

How do we transfer the Apache web server log files over to my local Windows PC so I can work with it?

Instead of trying to copy the file from inside the PuTTY SSH session on the remote server, we can use a built-In Windows command line tool (pscp.exe) included with PuTTY called “PSCP” which mimic much of what you get with Linux’s built-in scp file transfer utility.

Putty’s Command Line File Transfer Tool (PSCP)

Below is a screenshot of me figuring out how to properly transfer an Apache web server log file over to my Windows PC temp folder on the C drive.

Sample Command Line Code

Run from the Windows Command Line from your Putty Install folder.

I used // to escape the forward slashes on the Linux server and \\ to escape backslahes and it worked.

Below should be entered all on one line.
c:\Program Files\PuTTY>pscp user@remoteserver://var//log//apache2//access.log c:\\temp\\access.log

Re-post: C# Programming – ASCII XMas Tree w/ Code Example

This is re-post from an article I originally posted 10 years ago on 11/18/2009. I still see a bit of traffic from it so it stays on the blog.

I’m guessing the web traffic for this article is coming from programming students who have received ASCII XMas Tree as a coding project.

The original title was: Introduction to C# Programming: ASCII XMas Tree – C# and .Net Sample Project.

ORIGINAL ARTICLE

C#, pronounced C Sharp, is a relatively new programming language from Microsoft that was designed to take full advantage of the new .Net Framework.

Below is a sample project I did for a programming class I’m attending at MJC in Modesto. To test and run the code listed on this site you should download a free copy of Microsoft Visual C# Express 2005. There are four Visual Studio Express versions available for free from Microsoft, Visual Basic, C#, J# and C++.

The xmas tree project was a good idea. It really makes you think. I think this project could have been a little more fun for me if I had a little more time but my schedule just won’t allow it. I don’t see myself coming back to often to update this code but if others want to sent me some other samples I may post them later. So here it goes….

Screenshot of command line window.

C# ASCII XMas Tree Code Sample

using System;
using System.Collections.Generic;
using System.Text;
namespace ConsoleApplication1
{
    class Program
    {
        static void Main(string[] args)
        {
            //create the main array
            int[] myArray = new int[] { 1, 3, 5, 7, 9 };

            //The outside foreach loop to loop throught the array
            foreach (int intLoop in myArray)
            {
                //creates the spaces, takes the array number minus 1 then divide by 2
                //this gives you the amount of spaces needed for each level of the tree
                 for (int iSpace = 0; iSpace < ((myArray[4]-intLoop)/2); iSpace++)
                {
                    System.Console.Write(" ");
                }               
                //middle loop writes the asterisks "*" the full amount of current array[]

                for (int i = 0;i < intLoop; i++)
                {               
                System.Console.Write("*");
             }
                //creates the spaces, takes the array number minus 1 then divide by 2
                //this gives you the amount of spaces needed for each level of the tree
              for (int iSpace = 0; iSpace < ((myArray[4] - intLoop) / 2); iSpace++)
             {
                 System.Console.Write(" ");
             }

            //creates new lines after all 3 loops run

             System.Console.WriteLine("");
            }
            //nest this loop and do it 3 times
            for (int iBase = 0; iBase < myArray[1]; iBase++)
            {
                // now make the base of the tree
                for (int iSpaces = 0; iSpaces < myArray[1]; iSpaces++)
                {
                    System.Console.Write(" ");
                }
                for (int iPipes = 0; iPipes < myArray[1]; iPipes++)
                {
                    System.Console.Write("|");
                }
                // now make the base of the tree
                for (int iSpaces = 0; iSpaces < myArray[1]; iSpaces++)
                {
                    System.Console.Write(" ");
                }
                  //creates new lines after all 3 loops run
                    System.Console.WriteLine("");
            }
        }
    }
}