MLUG

1. Converting a scanned document to a pdf document

Last year I did some consulting for a law firm that required me to submit time sheets with my invoices. In any given invoice period I would undertake work involving multiple clients. Work undertaken for each client was broken down into standard categories, telephone call, email, meeting, etc.

I was working from my own home and my first inclination was to record everything on a spreadsheet formatted to look like the log however this was a bit cumbersome and I found that it was much simpler to just keep a log on the side of my desk, or in my diary, and pen in entries as necessary.

The law firm filed everything as pdf files so I had to submit my log forms via email as a pdf documents.

To make life easy a wrote the following script to convert the scanned log forms from an image to a pdf. I used xsane set to lineart for scanning and saved as either .jpg or png which resulted in an image just about the same width and height as an A4 document. With xsane set to lineart and 300 dpi, the pdf files were around 93.5kB.

#!/bin/bash
############################################################
# /usr/local/bin/con2pdf 
# Usage: con2pdf [input file]
# Converts an image to a pdf.
# requires awk
# requires convert (from ImageMagick)
############################################################
 
# Assign a variable
input_file=$1
 
# Test to see if a variable was provided with the command
test -n "$input_file"
  if [ $? -eq 1 ]; then
    echo -e "\nUsage: con2pdf [input file]\n"
    exit
  fi
 
# Assign another variable using the output of a command
output_file=`echo "$input_file" | awk -F "." '{ print $1 }'`.pdf
 
convert $input_file $output_file  # This line does the actual conversion
 
rm $input_file  # This line removes the image file
 
# end of script

I will now explain how this script works:

Note that with the exception of the first line, any text prefixed with a hash, #, is ignored up until the next new line. Text prefixed with a hash is usually referred to as a comment. Comments can be put on the same line as a command but only after the command. There are no hard and fast rules about using comments. They are handy to explain things to other folks as well as oneself. I normally don't comment a small script as much as this one. Usually I just add some notes at the top and then perhaps add commets to explain why something is done a certain way for future reference.

#!/bin/bash

The first line of my script begins with the two characters “ # ” and “ ! ”. Since files are seen by programs as streams of data, a method is required to determine the format of a particular file within the filesystem. Different operating systems have traditionally taken different approaches to this problem.* In the case of Unix and in our case Linux, “ #! ” will tell the kernel to treat the file as an executable script and not a machine code program. “/bin/bash” declares the path to the command interpreter that will be used. In the instance bash.

input_file=$1

This line is used to assign a variable to input_file using the first string of text, i.e. a file name that has been entered after the command con2pdf. More than one variable can be passed to a script when it is run and they would be numbered $1, $2, etc, but I only want to pass the name of the input file to the script in this instance.

test -n "$input_file"

This line uses test a bash built in command (builtin) to test if the variable is a non zero string, i.e. if a file name was passed to the script when the command contopdf was run. Test will exit with an exit status of 0 (true) if input_file is a non zero string and 1 (false) if input_file is not a non zero string. The exit code does not print to stdout but it can be assigned as the variable $? and can then be evaluated using an if statement.

  if [ $? -eq 1 ]; then
    echo -e "\nUsage: con2pdf [input file]\n"
    exit
  fi

This if statement evaluates $? to see if it is equal to 1.

If $? equals 1 then it will run the bash builtin, echo which prints the text within the double quotes to stdout. Echo is used with the flag -e which enables interpretation of backslash escapes. In this instance a newline, \n, is inserted before and after the text.

The next command is the bash builtin exit which will be used to exit the script.

All if statements must be closed with fi.

output_file=`echo "$input_file" | awk -F "." '{ print $1 }'`.pdf

Instead of passing both an input filename and an output (save) filename to the script the next line to assign an output filename to the variable output_file. Variables can be assigned using the output of a command when the command is enclosed in two backticks, `[command`.

In this line echo is used to print the variable input_file but instead of printing to stdout it is redirected with a pipe to awk.

Awk, or gawk, is a pattern matching program. Here the flag -F is used to declare “.” (full stop) as the field separator. For example, the file name scanned_image.png consists of two fields separated by a full stop. Awk will print the first field, $1 (scanned_file) to stdout.

Note .pdf on the same line, after the second backtick. This appends .pdf to $1 so if $1 was scanned_file, the variable output_file would be scanned_file.pdf.

You will find that there are often more than one way to do something when scripting. The command cut could also have been used in place of awk.

output_file=`echo "$input_file" | cut -d. -f1`.pdf

Field separators are also referred to as delimiters. In the above line, -d. nominates full stop as the delimiter and -f1 selects field 1 for printing to stdout.

convert $input_file $output_file
 
rm $input_file

The next two lines need little explanation.

Convert is is an Image Magick utility that converts images from one format to another. The file extension .pdf appended to the variable output_file ensures that the scanned document image will be converted to pdf format.

I did not want save the document images so the next line deletes the image file.

I almost always have a terminal open so my scripts are usually intended to be run on the command line. After saving the scanned image into the directory where the relevant pdf records were kept I would cd into that directory and run the command con2pdf [image name]. In the next section I'll show how to modify con2pdf so that it will have a gui interface for both selecting the image file and selecting a path and name for the resulting .pdf file

Cheers!