Documentation:OCR Distribution

From UBC Wiki

Introduction

The OCR Distribution project, is a sub-project of the Language Reclamation project. This project's goal is to convert large quantities of existing transliterated .txt files consisting of English, Boas and U'mista orthographies into .pdf files, then combine all the pdf files for distribution and easy reading. This documentation is written for absolute beginners to programming. If you are not a beginner, please feel free to get started and clone the repository to use the existing bash script.

Getting Started

For this project, you will need:

  • Pandoc - Need to install
  • BasicTeX/TinyTeX - Need to install
  • cpdf (Coherent PDF) - Built into macOS and Windows system
  • DejaVu Font - Need to install
  • Bash - Built into device
  • Terminal (on macOS) or Command Line (on Windows) - Built-in on either devices
  • Text editing software of your choice - Some of the popular ones are: Visual Studio Code, Sublime Text, etc.
  • A Github Account - To access the file in the repository
  • g2p - Need to install if remapping of any file is needed
  • csplit/gcsplit - Need to install if using macOS

Note: Due to the stack of this project, it may be more favourable to proceed with a macOS system rather than Windows. However, there are many resources online to help troubleshoot should there be any challenges encountered while installing the softwares above.

Pandoc

Pandoc is an open-source universal document converter that runs from the Terminal or Command Line. Pandoc supports file conversion from and to these file formats.

To install Pandoc for any system/with any method, one can download it from a link from the Pandoc website.

BasicTeX/TinyTeX

Pandoc uses LaTeX to convert files to PDFs, however due to the large size of LaTeX install for perspective systems (MacTeX on macOS and MiKTeX on Windows), BasicTeX (for macOS) or TinyTeX (for Windows) will be suffice.

Note: You must install either BasicTeX or TinyTeX to convert files to PDF.

To install either BasicTeX or TinyTeX, please visit either one of the hyperlinks above.

cpdf (Coherent PDF)

cpdf is used in this project to combine the converted PDF files. It is a tool that allows you to modify and/or process PDF documents/files using Terminal or Command Line. It is built into the macOS and Windows system so it's accessible to all users using either systems.

DejaVu Font

This font needs to be installed because the default font pandoc uses to read files with is incapable of interpreting some of the unique characters in Boas and U'mista orthographies. To install the font, please visit this website.

For configuration on Windows after installation, please follow this tutorial.

For configuration on macOS after installation, please follow this tutorial.

Bash

Bash is a command-line interface shell program used extensively in Linux and macOS. A shell is a computer program that allows you to directly control a computer's operating system with a graphical user interface or command-line interface. So Bash is basically the remote control to your TV, which is either Terminal or Command Prompt. All devices that has Terminal or Command Prompt built in should have Bash or some kind of shell already.

Note: For newer macOS versions, the default shell is zsh, however bash commands will still work without having to set bash as the default shell.

Terminal or Command Line

This is a built-in tool that could be found on all devices. For macOS, it uses Terminal and for Windows, it uses Command Prompt. Both tools are used as a command line interpreter. Put simply, putting in commands in Terminal or Command Prompt is like another way of communicating to the computer instead of using a user interface. It's basically text-based interface for developers to optimize workflow instead of doing the tasks using the given interface that your computer comes with. This is the primary tool that the project uses.

Text Editing Softwares

This is an optional tool. You can write the commands in the script file itself via a text editor that is built into your device. Having text editing softwares just help visualize and indent your commands better. Some of the popular options are: Visual Studio Code and Sublime Text. Of course, there are even more out there but it is really up to preferences. For this project, since there are already scripts written and ready to be used, this tool is not mandatory.

Github

To get a copy of the currently existing bash script, you will need to create an account on Github. Github is a tool that allows developers to track versions/iterations of the codebase as well as making development collaborative, meaning that many developers can work on the same project together.

You won't necessarily need to do set up a token or an ssh key to clone the repository but in order to access the repository, an account is mandatory.

g2p

Short for "Grapheme-to-Phoneme", g2p is a tool that basically transliterate a given orthography to another for a given language. For this project, there was a need to fix the mapping file and reconvert all the original file for the fixed mapping. g2p can be installed here if remapping is needed. Instruction to use g2p is also given with the link. If remapping is NOT needed, you do not have to install g2p.

csplit/gcsplit

This is one of the bash commands that will be used in the split_for_remapping_2dg/3dg.sh file. This command splits a file that is being read at delimiters/stops that one would like it to stop at. For Windows users, you may need to go into the script to change the command to csplit before running the script. For macOS users, since the command is slightly different; it's gcsplit. The gcsplit command is not a built-in command so macOS users will need to install coreutils in order to use the command. A recommended and easy installation for coreutils is to install it through Homebrew. The command to install through Homebrew is brew install coreutils

Simple Bash Commands

The most commonly used command for this project are:

  • cd - Short for "change directory" is basically the command to navigate from folder to folder
  • ls - Lists all the files within a directory
  • rm - Removes a file that is specified. This command takes an argument which is the filename and the extension (if there are any) that is desired to be removed from the directory

To check out what these commands do as well as other available Bash commands, please visit this tutorial.

Creating or Editing a Bash Script

Note: This section is solely for knowledge on how to create, edit, or view a bash script. It is not mandatory to perform this step for the project unless you're interested in making your own bash script.

Creating a Bash script

One can follow this tutorial for how to create a bash script. The tutorial also includes information about other bash script syntax.

Editing the Existing Bash script

To get a copy of the existing/currently used bash script that this project uses:

  • Clone the repository on Github
    • OR
  • Download a zip file of the repository on Github

Once the repository is cloned or downloaded:

Note: It is not mandatory to use bash to perform the following steps.

  • Navigate to the folder that contains the copy of the repository
    • The folder will be called langrecscripts
  • Double click on the test_script.sh file
    • At this point, the file will open using a text editor tool
  • Start editing the commands!
    split_for_remapping script line 1-38

The Currently Used Bash Scripts

A bash script is basically a list of bash commands in a file that will execute just by running the script. Having and using a bash script for this project significantly accelerates workflow and eliminates repetitive tasks.

There are currently three bash scripts stored in this Github Repository called: convert_and_combine.sh, split_for_remapping_2dg.sh and split_for_remapping_3dg.sh. To use the bash scripts from this repository, one can simply clone the repository or download a zip file of the repository and allocate the file in close relation (in the same folder or wherever pandoc is configured to run in your local environment/device) to execute the script.

Note: If you make any changes to the script, please consult the Lab supervisor to see if pushing the changes is necessary.

What Each Script Does

  • split_for_remapping_2dg.sh and split_for_remapping_3dg.sh reconverts the U'mista orthographies from the original .txt files. For how the script works, please refer to this section.
  • convert_and_combine.sh converts all .txt files within a given directory to .pdf files and combine them in one. For how the script works, please refer to this section.
    split_for_remapping script line 38-75

Commands in split_for_remapping.sh (2dg and 3dg)

Note: These two scripts essentially give the same results except one must use 2dg for original files with only 2-digit file names (Ex. 99.txt) and 3dg for only 3-digit file names (Ex. 123.txt).

Line 1: Instructing and specifying to the program loader that the script is to execute using Bash shell

Line 3-10: A for loop that renames all original files to have a leading 0

Line 13-16: A for loop that splits all the files into 3 chunks using the delimiter "---------------"

Line 19-26: A for loop that deletes all the files ending with 2 in its filename

Line 29-35: A for loop taking all the files ending with 1 in its filename and converting them again from Boas to U'mista orthography using g2p

Line 38-46: A for loop that takes all the new files made so far and appending them to an array

Line 49-56: A for loop that loops through the files made from the previous array and rename the files ending with "umista" to end with 2 instead.

Line 59-66: A for loop that loops through the previous array and combine all the files with same file number that end with 0, 1, and 2.

Line 69-74: A for loop that populates all the newly processed file into a new directory called "processedfiles"

Commands in convert_and_combine_pdf.sh

convert_and_combine_pdf script

Line 1: Instructing and specifying to the program loader that the script is to execute using Bash shell

Line 3 - 6: A for loop that looks through the files with .txt extensions in the current directory and induces pandoc to convert every .txt files found into a .pdf file.

  • The -V geometry=margin=1cm part at line 5 is customizable depending on what the desired margins on each page of the PDF is
  • The -V mainfont="DejaVu Serif" part at line 5 changes the font preference to use DejaVu font instead of the default font pandoc uses
  • --pdf-engine=xelatex switches the laTeX engine to xelaTeX that pandoc is using to read the .txt file for converting into PDF

Line 8: An empty array to store the new PDF files in

Line 9 - 12: A for loop that looks through the files with .pdf extensions in the current directory and add the found files into the array made in Line 8

Line 13: Testing if the PDF files are added into the array by printing the items in the array

Line 14: Induces cpdf (on macOS) to combine all the PDF files that exists in the array and name the output "combined.pdf"

Executing the Bash Script

Once the script is made, the only thing left to do is to execute the script file.

Note: Due to the large quantities of files that need to be worked with for this project, it is recommended to run the script on smaller quantities of files before going ahead to run the script on a whole directory filled with all the files.

In terminal/command line, navigate to the directory that contains the .txt files to be converted and the script file.

Once in the directory, type in the command bash <script_file_name>.sh

At this point, you may have an error that says you don't have permission to execute the script. In that case, you'll have to add yourself and give yourself permission to execute the script with this command: chmod u+x <script_file_name>.sh

Once that command is put in, try bash <script_file_name>.sh again. Alternatively, you can also execute a script using this command: ./<script_file_name>.sh

Note: If you must reconvert any of the orthographies, you must follow these steps below before resuming to convert txt files into pdfs:

  1. Install g2p and make sure it is available to run everywhere in your local environment/device
  2. Fix the mapping file of interest given by g2p and update the mapping file with new fixes (instructions given from the g2p repository) if the mapping files need fixes
  3. Run either of the split_for_remapping.sh script according to their file names
    1. Note: This script must be in close relation/same folder as the files you wish to reconvert
  4. Put the convert_and_combine_pdf.sh script inside the folder/directory with the newly converted/remapped files (the directory will be called either "processedfiles_2dg" or "processedfiles_3dg")
    1. Note: You must put the convert_and_combine_pdf script in either of those directories or else the script will attempt to convert all the unnecessary .txt files.

If you do not need to reconvert any orthographies from the given txt files, please proceed to run the convert_and_combine_pdf.sh script.

At this point, the script should be running itself. It may take some time for file conversion if there are many files to be converted. Once you see the names of all the files in .pdf print in terminal/command line, you can expect the script to finish running soon.