Syriac handwriting recognition

From CSclasswiki
Jump to: navigation, search

Week 01 (May 18 - May 24)

May 18

  1. Explored how to connect to Smith computing resources (network server) through my own computer: Accessing your network files on a Mac.
  2. Read the first half of past project Helena done in Summer 2019: Syriac Project.
  3. Explored available HTR Project: Build a Handwritten Text Recognition System using TensorFlow.
    Didn't finish training due to time constrain to move forward. Interrupt training after Epoch 7.
    Character error rate: 14.465409%. Word accuracy: 66.330435%.
    Complete training on Epoch 17.
    Character error rate: 3.891709%. Word accuracy: 86.727273%.
    Character error rate not improved
    No more improvement since 5 epochs. Training stopped.
    Mac file path for BasicSyriacData used for training: smb://ad.smith.edu/files/Academic/Research/nhowelab/Summer2020/Data_1/BasicSyriacData
  4. Explored Wiki formatting.
  5. More reading on Line-level HTR:
  6. Explore how to use command line arguments.

May 19

  1. Figured out three ways of doing Line-Level HTR if we keep on Helena's track
    1. Using the Line-Level HTR code wrote in Git.
      • Pros: Code already complete. Already trained by the writer and performed the highest accuracy of 84%.
      • Cons: Hard to preprocess data as Helena experienced. It was abandoned because of the low accuracy of 0.133333%.
    2. Modified Word-Level HTR to Line-Level HTR.
      • Pros: Code for data processing already given (Haven't tried). The original code was tested and trained by Helena last year. Added more CNN layers today and made the input data bigger so that it fits lines.
      • Cons: Accuracy not tested. Not modified completely based on the writer's notes. Sourse for 2D-LSTM or MDLSTM not found. Confused about how to implement Mean.
    3. Using word segmentation and then implement Word-Level HTR.
      • Pros: Python code on word segmentation already given. Helena already explored how processing data worked in Matlab.
      • Cons: Didn't recognize lines correctly. It might not be what we are looking for. It might be slow (not compared yet).
  2. More readings on Line-Level HTR, LSTM, and MDLSTM:

May 20

  1. Connect to Campus Computer
    • Set up Splashtop
    • Connected to computer in FORD342-01.
  2. Install Tensorflow GPU
  3. Up load currant work to nhowelab drive.
    • Mac file path: smb://ad.smith.edu/files/Academic/Research/nhowelab/Summer2020
  4. Scheduled meeting with Saraphina.
  5. Scheduled meeting with Helena on weekend.
    • Double checked how Helena changed the original code in her project.
    • Looked more into Helena's file and how image is preprocessed.
  6. Successfully run ModifiedWordHTR with IAM Line Data on Tensorflow GPU, however did not experience expected speed increase. Will wait to see the over night result tomorrow.
  7. Started to explore Matlab code. Hard to understand. Decided to learn basic Matlab command tomorrow.

May 21

  1. Meet with Saraphina. Planed to collaborate and explore the project for the rest of the week.
  2. Install Tensorflow GPU:
    • Error Left Yesterday: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'cudart64_100.dll'; dlerror: cudart64_100.dll not found
    • Code runs fine, however Tensoeflow are proved not running on GPU.
    • Tried to uninstall Cuda v9.0 and install Cuda v.10.0. Succeeded!
    • Trained ModifiedWordHTR on IAM line data again on Tensorflow GPU. A lot faster!
  3. Set up virtual environment on my own laptop.
  4. Trained WordHTR on IAM line data on my laptop. Tried to compare the difference of line recognition using the original HTR and using the modified version.
  5. Looked at more deep learning resources Professor Howe send us:
  6. Determent that is not necessary to explore 2D-LSTM right now

May 22

  1. Set up another Tensorflow GPU in FORD342-02
  2. Did more research on different models:
  3. Increase the input size -- SWordHTR:
    1. Try to input IAM Word Data:
      Complete training on Epoch 50.
      Character error rate: 10.718587%. Word accuracy: 72.939130%.
      Character error rate not improved
      No more improvement since 10 epochs. Training stopped.
    2. Try to input IAM Line Data:
      Complete training on Epoch 49.
      Character error rate: 44.467460%. Word accuracy: 1.230769%.
      Character error rate not improved
      No more improvement since 10 epochs. Training stopped.
      "looked|out|of|the|windows|at|the|weeping|trees|and|the" -> "Looked|out|of|the|mies|sos|ton|hen|e|e|n|e"
  4. Failed when add two more layers:
    ValueError: Negative dimension size caused by subtracting 2 from 1 for 'MaxPool_6' (op: 'MaxPool') with input shapes: [?,200,1,512].
  5. Decided to add one more layer:
    1. Try squeeze (200 1 256) and tested on IAM Line Data:
      Complete training on Epoch 289.
      Character error rate: 19.780054%. Word accuracy: 2.923077%.
      Character error rate not improved
      No more improvement since 10 epochs. Training stopped.
    2. Try squeeze (100 1 256) and tested on IAM Line Data:
      Complete training on Epoch 32.
      Character error rate: 48.342874%. Word accuracy: 0.615385%.
      Character error rate not improved
      No more improvement since 10 epochs. Training stopped.
    3. Try squeeze (200 1 512) and tested on IAM Line Data:
      Complete training on Epoch 162.
      Character error rate: 28.073215%. Word accuracy: 1.230769%.
      Character error rate not improved
      No more improvement since 10 epochs. Training stopped.
    4. Try squeeze (100 1 512) and tested on IAM Line Data:
      Complete training on Epoch 52.
      Character error rate: 41.676710%. Word accuracy: 0.769231%.
      Character error rate not improved
      No more improvement since 10 epochs. Training stopped.
  6. Find the current best model which is SWordHTR_256200.

May 23 - May 24

  1. Successfully find the full data set Helena used last summer.
  2. Tried Matlab code.
    • Successfully running half of RawToLine and cropped raw images of transcripts 14485/14526/14610 into lines.
    • Failed on running MiniGenesisWork for cropping Genesis.
    • Tested cropping Genesis in RawToLine, half way succeeded.
    Error in imgcut3, BlockToLine, imgcutmulti
  3. Looked at Beth Mardutho library and downloaded Transkribus.
  4. Did more experiment on top model choices.
    1. SLineHTR with IAMLineData:
      Complete training on Epoch 67.
      Character error rate: 16.929045%. Word accuracy: 6.461538%.
      Character error rate not improved
      No more improvement since 10 epochs. Training stopped.
    2. SLineHTR with MoreSyriacData:
      Complete training on Epoch 80.
      Character error rate: 1.903553%. Word accuracy: 92.363636%.
      Character error rate not improved
      No more improvement since 10 epochs. Training stopped.
    3. WordHTR with IAMLineData:
      Complete training on Epoch 47.
      Character error rate: 65.642402%. Word accuracy: 0.000000%.
      Character error rate not improved
      No more improvement since 5 epochs. Training stopped.
    4. WordHTR with MoreSyriacData:
      Complete training on Epoch 66.
      Character error rate: 0.253807%. Word accuracy: 99.272727%.
      Character error rate not improved
      No more improvement since 10 epochs. Training stopped.
    5. SWordHTR with IAMLineData:
      Complete training on Epoch 289.
      Character error rate: 19.780054%. Word accuracy: 2.923077%.
      Character error rate not improved
      No more improvement since 10 epochs. Training stopped.
    6. SWordHTR with MoreSyriacData
      Complete training on Epoch 67.
      Character error rate: 0.296108%. Word accuracy: 98.909091%.
      Character error rate not improved
      No more improvement since 10 epochs. Training stopped.
Character accuracy rate on IAMLineData Character accuracy rate on MoreSyriacData
SLineHTR 83.1% 98.1%
WordHTR 34.4% 99.8%
SWordHTR 80.3% 99.8%

Week 02 (May 25 - May 31)

May 25

  1. Encountered error while running MiniGenesisWork:
    • Error: Output argument "lbl" (and maybe others) are not assigned during calling to "imgcut3"
    • Solved: Compile the code using mex.
  2. While trying to set up mex, encountered new error:
    • Error: Xcode not installed
    • Solved: Have to update MacOs in order to download latest Xcode.
  3. After installing Xcode and execute mex -setup, encountered new error:
    • Error: Error using mex, No supported compiler or SDK was found.
    • Tried: Install latest Matlab to match the version of Xcode.
    • Solved: By redirecting Xcode sudo xcode-select -s /Applications/Xcode.app
  4. After set up mex, encountered new error:
    • Error: Errors and warning were found in "imgcut3" due to the an update of system memory.
    • Solved: Emailed Professor Howe and updated "imgcut3".
  5. Successfully compiled "imgcut3" and moved the compiled file into /Volumes/nhowelab/Matlab/Utility.
  6. More code needed to be modified and compiled.
  7. Encountered new error with BlockToLine

May 26

  1. While trying to compile "imgcutmulti" by running mex '/Volumes/nhowelab/code/Mex/imgcutmulti.cpp' '/Volumes/nhowelab/code/MaxFlow/maxflow-v3.01.src/graph.cpp' '/Volumes/nhowelab/code/MaxFlow/maxflow-v3.01.src/maxflow.cpp' -I'/Volumes/nhowelab/code/MaxFlow/maxflow-v3.01.src encountered error:
    • Error: Cannot initialize a variable of type 'const int *' with an rvalue of type 'const mwSize *'
    • Solved: Changing "int" into "mwSize" in "imgcutmulti"
  2. While Running MiniGenesisWork, encountered error:
    • Error: Check for missing argument or incorrect argument data type in call to function 'dct'.
    Error in blockToLines (line 15)
    trmean = dct(rmean);
    • Solved: Install the signal processing toolbox
  3. Successfully run two Matlab code to cut images into separate lines.
  4. Tried WordSegmentation in python.
  5. Tried to remove boarders of images. (Find out that it is not necessary after meeting with Professor Howe.
  6. Decided to move on from image processing.

May 27

  1. Wrote python code for image processing and save them as GenesisLines.
    • Increase contrast of the line images(in python)
    • Rotate +1 degree and resize*2
    • Rotate -1 degree and resize*2
    • Rotate +2
    • Rotate -2
    • Erode
    • Add Gaussian white noise
    • Added salt and pepper noise
  2. Did more experiment on the Model we have:
    1. SwordHTR with EnhancedSyriacLineData
      Complete training on Epoch 24.
      Character error rate: 95.507296%. Word accuracy: 0.000000%.
      Character error rate not improved
      No more improvement since 20 epochs. Training stopped.
      "JHTADNWNMJ.ENDLIM.EDJA" -> "A"
    2. LineHTR with EnhancedSyriacLineData
      Complete training on Epoch 29.
      Character error rate: 95.128117%. Word accuracy: 0.000000%.
      Character error rate not improved
      No more improvement since 20 epochs. Training stopped.
    3. LineHTR with SyriacLineData
      Complete training on Epoch 62.
      Character error rate: 78.772441%. Word accuracy: 0.000000%.
      Character error rate not improved
      No more improvement since 30 epochs. Training stopped.
    4. SwordHTR with SyriacLineData
      Complete training on Epoch 44.
      Character error rate: 100.000000%. Word accuracy: 0.000000%.
      Character error rate not improved
      No more improvement since 30 epochs. Training stopped

May 28

  1. Reorganized MoreGenesisData and tried to combine it with GenesisLineData.
    • Want to use this combined data to train the model so that it could recognize line with arbitrary length.
    • Fail at first as the model only took in line data after combines.
    • Solved by mix up folders with word and line images.
    • Wrote Python code that extract transcripts of word images from txt that Helena used.
  2. Searched new models that could be used:
    1. Go through steps to use Transkribus.
      • Emailed Beth Mardutho Library and asked for permission to access their pre-trained Syriac HTR model. (Haven't hear back yet)
      • Cropped Genesis images in Transkribus.
      • Looked at open source code for Transkribus.
      • Looked at ways to train Transkribus for a new language.
    2. Looked at instruction on how to use and train Tesseract.
      • Download and install Tesseract.
      • Download and install Macports and Homebrew

May 29

  1. Decided to look at more possible models that we could use after meeting with Professor Howe.
  2. ICDAR: International Conference on Document Analysis and Recognition
    • Looked that Deep Learning for Document Analysis, Text Recognition, and Language Modeling Tutorial. However didn't find useful information that we can implement to build new models.
    • Some webpages can't be found.
  3. IJDAR: International Journal on Document Analysis and Recognition
    The out come of this method seems to be what we want.
    However it too touch or pen strokes as input and I am not sure if we can implement it with our images.
  4. Attention models for end-to-end handwritten paragraph recognition
    • Implementation of MDLSTM could increase accuracy to a large extent.
  5. Handwritten Text Recognition using TensorFlow 2.0
  6. A Scalable Handwritten Text Recognition System
  7. A Computationally Efficient Pipeline Approach to Full Page Offline Handwritten Text Recognition
  8. Offline Handwritten Text Recognition using Convolutional Recurrent Neural Network
    • Similar to the model that we first tried.
  9. Handwriting Recognition Based On Temporal Order Restored By The End-To-End System
    • More like an online based HTR.
  10. Handwriting Recognition System- A Review

May 30 - May 31

  1. WordHTR with EnhancedSyriacLineData
    Complete training on Epoch 501.
    Character error rate: 22.676089%. Word accuracy: 8.250000%.
    Character error rate not improved
    No more improvement since 30 epochs. Training stopped.
  2. WordHTR with SyriacLineData
    Complete training on Epoch 455.
    Character error rate: 11.625302%. Word accuracy: 25.950000%.
    Character error rate not improved
    No more improvement since 30 epochs. Training stopped.
  3. WordHTR with SyriacCombinedData
    Complete training on Epoch 631.
    Character error rate: 39.334055%. Word accuracy: 4.507463%.
    Character error rate not improved
    No more improvement since 30 epochs. Training stopped.
  4. Found out that the line data that we used is wrong.
  5. Correct the SyriacLineData
  6. Add more data in to line data:
    1. Original Lines
    2. Rotate +1 degree and resize*2
    3. Rotate -1 degree and resize*2
    4. Rotate +2
    5. Rotate -2
    6. Rotate +3
    7. Rotate -3
    8. Rotate +4
    9. Rotate -4
    10. Tilt +2
    11. Tilt -2
    12. Tilt with salt and pepper noise +3
    13. Tilt with salt and pepper noise -3
    14. Tilt +4
    15. Tilt -4
    16. Tilt +5
    17. Tilt -5
    18. Erode
    19. Add gaussian white noise
    20. Added salt and pepper noise
    21. Convert the image to grayscale
  7. Wrote a summary about the next step.

Week 03 (June 1 - June 7)

June 1

  1. Went Over part of Tensorflow 2 tutorials.
  2. Decided to explore model introduced in the paper: A Computationally Efficient Pipeline Approach to Full Page Offline Handwritten Text Recognition
  3. Tested Models with NewSyriacLineData.
    1. WordHTR with NewSyriacLineData
      Complete training on Epoch 66.
      Character error rate: 0.013087%. Word accuracy: 99.792453%.
      Character error rate not improved
      No more improvement since 70 epochs. Training stopped.
    2. LineHTR with NewSyriacLineData
      Complete training on Epoch 66.
      Character error rate: 0.013087%. Word accuracy: 99.792453%.
      Character error rate not improved
      No more improvement since 70 epochs. Training stopped.
  4. After the Transkribus team respond, the Transkribus failed to open.
  5. Explore code in Git: Handwritten Text Recognition (OCR) with MXNet Gluon-Git

June 2 - June 3

Guid to run Handwritten Text Recognition for Apache Mxnet on computers in Ford 342 (Windows)

  1. Git clone and setup following guide on git:
    1. Click into Handwritten Text Recognition for Apache Mxnet
    2. Windows does not include a Git command. Type in git in Command Prompt to see if git is installed. If not, download and install git here.
    3. Clone the repository by following steps in README.
    4. Install SCLITE:
      • Error: Windows does not include an export and make command.
      • Tried: Setup SCLITE on Mac and then move the folder to a Windows computer.
      • Not yet solved
    5. Install hsnwlib:
      • Error: fatal error C1083: Cannot open include file: 'hnswlib/hnswlib.h': No such file or directory
      • Tried: Setup hsnwlib on Mac and then move the folder to a Windows computer.
      • Solve: simply do pip install hnswlib
  2. Setup Mxnet GPU:
    1. Check CUDA version on the computer.
      • Experiencing warnings with CUDA 10.0.
      • Solve: Installing CUDA 10.2
      • To install new version of CUDA, partially follow tutorial: here
    2. Download and install Mxnet GPU which the CUDA version.
      • Check here for different versions of Mxnet GPU.
      • Error: pip install command for mxnet is working, but only package 1.5.0 is installed. So if pip install command is used, we might encounter error ImportError: cannot import 'replace_file', when we later try to install gluonnlp.
      • Solve: Use pip install to install packages from here.
    3. Test installation with import mxnet as mx and mx.context.num_gpus() in Python.
  3. Install other packages:
    1. Most packages can be find here
      • If there is any error with ocr, check the directory Jupyter Notebook.
    2. Install gluonnlp by using pip install gluonnlp:
      • Error: Command errored out with exit status 1: ...
      • Tried: Using Command Prompt (Admin), Using Anaconda.
      • Solve: Install from source here.
      Download and install Microsoft C++ Build Tools with Visual Studio Installer if needed.
  4. Make necessary changes before running the code:
    1. Download Pertained models by running python get_models.py
    2. Create credentials.json using credentials.json.example by editing the content and renaming the file.
  5. Errors while running on Mac:
    1. While running on my laptop:
      • Error: The kernel appears to have died. It will restart automatically when downloading IAMdataset.
      • Tried: Run in Terminal, however an other similar error: Killed 9 occurs.
      • Cause: Might be a lack of memory.
      • Solve: Run on Mac computer in FORD so that there will be enough memory to download IAM data.
    2. While running on computer in FORD (Mac):
      • Error: Timeout Error while training.
      • Tried: Set up Mxnet GPU on mac, but failed because NVIDIA does not support Intel GPU.
      • Solve: Move to Windows Computer with a NVIDIA supported GPU.
  6. Errors after moving to Windows with GPU.
    1. Download error:
      • Error: IOPub message rate exceeded when downloading IAMDataset
      • Solve: Run the downloading code on a Mac computer in Ford and move the dataset folder to a Windows computer.
    2. AttributeError:
      • Error: Can't get attribute 'augment_transform' on <module '__main__' (built-in)>
      Can't get attribute 'transform' on <module '__main__' (built-in)>
      • Solved: Follow steps here
      Created and imported files: defsp.py, defsl.py, defsh.py
      Commented out the original augment_transform and transform function.
    3. GPU number error:
      • Solved: Reduce all GPU counts to 1.
    4. WinError
      • Error: The process cannot access the file because it is being used by another process
      • Solve: Restart the computer.
  7. Successfully Run:
    1. 0_handwriting_ocr.ipynb (partial)
    2. 1_a_paragraph_segmentation_msers.ipynb
    3. 1_b_paragraph_segmentation_dcnn.ipynb
    4. 2_line_word_segmentation.ipynb
  8. It is also worth to look at the Github issues page
  9. Meet with Saraphina

June 4

  1. It is recommended to use an instance with 32GB+ RAM and 100GB disk size, a GPU is also recommended. A p3.2xlarge would be the recommended starter instance on AWS for this project
  2. Got reply from Transkribus team, but only manuscripts were found in the Beth Mardutho Library.
  3. Errors of code run:
    1. 0_handwriting_ocr.ipynb
      • Denoising the text output 24
      MXNetError: [21:15:42] c:\jenkins\workspace\mxnet-tag\mxnet\src\storage\./pooled_storage_manager.h:161: cudaMalloc retry failed: out of memory
      • Quantitative Results 25
      MXNetError: [21:15:42] c:\jenkins\workspace\mxnet-tag\mxnet\src\storage\./pooled_storage_manager.h:161: cudaMalloc retry failed: out of memory
      • Quantitative Results 26
      AssertionError: ../SCTK/bin does not exist
    2. 1_a_paragraph_segmentation_msers.ipynb
      • Successfully run
    3. 1_b_paragraph_segmentation_dcnn.ipynb
      • Successfully run with Batch size 32
    4. 2_line_word_segmentation.ipynb
      • Successfully run
    5. 3_handwriting_recognition.ipynb
      • Training 13
      MXNetError: [19:22:12] c:\jenkins\workspace\mxnet-tag\mxnet\src\operator\./rnn-inl.h:769: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) : cuDNN: CUDNN_STATUS_EXECUTION_FAILED
      MXNetError: [19:34:22] c:\jenkins\workspace\mxnet-tag\mxnet\src\storage\./pooled_storage_manager.h:97: CUDA: unspecified launch failure
      MXNetError: [20:01:04] c:\jenkins\workspace\mxnet-tag\mxnet\src\operator\./rnn-inl.h:1462: Check failed: e == cudaSuccess || e == cudaErrorCudartUnloading: CUDA: unspecified launch failure
      [I 22:20:32.367 NotebookApp] Replaying 3 buffered messages
      [22:21:29] C:\Jenkins\workspace\mxnet-tag\mxnet\src\base.cc:84: Upgrade advisory: this mxnet has been built against cuDNN lib version 7300, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
      [22:21:33] c:\jenkins\workspace\mxnet-tag\mxnet\src\operator\nn\cudnn\./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
      • Results 14
      MXNetError: [22:36:31] c:\jenkins\workspace\mxnet-tag\mxnet\src\storage\./pooled_storage_manager.h:97: CUDA: unspecified launch failure
      • Writing the transformed test dataset for validating the language denoiser 15
      MXNetError: [22:36:31] c:\jenkins\workspace\mxnet-tag\mxnet\src\storage\./pooled_storage_manager.h:97: CUDA: unspecified launch failure
    6. 4_text_denoising.ipynb
      • TypeError: can't pickle hnswlib.Index objects
      • Tried: Install from source
    7. 5_a_character_error_distance.ipynb
      • Decode noisy forms 4
      MXNetError: [23:37:39] C:\Jenkins\workspace\mxnet-tag\mxnet\3rdparty\dmlc-core\src\io\local_filesys.cc:209: Check failed: allow_null: LocalFileSystem::Open "models/handwriting_line_sl_160_a_512_o_2.params": No such file or directory
    8. 5_b_visual_distance.ipynb
      • Data Generation 9
      BrokenPipeError: [Errno 32] Broken pipe

June 5

  1. Read more about the project here.
  2. Installed the latest version of Mxnet: mxnet-cu102 on FORD 342-01.
    • Successfully get ride of the CUDA warning.
    • But CUDA lunch error and out of memory still occurs.
  3. Tried different ways to fix the problem:
    1. Tried: rm -rf ~/.nv/ in command prompt.
      • Kernal dead :(
    2. Tried: Reduce batch size.
      • Didn't make a difference.
    3. Tried to run the code in command prompt.
      • Errors still occurs.
    4. Tried to increase RAM or virtual memory:
      • Didn't solve the problem.
  4. Decided to try on virtual machines.

June 6 - June 7

  1. Started looking at virtual machines that may help us to run the code.
  2. Tried AWS EC2:
    • Created an account.
    • Go over the set up tutorial here
    • Found that it is a bit complicated, so make it a low priority.
  3. Tried Google Colab.
    • Made sure Google Colab has all the functionality that we need.
    • Get familiar with Google Colab.
    • Found out that Google Colab Pro has access to more RAM than FORD Windows computer.
    • Decided to try to run the code on Google Colab
  4. Running this project on Google Colab.
    1. Created Setup.ipynb which include all the necessary setup for Google Colab.
    2. Mounted Google drive so that Google Colab can access file from Google Drive.
    3. Found that we have to upload preprocessed packages (25GB) into Google Drive. Because Processing those data and saving them in Google drive took too much RAM.
    4. Note: Every different file has a different Runtime, so setup has to be repeated.
    5. Google Colab seems to be slower than FORD computer.
    6. Successfully ran:
    7. Found that deleting exception code in sclite_helper.py could partially solve the TypeError: exceptions must derive from BaseException error while using SCTK. However error still occurs later.
    8. Inspired by running the code in Mac (no GPU). Found that changing GPU to CPU later in 0_handwriting_ocr.ipynb could solve some errors.

Week 04 (June 8 - June 14)

June 8

Errors.png

June 9

  1. Tried Genesis data on 1_b_paragraph_segmentation_msers.ipynb
    • The Output of the data is not what we expected:
    • I am also not sure if it is the right choice to detect bounding boxes of lines by training the system.
    • Because the formate of different data set might varies a lot.
    For example the IAM Dataset only contains useful data in the lower half of the images but Genesis have three columns.
    • So if the systems works well on one sample, it might not work well on the others.
  2. Tried to train 1_b_paragraph_segmentation_msers.ipynb on Genesis data.
    However we need to find bounding boxes of blocks and words in the manuscript in order to train the model.
  3. Tried to construct block bounding boxes and word bounding boxes in by using MiniGenesis
    however only bounding boxes of blocks and lines are found.
  4. Tried to find word bounding boxes in nhowelab drive.
    Need to explore Matlab code a bit more to find the code for generating word bounding boxes

June 10

  1. Tried Genesis data on 1_a_paragraph_segmentation_msers.ipynb
    This is the method that the author abandoned, but I found it working pretty well in terms of predicting the word bounding boxes.
    1. Create bounding boxes for page and blocks
      • There are mainly three variables that can be tuned for detecting page and blocks: iterations, dilation_d, and intersection_threshold
      • By tuning those three variables the system would be able to detect the bounding boxes and cropped out the bounded page.
      • However there are three columns of text in Genesis and it is hard to create a bounding box of each column.
      • Tried: iterations = 5, dilation_d = 1.1, intersection_threshold = 0.1 for the page detection.
      • Also tried to tune different variables, but the out come is not promising.
    2. Create bounding boxes for words
      • The original word detection is overwhelmed. (Words Detecting 1)
      • After I tuned the intersection_threshold = 0.9 the bounding boxes of the words become much cleaner. (Words Detecting 2)
      • However the detection might still miss one or two words.
    3. Create bounding boxes for Lines
      • The system detected words pretty well but is not working well with line.
      • The bounding boxes might be too small for the line generating method to find the correct line bounding boxes.
      • The sort_bbs_line_by_line method might also need to be tuned.

June 11

  1. Read more about the line segmentation method
  2. Got logged out from Splashtop.
    • Contact Cats to reboot the computer so that I can log back in.
  3. Professor Howe also tried to run the handwritten-text-recognition-for-apache-mxnet
    But similar errors still occurs as described a few days ago.
  4. Tried different line segmentation methods:

June 12

  1. Got new data from Beth Mardutho! There are several things that needed to do.
    1. Explore the new data:
      • The data contains 23 different manuscripts.
      • There is an xml file with every single page of manuscript just like the IAM Dataset.
      • Some of the data come with an txt file with transcripts, but some of them only have transcripts in the xml file.
    2. Tried 1_a_paragraph_segmentation_msers.ipynb with the new data.
      • Word recognization are working well.
      • But line detection are still working poorly.
      • Decided to stick with the Matlab segmentation function
    3. Could also try 3_handwriting_recognition.ipynb with the new data because there is xml file for every page.
      • However the data we got do not have word bounding boxes which 3_handwriting_recognition.ipynb needed.
      • Explored with the code and the format of xml and decided to try it later if I have time.
      • Decided to stick with the Matlab code

June 13 - June 14

  1. Tried to segment the new data with the Matlab code.
    • First cropped the text section out from the page.
    Sometimes there are fragments with the code
    • Cropped the lines from the page.
  2. Tried to aligned the line text in the transcript with cropped line images.
    • The alignment process is difficult.
    • Some of the lines do not have transcript.
    • Some of the transcript do not have matched images.
  3. Was able to make alinement at last and trained the data.
    • Processed 6 manuscripts
  4. Trained data with the lineHTR system
    Complete training on Epoch 204.
    Character error rate: 2.225549%. Sentence accuracy: 48.000000%.
    Character error rate not improved
    No more improvement since 100 epochs. Training stopped.

Week 05 (June 15 - June 21)

June 15

  1. Wrote a methods to process xml file
    • Able to detect bounding boxes of lines
    • Able to extract text from xml file
    • Able to crop images
    • Able to save images
  2. Figured out that the first method couldn't be used for all manuscripts.
    • Wrote an other method that could do the same things.
    • Find a way to name the cropped images so that the order of those lines won't be messed up.
    • There is also a Matlab method.

June 16

  1. Processed all the data using the new method I wrote.
  2. Modified the IAM file a bit so that it would be easier to use.
  3. Trained data with the lineHTR system.
    Complete training on Epoch 171.
    Character error rate: 0.563341%. Sentence accuracy: 84.352941%.
    Character error rate not improved
    No more improvement since 100 epochs. Training stopped.

June 17

  1. Begin to augmented all the data using the following
    • NoiseGW
    • RotateNeg2
    • RotateNeg4
    • RotatePos2
    • RotatePos4
    • TiltNeg2
    • TiltNeg4
    • TiltPos2
    • TiltPos4

June 18

  1. Augmented all the data except the Genesis data because there is no transcript with spaces for Genesis data.
  2. Got over 170,000 lines in total.
  3. Tried to transfer the data to Ford Computer but the compressed file is still too big and have to divide them.
  4. Decided to run the LineHTR tomorrow.

June 19

  1. Figured out the difference between Training, Validating and Testing set.
    • The Testing set should be hide from the system the whole time and should only be used once after all the training is done.
    • The Validating set is used to tune the system. So even though it is not used for training, the system will peek from the Validating set in every training round.
  2. Decided not run the augmented data because the Validating and Testing is not set properly.
  3. Add testing method in the LineHTR model.
  4. Separated the folder for loading validating and testing datasets.
  5. Discussed how to separate data into train validation and test with Professor Howe.
    Decided to use 3 manuscripts for testing and 2 manuscripts for training.
  6. Documenting some of the progress on wiki.

June 20 - June 21

  1. Documenting rest of the progress on wiki.
  2. Process all the data with proper validation and training set
    1. Validating set:
      • 03
      • 13
      • 16
    2. Testing set:
      • 06
      • 15
      • 19
    3. Training set:
      • Others except 10 (Genesis)
    4. Validation Error:
      Complete training on Epoch 226.
      Character error rate: 15.986453%. Word accuracy: 2.352941%.
      Character error rate not improved
      No more improvement since 100 epochs. Training stopped.
    5. Testing Error:
      Character error rate: 10.357037%. Word accuracy: 11.818182%.
      • Question: Why didn't assert len(lineSplit) >= 9
      • The result wasn't the what we expected: May because that I was using three different font for validation and testing instead of only one font.
  3. Tried: Only use the last font as validation and testing (tried to reproduce the result where we get 99% accuracy)
    1. Used the whole data as Training set.
      • Result: Validation character error rate of saved model: 8.137879%
      • Still not the expected error rate.
    2. Used only the data for the last font as Training set.
      • Result: Validation character error rate of saved model: 7.984380%
      • Still not the expected error rate.
    3. Found out that all the validations which included symbols like "(" ")" "+" have extremely high error rate.
      • Decide to exclude those symbols from all data by modifying syriacProxies

Week 06 (June 22 - June 28)

June 23

  1. Modified all data (exclude symbols)
    1. Validation error:
      Complete training on Epoch 221.
      Character error rate: 16.074776%. Word accuracy: 2.823529%.
      Character error rate not improved
      No more improvement since 100 epochs. Training stopped.
    2. Testing error:
      Character error rate: 11.749072%. Word accuracy: 9.818182%.
  2. EastSyrac:
    1. Validation set:
      • 03
    2. Testing set:
      • 06
    3. Training set:
      • 01
      • 02
      • 04
      • 05
      • 06
      • 08
      • 09
    4. Validation error:
      Complete training on Epoch 312.
      Character error rate: 30.990390%. Word accuracy: 1.142857%.
      Character error rate not improved.
      No more improvement since 100 epochs. Training stopped.
    5. Testing error:
      Character error rate: 21.874253%. Word accuracy: 0.000000%.
  3. Estrangela:
  4. Serto:
  5. Still not the error rate we expected.
    • Tried the original division of train and validation set where part of a23 and all of a24 formed the validation set and all other data (including Genesis) are training set.
    • Training not completed but got really close to 0.5% error rate.

Conclusion

Offline Handwritten Text Recognition (HTR) systems transcribe text contained in scanned images into digital text. We want to use this system to convert scanned Syriac documents into Syriac transcripts. The previous project utilized Harald Scheidl's HTR system [1], which depended on TensorFlow, and contained 5 CNN layers, 2 RNN layers, and a CTC decoding layer. This former system was trained with approximately 2000 word images from the Ceriani Veteris Genesis document. However, it only did word-based recognition, and the output was limit to 32 characters. Also I found that the training, testing, and validation sets were not correctly separated in the previous approach. The system was tested on what it was trained on, so the output is not reliable.


In summer 2020, I worked to enhance the previous system. I increased CNN from 5 to 7 layers for a more accurate recognition; I increased input size from 128*32 to 800*64 for a line bases HTR, and I increased output size from 32 characters to 100 characters. Some other features were changed accordingly. I also expanded data size from approximately 500 lines to 17390 lines. And after properly separating training, testing, and validation set, I get 90 present accuracy rates for line-based recognition. This accuracy rate varies for different Syriac font, so further testing is needed for a more reliable result.


To improve accuracy, I tried different models. Jonathan Chung and Thomas Delteil's HTR [2] is another system that we examined. I worked to set up this system, and tried to use it on Syriac data. Even though a full-page HTR is appealing, the data it worked with originally formatted different comparing to our data. Therefore, more work done. Due to time limitations, this system's features are not fully tested, so it is still worth constructing further researches, which would hopefully achieve our initial goal.


Scheidl, H. (2020, August 09). Build a Handwritten Text Recognition System using TensorFlow. Retrieved August 27, 2020, from https://towardsdatascience.com/build-a-handwritten-text-recognition-system-using-tensorflow-2326a3487cd5

Chung, J., & Delteil, T. (2019). A Computationally Efficient Pipeline Approach to Full Page Offline Handwritten Text Recognition. 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). doi:10.1109/icdarw.2019.40078