Syriac Project

From CSclasswiki
Jump to: navigation, search

Week 01

May 28

  • read some related papers; get a general sense of the field:
  • T. Bluche, H. Ney, J. Louradour and C. Kermorvant, "Framewise and CTC training of Neural Networks for handwriting recognition," 2015 13th International Conference on Document Analysis and Recognition (ICDAR), Tunis, 2015, pp. 81-85.
URLShow CTC was similar to forward-backward training of hybrid NN/HMM systems and can be extended to more standard HMM topologies;
  • P. P. Sahu et al., "Personalized Hand Writing Recognition Using Continued LSTM Training," 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, 2017, pp. 218-223.
URL Demonstrate simple but robust continued-training techniques for adapting a pre-trained model using Long Short Term Memory (LSTM) network and a Connectionist Temporal Classification (CTC) loss function- to a specific user's writing style;
  • W. Hu et al., "Sequence Discriminative Training for Offline Handwriting Recognition by an Interpolated CTC and Lattice-Free MMI Objective Function," 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, 2017, pp. 61-66.
URL Find CTC+LFMMI approach very effective and helpful for some punctuation-sensitive scenarios such as handwritten receipt recognition.
  • V. Pham, T. Bluche, C. Kermorvant and J. Louradour, "Dropout Improves Recurrent Neural Networks for Handwriting Recognition," 2014 14th International Conference on Frontiers in Handwriting Recognition, Heraklion, 2014, pp. 285-290.
URL Applied dropout (a recently proposed regularization method for deep architectures) to RNNs; confirmed the effectiveness of dropout on deep architectures even when the network mainly consists of recurrent and shared connections;
  • Z. Xie, Z. Sun, L. Jin, H. Ni and T. Lyons, "Learning Spatial-Semantic Context with Fully Convolutional Recurrent Network for Online Handwritten Chinese Text Recognition," in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 8, pp. 1903-1917, 1 Aug. 2018.
URL Proposed a multi-spatial-context fully convolutional recurrent network (MC-FCRN) to exploit the multiple spatial contexts from the signature feature maps and generate a prediction sequence while completely avoiding the difficult segmentation problem; developed an implicit language model to make predictions based on semantic context within a predicting feature sequence, providing a new perspective for incorporating lexicon constraints and prior knowledge about a certain language in the recognition procedure.

May 29 - 30

  • Installed anaconda, openCV and Tensorflow;
built a python3.6 instance inside of python3.7 (Tensorflow does not support python3.7);
read anaconda documentation;
got familiar with Jupiter notebook.
  • Explore terminal;
learnt about path;
learnt how to switch version of python/tensorflow on Mac;
learnt to use command line arguments in a Python program;
Command Line Arguments in Python.
  • Tried out a demo NN implementation;
Build a Handwritten Text Recognition System using TensorFlow;
fixed 'placeholder' attribute error caused by TensorFlow2.0; solution: reinstall TensorFlow1.12;
fixed import error; solution: open the IDLE with the virtual environment.
  • helpful tutorials:
machine learning tutorial by Hung-yu Lee;
read the first part of lecture notes : introduction of deep learning;
Training and Testing on our Data for Deep Learning.

May 31

  • Set up system in Ford 342:
installed Python and TensorFlow1.12;
got the IAM dataset for training(opening large tar file in windows took a lot time).

June 1

  • Installed TensorFlow GPU:
ran into import error when testing installation;
issue: it seems to be a version compatibility issue;
started things over following How to Install TensorFlow GPU on Windows -FULL TUTORIAL;
looked through deep learning SDK documentation link.

June 2

  • went over the installation process once again but got the same error when importing tensorFlow :(

Week 02

June 3

  • Trained IAM dataset on CPU in Ford 342;
I'm curious about the training process so here's some notes:
11:30 | Epoch 01 | Character error rate: 17.056961%. Word accuracy: 61.686957%.
13:37 | Epoch 11 | Character error rate: 12.676747%. Word accuracy: 69.617391%.
15:40 | Epoch 21 | Character error rate: 11.097730%. Word accuracy: 73.060870%.
..........| Epoch 24 | Character error rate: 10.486641%. Word accuracy: 74.295652%.
.......... Character error rate improved, save model
16:29 | Epoch 25 | Character error rate: 10.531246%. Word accuracy: 74.434783%.
.......... Character error rate not improved
16:45 | Epoch 26 | Character error rate: 10.491101%. Word accuracy: 74.313043%.
.......... Character error rate not improved
  • Have a better understanding of Deep Learning with Neural Networks:
The artificial neural network is a biologically-inspired methodology to conduct machine learning, intended to mimic our brain (a biological neural network). The idea has been around since the 1940's, and has had a few ups and downs, most notably when compared against the Support Vector Machine (SVM). For example, the Neural Network was popularized up until the mid 90s when it was shown that the SVM, using a new-to-the-public (the technique itself was thought up long before it was actually put to use) technique, the "Kernel Trick," was capable of working with non-linearly separable datasets. With this, the Support Vector Machine catapulted to the front again, leaving neural nets behind and mostly nothing interesting until about 2011, where Deep Neural Networks began to take hold and outperform the Support Vector Machine, using new techniques, huge dataset availability, and much more powerful computers. reference
  • Covered some basics on what TensorFlow is and began using it
read article "what is TensorFlow;"
went through TensorFlow Basics;
wrote a mini TensorFlow program and ran it;

June 4

  • finished training IAM dataset
| Epoch 40 | Character error rate: 10.379589%. Word accuracy: 74.643478%.
Character error rate not improved
No more improvement since 5 epochs. Training stopped.
  • find out how to convert our own data so that the model can recognize the text; link
  • continue working on deep learning with NN and TF tutorial series
watched demos in:
Part 4: built the model for Neural Network and set up the computation graph with TensorFlow;
Part 5: set up the training process which is what will be run in the TensorFlow Session;
the basics of RNN: link
  • read about previous work on classwiki
  • understand HTR better reading:
U. -. Marti and H. Bunke, "Text line segmentation and word recognition in a system for general writer independent handwriting recognition," Proceedings of Sixth International Conference on Document Analysis and Recognition, Seattle, WA, USA, 2001, pp. 159-163. click here
  • installed Meltho Fonts

June 5

  • familiarized myself with Matlab
rotated 2223 grayscale images in H:\SyriacGenesis\gwords\a01\a01-000u by +10 and -10 degree
  • folder path for rotated copies:

June 6

  • trained model against Syriac data:
Character error rate: 7.534247%. Word accuracy: 75.000000%.
  • prepared more data for training:
learnt about batch renaming of files in a directory;
made a python program that takes a folder name from the input argument and renames all its files;
edited word.txt using python and made it ready for training;
successfully added modified data (+10/-10) to the training examples and started training.
  • path for folder 'a02', 'a03', and new 'word.txt':
  • finished training with modified images at Epoch 75
Character error rate: 4.618938%. Word accuracy: 84.333333%.

June 7

  • made a list of ways to improve recognition accuracy;
  • tried to install TensorFlow GPU on the gpu server:
tried cuda version 9.0, 9.1, 10.0;
tried TensorFlow GPU 1.12, 1.13;
  • prepared more data and trained the model:
a04-000u: make the size 1.5 times bigger, tilt the image by +5 degree
a05-000u: make the size 1.5 times bigger, tilt the image by -5 degree
  • path for folder 'a04-000u', 'a05-000u', and new 'word.txt':

Week 03

June 10

  • updated the model with:
Character error rate: 1.184433%. Word accuracy: 95.818182%.

NRH>>> Awesome! Quick question: is this training error or test set error?

The dataset is split into 95% of the samples used for training and 5% for validation. This is the validation set error rate, which provides an estimate of the test error rate.
(how experts in the field of machine learning define train, test, and validation datasets)
  • read coda toolkit documentation at URL
  • successfully trained the model on GPU (finally!!!)
same dataset but different validation set error rate on a different PC:
Character error rate: 0.803723%. Word accuracy: 97.454545%.
  • working on shear transformation to enlarge the dataset
added shear -0.25 degree with size 0.8

June 11

  • set up the second tensorflow-gpu environment:
sadly it wasn't successful :(
  • Image Data Augmentation:
Added salt and pepper noise, with a noise density of 0.1, to the image. --> a7
Added Gaussian white noise with mean 0.1 and variance of 0.01. --> a6 (replaced shear transformation)
Character error rate: 0.759878%. Word accuracy: 97.066667%.
PS: At first, the word accuracy rate dropped by 10% after the these 2 sets of images were added.
One possibility is that the data augmentation strategy adds some bias to the examples that doesn’t match the original examples.
I also think maybe data augmentation will be more effective if I train from scratch than finetuning.
So I replaced the shearing set with Gaussian white noise set, because I only sheared the photo in one direction, which may lead to bias.
Then I deleted the files contained in the model/ directory, and trained the model from scratch.
two sources I find helpful:
some possible reasons why training accuracy would decrease over time
Why data augmentation leads to decreased accuracy when finetuning

June 12

  • After adding two sets of shear and two sets of noise:
Character error rate: 0.298851%. Word accuracy: 98.900000%.
  • ran Genesiswork.m:
found missing functions and used addpath();
read comments, ran chunks and tried to see how each one works;
had a general view of the process: 1) load page 2) binarize 3) blocks 4) lines

June 13

  • selected 15 pages from Binarized Manuscripts and found their copies in Raw Manuscripts;
  • working on finding corner coordinates of binarized manuscripts:
tried to apply functions in Genesiswork.m to binarized manuscripts;
but the borders wasn't cropped properly in this way;
changed to use ginput to cut manually;
decided to go over some basics of matlab.

June 14

  • learnt and practiced some matlab basics:
variables: int8, int16, int64, char, logical, double, single
some math functions
if statement/ swich case in matlab
vector and matrics
cell arrays

week 04

June 17

  • had a better understanding of digital image processing;
read through and took notes of chapter 2, Digital image processing using matlab (gonzalez);
learnt about matrix indexing and how images are stored as matrices;
a binary image is a logical array of 0s and 1s
  • went back to the code
put all of the 15 pages in a 1*15 cell array where each cell is a 1*2 cell array that contains 2 blocks.

June 18

  • successfully generated grey scale line images!
spent the whole day coding
became much more comfortable using matlab, and found it helpful to draw pictures to see how data are stored
computed the line corners from binary images and applied them to grey scale images
--the biggest challenge is the relationship of Xs and Ys, the structure of nested cell arrays, and how to use indexing on the nested cell arrays to get the correct pair of coordinates
learnt to use dbstop if error

June 19

  • figured out how to save all the chopped lines as image files
read documentation of sprintf
learnt to combine sprintf and imwrite to name and store images in a loop
  • move on to the next step:
--feed complete text-lines instead of word-images
the first small step is to generate training examples of text-lines(images with tags)
the plan is to make use of wtag and wimg to get lntag and lnimg

June 20

  • trying to get to line word index(lwi)
read the rest of GenesisWork.m closely, load variables and add functions to the path
a code chunk that is a few lines above lwi ran for the whole afternoon, have to come back to it tomorrow

June 21

  • got the line word index;
  • wrote code to create line images and line transcripts;
found mismatch of the words that are stored in different places.

Week 05

June 24

  • assembled the line images and line transcripts:
learnt how to put elements into cells while looping;
corrected the order of the transcripts;
  • prepared line data for NN training:
used both python and matlab to convert our dataset to the IAM format;
named the image files according to the IAM format;
created a txt file that includes the transcripts of each image in the correct format.

June 25

  • working on a recognizer that can take in complete text-lines
Line-level Handwritten Text Recognition with TensorFlow
  • prepare dataset:
the IAM format txt file does not work;
a labels.json file is required (With each key is the path to the images file and each value is the ground truth label for that image);
used matlab and python, successfully wrote a json file;
  • Try to train the model:
1.FileNotFoundError: [Errno 2] No such file or directory: '/path/to/training/data/labels.json'
fixed the directory error by editing path in
2.File "", line 115, in validate
TypeError: 'float' object is not iterable
3.NN kept recognizing nothing, the training stopped after 8 epochs without improvement
-solved by changing "earlyStopping" from 8 epochs to 20 epochs
-it started to recognize characters somewhere around the 15th epoch

June 26

  • checked the alignment of transcripts to the images
manually checked all of the images that are multiples of 4(image number 16,20,24,28,32...)
found two parts of misalignment, approximately 18 images in total
  • downloaded IAM line dataset
had trouble extracting tgz file on PC but solved it at the end of the day

June 27

  • sanity check: train IAM lines;
created transcripts in a json file
--had to manually fix all of the quotation marks that are inside of strings(fixed through files a to g)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
--solved after added ".png" to every key in the json file
get the training to go.

June 28

  • Loaded 6929 images
Train on 6236 images. Validate on 693 images.
Character error rate: 19.545311%. Address accuracy: 2.000000%.
  • Wrote to the author of LineHTR
  • try out splitLine.m

Week 06

July 1

  • got binarized line images;
took the full page binarized images and split them at the same points as the grayscale images;
  • tried splitLine function on line images:
loaded line images in two arrays: gray_line and binary_line;
compared the results and found mismatches.

July 2

  • corrected the data and got the line HTR to run with lines.txt;
modified and;
the original data loader loads json file, we want it to be able to load IAM transcript, which is a txt;
imported os;
added attribute truncateLabel to DataLoader;
change the directories of images according to IAM format;

July 3

  • train on corrected data:
Character error rate: 24.636447%. Address accuracy: 0.133333%.
decided to abandon the line recognizer.
  • continue working on the word recognizer:
binarizing the gray line images to get BW images that are perfectly aligned;
applied splitLine on the new lines.

Week 07

July 9

  • try new splitline;
save gray words as png image files.
  • try to make the NN to infer the text from our own images;
modified code that is under 'args.test' in

July 10

  • collect the result of the inferred texts:
the result is not as good as expected;
it turns out that the NN is not very good at recognizing text from manuscripts other than the training example
  • add more training examples:
a01: RGB images;
a02-a09: variations of a01, RGB images;
add a10: erodes the images using se = strel('cube',2), RGB images;
add a11: histogram matching by intensity, grayscale images;
--a11's images looks similar to the grayscale words that we created.
  • train the network.

July 11

  • applied splitLine directly on RGB images;
  • decided to make testing data look like the training examples:
histogram matching by color;
size adjustment of the images --scaled images all by 30/86 using imresize and ran the network again;
  • split pages from manuscript# 14485 into words.

July 12

  • matlab broke down in Ford, worked in young;
  • selected some words that are in good condition from 14485 and applied the recognizer on them;
The result looks good enough on the selected testing examples;
  • fixed all remaining issue of splitting lines into words.

week 08

July 15

  • prepare testing data:
went over "pages -> lines -> words" process again:
manuscripts #14485 and #14526;
make sure:
1)all of the lines can be loaded in order --change file names using %03d;
2)all of the words have been correctly resized to the same scale as training data;
get word images that can be easily pinpointed the location in the raw pages.
  • folder path:

July 16

  • did experiment on scaled line images, but they could't go through the network
  • started working on the transcript file
Page1 block1 line1 - 7
  • used proxiesToSyriac to translate the labels back to Unicode Syriac
  • converted unicode to a string and wrote to file

July 17

  • searched for corresponding manuscripts in Syriac Digital Corpus --no results;
  • successfully created HTML for the first 7 lines;
installed Syriac fonts;
--the webpage looks really nice!!
  • transcription:
Page1 block1 line8 - 17;
  • Transkribus:
registered at the website and downloaded Transkribus;
the most important purposes of Transkribus to us are:
i. Transcribe documents for a scholarly edition;
ii. Create training data to feed the Handwritten Text Recognition (HTR) system so it can learn to decipher our own historical documents;
iii. Run HTR on our documents and receive automatically generated transcripts;
  • exploring Transkribus and becoming familiar with how it works: through the guide and learnt to upload documents;
b.Segment documents into lines;
Use Handwritten Text Recognition (HTR) on your documents:
a. It is simple to have your documents recognised by the computer. You can start training a model with around 5,000 transcribed words of printed text or :15,000 words of handwritten text. To start the training process please drop us a short email once you have segmented and transcribed a first batch of :pages(
b. You will receive the permission to train your own model from us. If you need more
information on that please check the How to Train a Model guide.

July 18

Ecclesiastical History by Eusebius of Caesarea
description in British Library

week 9

July 22

  • Continued searching for transcripts in Syriac Digital Corpus that matches our manuscripts;
  • Decided to transcribe 30 pages of SyriacGenesis that we have.

July 23

  • Prepare testing dataset
Chopped pages to lines and then to words;
Notice that the order of word images in the directory is not what I want, redo the segmentation process;
--when naming the files, pad the number with zeroes up to the maximum number size, the sort order should be maintained better in the directory.
  • Started writing a program that create line transcripts from word transcripts.

July 24

  • Try to solve an issue raised when testing
The 30 pages of Syriac Genesis are chopped into 5085 lines and 26839 “words,” but they can’t go through the network and give me a whole list of word transcripts all at once:
random “broken” images that fail NN and throw me errors;
remove them from the testing data by running the program repetitively;
  • finish transcribing words from line 1-1500;
  • Program that create line transcripts:
Got stuck on how to read in the line number and word number.

July 25

  • Successfully write out the program that create line transcripts;
  • finish transcribing words from line 1500-3000;

July 26

  • finish transcribing all of the words (5085 in total);
Skip over error images using try and except statement in python;
  • put "wimg" 2236-3600 into NN;
our recognizer can deal with binarized images, gray scale images and RGB images.