Syriac handwriting recognition II

From CSclasswiki
Jump to: navigation, search

Week 01 (May 18 - May 24)

May 18

  • Installed python 3.8, pip, sublime, virtualenv.

May 19

  • Installed TensorFlow 2.
  • Read through Helena's wiki to get a sense of the work she had done.

May 20

  • Connected to Smith VPN.

May 21

  • Met with Winnie.
  • Got onto the Howe drive, familiarized myself with the drive.
  • Tried to run the code from GitHub that Helena used for her code.
    • Dowloaded zip from [1].
    • When running code in terminal, it said I didn't have cv2, so installed OpenCV using [2]. Also installed Anaconda. This did not fix the problem, but used this post ([3]) and fixed the problem with the command "pip install opencv-python."
    • Ran code again; got error "No module named 'editdistance'" so used [4] and typed the command "conda install -c conda-forge editdistance" and proceeded to download the corresponding package.
    • Got the error "No module named 'tensorflow'". Installed TensorFlow 1 with command "pip install tensorflow==1.15"
    • Got the error "No saved model found in: ../model/". Looked through the file but did not find a solution to this problem.
  • Familiarized myself with code.
    • Looked at the various files, tried to understand meaning.
    • Read the article [5] to help understand the Deep Learning process and function of files. Looks like is not part of the Deep Learning process, but rather gets information about the input before the Deep Learning takes place.
  • Learned more about Deep Learning.
    • Watched video [6].


  • How to fix problem with running code.
  • How to change data inputs.
  • How to get access to the GPU, how to read IEEE papers.
  • What exactly the RNN does.
  • Why does Helena have 3 folders-- NN, and the two HTRs? Where are Helena's Matlab files?
  • What does the transfer from the hidden layers to the last layer do?
  • What exactly is an activation function (how does it determine which specific neuron to activate)?
  • What is the weighted sum in Deep Learning (how does it relate to weighted channels and biases)?

May 22-25

  • Met with Winnie.
    • Determined that the problem with no saved model was just due to my not unzipping a folder named "model" within the SimpleHTR-master folder.
    • Looked carefully through README file.
    • Code runs now.
      • Returned: "Recognized: "little" / Probability: 0.966255"
  • Looked more closely at Deep Learning videos and articles from yesterday, answered my questions from yesterday on Deep Learning.
  • Set up this wiki page.
  • Looked at the Beth Mardutho website and corresponding libraries [7] in search of manuscript page images and their corresponding transcriptions.
    • Made an account with Transkribus [8], tried to download the application.
    • Received error, "application not responding" whenever I tried to open it. Emailed the team.
  • Continued learning on Deep Learning.
    • Read CNN explanation Part 1, [9].
    • Viewed linked video, [10].


  • What is word beam search decoding?

Week 02 (May 25 - May 31)

May 25

  • Was able to open the Transkribus application (just changed location).
    • Navigated around the application, figured out how to export training data in several languages.
    • Did not see the Syriac training data available.

May 26

  • Determined how to open files from server in Matlab online.
  • Made an account, began downloaded Splashtop Business. Wasn't sure what computer to connect to.
    • Ford 342-02 was already in another session so I couldn't connect to it.
  • Look through Helena's code paying attention to formats and input of data.
    • Dataset should be converted to IAM format, or you can edit code in DataLoader to match preferred dataset format.
    • Put data in the "data" folder. (More info in README.)
    • Still confused about Helena's code and the different files and functions within them, especially DataLoader.
  • Couldn't find link to playground.

May 27

  • Set up TensorFlow GPU on Ford 342-03 using a comment at bottom of this video [13].
    • The computer was on Python 2.7.16 so I updated it by installing Python 3.8.3.
      • Used this link [14], and selected "Add Python 3.8 to PATH"
      • But when I checked to see if its was installed by doing "python --version" it still said 2.7.16. Looking on apps, I saw Python 3.7 had also been installed. Figured they were all installed but for some reason the earliest version was just the version it stated in Terminal.
    • Installed conda for Python 3.7 64-bit using [15].
      • Did not add Anaconda to PATH variable (not recommended), but did select to use it automatically when using Python (recommended).
    • In Anaconda Prompt (rather than terminal), typed "conda create -n myenv python=3.8.3" then proceeded.
    • Then typed "conda install -c conda-forge tensorflow-gpu" and proceeded.
    • Not completely sure if it is installed properly, as the commands to test ([16], [17]) it generated errors in both Anaconda Prompt and Command Prompt (Terminal): "'import' is not recognized as an internal or external command, operable program or batch file","'activate' is not recognized as an internal or external command, operable program or batch file".
  • Read CNN explanation Part 2, [18].
  • Found out how to obtain the Syriac Transkribus models on the website [19] under "instructions."
    • Wasn't sure which models to request in my email to the Beth Mardutho people(Estrangelo, East Syriac, and/or Serto models).
  • Confused on how to locate the image cut files.


  • When to use Anaconda Prompt vs. Terminal vs. Python on GPU?
    • Why isn't import tf, activate, etc working?
  • Why does it always state the python version is 2.7?

May 28

  • Emailed the Beth Mardutho team requesting Estrangelo, Serto, and East Syriac models.
  • Looked into how to crop images on Matlab.
    • Looked at tutorial for cropping images around face [20]
    • Other potentially helpful links that are more pertinent to text image cropping are [21], [22], and [23].
  • Read CNN explanation Part 3, [24].


  • How to run .cpp and .mexw64 files on Matlab?
  • Should I start from scratch making an image cropping function? What specifically should I crop the images to?
  • Will I break up the transcription on my own, how to label them, etc?

May 29-June 1

  • Received the three Syriac models but unable to open them because not able to log into my Transkribus account.
    • Received error "Login failed: Already connected"

Week 03 (June 2 - 7)

June 2

  • Still receiving Transkribus error, tried troubleshooting a few other ways to no avail.

June 3

  • Met with Winnie
    • Got information on how to get started with image cropping on Matlab as well as how to train the git model on Helena's already-cropped images.
      • Where files are located, etc.
  • Changed the path of new version of Python on the GPU
    • Watched this tutorial and followed along [25] but when I tested it out with cmd it was not successful, still showing Python 2.7.16.
    • Used (the second part of) this tutorial [26] (without undoing what I did in previous tutorial), changing name of python.exe in Python 3.8.3 to python3.exe so computer could differentiate between the python.exe files in the two different versions of python.
    • This was successful; now typing "python" brings up Python 2.7.16 whereas typing "python3" brings up Python 3.8.3.
    • Also typed "python3 -m pip install --upgrade pip" so that in the future downloading python modules won't be a problem.
  • Began training git code with our files.
    • Confused on how to input the Genesis data onto the git code. Looked at this article [27] and the git code's original data [28], as well as an article containing python code to convert data into an IAM-compatible format [29], but still unsure how to proceed.
    • Looked at =, tried to understand it.
    • Ended up using the converted data in Summer2020.
      • Copied them into my separate folder, followed instructions in readme.
      • Ran this and it still outputted the "little" original output, so the new data is not being utilized.
        • Then remembered I need to train it first on new data.
          • Did "python --train" from src, and got many errors.
          • Deleted everything from directory; it is now training though taking a very long time.
          • Took about 5 hours to run; ~25-30 epochs; output was:
            • "Character error rate: 4.653130%. Word accuracy: 84.181818%. / Character error rate not improved / No more improvement since 5 epochs. Training stopped."
    • Decided to train on Ford computer.
      • Not sure of the best way to go about this.


  • Should I create a new User account on the Ford 342-03 computer? Which user should I be using, because I just realized I was under "mtphan." Will I have to reinstall things if I switch?
  • What exactly is IAM? In general how to set up our own data on git code. Still want to look at the python converter file to understand, even though I already have the data.
  • Should the git code files and the IAM data and basically all my work be on the Ford computer or my own (is there a way to run files from my own computer on the GPU)?
  • Is it correct that in order to run it on the GPU I have to put my files on the nhowelab server and then copy them to the GPU? Or is there an easier way?

June 4

  • Tried to switch accounts in Ford 03, the computer seems to be glitching and unable to sign out.
  • Tried updating Java and redownloading Transkribus, unable to even open the application at this point. Computer glitches as well now.
  • Began reading up on Tesseract.

June 5-7

  • Still unable to log out of mtphan account on Ford computer.
  • Successfully installed Tesseract (on personal computer) following these directions: [30],[31].
    • Installed Homebrew.
  • Began testing it out with simple images to convert.
    • Works well with printed text.
    • For some reason reads cursive text "little" as empty document.
    • Poor results with handwritten non-cursive test. For example, in somewhat messy handwriting of "the fox jumped over the log" 6 times, the output was:
      • The fox “amped over Le \og / The fox "yucaped Over Tle leg / “The for succes ver tLe \o6) / “a fox "yuerped over tHe sq / “The Poe “paged pve We \o / The Bx juesped ov es saTey
    • To transcribe something, command is "tesseract fileName.png out" and outputs to out.txt file in Home.
    • Type tesseract -h for help menu.
    • Helpful link: [32].
    • Seems like converting a series of images into a .tiff format to run Tesseract on may be viable [33].
    • Attempted adding Syriac language "syr" from [34] with this tutorial [35].
      • Typed "tesseract fileName.png out -l syr" but got error "Can't open syr"
        • Not sure if it's not fully installed or if it's not installed in the right location. Tried changing location but didn't work.
      • Note: Can only extract one language at a time.

Week 04 (June 8 - June 14)

June 9

  • Fixed Tesseract error with syr, tested it out on individual image.
    • Moved syr.traineddata into "4.1.1" -> "share" -> "tessdata" folder where all of the other languages e.g. eng.traineddata are, and still outputs same error, but when I tried it again, it worked for some reason.
      • Outputted gibberish because it was transcribing an English text.
    • Tried testing it on a Syriac word.
      • Got error "Warning: Invalid resolution 0 dpi. Using 70 instead. / Estimating resolution as 1756 / Empty page!! / Estimating resolution as 1756 / Empty page!!" Fixed it by making image smaller.
      • Got error "Empty page!! / Empty page!!" Fixed by using manual page [36] to add on a --psm N to the command, typing "tesseract syriacTest3.png out -l syr --psm 8"
        • --psm N commands specify the type of input, e.g. single word, block of text, etc.
        • This image [37] (shrunken) got transcribed to be ܥܒܢ.
  • Figured out how to get some sort of accuracy gauge returned by Tesseract
    • Confidence value can be outputted as a .tsv alongside the transcription by "tesseract syriacTest3.png out --oem 1 -l syr --psm 8 tsv"
      • Output is a matrix, can be opened in Numbers. It appears that the confidence value is the Level 5, "conf" column value.
      • Confidence value ranges from 0 to 100 with 100 being most confident.
        • Previous image example [38] returned a conf value of 3.
  • Ran tesseract on Genesis documents.
    • Wanted to go through multiple at a time, so installed pytesseract with [39] and wrote a python code to do so.
    • Pillow already installed.
    • Received errors when using previous video likely because it was on a Windows, but with this tutorial [40] and a few adjustments it transcribed successfully.
    • Successfully wrote program to transcribe each manuscript of "30Pages" and save it as its own .txt file.
      • Does not seem very accurate with transcription.


  • How to compare Tesseract's conf values with Tensorflow accuracy rates? Are my interpretations of the conf values even correct?
    • Just compare Tesseract's transcription to "Ground Truth" document.
  • Which data should I test Tesseract on?
  • Should this be on CPU (like it is now)?
  • Should I be putting the output files in the shared server?
  • Where are the "ground truth" transcriptions located?

June 10

  • Tried to run Matlab file that translates Syriac to latin letters, but unsuccessful. Don't see code, only cells. Perhaps am trying to run incorrect document.
  • Compared Tesseract's transcribed document to ground truth.
    • Located ground truth "30Pages" documents, in "Caroline Provine Syriac"
    • Learned about using cosine between documents to compute similarity, in Natural Language Processing. Also learned about Natural language toolkit (NLTK) and Gensim. Using [41], installed Gensim. NLTK was already installed.
    • However, decided to go with Spacy [42]. Installed it in a virtual environment using [43].
      • Spacy has other languages [44], but not Syriac.
    • Wrote a python file that computes similarity between two text files using NLTK method.
      • Unsuccessful with NLTK. Error: "No module named 'nltk'". But I don't think this method will work because it breaks up text according to English words.


  • Where is transliteration Matlab file, how to run it?

June 11

  • Tried to run Matlab syriacProxies.m file (in Matlab->Handwriting) but received an error.
  • Looked into editdistance function--seems more complicated than transliteration.


  • Can I make the transliteration and even comparing a part of the transcription process? Seems a lot easier than 3 separate codes. How to get transliteration on python not Matlab.

June 12-14

  • Matlab trial is about to expire.
  • Fixed a few semicolons in syriacProxies.m file.
  • Learned how Matlab functions work and how to run them with tutorials [45] and [46].
  • Not sure how to run syriacProxies though, don't know what inputs/parameters to give it and what str,eqv,and cacc are in "function [txt,acc,eqv] = syriacProxies(str,eqv,cacc)".
  • Had to adjust file to get input in correct format.
    • Switched from multiple .txt to a single .txt file, but not 100% sure it worked, it is difficult to check.
  • Removed numbers and some characters from Tesseract vocabulary.
    • Learned about configs and blacklisting characters from [47]. Decided to blacklist "0123456789!@#$%^&*()~-_`{}[];:"'<>,.?/\|+=".
      • Some characters have special meanings, so I was unable to blacklist the characters "*$^;:"'[]<>,.?/\|+=". Did research but couldn't find relevant info for within the custom config blacklist. Need to figure this out.
    • Blacklisting the more basic characters "0123456789!@#%&()~-_`{}" was successful.
  • Tried to transliterate on Matlab.
    • Must input only the Syriac unicode you wish to run, i.e. syriacProxies([unicode Syriac string]).
    • Re-configured python Tesseract file so that it outputs the text in unicode with [48] by adding text.encode("utf-8").
      • Unsuccessful. I think because encode() works with python byte strings and unicodes, which is not the situation here.
      • Tried a different tactic using [49] with "text = unicode(utf8string, "utf-8")".
        • Error "NameError: name 'unicode' is not defined"
      • Then noticed that Matlab uses ASCII rather than utf-8, so changed code accordingly.
      • Looked at [50], [51], and tried "text=text.decode('ascii')"
        • Error "AttributeError: 'str' object has no attribute 'decode'"


  • How to blacklist the characters with special meanings?
  • How exactly to transfer from normal text string to unicode?
  • Why do we need to bother transferring to latin characters/using the Matlab function at all if we can just look at both in unicode?

Week 05 (June 15 - June 21)

June 15

  • Scrapped python unicode output. Instead, uploaded Out30Pages.txt to Matlab, and made new Matlab file "fileToUnicode.m"
    • Used [52] and [53] to translate file to unicode with "outInUnicode = double(str)"
    • Successfully ran syriacToProxies by "syriacToProxies(outInUnicode)".
  • Removed special characters.
    • Just used backslash, for some reason today it worked. Then added a bunch of special characters with "option" and "option+shift".
      • custom_config = r'-c tessedit_char_blacklist=0123456789!@#%&()~-_`{}\$\\;\<\>\^\*:\.,\'\"\?Ω\/≈ç√∫˜µ¬˚∆˙©ƒ∂ßåœ∑´®†¥¨ˆøπ¸˛Ç◊ı˜ÂÒÓÔ˝ÏÎÍÅŒ„´‰ˇÁ¨ˆØ∏¡™£¢∞§¶•ªº⁄€‹›fifl‡°·‚≤≥÷…æ“‘«≠–`¯˘¿ÚÆ”’»±—'
  • Set up Spacy.
  • Re-ran Tesseract transcription, fed it through Matlab unicode translator, then fed it through Matlab syriacProxies.
    • Error: Warning: Unknown symbols encountered / > In syriacProxies (line 95) / ans = " listing a lot of transliterated text.
    • Need to fix this.
  • Then manually uploaded output as text file to folder.
  • Fed TRANSCRIPT62001.txt through Matlab unicode translator, then fed it through Matlab syriacProxies and re-uploaded as new text file.
    • Same error when running syriacProxies as with Tess file.
  • Ran spacy comparison files.
    • Various syntax errors; adjusted.
    • Re-installed Spacy, updated pip. With resource [54], also ran on terminal "python -m spacy validate" and "python -m spacy download en_core_web_sm".
    • In python changed "nlp = spacy.load('en')" to "nlp = spacy.load("en_core_web_sm")".
    • Output was: " ModelsWarning: [W007] The model you're using has no word vectors loaded, so the result of the Doc.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available. / doc1.similarity(doc2)"
    • Installed en_core_web_md.
      • Got error "no nodule named spacy" so had to install again. Also realized I never printed output.
    • New output was: "0.9999999984162853"
  • Since this output is so high, I'm worried the documents are the same.


  • How to fix line error.
  • How to upload output as file manually in Matlab.
  • Will the line breaks etc. matter in Spacy comparisons, or just the actual letters?
  • Why do I keep having to re-install spacy?
  • Am I comparing the same documents?

June 16

  • The Out30Pages.txt looks quite different from the TRANSCRIPT62001.txt yet transliterations are identical, so I think there must be a problem in Matlab unicode translation or latin characters transliteration.
    • Can't access Matlab.

June 17

  • Examined code parsing the new XML data.


  • Where is original XML data located?

June 18

  • Continue trying to set up Tufts Matlab
    • Faced obstacles. This is not feasible.
  • Began setting things up on Ford-04 to use its Matlab, under my account from Moodle. Began installing software, transferring files.


  • How to get onto nhowelab network/shared drive?

June 19-21

  • Installed Python 3.8 on under howelab user of Ford342-04 and changed path using same link as before so it's called python3 and Python 2.7 is python.
  • Installed Anaconda (just on) howelab account.
  • Installed Sublime in "C:\Program Files\Sublime Text 3".
  • Installed Spacy with "conda install spacy" in Anaconda Prompt. Unable to do this in normal command window.
  • Created new Matlab file "fileToUnicode". Received error:
    • Undefined variable "Out30Pages" or class "Out30Pages.txt"
    • Error in fileToUnicode (line1) filename=input('enter file name')
    • Fixed these by for now not using input, just typing the file name directly into the Matlab file and clicking run rather than typing in command window.
  • Edited code in between runs of each file by changing outInUnicode to outInUnicode2. Then transcribed.
    • Document transcriptions look empty.
  • Again created text files for each, and ran the Spacy comparison.
    • Error "no module named Spacy"; fixed this by re-downloading Spacy in Anaconda Prompt with "conda install -c conda-forge spacy" and "python -m spacy download en_core_web_md". But still receiving same error.
      • Tried running in Anaconda Prompt; received error "compareTranscriptions is not recognized as an internal or external command, operable program or batch file".
      • Attempted to fix this with [55]; changed the path variable. Still receiving same error.
      • Simply quitting Command Prompt and reopening it worked. Re-tried previous commands installing spacy.
      • Received errors "An HTTP error occurred when trying to retrieve this URL".
    • Tried "pip3 install spacy".
      • This is successful, but program won't run still. Did "python3 -m spacy download en_core_web_md" and successful download. Still receiving same error though.
  • Even when unicode isn't run, transliteration is still blank.


  • How to get rid of spacy error.
  • What is the difference between typing pip and pip3?
  • Why is syriacProxies output blank?

Week 06 (June 22 - June 28)

June 22

  • Attempts to solve the blank transliteration output problem.
    • Tried converting transliteration output back to unicode with "translit=syriacProxies(outInUnicode4)" and then "uniTranslit=double(translit)", and there was indeed a complete unicode output. So there are characters.
    • But when tried to convert back to characters with "charTranslit=char(uniTranslit)", output is still blank.
    • Copying and pasting the "blank" output to another application such as notepad is unsuccessful.
    • Researched character encoding issues, nonprinting characters. Inconclusive.
    • Decided to just use the unicode translation, and translate it to a string in the python file before I compare with Spacy (on personal computer, not GPU).
      • For the TRANSCRIPT62001 file, it seems that the output in the command window is not the full file. Perhaps the output is too long to appear in command window. This could greatly affect results. Need to find a way to get the whole transliteration.
      • Using [56], edited python file to first translate to a normal string from unicode.
      • Tried .encode("ascii").
        • Got error "TypeError: Argument 'string' has incorrect type (expected str, got bytes)". Seems like .encode is not the right function to use.
      • Tried chr() function from [57].
        • Got error "TypeError: an integer is required (got type str)". So document is being read as string, thus can't convert to normal string because this function thinks it already is one.
      • Tried "tessData = unicode(tessDataUni, "ascii")"; got error "NameError: name 'unicode' is not defined".
        • Apparently (by [58]), Python renamed "unicode" as "str"; the old "str" is now "bytes".
        • Replaced this in the python code, but error "TypeError: decoding str is not supported".
        • Replaced "str(tessDataUni, "ascii")" with "str(tessDataUni) + "ascii"" and same for truthData.
          • Got output "0.9884563720036482".
    • Changed from ascii to UTF-16 since that's what Matlab uses; got output of "0.9884554155399411".

June 23

  • Decided to use a smaller manuscript sample and compare visually to see if Tesseract is even worth pursuing.
    • Located binarized manuscripts in HoweLab -> ImageDirectory.
    • Chose two manuscripts to use (one with multiple columns); reversed colors.
    • Learned how to do multiple columns in Tesseract [59] as psm 1. Second column will appear in text file after the first.
    • Made python file to transcribe them called "". Outputs are "OutText1.png" and "OutTest2.txt".
      • Just looking at it, does not seem very similar; perhaps Tesseract isn't worth pursuing.
  • Edited the Matlab syriacProxies file using information from [60] to make it write output to a file.
  • Explored the Deep Learning playground [61].

June 24-5

  • Decided to possibly continue Winnie's work on the new model with TensorFlow, rather than train Tesseract.
  • Reached out to Winnie, began reading her wiki page and trying to understand.

June 26-28

  • Met with Winnie.
  • Unable to access related paper [62].
  • Read the GitHub page for the new model (OCR with Gluon), [63].
    • Cloned git file.
    • Installed SCLITE for WER evaluation with "git clone".
      • Error "could not create work tree dir 'SCTK': Read-only file system". Changed directory to a specific folder (moved initial git clone to same location as well) and tried again, successfully. Finished installing the rest (received various warnings).
    • Installed hsnwlib.
  • Watched [64] to learn more about the approach.
  • Worked on gaining familiarity with the system.
    • Got the pre-trained models with "python"; presumably, they're in English.
      • Error that python file doesn't exist. Fixed by navigating to folder "handwritten-text-recognition-for-apache-mxnet".
      • When running file, error "ModuleNotFoundError: No module named 'mxnet'".
      • With [65], installed mxnet with "conda activate myenv" then "pip3 install mxnet".
      • Still receiving same error. Re-installed mxnet in root directory; file now runs successfully.
    • Tried to get test IAM dataset by registering at [66].
      • Note that if we use this for scientific work, we are required to reference the paper "U. Marti and H. Bunke. The IAM-database: An English Sentence Database for Off-line Handwriting Recognition. Int'l Journal on Document Analysis and Recognition, Volume 5, pages 39 - 46, 2002."
      • Login takes me to this page [67]; however, I am supposed to create a credentials.json file using credentials.json.example and editing username/passowrd. Unable to open credentials.json.example.


  • Should I get pre-trained models? Do I then train those models on our Syriac data, or would it be best to just make a new model for Syriac?
  • When exactly should I activate conda environment(s)? How many...?
  • How to open credentials.json.example file?

Week 07 (June 29 - July 5)

June 29

  • Created credentials.json by renaming credentials.json.example after opening it by choosing "Other..." and then changing username and password in TextEdit.
  • Learned abotu .ipynb files; decide to open with Jupyter rather than Sublime. Command "python 1_a_paragraph_segmentation_msers.ipynb" worked, but doesn't seem to do anything.
  • Tried to open Jupyter through Sage, but unable to do so. Log says "OSError: [Errno 1] Operation not permitted".
  • Re-installed Jupyter with "conda install -c conda-forge notebook".
  • Read about the Jupyter basics with [68].
  • Tried to run "1_a_paragraph_segmentation_msers.ipynb" within the Jupyter Notebooks interface (activate by navigating to correct folder and doing "jupyter notebook" in terminal); error "No module found named Mxnet". Tried re-installing Mxnet in "handwritten-text-recognition-for-apache-mxnet" folder.
    • Still receiving same error.

June 30

  • Fixed mxnet error in 1_a_paragraph_segmentation_msers.ipynb.
    • Tried re-installing notebooks with "conda install notebook" after doing "conda activate myenv"
    • Affirmed that I had ipython installed.
    • Tried to install mxnet in /ocr/utils but requirements were already satisfied.
    • Used [69] to re-install mxnet with directions given.
      • First section of 1_a_paragraph_segmentation_msers.ipynb now runs.
  • Installed gluon package [70] with "pip install --upgrade mxnet gluonnlp" in root directory.
  • Next error is "NameError: name 'IAMDataset' is not defined".
    • When code is run again, this disappears.
  • Next error is a long string of code ending in "HTTPError: HTTP Error 401: Unauthorized".
  • Decided to try to run 0_handwriting_ocr.ipynb first.
    • Error "ImportError: cannot import name 'replace_file' from 'mxnet.gluon.utils' (/Applications/anaconda3/lib/python3.7/site-packages/mxnet/gluon/".
      • Do not see a "replace_file" in in that location.
      • Tried "pip install --upgrade jupyter_client" as suggested by [71]. Error still appears.
      • Tried "pip install --pre --upgrade mxnet" from [72]. Solved.
    • Next error is "ModuleNotFoundError: No module named 'leven'".
      • Installed leven with "pip install leven" using [73]. Solved.
    • Next error is (in next box) "HTTPError: HTTP Error 401: Unauthorized".
      • Unable to find any relevant sources on this.
  • Input looks to be in IAM/image format. (Which?)
    • Apparently (according to the YouTube presentation [74]), there is already an available script (get item method in particular) that parses an XML file and gets the image and associated output from it.
  • Read more about what Mxnet and Gluon even are from [75].


  • Can you type Terminal commands when Jupyter is running? It looks like no.
  • Still don't know when to use pip vs. pip3.
  • Am I just supposed to be looking at the handwriting recognition part of the model, not the page segmentation or line segmentation parts? Then should I skip right to "3_handwriting_recognition.ipynb"? Are we inserting our previous model for the first two sections, and if so, how to combine them?
  • Can we just go straight to the handwriting recognition python script available [76] and [77]?

July 1

  • Verified that username and password were correct.
  • Decided to skip to the handwriting recognition part of the program to circumvent the problem with the IAM dataset; just look at "3_handwriting_recognition.ipynb", "4_text_denoising", "5_a_character_error_distance", and "http://localhost:8888/notebooks/5_b_visual_distance.ipynb".
  • Started with "3_handwriting_recognition.ipynb".
    • Error "ModuleNotFoundError: No module named 'mxboard'".
      • Installed mxboard with "pip install mxboard" from [78]. Solved.
    • Error in In 4 "NameError: name 'mx' is not defined". When ran again, this disappeared.
    • Error in In 17, the same HTTP error.
      • Attenpts to edit code to replace IAM dataset with Beth Mardutho data.
        • Tried to connect to Smith VPN to access nhowelab, but error message "Connection Error / Account has expired. (Error:1328) / Contact your network administrator".
          • Attempted to set up the Smith VPN again, but received same error message. Emailed someone from Smith.
        • For now, decided to use Genesis data.
        • Manually cropped first five lines of "BinaryManuscriptsOr. 5021-006.png" into line segments; put it in folder "syriactestdataset" in "dataset". Switched out IAMDataset for syriactestdataset.
        • Error "TypeError: 'module' object is not callable". I know the data is not in the same format as IAMDataset was.
        • Created program "" to make an IAM dataset analog modeled after [79].
          • For now, it does not have Syriac data and translations because unable to access that data.
          • Not sure how to call dataset in Jupyter program. Tried replacing IAMDataset with new folder SYRIACDataset but still receiving module object is not callable error.
  • Attempts to locate IAMDataset.
    • Tried running; received error "ImportError: attempted relative import with no known parent package".


  • How to locate IAMDataset to mimic it.
  • How to get back on Smith server.
  • Concerned that this will lead to more and more problems down the line.

July 2

  • VPN is reactivated;.
  • Attempts to use IAM.
    • Tried running test_iam_dataset.ipynb
      • Error "ImportError: attempted relative import with no known parent package".
        • This error is within the file, so looked into the problematic line "from .expand_bounding_box import expand_bounding_box" of that file. Thought this might depend on the previous scripts, so decided to go back to "0_handwriting_ocr".
      • Noticed and corrected typo error in credentials.json; ran "0_handwriting_ocr".
        • Error "ModuleNotFoundError: No module named 'sacremoses'". Installed sacremoses from [80] with "pip install sacremoses".
        • Took a couple of hours to run.
          • Error "OSError: [Errno 28] No space left on device".
  • Switched over to Mac computer frd241-c011971-07 via Splashtop.
    • Set up everything on that computer just like previously. Jupyter and pip are also not installed.
      • It should already have pip because has python 3.7.3, but apparently does not. Unable to update python because no administrator permission. So got pip with "curl -o" and then "python".
      • Error downloading hnswlib:
        • The following error occurred while trying to add or remove files in the

installation directory:

   [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/test-easy-install-17716.pth'

The installation directory you specified (via --install-dir, --prefix, or the distutils default setting) was:


Perhaps your account does not have write access to this directory? If the installation directory is a system-owned directory, you may need to sign in as the administrator or "root" account. If you do not have administrative access to this machine, you may wish to choose a different installation directory, preferably one that is listed in your PYTHONPATH environment variable.

For information on other options, you may wish to consult the documentation at:

Please make the appropriate changes for your system and try again.

frd241-c011971-07:python_bindings howelab$ python3 install running install error: can't create or remove files in install directory

The following error occurred while trying to add or remove files in the installation directory:

   [Errno 13] Permission denied: '/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/test-easy-install-17719.write-test'

The installation directory you specified (via --install-dir, --prefix, or the distutils default setting) was:


Perhaps your account does not have write access to this directory? If the installation directory is a system-owned directory, you may need to sign in as the administrator or "root" account. If you do not have administrative access to this machine, you may wish to choose a different installation directory, preferably one that is listed in your PYTHONPATH environment variable.

For information on other options, you may wish to consult the documentation at:

Please make the appropriate changes for your system and try again.

frd241-c011971-07:python_bindings howelab$ python install running install Checking .pth file support in /Library/Python/2.7/site-packages/ error: can't create or remove files in install directory

The following error occurred while trying to add or remove files in the installation directory:

   [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/test-easy-install-17722.pth'

The installation directory you specified (via --install-dir, --prefix, or the distutils default setting) was:


Perhaps your account does not have write access to this directory? If the installation directory is a system-owned directory, you may need to sign in as the administrator or "root" account. If you do not have administrative access to this machine, you may wish to choose a different installation directory, preferably one that is listed in your PYTHONPATH environment variable.

For information on other options, you may wish to consult the documentation at:

Please make the appropriate changes for your system and try again.

frd241-c011971-07:python_bindings howelab$ ls hnswlib tests Makefile requirements.txt bindings.cpp frd241-c011971-07:python_bindings howelab$ cd hnswlib frd241-c011971-07:hnswlib howelab$ cd .. frd241-c011971-07:python_bindings howelab$ ls hnswlib tests Makefile requirements.txt bindings.cpp frd241-c011971-07:python_bindings howelab$ cd .. frd241-c011971-07:hnswlib howelab$ ls python_bindings CMakeLists.txt examples sift_1b.cpp LICENSE hnswlib sift_test.cpp main.cpp frd241-c011971-07:hnswlib howelab$ cd python_bindings frd241-c011971-07:python_bindings howelab$ python install running install Checking .pth file support in /Library/Python/2.7/site-packages/ error: can't create or remove files in install directory

The following error occurred while trying to add or remove files in the installation directory:

   [Errno 13] Permission denied: '/Library/Python/2.7/site-packages/test-easy-install-17731.pth'

The installation directory you specified (via --install-dir, --prefix, or the distutils default setting) was:


Perhaps your account does not have write access to this directory? If the installation directory is a system-owned directory, you may need to sign in as the administrator or "root" account. If you do not have administrative access to this machine, you may wish to choose a different installation directory, preferably one that is listed in your PYTHONPATH environment variable.

For information on other options, you may wish to consult the documentation at:

Please make the appropriate changes for your system and try again.

      • Moved forward for now. Installed Jupyter with "pip3 install notebook"; no conda installed. Lots of errors abut permission, but moved on for now.
      • When running, error that we didn't have mxnet. Installed mxnet with "pip3 install mxnet" and "pip3 install mxnet-mkl". Same permission errors, said could not install packages.
        • Still receiving mxnet error.
  • Moved everything to the shared drive.


  • Why does "pip3" command work on GPU, and not "pip"?
  • The administrator thing... is there a different computer I should be using?

July 3-5

  • Attempts to set up a new GPU.
    • Tried a Ford 243 computer, but similar to 241, Howelab does not have administrative permission.
    • For now, proceeded by setting things up on my own computer, but there is just not nearly enough space.
  • Examined GitHub Issues page [81] for the gluon program to try to skip the IAM implementation and go back to trying to use a Syriac dataset.
  • Decided to just mimic the data format and change the code.
    • Decided to just use Estrangelo data for now as I figure out how to prepare it.
    • Replaced with an edited Took iamdataset folder out of dataset folder. Replaced it with newSyriacDataset. Replaced IAMDataset with NEWSYRIACDataset.
      • newSyriacDataset has the five Estrangelo data folders, re-organized into the way the IAM was organized.
      • Edited the to change the part where it imported the data from and went through the folders.
        • Need to edit __init__ function and/or change newSyriacDataset organization, but unsure exactly how to change urls leading to internet cites into file imports and what the code is doing.

Week 08 (July 6-12)

July 6

  • Work on
    • Deleted credentials parameter.
    • Deleted a lot of the downloading things because I can just put the file in the same directory.
    • The whole goal of this python file is to make a local data folder of iam. Since the syriac data will already be downloaded, all this code needs to do is process it and then put it in the newSyriacDataset folder. Or I could just not deal with the code at all and just manually format the data.
      • Changed name of newSyriacDataset folder to preprocessedSyriacDataset, and put it in same directory as
    • Went through the code looking into what each section did and summarizing it.
    • Not sure how to begin editing code because of so much mismatch between inputs.


  • What is parse method and is it needed?
    • Parse method is either forms, forms_bounding box, lines, or words. Not sure why this matters. Maybe this actually refers to the input of the data since the IAM input data to be processed is divided into forms, lines, sentences, and words. Do I need to divide the Syriac data (currently just in forms) into sentences, words, and lines, or is it okay just to have forms?

July 7

  • The frd241-c011971-07 Mac now has editing privileges, so began setting it up.
    • Mostly successful setup from GitHub page (combination of pip3 and pip according to various errors), but still receiving same permission error with "python install". However, when ran this again, it was successful.
      • pip3 install notebook doesn't work, so installed conda.
        • "conda" command still is unrecognized, but was able to access Jupyter notebook through Anaconda Navigator.
    • Made Howe Lab account on the FKI IAM website with same info as on the Ford computers.
    • Tried to run ""; error "No module named mxnet".
      • Installed mxnet with "pip install mxnet" in OCRWithGluon folder. Solved.
  • Ran 0_handwriting_ocr in Jupyter.
    • Error "no module named 'cv2'".
      • Did "pip install opencv-pthon". Solved.
    • Error "no module named "gluonnlp".
      • Did "pip install --upgrade mxnet gluonnlp". Solved.
    • Installed leven, sacremoses, mxboard.
    • "iamataset" in "dataset" folder downloading.
      • Stopped partway through, error "IOPub message rate exceeded. The notebook server will temporarily stop sending output to the client in order to avoid crashing it. To change this limit, set the config variable `--NotebookApp.iopub_msg_rate_limit`. Current values: NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec), NotebookApp.rate_limit_window=3.0 (secs)".
    • Updated Jupyter and ran again. Have to delete everything from iam dataset folder in order to do so.
      • Still stops partway through--the browser just whites out, and terminal message "File "/Users/howelab/opt/anaconda3/lib/python3.7/site-packages/tornado/", line 1106, in wrapper / raise WebSocketClosedError() / tornado.websocket.WebSocketClosedError"."
      • Followed instructions [84] to change iopub data rate limit. Unsuccessful because cannot locate file.
      • By [85], tried just opening Jupyter with "jupyter notebook --NotebookApp.iopub_data_rate_limit=1e10".

July 8

  • Decide to download IAM database by hand and edit to locate the dataset rather than download it, and then do the processing.
    • Some of it is already on Howe Lab, but not all that's needed for the model, so decided to download it afresh.
    • Looked through the code and decided to download formsA-D.tgz, formsE-H.tgh, formsI-Z.tgz, lines.tgz, words.tgz, and xml.tgz.
    • Seems like the data in "iamdataset" folder is both unprocessed data as well as code outputs.
    • Deleted formsA-D etc. folders, instead combining them into one "forms" folder as seen in the partially-downloaded iamdataset.
    • Deleted parts of code where data was downloaded and extracted.
    • Attempted to run test_iam_dataset.ipynb.
      • Error: File "/Users/howelab/opt/anaconda3/lib/python3.7/site-packages/IPython/core/", line 3331, in run_code
   exec(code_obj, self.user_global_ns, self.user_ns)
 File "<ipython-input-1-4d01ae31b9b6>", line 8, in <module>
   from iam_dataset import IAMDataset
 File "/Users/howelab/OCRWithGluon/handwritten-text-recognition-for-apache-mxnet/ocr/utils/", line 301
   def _pre_process_image(self, img_in):

IndentationError: unexpected indent

        • Corrected a few indents in; solved.
      • Error: "ImportError: attempted relative import with no known parent package" on lines "from iam_dataset import IAMDataset" and "from .expand_bounding_box import expand_bounding_box". This is the same error as pre-manual-download.
        • Located line "from expand_bounding_box import expand_bounding_box" in; deleted the period. Solved.
      • Error: "AttributeError: 'IAMDataset' object has no attribute '_download_subject_list'".
        • Deleted a reference of a _download_subjects_list() function I had erased. Error persists. Restarted jupyter; solved.
      • Error: "TypeError: object of type 'NoneType' has no len()".


  • What is a subject list? Do I need to download and extract it or have I already done that?
  • What might len() do?

July 10-12

  • More attempts to run test_iam_dataset.
    • Apparently (according to [86]), the len() function returns the number of items in an object.
      • I guess the object "args[0]" returns 'none'. Maybe it's a problem with when ArrayDataset was defined.
      • Deleted "credentials" parameter.
      • Printed out args[0] and it was "IAMDataset". So perhaps the issue is that IAMDataset is empty since it's not properly opening up the files downloaded to the computer.
    • Working on editing _get_data from to properly be able to access the data.
      • Un-commented out the statements assigning values to variables like image_data that followed the data downloads. Lots of errors. Un-commented out more lines.
    • New error: "AttributeError: 'IAMDataset' object has no attribute '_download_subject_list'"
      • Un-commented _download_subject_list function.
    • Error: "AttributeError: 'IAMDataset' object has no attribute '_download'".
      • Uncommented out this function.
    • Error: "IndexError: single positional indexer is out-of-bounds". At this point, I think it's just trying to download the data as usual.


  • Should I be downloading the subject list manually or through the code? What would be the analog for the Syriac data?
  • I commented out the lines, but I don't know how to have the program recognize where the new data is located. I hoped this would happen automatically, but is there a better way?

Week 09 (July 28-August 2)

July 28-August 1

  • Made new Jupyter python 3 notebook with split up in many cells to see where the problems were.
    • First error: "ModuleNotFoundError: No module named 'expand_bounding_box'"
      • Added a "." to form ".expand_bounding_box".
    • Error: "ImportError: attempted relative import with no known parent package".
      • Tried the new jupyter file to the same location as was.
        • Splashtop is lagging very heavily and unable to perform basic actions. So tried updating Splashtop. Error persists.
        • Deleted the "." and error is gone.
      • Next error under IAMDataset class: "NameError: name '__file__' is not defined".
        • Using [87], added to code but ended up with error: "IndentationError: unindent does not match any outer indentation level". Tried changing indentations but error persists.

August 2

  • Unable to connect to computer with Splashtop.
    • Tried "waking up" computer, with no success. Apparently the machines were unplugged.

Week 10 (August 3-9)

August 3

  • Still unable to use computer, in the process of granting software installation privileges on another computer.

August 4

  • Starting over with a new computer, "frd243-c010676-07".
    • Following same setup process as before. Repeated problems with Splashtop connections and installation privileges.
    • Power outage for most of the day.

August 5

  • Computer is still offline, probably from power outage.
  • Able to connect to the original computer, frd241-c011971-07, for some reason.
  • Changed line breaks and indents which were off for some reason.
  • Inserted a file reading in the _get_data function to replace the data downloads. Used [88] to read all contents of iamdataset folder.
    • Wrote: directory = r'/Users/howelab/ocrwithgluon/handwritten-text-recognition-for-apache-mxnet/dataset/iamdataset/'
       for someFileName in os.listdir(directory):
           if someFileName.endswith(".png") or someFileName.endswith(".zip") or someFielName.endswith(".plk") or someFileName.endswith(".tar"):
               f = open(someFileName, "r")
      • Not sure if this method goes through all the layers of the directory though. Also not sure if the .zip, .plk, and .tar files need to be "read".
  • Errors in fourth section of Jupyter code (beginning of IAMDataset class):
    • "NameError: name 'filename' is not defined"
      • Fix: Switched "filename" to "__file__".
    • Various syntax errors fixed by correcting typos.
    • "NameError: name '__file__' is not defined"
      • Tried (using [89]): changing line from find_data_file_ from "datadir = os.path.dirname(__file__)" to "datadir = os.path.dirname('/Users/howelab/OCRWithGluon/handwritten-text-recognition-for-apache-mxnet/ocr/utils/dataset/iamdataset/')"
        • Not sure if I should just use /Users/howelab/OCRWithGluon/handwritten-text-recognition-for-apache-mxnet/ or say /home/howelab/OCRWithGluon/handwritten-text-recognition-for-apache-mxnet/ocr/utils/dataset/iamdataset/. Unsuccessful
      • Tried declaring variable __file__ and setting it equal to that path. Unsuccessful. Probably because find_data_file is not run before __init__.
      • Tried declaring __file__ as global variable, in first block of code. Successful.
  • Fifth block of code errors:
    • Various indentation errors.
  • Sixth (final) block of code errors:
    • Various indentation errors.
      • So, at least the version in Jupyter, runs smoothly. Copied and pasted he Jupyter version "Untitled.ipynb" into the actual file.
  • Tested with python in terminal, and no errors appeared.
  • Tried to run Errors begin in ln[3].
    • Error: "FileExistsError: [Errno 17] File exists: '/Users/howelab/OCRWithGluon/handwritten-text-recognition-for-apache-mxnet/ocr/utils/dataset/iamdataset/../../dataset/iamdataset'".
      • Changed the __file__ to being equal to "/home/howelab/OCRWithGluon/handwritten-text-recognition-for-apache-mxnet/ocr/utils/".
    • Various typos.
    • Error: "FileNotFoundError: [Errno 2] No such file or directory: 'words.tar'".
      • Changed the __file__ to being equal to "/home/howelab/OCRWithGluon/handwritten-text-recognition-for-apache-mxnet/". Unsucessful.
      • Changed the file reading block of code from before to
       for someFileName in os.listdir(directory):
           if someFileName.endswith(".png"):
               f = open(someFileName, "r")
    • Error: "AttributeError: 'IAMDataset' object has no attribute '_credentials'".
      • Commented out all of _download function except filename variable definition.
    • Error: "FileNotFoundError: [Errno 2] No such file or directory: '/Users/howelab/OCRWithGluon/handwritten-text-recognition-for-apache-mxnet/../../dataset/iamdataset/'"
      • Added back into file reading block of code the "or someFileName.endswith(".zip")" but now got error "FileNotFoundError: [Errno 2] No such file or directory: ''".
      • This file is in fact in the iamdataset folder...
  • Backed up to shared drive, although issues remain.

August 6

  • More work troubleshooting with test_iam_dataset.
    • Moved all .zip, .plk, and .tar files out of "iamdataset" folder in attempt to fix error "FileNotFoundError: [Errno 2] No such file or directory: '/Users/howelab/OCRWithGluon/handwritten-text-recognition-for-apache-mxnet/../../dataset/iamdataset/'". Error persists.
      • Replaced contents of _download_subject_list function in with:
       theDirectory = r'/Users/howelab/ocrwithgluon/handwritten-text-recognition-for-apache-mxnet/dataset/iamdataset/largeWriterIndependentTextLineRecognitionTask/'
       for someFileName in os.listdir(theDirectory):
           f = open(someFileName, "r")
        • Error: "FileNotFoundError: [Errno 2] No such file or directory: '/Users/howelab/ocrwithgluon/handwritten-text-recognition-for-apache-mxnet/dataset/iamdataset/largeWriterIndependentTextLineRecognitionTask/'".
        • Changed code to "theDirectory = r'/Users/howelab/ocrwithgluon/handwritten-text-recognition-for-apache-mxnet/dataset/iamdataset/subject/'".
    • Error: "FileNotFoundError: [Errno 2] No such file or directory: 'validationset1.txt'".

August 7-9

  • Looked into error from last time with little success because validationset1.txt is indeed in the subject folder.

Week 11 (August 10-16)

August 10

  • More troubleshooting to run test_iam_dataset.
    • Tried deleting the "r" in the line "theDirectory = '/Users/howelab/ocrwithgluon/handwritten-text-recognition-for-apache-mxnet/dataset/iamdataset/largeWriterIndependentTextLineRecognitionTask/'". Unsuccessful.
    • Tried writing out full file path in the open command. Error that the directory didn't exist.
    • Rewrote "ocrwithgluon" as "OCRWithGluon", deleted "/" after final file name. Error: "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 23: invalid start byte".
    • Added back the final "/". Error that the directory didn't exist.
    • Added line "os.chdir("/users/howelab/")" before the other block of code. Error persists.
    • Substituted "/Users/howelab/" with "/home/". Error persists.

August 11

  • Unable to connect to Splashtop.

August 12

  • Splashtop computer is back.
  • Continuing trying to fix same error as yesterday.
    • Tried printing someFileName, no output, so moved it to Jupyter notebook. Error in "'filename' not defined". Updated with latest version, and error "'__file__' not defined"
      • Fix: updated the rest of the code. Runs successfully but no output.
      • Edit original to make it write someFileName to text file
       listString = ""
       directory = r'/Users/howelab/ocrwithgluon/handwritten-text-recognition-for-apache-mxnet/dataset/iamdataset/'
       for someFileName in os.listdir(directory):
           listString.append(someFileName + ", ")
           if someFileName.endswith(".png"):
               f = open(someFileName, "r")
       file = open('outputTest.txt', 'w')
      • Still no output.
      • Still suspect the problem lies in this section of the code though, and that the loop is infinite.

August 13

  • More troubleshooting to get test_iam_dataset to run.
    • Corrected the loop in _get_data using [90].
      • FileNotFoundError: "[Errno 2] No such file or directory: 'e04-127-04-05.png'"
        • This is confusing because the file e04-127-04-05.png is indeed located within a subdirectory of iamdataset, and also because I don’t get why it would say the file doesn’t exist since the file must have been iterated over in order for the name to get mentioned in the error message.

August 14-16

  • More troubleshooting of test_iam_dataset.
    • Tried changing "if filepath.endswith(".png"):" to "if filename.endswith(".png"):". Error persists.
  • Ford computer "frd241-c011971-07" no longer shows up as an option to connect to in SplashTop.
  • Plan to implement second solution from [91] to see the values of "filepath" and "filename" and thus continue troubleshooting error.

To Do Next

  1. Back up files in OCRWithGluon folder from Mac computer frd241-c011971-07 to the Howe Lab drive.
  2. Troubleshoot errors in test_iam_dataset.ipynb by adjusting code in to get Mac to successfully use IAM dataset files already on computer rather than downloading from a URL.
  3. Replace IAM dataset files with Estrangelo Beth Mardutho Syriac training dataset, and either change accordingly to accommodate this new data format/organizatio, or put syriac dataset in same format/organization as IAM dataset.
  4. Run Apache Mxnet OCR With Gluon program on that Syriac dataset and measure accuracy; compare to accuracy rates of other models.
  5. Sort out details of the new model through making sure settings are optimized for multiple columns in training data, etc.
  6. Add Serto and East Syriac languages to model.