Parzival Spring 2018

From CSclasswiki
Jump to: navigation, search

By: Cathy Lee, Fanghui He, Ha Cao, Qiaqia Ji

Project goal

To create ground truth images for each of the Parzival pages. That is, we will give each pixel of the page image a label according to what it belongs to: a particular character, background, or noise.

Progress made by last year's team

We looked at the wiki documentation and code written by the previous team in Spring 2017. They wrote code that allows the user to simultaneously open a line image file and click on a matching line in the page image file, and susequently record the coordinates of the line in the page image file. The line coordinates of each page image are saved in separate .mat files in the Parzival > Files directory. There were missing line images from some pages that were not recorded; the pages with missing lines are listed under our meeting documentation for February 18.

Parzival file hierarchy

Files and directories located in H:/Parzival/.

  • Files/
    • Contains datasets of the line coordinates in most pages (some are missing)
  • parzivaldb-v1.0/
    • data/
      • line_images_normalized/
        • Normalized line images
      • page_images/
        • binarizeImage.m (written by us, discussed in the meeting notes for March 21
        • Images of each page of Parzival (with some pages missing)
      • raw_word_images/
        • Test output of autocut.m
      • word_images_normalized
        • Normalized word images
    • ground_truth
      • transcription.txt includes the letters in the normalized line images
      • word_letters includes any word in the normalized word images
    • Text files in folders ground_truth/, sets1/, and sets2/
    • Line Coordinates/ folder contains the same coordinates as Files/, but has more pages missing
  • cut.m, cutLine_copy.m (written by us)
  • roughAlignBitmaps.m, projecttoPage.m, ProcessParzival, discussed in the meeting notes for March 28

February 18, 2018

The first problem that we encountered was that we were unable to open .mat files (initially listed as Microsoft Access Table Shortcut files). We quickly found that this was because the default program for opening .mat files was incorrectly listed as Microsoft Access. We changed the default software to MATLAB instead, and were able to view the MATLAB data in the .mat files.

We went through the coordinate files created by the Spring 2017 team and found the pages with missing lines (this was determined by missing y-coordinates in the MATLAB data).

  • Pages with missing line coordinates: 6a (missing first 8 lines), 10b (only 7 lines), 39b (only 11 lines), 44a (only 14 lines), 124b (missing first 10 lines)
  • Pages missing entirely: 7b, 33a, 44b, 124a, 127a (for lines 1-45, lthe ine images are normalised, but the line coordinates are missing), 127b, 143a & b, 144a & b, 262a & b ~ 268a & b (Note: The line coordinates for 269a & b ~ 271a & b, 270a are present in Parzival > Files, but not in Parzival > parzivaldb-v1.0 > Line Coordinates. We thought that Files might be an updated version of Line Coordinates with less missing files.)
  • One of the line coordinate files was named wrongly - page 6 actually refers to page 6b. We have not renamed the file yet, as we wanted to check in with Professor Howe before doing so.

Note: We did rename the file, with Professor Howe's permission, on February 25

February 21, 2018

Today, we looked at the files suggested by Professor Howe during the first meeting.

DemoSkc.m

This file is the demo of a ball-spring model (or part-structured model). After further clarification with Professor Howe, we now know that DemoSkc.m is supposed to show us how the ball-spring models are built, and how they match against other words. We tried to run this, but it could not compile, due to a failure to load the file GW20WordDescription. After guidance from Professor Howe, we learnt that GW20WordDescription must either be in the current working directory or the path to be successfully loaded. When we fixed this, however, we were still unable to compile the file, as the code made several references to a variable called bimg that was not created in DemoSkc.m. We searched for bimg in the drive, but could only find a file called bimgEntropyCodeLen.m that also passes in bimg as a parameter.

Note: we managed to find the variable bimg when we met again on February 25.

Note #2: Professor Howe told us that DemoSkc.m should actually be named DemoPSM.m; PSM stands for part-structured model.

autoPSM.m

This file creates character models. When we tried to run this file, an error occurred: Not enough input arguments. This error occurred in line 10: epix = [bimg(1,:),bimg(end,:),bimg(2:end-1,1)',bimg(2:end-1,end)']; We guessed that extra arguments might have to be provided ourselves, but we did not know how.

NRH note: yes, you need to call the function with a binary image of the letter that you want to turn into a model.

FitPSMWordModel

This file fits models to words on page. When we tried to run this file, an error occured: Undefined function or variable 'setupParameters'. This error was in the line p = setupParameters(p,varargin{:})

NRH note: setupParameters is defined in the MATLAB\Utility folder. Make sure you have added this to your path.

PageTruth

This file puts word boxes onto the entire page. When we tried to run this file, an error occurred: Not enough input arguments. This error occurred in line 13: if iscell(page)

NRH note: part of the work you should be doing is to look at how these functions are used in the example files, and figure out what kind of arguments they need. if you are having trouble you can always ask me.

FixErrors

This file (located in the PageTruth folder) revises and corrects the program. When we tried to run this file, an error occurred: Not enough input arguments. This error occurred in line 5: oGtp = gtp;

February 25, 2018

Professor Howe said during the meeting on February 23rd that a specific file would be able to help us confirm the rectangles around each word in Parzival were correct. He thought this file might be named ImgRect, but was not sure. Today, we tried to find the ImgRect file; however, searching ImgRect in the HoweLab data drive yielded no results. We thought that it might be the recoverGW20Rects.m file. We loaded data for the missing variables nword, wpage, and page by loading the MATLAB data files GW20WordImages.mat, GW20Aug2012Seq.mat, and GW20Pages.mat respectively. For nword, we typed nword = wimg(n) in the command line for the file to properly compile (n should be replaced with any number from 1 to 4857).

Later, we asked Professor Howe about the file, and received the following response:

"In the first section it creates a data file called GW20WordRectangles that contains the word rectangle data for GW20. (I don't remember why I had to do this, because the word rectangles should have been given with the GW20 data set.) You will note that it is not set up as a function; rather, it records a sequence of executed commands. You can run it several ways; one is by pasting the code at the command line. Note that you only need to do this once -- as soon as GW20WordRectangles is created, you can just load the data from it."

When we tried running the program, its output was consecutive numbers from 1 to 4857. As the program processes the word, it prints out numbers, so that the user can monitor the program's progress. When it reaches 4857, it is done. At that point, the variable wr has been completely computed.

With Professor Howe's permission, we renamed Page6_Coordinates.mat to Page6b_Coordinates.mat in both the Files and Line Coordinates folders.

Update on the bimg variable: We found the bimg variable in GW20Aug2012Seq.mat, GW20BinaryWordImagesV3.mat, GW20BinaryWordImagesV4.mat for the George Washington project, and BinaryWordImages.mat for the Parzival project.

March 21, 2018

We created a program in Parzival/Parzival-v1.0/data called binarizePages.m that binarized the raw page images by using the available line coordinates to cut sliced raw lines from the binarized raw page. It creates a loop to load all the page images in the folder and binarize all of them, and then save the binarized page images in a vector called `binarizedimg`. Although this successfully completes binarization, we are wondering whether we should find a more advanced or sophisticated way to binarize the image, because we used the built-in MATLAB `im2bw` function that simply changes the images to black and white ones. We are also having some difficulty slicing the lines with the coordinates from the page images; we are able to cut some lines, but we are not sure if it is the right way to do it.

March 23, 2018

We changed all the names of the files in parzivaldb-v1.0/Line Coordinates from Page6a_Coordinates into the same name as their page images, so that we can find the files more quickly while running binarizePages.m.

We also updated binarizePages.m to check whether we have the coordinates file for the pimg (page image). If not, the program will print an error message.

March 28, 2018

Professor Howe sent us 3 files:
roughAlignBitmaps.m: to figure out offsets between two lines
projectToPage.m: to create ground truth images after you have all the words finished
processParzival.m: lines 236-250 show creation of segmented line images

He also updated the Parzival folder in the HoweLab drive with two files, ParzivalSegmentedBlocks.mat and ParzivalSegmentedLines.mat, which is the data that should be loaded into MATLAB, along BinaryWordImages.mat, which loads in the bimg variable. He told us to apply roughAlignBitmaps.m to these MATLAB data files and the deslanted lines (although there may be some lines missing), and that we might have to invert one from black on white to white on black.

On our first attempt to run these files, we got the error Not enough input arguments. We then attempted to fill in the input arguments via the command line, roughAlignBitmaps(lines{1, 1}{1, 1}{1, 1}, lines{1, 1}{1, 1}{1, 1}). The arguments passed into the function were bimg1, bimg2, sigma. We didn't know what sigma was, and we noticed that there was a default value assigned to it if we left it empty, so we passed in the first two bimg variables instead. Our next error was The unitize function is not defined. We asked Professor Howe about this, and he told us that unitize.m was located in the Utilities folder in the HoweLab drive. After this, the program compiled successfully, and the output was a matrix. After asking Professor Howe about this, he said:

"There are three outputs. The first two are transformed images: bimg1to2 translates bimg1 into the best possible alignment with bimg2, and the second is the reverse. scr measures the quality of the match; lower is better. It looks like I may not quite have finished writing this. I would expect it to also return information about the transformation that maps one line image on top of the other one, but as currently written this information is computed inside the function but never returned. It probably should be rewritten to pass back that information as additional return values. If I'm reading things right, the crucial information are the computed values of p1 and p2. But it's not as simple as an offset; there's some stretching possibly involved."

March 30, 2018

Today, we tried again to better understand roughAlignBitmaps.m so that we could write our own function. Professor Howe provided us with a transcript to show us how to properly run it:

cd Handwriting\
load S:\Labs\howelab\Parzival\ParzivalBinaryLineImages.mat
subplot(2,1,1);
subplot(2,1,2);
imshow(lines{1}{1}{9})
imshow(limg{1})
load S:\Labs\howelab\Parzival\ParzivalSegmentedLines.mat
imshow(lines{1}{1}{9})
[a,b,c] = roughAlignBitmaps(line{1}{1}{9}, limg{1});
figure
subplot(2,1,2); subplot(2,1,1);
(Click on the plot you want the image to show up on before you pass the image into the command line)
imshow(lines{1}{1}{9})
(Click on the plot you want the next image to show up on)
imshow(b)
Create a new figure:
figure
subplot(2,1,2); subplot(2,1,1);
(Click on the plot you want the next image to show up on)
imshow(limg{1})
(Click on the plot you want the next image to show up on)
imshow(a)
c

The output of roughAlignBitmaps.m is a float, which is the offset between the two input lines. The program can also be made to plot 2 images, one on top of the other, using command line arguments from the transcript above.

During our attempt to further understand roughAlignBitmaps.m, we went through it a few times, extensively documenting what each part of the program does, as well as what every variable abbreviation stands for. These comments can be found in roughAlignBitmaps_copy.m, located in the same directory as the original file.

April 11th, 2018

We started work on a program, cut.m, located in the main Parzival directory (HoweLab/Parzival/), that would take the word split information from the deslanted image and use it to split the raw line image in the equivalent place.

April 13th, 2018

We finished work on cut.m. It takes in binarised word images (wordbimg) and the txg2 from roughAlignBitmaps.m. Currently, the program works on page 6 only; we are working on making it more generalised, so that the user can run it with minimal manual effort (i.e. so that the user doesn't have to change txg2 and the wordbimgs every time they want to run it. This would require a lot of manual work if we were to go through all the pages in the Parzival data).

The call to the cut.m function is [a, b] = cut(lineRimgs, limg).

April 20th, 2018

We split into 2 groups; our objective was to extend cut.m to make it more generalised (as mentioned above). Two copies of the file were made: autocut.m and cut_copy.m. IN the original file, there is a line of code, wordbimgs = dir('d-006a-001*.png'). We split this file name in order to make this program more generalised: pattern = strcat.m('d-', pageNo, aOrB, '-', lineNo, '*.png');. pageNo is a variable for the page number (e.g. '006'), aOrB is a variable for the letter after the page number is 'a' or 'b', and lineNo is a variable for the line number (e.g. 001). Although this makes the program more generalised because you can enter a different txg2, it also requires a lot of manual effort on the user's part, as they would have to keep entering all the different variables.

April 27th, 2018

We continued work on the functions, mainly trying to get autocut.m and cut_copy.m to compile, as we were getting errors. In cut_copy.m, we passed in wordbimgs so we do not have to hard code the file name. We were having an error where MATLAB was printing out txg2 even though we had no code that explicitly displayed / printed txg2. It turned out we had a missing semicolon for the statement where we were assigning something to txg2.

May 6th, 2018

Call to the cut_copy.m function: [~, segment] = cut_copy(lines, limg)

Showing specific images:
imshow(segment{1, 1}{1, 2}{1, 1})
imshow(segment{1, 3}{1, 2}{1, 1})
imshow(segment{1, 4}{1, 2}{1, 1})
imshow(segment{1, 5}{1, 2}{1, 1})
imshow(segment{1, 6}{1, 2}{1, 1})

We changed our code in order to include the fixed offset when we cut the words, using the values in the cells of txg2. Using this fixed offset, we moved the start and end point of the cut to the right.

This approach seems to work well for some raw binarized lines, but not so well for others. We tried constants 7, 10, and 20 on the first few lines (we manuallly tested lines 1-6). 10 seems to work for line #1 and #3 but not for the rest. 7 works in general except for line #1, but is not the best fit. 20 did not work. Some lines needed more offset to the right while others needed more offset to the left. Furthermore, it wasn't just a matter of direction: regardless of whether they needed offset to the left or right, they needed different offset amounts - hence why we tried different constants. We thought about attempting to use different constants for each line, but thought we would have to hard-code a different constant for each line (which would be difficult as we did not know how to determine a suitable offset tailored to each line). Professor Howe advised us to pick the offset that seemed to work the most widely and consistently, and produce a set of word images using that. He also said we could figure out a way to adjust the word bioundaries so that small pieces are not cut off.

We then found out that the width of the word images combined is not equal to the width of the raw binarized line, which might be part of the reason why the previous codes do not work. Therefore, moving forward, when we next meet we are going to try cutting the line right where the first word starts (the first white dot of the first word), rather than start cutting from the beginning of the line. The difference between the beginning of the line (column 0) and where the first word starts is the offset, so we will move the cutting position for every word by that offset. This way, every line will have its own offset. Professor Howe approved of this, and also said that another strategy that might work would be to superimpose the labeled, transformed old line onto the new one, and let the pixels vote for each connected component.

May 10, 2018

We worked collectively on cut_copy.m instead of separately on cut_copy.m and autocut.m. We renamed cut_copy.m to cutLine_copy.m to make the function name more informative.

We tried to find a better way to cut the words without having to set a fixed constant that only works for some words and not others. We had a persistent error: Subscript indices must either be real positive integers or logicals. The error was in the line of code where we used round(), which initially did not make sense because round() should have made the index an integer. We realised that the error might be because the float being rounded is being rounded down to 0. Our solution was to use ceil() to round up instead of just rounding to the nearest integer. However, we still got the same error - so we added 1 to (k-1) to avoid the index becoming 0. Currently, it works until d-006a-005*.jpg, before crashing with the error Index exceeds matrix dimensions.

Summary for next year's team

The Parzival directory documentation for the Howelab drive might be helpful.

Useful variable locations

Located in main Parzival directory (H:/Parzival/):
ParzivalBinaryLineImages.mat: limg (binary line image), lid (line ID)
ParzivalBinaryPageImages.mat: bpage (binary page image)
ParzivalBinaryWordImages.mat: bimg (binary word image), wid (word id)
ParzivalPageImages.mat: page (page image)
ParzivalSegmentedBlocks.mat: blocks, bpts
ParzivalSegmentedLines.mat: lines (segmented lines), lpts
txg2.mat: a (transformed x grid from roughAlignBitmaps.m

Located in H:/Matlab/Utility/:
setupParameters.m (needed to run FitPSMWordModel.m)

Code written

All code written is located in the main Parival directory in HoweLab (HoweLab/Parzival/).

We wrote binarizePages.m, which slices raw lines from the binarized raw page and binarizes the Parzival page images, cut.m, which cuts words from raw binarized lines, and attempted to extend it in cutLine_copy.m (renamed from cut_copy.m), where we attempted to generalize the word slicing by cutting the line where the first word starts (the first white dot of the first word), rather than cutting with a fixed offset. Using a fixed offset worked, but was not accurate enough when cutting the binarized lines, since different binarized lines required different offsets.