Syriac dating

From CSclasswiki
Jump to: navigation, search

For Easy Acess

Losses

  • Linear distance/angle: Self-explanatory.
  • Log distance with l2_normalized vectors: Same: -log(-distance/2 + 1). Different: -log(-distance/2). Both minimum is 0 (distance in range 0-2).
  • Inverse distance with l2_normalized vectors: Same: -1/(distance-2) - 0.5. Different: 1/distance - 0.5. Both minimum is 0 (distance in range 0-2).
  • Log distance with cosine: Same: -log((1+cos)/2). Different: -log((1-cos)/2). Both minimum is 0. Same minimum at cos=1, different minimum at cos=-1.
  • Inverse distance with cosine: Same: 1/(1-cos) - 0.5. Different: 1/(1+cos)-0.5. Both minimum is 0.
  • Scaled: Loss/length of latent 1. No defined min.
  • Range: Loss/range of latent 1. No defined min.

DatedManu mod

  • BL. Add. 14425\n(A and B) to BL. Add. 14425A and 14425B
  • Vat. Syr. dd/d to Vat. Syr. 0dd/0d
  • Par. Syr. dd/d to Par. Syr. 0dd/0d
  • filename: Add_12134 to BL. Add. 12134
  • BL. Add. 12156 to BL. Add. 12156A
  • BL. Add.\n7157 to BL. Add. 07157
  • BL. Add.17127 to BL. Add. 17127
  • BL. Add. 14668 A and BL. Add. 14668 B to BL. Add. 14668A/14668B
  • BL. Add. 12135 (interface 12, 135 B) to 12, 135B
  • BL. Or. 8731 ('734') to BL. Or. 8731 LI
  • BL. Add. 12171 A/B- all folios before f65 to BL. Add. 12171A/B
  • Missing: ['BL. Add. 14512', 'BL. Add. 14548']

Week 01 (May 18 - May 24)

May 19

May 20

May 21

May 22 - May 24

  • Install CUDA 10.0 and CuDNN for tensorflow 1.15.0
  • Run ALI_Encoder and decide that my laptop still runs too slow even on GPU (avg 5s/it):
    • Attempt to install CUDA on Ford's computer but failed.
    • Switched to using Google Collab. Will take a while to transfer all code to Google Collab.
    • Main problem: training steps too large.
  • Have some problem with ALI_Encoder training file.
  • Read
  • Finish a tutorial for tensorflow 1.15.0
  • Read through Minyue's ALI_Encoder code and try to figure out the code.
  • Run Minyue's SVM files. Try to figure out the difference between multiple SVM files in multiple folders.
  • Combines some SVM files with same content into one. Add pandas table for easy sorting and visualization.
  • Paper mentioned uneven labels: Add precision, recall, F-1 and confusion table. Hopefully that would allow training on more labels.
  • Was not able to successfully produce her result.

Week 02 (May 25 - May 31)

May 25

  • Explore tsne.py. Wrap it in a tqdm progress bar because I'm impatient and I need immediate response. Graphs produced are different but still very similar in structure.
  • Explore and create a summary of how the data flows. Some questions remained.
  • Try to figure out why I got different numbers from the papers. I think it might be because of the 10 most common letter labels thing. Still even without it I still don't get the same number as the one recorded in [wiki page].
  • Write a script for fast switching between CUDA version because I have gotten away with not installing anaconda for so long. Also Windows should definitely gets rid of spaces in folder and file names.
  • Copy essential data totart working on translating GAN tensorflow v1 to v2.
  • Create overview.ipynb to explore the data. Probably not necessary but I do quite like looking at graph.
  • Create requirements.txt and README.md

May 26

May 27 - May 31

  • Reimplemented Minyue's GAN and VAE in Tensorflow v2.
  • Okay so I didn't record anything I read here because I didn't like to disrupt the workflow to go write things down. Basically what I did was while reading Minyue's code if I encounter something I don't know I would go and read about it. Additionally if there is something I'm not sure about in the tensorflow documentation I would go to tensorflow's source code and try to figure that out.
  • I tried to implement GAN in a way that subsequent model implementation would be faster. At first I tried doing subclassed model but then switched to using keras functional model because it's a lot easier to read/save and reuse architecture.
  • Ran some tests on the GAN and VAE v2 model along side v1 model to make sure they output similar looking images.
  • I also rewrote some functions to be more readable/efficient (specifically the scripts for relabelling, separating train and test set and getSameManu)

Week 03 (June 01 - June 07)

June 01

  • Implement the rest (GANEncoder, ALI, ALIEncoder) in Tensorflow v2. Ran some tests to make sure v2 models are equivalent to v1 models.
    • v2 is slightly faster than v1.
  • Reread the VAE paper to better understand the math of VAE.
  • Install Git and because I accidentally delete stuff a lot.
  • Problem came across:
    • I have no idea if these models were pretrained or not (there was a variable called pretrain_step and a commented out pretrain block). For now taking out pretrain steps still yield acceptable result.
    • Final layer of some encoder has batch norm.
    • Don't get the loss of fake-fake pairs.
    • The SVM test would be bias if letters from the same manu appears in both test set and training set. Plus it doesn't quite make sense if we are going to run the model on new dataset since they will most certainly have different manuscript code => this looks more like a clustering (supervised clustering?) problem.

June 02

  • Run models' output test on TSNE. Found that they might correlate too much with letter label. Possible reason: Too much concatenation with y layer in the encoder model?
  • Modified encoder layers a little bit to see if we achieve better performance:
    • Remove some of the concat label layers. Did not test with TSNE because the generated images are still very noisy.
  • Drop the batch normalization layer at the output of ALI, ALIEncoder and GANEncoder. GAN and VAE don't have this layer at the end. Moreover I don't get the purpose of having a batch norm layer at the output. Did not see major difference in the output.
  • Modified the loss function of VAE and it gave slightly better result.
  • Modified ALI to match the ALI paper a little better. Specifically the paper mentioned using reparametrization trick so that gradients from discriminator propagate to the generator network. It achieves slightly better result (less noisy) but still not very good.
  • Read:
  • Problems:
    • Installing package in virtualenv on Ford computer keep running into connection problems? Resulted to installing anaconda :(

June 03

  • Tried training discriminator and generator an equal number of time => conclusion: Its better to alternate train discriminator for k step and then generator for 1 step. The reverse (generator k step and discriminator 1 step) is not true.
  • Modified the loss function of cGAN + AutoEncoder model to include L1. Got less blurry and sharper result.
  • Modified the training loop to maximize negative loss function as suggested in the original GAN paper but it didn't work :( I don't see why.
  • Read:
  • Cool StyleGAN things that I may or may not have wasted like 30 minutes on
  • Problem came across:
    • getSameManu returns letter from same manuscript but does not guarantee the letter will be different. Some manuscript only contains 2 letter so possibility of duplicates is high.

June 04

  • FORD342-10 has died and I can't wake it up. RIP. Now I have to set up a different computer.
  • Read
  • Implement conditional adversarial network with U-Net as described in the above paper.
  • Modified model to suit feature extraction task. It takes an image and a set of label and tries to generate new image based on the original image style.
  • Experiments:
    • Encourage encoder to encode images from same manuscript next to each other with L2 cost. I have a feeling this is going to result in either overfitting, or the encoder will map all point onto one single spot, but we will see.
    • Modified the training loop to display both images from dataset and images outside of dataset to see if we are overfitting. Unfortunately I don't have a good way to plot latent vector without adding crazy overhead so the above problem remains.
    • Encoder's gradients cannot affect decoder's weights. I did this because otherwise it resulted in stupidly crazy images. If encoder overfits maybe add this back now that I have modified the discriminator input to be more flexible.
    • Omitting label inputs entirely resulted in the generator cheating the system by generating the same image every time. I don't know why I didn't think of that, the model is already smarter than me.
    • Adding label however slows the model down from ~3.5it/s to ~2.5it/s. I think either I'm feeding it too many input or the model is too deep. Update: After changing discriminator structure to C64-C128 model runs at ~5.5it/sec now, conclusion is reducing the number of parameters really helps.
    • Ran model with lambda=10 but output was blurry. Ran it with lambda=100 instead. Still blurry
  • Thought as I watch tqdm progress bar inches slowly to completion: I should have shuffle the seed labels after every plot.

June 05 - June 07

  • Read
  • Experiments:
    • Let encoder see true label: All that happened is TSNE becomes more bias towards letter label.
    • Let encoder affect weights of the decoder: encoder see true label: Perhaps 100 epochs is way too long since the letters look fine at the beginning but grows more unintelligible as time goes on - tensorflow crashed halfway. encoder doesn't see true label: okay result I guess.
    • Use last layer of encoder as default latent vector. This means we will have 512 dimension latent vectors.
    • Use 34x34 PatchGAN. Image generation wise 34x34 PatchGAN is probably better.
  • Try to implement CycleGAN but realized half way through that this is not a suitable model for the task. But I thought the idea of cycle consistency loss was interesting so I implemented a modified version of that into current model.

Week 04 (June 08 - June 14)

June 08

  • Read
  • Fix the bug in cyclic loss that was causing it to return the exact same input.
  • Start tracking encoder loss on models that doesn't have encoder loss.
  • Set up model plotting and generate model plots for all previous models.
  • Wrote a summary of all current problems.

June 09

  • Read:
    • RMS to see how to normalized RMS.
    • metrics source code to see how they implemented metrics, but in the end I gave up and just use the normal function.
  • Figured out why Pix2Pix was giving me blurry images: There was some problem with normalizing and the activation of the last layer. Fixes the thing and now they look better.
  • Normalize encoder loss to scale things. Rerun previous experiments on new encoder loss.
  • Experiments:
    • Let the decoder see the letter label so hopefully the encoder won't need to learn it.
    • Try smaller patchGAN (16x16 and pixel) and the TSNE plot looks a lot better. Probably because I dumbed down the discriminator?
    • Run 16x16 on other experiments.
    • Cyclic on other experiments.
  • Set up SVM for classifying images into manuscripts. It did surprisingly okay but it was classifying test examples into seen manuscript of test set so definitely not what we want.

June 10

  • Read:
  • Implement NPFullPrint context switch to full print a numpy array. Reason: for debugging.
  • Try more complicated extra layers for encoder.
  • Decide to abandon cyclic loss because I probably will see that example again anyway.
  • Fix a bug related to last layer activation and loss functions that was causing blurry images. Images look good now.

June 11

  • Read:
  • Realized that previous TSNE plot was wrong because I forgot to normalize the test set. After normalizing it it looks slightly better.
  • Add classifier to the current generator - discriminator model, simplify some of the layers so that we don't have so many trainable variables anymore and run all previous experiments.
  • Run all previous experiments on the classifier. Try to figure out why the loss practically doesn't decrease. I think I eventually found out why and fixed it but I already forgot why.
  • Set up model plotting on the other two computers.

June 12 - June 14

  • Realized that I have been using the wrong loss function (binary cross entropy instead of categorical entropy). Fixes that and rerun all of the experiments.
  • Implement a model with classifier (classify labels), separator/discriminator (decide if two latents are from the same manuscript), and encoder. Two type of encoders: with ALIEncoder's structure and with current model encoder structure.
  • Put some reparameterization in.
  • Run experiments on all variations.

Week 05 (June 15 - June 21)

June 15

  • Implement random alphabet-set selection. Ran into a bug that I thought was a bug but was actually just me using the wrong variable names. Problems: It runs quite slow and it’s 100% because of the alphabet-set selection.
  • Implement Stacked model with linear rms loss. We don’t have adversarial anymore.
  • Implement a simpler separator model, maybe too many complicated layers confused it.
  • A current model is pretty good when ran through TSNE (Stacked/first) but it’s running the fewest number of steps. Either it is just learning the identity or all other models are overfitting.
  • Modify the training loop because we don’t have epoch anymore (we can I just don’t think it’s a good idea).
  • Experiments on Stacked: original, larger_batch, larger_batch and reduces training step.

June 16

  • Read (and follow):
  • Modify the random batch selection to use pandas GroupBy object. It is significantly faster if I just agg a random function across all (manu, label) group then select the group that I need but the drawback is that I can’t now have same manuscript in a batch. Nevertheless I’m sticking with it for now.
  • Something is wrong with my plotting of TSNE.
  • Experiments on Stacked with simpler encoder: original, larger_batch, larger_batch and reduces training step

June 17

  • Implement random shifting images. As expected adding shifting really confused Stacked/first.
  • Modify current get different manu function so that it can do element-wise difference.
  • Modify the random batch selection to allow same manuscript/batch. This really slows it down (could be mainly because of the shifting) but is still faster than the original approach (~40s for 100 elements vs ~1.7s).
  • Figure out why TSNE scatter plot only shows 6 colors and fixes it. Read about colormaps in pyplot. Found an acceptable solution. It is possible to create a custom colormap but I bet with ~40 manuscripts some colors are going to be indistinguishable even if manually chosen anyway.
  • Run experiments on StackedAlphabet with shifting and different batch size.
  • Implement EncoderOnly, which doesn’t take too long. Run experiment with distance loss. Normalize loss (rms/(max_latent – min_latent)) so that it doesn’t reduces loss by assigning small numbers to latent vector.

June 18

  • Modify get_diff_manu with unnecessarily fancy arithmetic so that it doesn’t have to go through multiple loops to do elementwise difference. Probably wouldn’t speed things up by a lot but any bit helps. Consequence: Have to implement something like a label encoder for manu_train data because of the new algorithm.
  • Run rms and log loss experiment.

June 19 - June 21

  • Read:
  • Losses:
    • L2-normalize and add cosine loss. This hopefully bypass the normalizing problem.
    • Try log cosine loss. Forgot that log(negative) gives undefined. Reran with cosine loss in range [0, 2] instead of [-1, 1].
    • Inverse loss. I could try this on rms if I know the scale.
    • Note: Reducing different_manuscript_loss was not a problem at all for cosine similarity. (It was for rms).
  • KNN:
    • Implement a cosine similarity metric because KNN doesn’t offer one. But then it turns out that KNN doesn’t offer one because cosine similarity doesn’t work well with KNN because it’s not a proper distance loss. The loss for cosine on a sample test set indicates that it’s doing pretty well but KNN doesn’t think so, either because I messed up the function or because KNN doesn't like angles. Meanwhile the other models are doing okay.
    • Put everything in a table for easy visualization. With no shifting it’s possible to achieve 100% accuracy, which I found super suspicious.
    • Visualize everything with a confusion matrix.
  • Update the long neglected wiki page. Clean up all the -Copy files that I created to run the notebook on multiple machines at the same time.
  • Read images and labels data into npy. Length of images and text file discrepancy. Also I don't understand the rest of the info.

Week 6 (June 22 - June 28)

June 22

  • Take time to pick the loss function again more carefully this time to monitor min loss.
  • Cosine loss doing fine after this modification.
  • Experiments: Batch size, cycling training
  • Add random zooming.

June 23

  • Fix the loss affecting learning rate of models. Rerun all linear models + experiments on said models.
  • Zooming (~10%) and shifting ( helps identify untrained random model.
  • Read MainSamp.zip - images and text labels file are not equivalent but I put a few function in to create a table anyway once we get the data.
  • MainSampInfo.csv - contains all possible info about the letters
  • Note new data has 0.0 = empty, 1.0 = label. This is the other way around in reg3.

June 24

  • Implement something that can generate probability of manuscripts relation. Test it on dated_data.npz.
  • Plot data and generate tables and stuff (all analysis is in Query.ipynb file, albeit might be a bit messy right now).
  • Plot data to check if there is any relationship between manuscripts size (min # albel and average) and correctly classified.

June 25

  • Read MainSamp.zip with info extracted from filenames.
  • Manually rename dated manuscript list to match dataset. There are two missing manuscripts. Generate dated_data.npz.
  • Implement something that can generate probability of manuscripts relation but with second choice. Add probabilistic table. Test both on dated_data.npz.
  • Plot data and generate tables and stuff (all analysis is in Query.ipynb file)

June 26 - June 28

Week 7 (June 29 - July 5)

June 29

June 30 - July 1

  • Reference: List similarity, list cosine similarity, string similarity, difflib.
  • Create a k-nearest neighbors function that ignores true manuscript. Generate new probability tables from this functions.
  • Create a function that bubble sort one list (source) to another list (destination). Attempt a two ways bubble list and that does reduces the number of steps a little bit. Try to figure out why the worst case isn't the reversed order list.
  • Attempt a quicksort and merge sort version but adjacent elements swapping defeats their purpose anyway.
  • Order score on 1D condensed distance vector.
  • PCA on full set of latent vectors (instead of just their centroid) and distplot them.
  • Update README.md

June 2

  • Update the wiki page.
  • Redo list similarity calculation. Concerns: We probably care more about the relationship of the manuscript than the exact order?
  • Generate completely new training and test set. All training examples doesn't include date.
  • Add random blank examples. Train cosine loss on new dataset.
  • Note distance preserving score drops really fast if we let n_components=2D, but distance correlation score isn't changing. This is very weird.

A compilation of resources Minyue used

Cited in paper for context

Cited in paper and directly related

Other reading

Tutorials

Other