Deep Learning Match Making with Recurrent Networks

Hello World,

So I can’t disclose the problem statement, but what I can disclose is that the solution was a “Siamese LSTM”.  I was recommended to read Yan LeCun’s paper on “Siamese Networks” in which a shared weight convolutional neural network was used for facial recognition.  What I did was modify this concept for a forward reverse sweep recurrent (LSTM) network to match things together.

The primary objective of my network was not to verify faces, but rather to understand if various things in a specific order matched or not.  It is a complex time series problem to see if things over time match or not.  Here is the dummy/sanitized version that you can use modified to acheive your goals should they be match making of various random intermetten sequences. A good dummy analagous solution may be decipher 6 people speaking simultaneously different conversations and being able to separate those conversations into their own individual thing.

A key requirement here is that there are infinite conversations folks can have, and therefor you cannot classify or categorize a conversation.

The Label File

Here is some dummy data you can use to verify it computes.  The idea is that each sequence is a match (same) or not.  This is just a multi-hot encoded vector of stuff I made up that matches or not over the sequence.

0 |x 0 0 0 1 |y 0
0 |x 0 0 1 1
1 |x 1 0 0 1 |y 1
1 |x 1 0 0 1
1 |x 1 0 0 1
1 |x 1 0 0 1
2 |x 0 1 1 0 |y 0
2 |x 0 1 1 0
2 |x 1 1 0 0
3 |x 1 1 0 0 |y 1
3 |x 1 1 0 0
3 |x 1 1 0 0
4 |x 0 1 1 0 |y 1
4 |x 0 1 1 0
4 |x 0 1 1 0
5 |x 0 1 1 0 |y 1
5 |x 0 1 1 0
6 |x 1 0 1 0 |y 0
6 |x 1 1 1 0
6 |x 1 1 0 0
7 |x 1 0 1 0 |y 0
7 |x 1 1 1 0
7 |x 1 1 0 0
7 |x 1 1 0 1

So a y| 0 is a non match and a y|1 is a match.  The first number is the sequence id, the |x is the features and the |y is the y.

So the Code

I’m not going to overly explain this.  I’ll answer questions in the comments, just ask questions.  I’m going to break it into two segments “The Model” and “All the Code”

The Model

So here the idea is that we do a feature map layer dense across all data in the set; then we do a forward recurrent pass with that feature map and a reverse pass with that same feature map.  We splice those vectors together and generate a new feature map.  We then take that feature map and do a fold pass forwards and a fold pass in reverse (this generates a single output for each one) which we then splice together, dense over and generate a yes or no classifier on.  PLEASE ASK QUESTIONS IF YOU ARE CONFUSED.  Here is the code…

def create_model(x):
    with C.layers.default_options(init = C.layers.glorot_uniform(), activation = C.relu):
        h = x
        h = C.layers.Dense(hidden_dim, name='feature_map_1')(h)
        h_r_0 = C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=False)(h)
        h_r_1 = C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=True)(h)
        h = C.splice(h_r_0, h_r_1, axis=0)
        h_f_0 = C.layers.Fold(C.layers.LSTM(hidden_dim), go_backwards=False)(h)
        h_f_1 = C.layers.Fold(C.layers.LSTM(hidden_dim), go_backwards=True)(h)
        h = C.splice(h_f_0, h_f_1, axis=0)
        p = C.layers.Dense(num_labels, name='classify', activation=C.sigmoid)(h)
        return p

OK, so that does the magic sauce, but really the magic sauce is around all the other supporting code; so here it is in its entirety.

ALL CODE

import cntk as C

# model dimensions
input_dim  = 4
num_labels = 1
hidden_dim = 6

label_file = 'C:/projects/CUSTOMERS/PATHOFSTUFFS/sample.ctf'

# Create the containers for input feature (x) and the label (y)
x = C.sequence.input_variable(input_dim)
y = C.input_variable(num_labels)

def create_model(x):
    with C.layers.default_options(init = C.layers.glorot_uniform(), activation = C.relu):
        h = x
        h = C.layers.Dense(hidden_dim, name='feature_map_1')(h)
        h_r_0 = C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=False)(h)
        h_r_1 = C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=True)(h)
        h = C.splice(h_r_0, h_r_1, axis=0)
        h_f_0 = C.layers.Fold(C.layers.LSTM(hidden_dim), go_backwards=False)(h)
        h_f_1 = C.layers.Fold(C.layers.LSTM(hidden_dim), go_backwards=True)(h)
        h = C.splice(h_f_0, h_f_1, axis=0)
        p = C.layers.Dense(num_labels, name='classify', activation=C.sigmoid)(h)
        return p

def create_reader(path, is_training):
    return C.io.MinibatchSource(C.io.CTFDeserializer(path, C.io.StreamDefs(
         x_sequence = C.io.StreamDef(field='x', shape=input_dim, is_sparse=False),
         y = C.io.StreamDef(field='y', shape=num_labels, is_sparse=False)
     )), randomize=is_training, max_sweeps = C.io.INFINITELY_REPEAT if is_training else 1)

def create_criterion_function(model, labels):
    ce   = C.binary_cross_entropy(model, labels)
    metric = C.equal(C.greater(model, 0.5), labels)
    return C.combine ([ce, metric]) # (features, labels) -> (loss, metric)

def train(reader, model, max_epochs=10):
    # Map the data streams to the input and labels.
    # this is where we can pull in our label pairs
    input_map = {
        x  : reader.streams.x_sequence,
        y  : reader.streams.y        
    }
    
    # Instantiate the loss and error function
    loss, label_error = create_criterion_function(model, y).outputs

    # training config
    epoch_size = 10000
    minibatch_size = 1000
    
    # LR schedule over epochs 
    # In CNTK, an epoch is how often we get out of the minibatch loop to
    # do other stuff (e.g. checkpointing, adjust learning rate, etc.)
    lr_per_sample = [3e-4]*4+[1.5e-5]
    lr_per_minibatch = [lr * minibatch_size for lr in lr_per_sample]
    lr_schedule = C.learning_rate_schedule(lr_per_minibatch, C.UnitType.minibatch, epoch_size)
    
    # Momentum schedule
    momentum_as_time_constant = C.momentum_as_time_constant_schedule(700)
    
    # We use a the Adam optimizer which is known to work well on this dataset
    # Feel free to try other optimizers from 
    # https://www.cntk.ai/pythondocs/cntk.learner.html#module-cntk.learner
    learner = C.adam(parameters=model.parameters,
                     lr=lr_schedule,
                     momentum=momentum_as_time_constant,
                     gradient_clipping_threshold_per_sample=15, 
                     gradient_clipping_with_truncation=True)

    # Setup the progress updater
    progress_printer = C.logging.ProgressPrinter(50, first=1, tag='Training', num_epochs=max_epochs)
    
    # Uncomment below for more detailed logging
    #progress_printer = ProgressPrinter(freq=100, first=10, tag='Training', num_epochs=max_epochs) 

    # Instantiate the trainer
    trainer = C.Trainer(model, (loss, label_error), learner, progress_printer)

    # process minibatches and perform model training
    C.logging.log_number_of_parameters(model)

    t = 0
    for epoch in range(max_epochs):         # loop over epochs
        epoch_end = (epoch+1) * epoch_size
        while t < epoch_end:                # loop over minibatches on the epoch
            data = reader.next_minibatch(minibatch_size, input_map = input_map)
            trainer.train_minibatch(data)               # update model with it
            t += data[y].num_samples                    # samples so far
        trainer.summarize_training_progress()

def do_train():
    global z
    z = create_model(x)
    criterion = create_criterion_function(z, y)
    reader = create_reader(label_file, is_training=True)
    train(reader, z, max_epochs=10)
do_train()

Alright, so how does this thing actually perform against the fabricated data set?  Perfectly

Here is the Output Logs

Training 1579 parameters in 16 parameter tensors.
Learning rate per minibatch: 0.3
 Minibatch[   1-   1]: loss = 0.692847 * 334, metric = 62.57% * 334;
Finished Epoch[1 of 10]: [Training] loss = 0.690255 * 10325, metric = 58.48% * 10325 1.226s (8421.7 samples/s);
 Minibatch[   1-   1]: loss = 0.686975 * 333, metric = 62.16% * 333;
Finished Epoch[2 of 10]: [Training] loss = 0.680637 * 9992, metric = 62.50% * 9992 0.492s (20308.9 samples/s);
 Minibatch[   1-   1]: loss = 0.669446 * 333, metric = 61.86% * 333;
Finished Epoch[3 of 10]: [Training] loss = 0.614048 * 9987, metric = 62.49% * 9987 0.476s (20981.1 samples/s);
 Minibatch[   1-   1]: loss = 0.485046 * 333, metric = 62.16% * 333;
Finished Epoch[4 of 10]: [Training] loss = 0.374500 * 9989, metric = 82.91% * 9989 0.496s (20139.1 samples/s);
Learning rate per minibatch: 0.015000000000000001
 Minibatch[   1-   1]: loss = 0.265994 * 333, metric = 100.00% * 333;
Finished Epoch[5 of 10]: [Training] loss = 0.260414 * 9987, metric = 100.00% * 9987 0.516s (19354.7 samples/s);
 Minibatch[   1-   1]: loss = 0.256089 * 333, metric = 100.00% * 333;
Finished Epoch[6 of 10]: [Training] loss = 0.248763 * 9989, metric = 100.00% * 9989 0.493s (20261.7 samples/s);
 Minibatch[   1-   1]: loss = 0.241517 * 332, metric = 100.00% * 332;
Finished Epoch[7 of 10]: [Training] loss = 0.238082 * 9988, metric = 100.00% * 9988 0.505s (19778.2 samples/s);
 Minibatch[   1-   1]: loss = 0.231504 * 332, metric = 100.00% * 332;
Finished Epoch[8 of 10]: [Training] loss = 0.227827 * 9988, metric = 100.00% * 9988 0.479s (20851.8 samples/s);
 Minibatch[   1-   1]: loss = 0.223481 * 333, metric = 100.00% * 333;
Finished Epoch[9 of 10]: [Training] loss = 0.217725 * 9987, metric = 100.00% * 9987 0.498s (20054.2 samples/s);
 Minibatch[   1-   1]: loss = 0.210213 * 332, metric = 100.00% * 332;
Finished Epoch[10 of 10]: [Training] loss = 0.207697 * 9989, metric = 100.00% * 9989 0.503s (19858.8 samples/s);

Freaking nice.  So do realize the metric here is for accuracy and not loss.  Great freaking stuff!

Summary

So this example uses some dummy data, but replacing the data, upping the number of neurons increasing the number of epochs and reducing the minibatch size (due to input sequence size comparably) means that in reality we get really good results kinda similar to this, but not exactly.  Real data vs sythesized data is always different; reality vs perfection (right?).  Anyways, it appears to work on real data as well.

Final Notes

I wish I had enough time to really track this and write it out well; but post questions and I’ll do my best to answer them.  This approach to the problem appeared to have significant promise.

 

Leave a Reply

Your email address will not be published. Required fields are marked *