This paper would not have been possible without the guiding hand of Mr. Michael Stancik. His advice was invaluable when starting this research paper and his commitment to my success despite the challenges that I faced was inspiring.

Problem or Need

The goal of this research project was to create a machine learning algorithm capable of learning basic visual features from a video, which is a chronological sequence of images. The features that the algorithm learns should be similar to those found in the primary visual cortex of the mammalian brain. In addition, the learning process of the algorithm should more accurately model the way the mammalian brain learns than current algorithms, primarily by learning in a dynamic fashion on only a few frames of information at a time. The majority of machine learning algorithms used for tasks involving visual input such as images have relied on precompiled static datasets to learn their features. While this approach is useful in applications like handwriting recognition and object classification, it is often computationally intensive and it takes large amounts of memory to store all of the images used for learning. Furthermore, the learning method of these algorithms does not accurately model the learning process of the mammalian brain, one of the greatest learning machines that we know of. The mammalian brain learns on a constant stream of visual data. As soon as light reaches the receptors in our eyes, that raw data is immediately processed by the primary visual cortex and sent down the neural pathways to various parts of the brain. Creating a machine learning algorithm that learns on a stream of data, like the mammalian brain, would significantly reduce the amount of temporary storage needed to process the necessary data and the algorithm would be able to adapt to new data. The ability to adapt would be essential for applications where the initial conditions are unknown or uncertain and learning had to be done on the fly, such as robots acting as first responders to natural disasters. This design investigation will attempt to create an algorithm that learns on a dynamic dataset and will learn features similar to those learned from algorithms that train on static datasets.

Background Research

The Mammalian Brain, Learning, and the Visual Cortex

The mammalian brain is made up of many neurons, which can be thought of as basic input/output devices and are the building blocks of the brain. The brain is a plastic organism, meaning that physical changes occur in the brain as a direct result of learning. For example, a piece of a memory might be formed by changing the strength of the connections between neurons or connecting two new neurons together. As of today we don’t know the exact process for learning but there are several theories that exist. In the mid nineteen hundreds, two scientists conducted an experiment on the visual cortex of a cat’s brain. The two scientists, David Hubel and Torsten Wiesel, recorded the activity of the neurons in a cat’s primary visual cortex and showed the cat videos of black lines oriented at several angles and moving in different directions. They found that specific neurons were very responsive when there was a line with a specific orientation and motion on the screen. They found that there are orientation selective neurons in the primary visual cortex of the brain, and concluded that we decompose the world around us into lines and edges. It has also been concluded from other experiments that the primary visual cortex of the brain undergoes dramatic change at early ages. It was found that when a cat was only presented with limited visual angles at an early age, it was only able to identify those limited angles at a later age, making it effectively blind to all other angles (Segev, 2015).

Machine Learning

Machine learning is a subdomain of artificial intelligence that is concerned with creating algorithms that learn functions from data. There are some computational problems that are very difficult to manually program and would be quite inaccurate if they were programed in that manner. Problems like these, such as object recognition and speech recognition, are best solved using machine learning algorithms (Schapire, 2008). The most common machine learning algorithms are created to learn an unknown function f : X → Y where X is a set of inputs and Y is the set of corresponding outputs (Mitchell, 2006). In a task such as speech recognition, each element in X would consist of a small segment of recorded audio and each element of Y would contain the corresponding syllable that was spoken in those audio segments. The goal of this speech recognition system would be to learn the function f that maps allX to their corresponding Yvalues as accurately as possible. In machine learning, it is convenient to think of Xand Yas vectors with the i-th element of each vector associated with xiand yirespectively. This is because most machine learning algorithms utilize fast matrix multiplication and linear algebra libraries to carry out their calculations. Other machine learning algorithms, called clustering algorithms, receive only inputs Xwithout any labels Yand attempt to find some structure in the data by grouping related inputs (Mitchell, 2006).

There are many other types of machine learning algorithms that are capable of solving some very challenging problems such as handwriting recognition, spam filtering, medical diagnosis, fraud detection and much more (Schapire, 2008). The reason why machine learning is often used to solve tasks like these is that people are very good labeling things but they often have difficulty describing precisely how they knew to give a certain object a certain label (Schapire, 2008). For example, if a person were shown a picture of a cat, they would be able to say that the picture contained a cat. However, when asked precisely how they knew it was a cat, what went on in their brain that let them know it was a cat, they would be unable to answer. As a result, when faced with a problem where the algorithm necessary to solve the problem is very complex, it is often necessary to have the computer come up with the algorithm while the person labels the necessary data (Mitchell, 2006).

Neural Networks

A neural network, sometimes known as an artificial neural network (ANN), is a type of machine learning algorithm inspired by the way neurons are connected in the brain (Stergiou, 1997). Several neurons are connected to another neuron by connecting their axons to the dendrites in the dendritic tree of the given neuron. The axons and dendrites are connected by synapses, which transmit the electrical information from the axon to the dendrite. The electrical signals from all of the neurons connected to a neuron accumulate in the neuron’s cell body until a certain threshold is reached, which is when that neuron fires an electrical signal down its axon and into other neurons through their dendrites (Segev, 2015).

An ANN behaves in a similar way, but on a much smaller scale than an actual brain (Burger, 2004). At the core of an ANN is an artificial neuron which takes several inputs from other neurons, performs a calculation on those inputs, and then outputs a value based on that calculation (Nielsen). Traditionally, the inputs of an artificial neuron are represented as a vector X and the i-th element in vector X is labeled as xi.

Additionally, there is a weight associated with each input element which represents the importance of that input to the output. The weights of the inputs are also represented as a vector called W, with wi corresponding to the i-th element in W. The most common artificial neuron is called the sigmoid neuron, which passes the weighted sum of the inputs into a sigmoid function to produce an output between 0 and 1 (Nielsen, 2015). The formal equation for this is σ(Σi wi*xi), where the sigmoid function σ(z)=11+e-z . The sigmoid function is often chosen because its value is always between 0 and 1 and because it produces a nice derivative, which is essential for learning. The sigmoid function is also the activation function for the artificial neuron and is one of many that can be used. It is also common for the sigmoid neuron to include a bias value, typically denoted as b, which represents how easy it is for the neuron to fire. With this in mind, the equation for the output of a sigmoid neuron becomes σ(Σi wi*xi+b) (Nielsen, 2015). This structure is similar to how a neuron in the brain functions, where the the output depends on the sum of the inputs.

When multiple sigmoid neurons are connected together they form an ANN. In most ANNs, these connected neurons are organized into layers. The first layer of an ANN is the input layer, which consist of neurons that represent the input to the ANN and simply pass on their information to the next layer. The next layer is the hidden layer, which can actually consist of more than one layer. Each neuron in each hidden layer receives weighted inputs from the previous layer, sums up those weighted inputs, applies the sigmoid function, and passes along the output of the sigmoid function to the next layer. The final layer is the output layer, where in tasks like image classification the values of the output neurons are treated as the probability that the given image belongs to a certain category (Burger, 2004).

A simple example of an ANN is a neural network used to classify handwritten digits. This network takes in a grayscale image 28 by 28 pixels in dimension and outputs ten probabilities representing the probability that the network thinks the given image is an image of one of the digits from zero to nine. Therefore this network will contain 784 (28*28) input neurons, 15 hidden neurons, and 10 output neurons. The input neurons correspond to the intensity of each pixel with a value of 0 representing white and 1 representing black. The output neurons will output numbers between 0 and 1. If the 1st output neuron outputs a 1, then that means that the network is very sure that the image is a 0 (because the 1st output neurons corresponds to the digit 0) (Nielsen, 2015).

The way we would train this network to recognize digits is by giving it many training examples consisting of a 784 dimensional vector representing the intensity of each pixel and a 10 dimensional vector representing the probability that the image is one of the ten digits. For example, if a training example was an image of a 3, then the fourth element of the 10 dimensional vector would be 1 and all other elements would be 0, meaning the probability that the vector is a 3 is 100% and the probability that the vector is another digit is 0%. The network is then trained using backpropagation, which calculates the error of each neuron (excluding the input neurons), and then updates the weights of the connections going into each neuron according to the partial derivative of the error with respect to the weights. The typical way to calculate error for a single training example is with the squared error function, which is 12*Σ||x-y||2, where x is the vector that the network currently outputs when given a training example and y is the vector of the desired output of the network for the training example. Because both x and y are vectors, we simply sum all of the elements of the resulting vector when x is subtracted from y to get a single number. When x=y, the error will be 0, so we are essentially trying to minimize the error (Nielsen, 2015).

A good way to think about how a neural network learns is to think of it as an optimization problem where we are trying to minimize the error of the function 12*||x-y||2in terms of x, which depends upon the weights in the network. If the network consisted of only two weights v1 and v2 and we let the value of the error function be equal to C, then we can model Cin terms of v1 and v2 like so:

The reason that the function is shaped like a bowl is because the error function is in terms of the input squared, making it a three dimensional parabola in this case. The goal is to minimize this function, to get C as close to zero as possible. If we were to start on some random point on that three dimensional curve, the way to get to the bottom would be to go down the slope of the curve until the only way we could go was up. This is what the backpropagation algorithm is doing; given a starting position, it is finding the partial derivative of the error with respect to each of the weights, or the slope of the curve at that point, and then is taking a step down the curve by subtracting the partial derivative of each weight from that same weight. By iteratively going through the process of calculating the derivative and taking a small step down the slope in the direction of the derivative, we will soon arrive at the bottom of the curve (Nielsen, 2015).

Sparse Autoencoder

An autoencoder is an ANN that learns features from unlabeled data. For images, an autoencoder tries to find a sparse representation of an image by trying to find an approximation to the identity function f(x)≈ x. It does this by setting the input of the ANN and the output to the same image and then limiting the number of hidden neurons. By limiting the number of hidden neurons, the network is forced to find structure in the data. The weights that go into the hidden layer are meant to find a compressed version of the input data and the weights that exit the hidden layer are meant to find a way to reconstruct that compressed data into the original data. For the majority of algorithms, the compression aspect of the autoencoder is the primary focus (Ng, 2013).

The basic autoencoder can be extended to a sparse autoencoder by adding an additional term to the error function called the sparsity term. The sparsity term is introduced so that even when the size of the hidden layer is larger than the input layer. the autoencoder can still find some structure in the data. If the hidden layer is larger than the input layer, one would expect that the autoencoder would simply learn the identity function, f(x)=x, by setting a few weights equal to one and all the rest equal to zero, creating a perfect reconstruction of the data (error=0). This problem can be solved with sparsity. Sparsity means that we want the average value that a neuron outputs to be close to zero. This ensures that the neuron will have an output of one very rarely, meaning that it has to find a more compressed form of the data than the one represented by the identity function. The way to calculate this for an individual neuron is 1m*i=1ma(xi) where m is the number of training examples and a(xi) represents the output of the neuron when given the i-th training example. The sparsity term that gets added to the error function is j=1hp*log(psj)+(1-p)*log(1-p1-sj) where p is the sparsity parameter, or the desired average output of the neuron, h is the number of hidden neurons, and sj is the average output of the j-th neuron. This function was chosen because it has a similar shape to a quadratic function and because it produces a nice derivative of -psj+1-p1-sj for the weights going into the jth neuron (Ng, 2013).

Design Plan

For this project, I used the Octave programming language, a free, open source, high level programming language with built in libraries for fast linear algebra computations. This programming language was also chosen for its readability and its ability to enable rapid prototyping of complex algorithms requiring many matrix calculations. All of the tests were performed on Macbook Pro late 2011 model with a 2.2 GHz Intel Core i7 processor and 4GB of 1333MHz DDR3 ram.

The first step of creating the algorithm was implementing a sparse autoencoder. A sparse autoencoder was chosen because it has been successfully used many times to extract visual features of varying abstraction from large sets of natural images. An example of a successful sparse autoencoder is Google’s deep learning algorithm, which learned to recognize images of cats and people by observing unlabeled images sampled from 10 million Youtube videos. The algorithm used in that project was a modified sparse autoencoder that was trained over several days on very fast computers (Le et al., 2012).

The first step in implementing the sparse autoencoder consisted of writing the code to create initial values for the all of the weights in the autoencoder. This was done by writing a function called initializeParameters, which initialized the weight matrices of each layer to very small random numbers. This guaranteed that neurons would initially try to learn different features from each other and that the neurons would not immediately be stuck on one of the two plateaus of sigmoid function, which would increase learning time. The second step was to write a function that would calculate a feedforward pass through the network for all of the images in a given dataset in order to calculate the error, or cost, of the network with the current weights. This would allow a periodic progress check to make sure that the algorithm was actually converging to a solution. This was implemented with the following code, where W1,W2,b1,and b2 stand for the first and second weight matrices and the first and second bias vectors respectively, lambda is the regularization parameter, m is the number of images in the data set, beta is the weight of the sparsity parameter, and sparsityParam is the sparsity parameter:

In the code, the image dataset is represented by an n by m matrix where the i-th column contains the vector representing the i-th image in an unrolled format (think of it as taking an image and reordering all of the pixels in a long, one pixel wide line). The next step was to implement the backpropagation algorithm, which calculates the derivatives of the error with respect to each weight matrix. It then combines all of the derivative matrices into one long vector which it outputs to be used by the optimization function.

The next step involved writing the code to test the sparse autoencoder. This was done by using a library containing very fast and efficient optimization algorithms called minFunc (Schmidt, 2013). In the library, the optimization algorithm chosen to test the autoencoder was the Limited-memory Broyden–Fletcher–Goldfarb–Shanno (LBFGS) algorithm. What this function does is it accepts another function that outputs the current error of the problem as well as the partial derivatives of all of the variables in the function and uses those to approximate the second order partial derivatives of the variables. It does this in order to more quickly arrive at a solution than a simple gradient descent algorithm, which would take constant steps of the same size and converge very slowly. This section of code was used to get a baseline for the learned visual features by using a static dataset of thousands of images that it trained on all at once. This is the traditional way that the sparse autoencoder is used and the results from this algorithm will be compared to the results of the dynamically learning algorithm in order to determine the success of the dynamic learning algorithm. The code used to call the LBFGS function is:

Then code was written to visualize the learned features of each neuron and code was written to normalize the image data of the pixel intensities from the traditional 0-255 range to 0.1-0.9. Finally, code was written to train the sparse autoencoder on a sequence of images, meaning that the sparse autoencoder would only get to learn on a few images at a time instead of thousands at once.

Results and Discussion

The initial design of the algorithm used colored image patches from the CIFAR-10 image dataset. The CIFAR-10 dataset consists of 60,000 32 by 32 pixel natural images that all have labels attached to them that place them into one of ten categories. The labels were not used during the course of this experiment as the goal was to learn on unlabeled data. To keep the memory size manageable, only the first 10,000 images in the CIFAR-10 dataset were ever used throughout the course of this experiment. The program first goes through and extracts 8 by 8 image patches from random positions of random images in the dataset, making sure that each image is used at most once. It then unrolls the 8 by 8 RGB image patch into a 192 (8*8*3) dimensional vector and appends it to an image patch matrix where the i-th column represented the unrolled i-th image patch. Then, it runs an algorithm called Zero Component Analysis (ZCA) whitening on the image matrix to normalize the pixel data to a gaussian distribution and give the data a mean of zero. Then, it initializes the weight matrices and starts learning using the specified algorithm.

There were several parameters that could be modified throughout the experiment. This included the total number of images sampled from, the size of the square patches in pixels, the number of hidden neurons, the sparsity parameter, the weight of the sparsity parameter (beta), the regularization parameter (lambda), and the number of iterations that an algorithm should perform to optimize the autoencoder. In Andrew Ng’s lecture notes on the sparse autoencoder used on RGB images, he set the sparsity parameter to 0.035, lambda to 0.003, and beta to 5, so those were the values chosen for those parameters in this experiment (Ng, 2013). The initial run of this algorithm trained the autoencoder on the full set of images at once using the LBFGS algorithm with the number of images set to 3000, the image patch size set to 8 pixels, the number of hidden units set to 225, and the number of iterations set to 600. These values are all very similar to the values used by Andrew Ng in his lecture notes (Ng, 2013). This first run was to get a baseline for the kind of features that would classify a learning algorithm as meeting the design criteria for features learned. This was used to check the performance of later algorithms that learned on the data in batches or in an online fashion. The feature that each of the autoencoder’s neurons has learned to detect after the first run is shown below.

The neurons highlighted in red boxes show some neurons that have learned valuable edge detectors. Edge detectors can be seen whenever there is a sharp shift between two colors or intensities in the learned feature of a neuron. It is also apparent that this autoencoder has learned many edge detectors at different angles and different shapes, indicating that it learned a good representation of all of the images that were fed into it. These features are also similar to the feature detectors found in a cat’s primary visual cortex. The criteria for determining the success of later algorithms was that if the algorithm produced many different feature detectors that were similar to the edge detectors learned by the baseline, then it would be considered a successful algorithm.

Now that a baseline had been set, an algorithm was created to train the autoencoder on only a few images at a time, known as a batch of images, for a small number of iterations of LBFGS and then repeating this process for the next batch of images. This also introduced a new variable parameter, the batch size, which represented the number of image patches trained on in each batch. Unfortunately, it turned out that LBFGS was simply too aggressive and only optimized the autoencoder on the features represented in the first or second batch. When the number of hidden units or the number of iterations were decreased to try and force it to learn a more general representation of the data, the algorithm still optimized itself too quickly. When the batch size was decreased, the algorithm didn’t learn anything useful at all. Below are some visualizations of the final autoencoder for three of the trials during this phase of the experiment. Despite several variations to the batch size and the number of iterations, the autoencoders clearly didn’t learn any useful features. In the case of trial 7, it did learn two or three features, but it didn’t learn a generalization for all of the data; it only generalized to the first few patches it was trained on.

Due to the inability to control the LBFGS algorithm, a gradient descent algorithm was programmed, which is a basic optimization algorithm that calculates the derivatives of the weights and updates the weights by subtracting the partial derivative of each weight multiplied by a learning rate, called alpha. Alpha was initially set to 0.01, a small enough value to make sure that the algorithm didn’t oscillate around the global minimum. However, it turned out to be too small and the algorithm was not able to learn any useful features by the time it was done learning on all batches. Thus, alpha was increased to 0.1. Although this did diversify the learned features, the learned features did not represent the edge detectors seen in the baseline. Some of the trial visualizations for the gradient descent algorithm are shown below.

In an effort to simplify the process and decrease the runtime needed to optimize the autoencoder, the color patches were turned into grayscale patches, meaning that the image patch vectors would only be 64 (8*8) dimensional vectors instead of 192 (8*8*3). Before testing any batch algorithms, a baseline test was run again on the grayscale image patches. The algorithm used was LBFGS and all of the patches were trained on at the same time. For this trial, the number of images was 6000, the patch size was increased to 16 to make the visualizations clearer, the number of hidden units was set to 225, and the number of iterations was set to 800. The sparsity parameter, lambda, and beta, were set to 0.01, 0.0001, and 3 respectively, which are the values that Andrew Ng used for training autoencoders on grayscale images in his lecture notes (Ng, 2013). Additionally, the ZCA whitening step was removed in favor of simply normalizing the pixel values to zero mean and then mapping each pixel value to a value between 0.1 and 0.9. The resulting visualization is shown below. The types of edges that have been learned are also much clearer in this visualization, due to the increase in the patch size.

After the baseline was set, a modification was made to the gradient descent algorithm based on current results. It was noted that when using a small alpha while training on the color patches, the autoencoder was learning too slowly and didn’t learn any useful features. However, when alpha was set too high, it generalized too quickly to the first few image patches it was trained on. The new gradient descent algorithm worked in three steps. It started out with a small base alpha and a small base number of iterations. It would run gradient descent for the number of base iterations at the base learning rate. This would get the autoencoder in a position to generalize. Then, it would increase the number of iterations and the alpha by a factor of 10 and run gradient descent again but it would retain the state of the autoencoder from the previous run and build off of that. This was so that it could learn at a faster rate as it approached the solution, as learning typically slows down with a constant learning rate as the autoencoder is optimized. It would multiply the learning rate and alpha by 10 once more, run gradient descent, and then be done. The results were better than before. The algorithm starts to learn different types of edges but there is still a lot of noise and the edges aren’t that clear. Additionally, the autoencoder doesn’t seem to generalize well and only picks up on a few different types of angles.

It appeared that this algorithm needed to be more aggressive to get rid of the noise and it needed to find some way of generalizing more. The answer to the first question was to use separate learning rates for each weight and then updating the value of a learning rate for a particular weight based on its performance. For example, if at iteration 10, a certain weight has a positive derivative, and then at iteration 11, the weight has a negative duration, then it means that the weight overshot the minimum and its learning rate needs to be turned down very quickly. However, if at iteration 11 the derivative is still positive, then that means we are still making our towards the minimum and we can increase the learning rate by a little bit to speed it up. This is sometimes called an Automated Learning Rate (ALR). The second problem about generalization was solved with a realization. For the majority of the trials the batch size had been set to really low numbers. This had been done because it had been reasoned that the brain can only store a few frames worth of visual data in the primary visual cortex before the data moves on to get processed by other parts of the brain. This would mean that the primary visual cortex learns feature detectors by only analyzing a few frames of data at a time. The goal for some time had been to get the batch size to under 5 and still learn many different features.

It was realized that each image in the batch only represented a patch from a single image and that a single 32 by 32 pixel image would contain 16 completely different 8 by 8 patches. However, 16 patches would still not be enough, so a sliding window was used to slide the 8 by 8 patch window across the 32 by 32 image by a constant step size less than 8 and then moving down the image by the step size at the end of a row. If your step size is s, then for a 32 by 32 picture with an 8 by 8 patch window, you would get ((32-8)/s+1)*((32-8)/s+1) patches. When s=4, you get 49 patches, which is a large increase from 16.

Additionally, if the current image contains many different angles and objects, this method will be able to capture all of those features and be able to build a more complete generalization for the entire dataset. By combining the sliding window and the separate ALRs, the algorithm was substantially improved and was able to learn important general features from the dataset. The algorithm also introduced a few new parameters into the algorithm. These included a slide step parameter which represented the step size of the sliding window, the initial alpha and the convergence alpha for the separate learning rates, the number of sequential frames of image data to train on at once (the image set size), and the number of iterations to perform gradient descent for the current batch.

Once the success of the algorithm was determined on the CIFAR-10 dataset, a new dataset was introduced. Due to the fact that the primary visual cortex learns from a continuous stream of input with some correlation between images and not random images like in the CIFAR-10 dataset, it was decided to capture 1 to 2 minute videos of certain environments and extract the image frames from each video. The first video was of a home interior and was 161 seconds long and recorded at 30 frames per second. This yielded about 4830 (161*30) images, and each image was scaled down to 100 by 56 pixels to make the image sizes manageable and learning relatively fast. This sequence of images was trained in sequential order (as opposed to the random order of previous algorithms) with an image patch size of 12. The state of the autoencoder after learning on the last set of frames is shown below.

This algorithm has learned many more general features than the previous algorithms and has learned more defined edges. However, there was still a bit of noise and the features were still not as defined as desired. So the number of iterations was increased to decrease noise and the slide step was decrease to 4 to increase the total number of patches per image and generate more general results. The results of these modifications are shown below.

Finally, another video was captured outdoors by walking down a suburban lane for 270 seconds at 30 frames per second to produce approximately 8100 images (these images were also scaled down to 100 by 56 pixels). Additionally, while training the algorithm, the state of the network after training on every image set was recorded to allow viewing of intermediate results. Below are some of the intermediate results.


After testing and modifying the learning algorithm for the sparse autoencoder several times, the algorithm that was most successful at fulfilling the purpose of the investigation was a gradient descent algorithm with adaptive learning rates for each individual weight. Additionally, a sliding window was used to extract sufficient image patches from images in order for the algorithm to have access to enough data to learn important features. This algorithm was considered the most successful because it produced the most generalized feature detectors similar to those seen in the baseline algorithm. Additionally, neurons with these kinds of feature detectors have been found in the primary visual cortex of the mammalian brain. Finally, this algorithm was able to learn these feature detectors by only looking at three frames of video at any given time, meaning that its method of learning is much closer to the way the brain actually learns in comparison to traditional autoencoders.

While the preliminary algorithm was unsuccessful and did not meet any aspect of the design criteria, it and the other unsuccessful algorithms did provide valuable insight into inner workings of the algorithm as it was learning. This can not be overlooked as these insights were what ultimately led to the inspiration for the final design. Nevertheless, the final algorithm was able to outperform the previous ones because it was allowed to look at more sections of an image at once and because it had learning rates that were able to vary their aggressiveness. The absence of the latter was a key downfall of some of the previous algorithms, as they were either pre built with too much or too little aggressiveness in mind, forcing them to converge to false optimizations too quickly or not being able to converge at all within a reasonable time span. The algorithm was also able to learn from video sequences, where consecutive images are typically very similar and often don’t contain the same features as previous seconds. The results from using this algorithm on the video of a home interior prove that it has some ability to remember visual features from several seconds in the past. However, the intermediate results from the video of a suburban lane prove that it can very quickly overwrite dominant features from previous times with the dominant features of the present. While this is one of several aspects of this algorithm that need to be improved to consider it on par with the learning in the mammalian brain, it is definitely a large step in the right direction.


Burger, J. T. (2004, June 14). A basic introduction to neural networks. Retrieved from

Communication. (2011, September 27). Retrieved from

Le, Q. V., M. Ranzato, R. Monga, M. Devin, K. Chen, G. S. Corrado, J. Dean, and A. Y. Ng. (2012). Building high-level features using large scale unsupervised learning [PDF document]. The International Conference on Machine Learning. Edinburgh, Scotland. Retrieved from

Mitchell, T. M. (2006, July). The discipline of machine learning [PDF document]. Retrieved from

Ng, A., Ngiam, J., Foo, C. Y., Mai, Y., & Suen, C. (2013, April 7). UFLDL tutorial. Retrieved from

Nielsen, M. A. (2015). Neural networks and deep learning. Retrieved from

Schapire, R. (2008, February 4). Theoretical machine learning. Lecture presented at Computer Science 511 in Princeton, Princeton. Retrieved from

Schmidt, M. (2013). minFunc [Computer software]. Retrieved from

Segev, I. (2015, October 14). Neurons as plastic/dynamic devices. Lecture presented at Synapses, Neurons and Brains in Hebrew University of Jerusalem, Jerusalem. Retrieved from

Stergiou, C., & Siganos, D. (1997, May 12). Neural networks. Retrieved from

“Boulevard of Broken Dreams” by Green Day.

This song is about loneliness and being an outcast from society, being the only one out and having nobody around. This loneliness and alienation very much characterize Raskolnikov, who has been unable to connect well with people throughout the novel. His insistence on pushing people away, such as when he first meets Razumikhin and even his initial rejection of his sister due to her then planned marriage, means that he has very little emotional attachments to people and is often lonely and dispirited. This loneliness is shown in the first verse of the song, “I walk a lonely road, the only one that I have ever known. Don’t know where it goes, but it’s home to me and I walk alone.” The narrator conveys that he is often alone, as this lonely road is the only one he knows and because he calls it home. The phrase, “don’t know where it goes” signifies that he is lost in his world, he doesn’t really have a purpose, which is similar to Raskolnikov’s situation. Raskolnikov doesn’t really seem to have any big plans for his future. He simply prefers to live life on a day to day basis without really having any concrete goals for his future. Another line in the song also very much characterizes a major part of Raskolnikov. The line “I’m walking down the line that divides me somewhere in my mind,” suggests a duality in the character of the narrator. He has a division in his mind, perhaps a second character and he walks on the line that divides the two, meaning he likely often and suddenly shifts between the two. We have definitely seen this duality in Raskolnikov, for example when he finds the girl on the bench and one moment is determined to save her from a predator and then in an instance couldn’t care less about her fate. Raskolnikov also often switches between his desire to reveal that he is the murderer and his hope that his actions will remain a secret. Raskolnikov’s loneliness and duality have caused him great suffering and have driven him to madness. This song’s heavy and sad guitar riffs help enforce the sadness of the situation the narrator is in and it also hints at great suffering, a major theme in Crime and Punishment.

“Until You Were Gone” by The Chainsmokers and Tritonal, featuring Emily Warren.

This song very much reflects the later half of the Marmeladov subplot, from the point where Marmeladov dies, to the time when Katerina dies. The song is about not realizing how good someone was until it he or she is no longer there. In other words, it is about regret and how you feel after you have lost someone close to you. This song very much portrays how Katerina’s life progressed after Marmeladov dies. The line “ever since I left you … it keeps getting worse,” perfectly sums up Katerina’s downfall. One way to look at Marmeladov’s death is that Katrina left Marmeladov; she gave up on him and refused to forgive him. Even though she fussed over him in his final moments, she still kept saying that she would be better off without him and that and that he was only a burden. Even though in the song the person that the narrator left is still alive, she says that “the break was binding,” meaning that there is no chance of getting back together. The phrase “it keeps getting worse” tells the last part of Katerina’s story, where she spends nearly all the money given to her on Marmeladov’s memorial meal and is cast out of her home. She then frightens her own children as she forces them to dance and beg, showing how low she has gone. Finally, the line “I’m burning on the inside and the truth is that I didn’t know how good you were until you were gone,” tells the unspoken part of Katerina’s story. Even with Marmeladov’s drinking problem and instability, Katerina seemed to be doing alright. She was rational and caring towards the children. However, without Marmeladov, she has entered a hopeless state of mind. She gets temporary respite from Raskolnikov's donation but without Marmeladov she was destined to fall. Even though Katerina thought Marmeladov was a burden to her, she can’t seem to function without him, just like the song’s narrator, who is pained because she left a person that she can’t function without.

“Battle Scars” by Lupe Fiasco and Guy Sebastian.

The theme and suffering is very prevalent in Crime and Punishment, as Raskolnikov is constantly tormented by his actions and by other people. The song “Battle Scars” is about the scars you carry around from all the things that have been done to you. However, these are mental scars, not physical scars. The song talks about how traumatic events can have lingering effects on you that you never forget and they become your scars. Raskolnikov’s murder has left a big mental scar on him that we are constantly reminded of. Even though the song is mostly about the scars from a relationship, the effects of these scars on the person are similar to Raskolnikov’s feelings after committing a murder. For example, one of the lines from the song talks about the hurt that you feel from “The enemy within and all the fires from your friends.” The “enemy from within” is your own insecurity and madness, which Raskolnikov displays many times. There have been several instances where Raskolnikov’s inner dialog has got him worried or made him do irrational things. In general, after he has committed the murder, Raskolnikov is very jumpy and frightened whenever someone brings up anything to do with the murder. The people around Raskolnikov, his friends, seem to only be causing him more distress and suffering. Towards the end of the song, the narrator says “And I’m at the point of breaking, and it’s impossible to shake it,” meaning that all the emotional pain of his scars have pushed him to the tipping point and there is nothing that can stop him from going over. This is similar to the climax of Crime and Punishment, when Porfiry plainly tells Raskolnikov that he knows that Raskolnikov is the murderer and that he will soon arrest him. Raskolnikov has no response to this and seems to eventually accept the inevitability of his situation. He even tells Svidrigailov that he is not intimidated by his threat to go to the police and reveal Raskolnikov's secret because he knows that he has already lost. The overarching theme of this song is that people have hidden emotional scars that are not always apparent and that we need to be aware that they exist.

Natural disasters are an inescapable part of this world. Tornadoes, hurricanes, and earthquakes are a few of the countless natural disaster to affect the inhabitants of the earth. Earthquakes are one of these natural disasters that cause a lot of deaths and property damage throughout the United States. Hardly a month goes by without hearing about at least one major earthquake happening in somewhere in the world. Accounts of many deaths and hundreds of people injured and homeless usually accompany these stories of mass destruction. The larger earthquakes sometimes stimulate national efforts to aid those affected by these massive tremors. The sad news is that these earthquakes will keep coming and continually devastate every corner of the globe.

Earthquakes have been a part of human history since the beginning of mankind. Earthquakes have sometimes become a major factor in the fall of ancient civilizations (“Earthquake,” Environmental Encyclopedia 2). Man’s quest to understand earthquakes began with the ancient Greeks, who believed that they were caused by underground winds. Over the past couple thousand years we have increased our knowledge exponentially, especially with the invention of the modern seismograph in the 1900’s (“Earthquake,” Encyclopedia 1). Nevertheless, every year earthquakes devastate many countries around the world, causing many deaths and costing the governments that are affected millions and sometimes billions of dollars in damage. One of the most famous earthquakes to hit the United States was the 1906 San Francisco earthquake. This devastating earthquake hit on April 4, 1906 and caused about 3000 deaths, thousands of injuries, and about half a billion dollars of damage (“Earthquake,” Environmental Encyclopedia 2). A more recent earthquake to hit California was the 2010 Baja California earthquake. It was the strongest earthquake to hit California after several years of relatively small seismic activity and it was felt by many people living in the southwestern portion of the United States. The earthquake was so powerful that it actually moved the Mexican city of Calexico two and a half feet to the south (3). The most powerful earthquake ever recorded measured a 9.5 on the Richter scale. It struck Chile on May 22, 1960 and caused 2000 deaths and more than half a billion dollars in damage. The resulting tsunami claimed lives all along the Western American coastline and caused damage as far away as Japan (3). There are indications in the past decade that human activities have affected seismic activity.

Recent studies highlight the reality of man-made earthquakes, also known as induced earthquakes. A study done by the US Geological Survey (USGS) found that the amount of yearly earthquakes in the united state of magnitude 3 or greater has tripled since the turn of the century. Most of the earthquakes were found to be linked with an increased amount wastewater injection into the earth. Even though these earthquakes were small and barely felt, there is the possibility that wastewater injection has caused much more powerful earthquakes that have resulted in some damage to homes (“Induced Earthquakes” 1). There has also been some evidence to suggest that the severity of earthquakes caused by wastewater injection increases the more water is pumped and the faster it is pumped. The USGS does however say that these findings require more research to produce more convincing conclusions (2).

There are several proposed and significantly researched techniques that could be implemented in a short period of time to reduce the damage caused by earthquakes, the first of which is early-warning systems. Different from predicting earthquakes, this method detects earthquakes that have already happened and warns people that haven’t felt them yet that heavy shaking is imminent. Beta examples of these systems are already operational in places like Caltech, where in March of 2013, their systems told them that they would experience the shaking of a magnitude 5.2 earthquake over half a minute before it happened (Nagourney 1). These systems work one of two ways. When a tectonic plate snaps back into place, it sends out two kinds of waves: primary and secondary waves. Primary waves are the faster of the two and barely cause any damage. The secondary waves are the ones responsible for all of the shaking felt during an earthquake. An early warning system would detect the faster primary waves and send out alerts that the secondary waves are on their way. The other way they could work is by having sensors close to the center of an earthquake and broadcast alert signals that would travel faster than the secondary waves (Normile 1). These systems are already in place in parts of Mexico and Japan. In Mexico it is intended to give their capital city over a minute of warning if a large earthquake struck along their Pacific Coast. In Japan, it is currently used to stop their bullet trains if a large earthquake is imminent and warn people on the highways to slow down (2). It is estimated that a modern system for California would cost around $100 million to develop and set up (Matheny 1). Currently in California the cost is a big problem and public awareness is necessary to raise the amount required (Nagourney 1).

A cheaper solution for the government would be to get people earthquake insurance if they live in earthquake prone areas. If more people had earthquake insurance in general, the government would have less to worry about in terms of disaster relief money. A proposed solution in the Philippines is the Earthquake Protection Insurance Co. (EPIC), which would offer more people earthquake insurance without seriously affecting the insurance industry. The Philippines is especially vulnerable to financial trouble after a natural disaster. It is estimated that it costs the Philippines about $2.57 dollars to deal with flood damage annually. It is also estimated that a large earthquake could take about $114 billion dollars to recover from, or about one-third of the Philippines GDP. That is a significant amount of money that the Philippine government would have a hard time recovering from. However, in order for EPIC to successfully reduce the financial burden of an earthquake it will have to be mandatory for all people so that everyone is covered (Torres 1). A similar strategy is being implemented for a different kind of natural disaster; floods. In Connecticut, the Federal Emergency Management Agency has released new flood mapping which puts many people living along the coast in high-risk flood zones (Spiegel 1). This new mapping in combination with Obama’s National Flood Insurance Program would significantly drive up the price of homes in flood zones. One person reported that if he had to pay $5000 for insurance, which one of his friends currently is, he would rather move than stay (2). If this same strategy were implemented for earthquakes, the cost of living in earthquake prone areas would most likely provoke many people to move out of earthquake areas, lessening the loss of life. Additionally, homes damaged by earthquakes that were insured would relieve the government of having to pay for disaster relief and keep the affected nation’s infrastructure stable.

A third way to reduce the damage of an earthquake would be to construct buildings that will not break and topple during earthquakes. Japan is considered to have some of the toughest buildings on the planet, implementing the latest technology and strict codes to keep even skyscrapers standing during powerful earthquakes. Unlike the United States, the Japanese building codes are much more specific on how a building must be engineered. After the 1995 Kobe earthquake, Japan spent large amounts of money researching and developing the best ways to make buildings more resistant to the intense ground shaking that occurs during earthquakes. Another method that Japan employs is strengthening older buildings that were not built with the current building codes and are most likely to be destroyed during major earthquakes (Glanz and Onishi 1). After assessing the damage to the Kobe earthquake it was found that the majority of the buildings that fell were those that were built before the major changes to the building codes in 1971 and 1981 (Normile and Kerr 1-2). Many current earthquake resistant buildings include several modern apparatuses to reduce the swaying of the building and the amount of force put on a building from the energy of the earthquake. One of those devices that are being used in Japanese buildings are isolation pads. These rubber pads are placed at the bottom of a building’s structure and cause the building to experience less shaking during an earthquake. Some buildings also use energy dissipation units, which are built into the skeleton of a building. These objects expand and contract to counteract the shaking caused by an earthquake and decrease the overall swaying of a building (Glanz and Onishi 1). After an earthquake hit Chile in early 2010, the Earthquake Engineering Research Institute of California sent many people to gather data on how Chile’s buildings performed in the earthquake (Nelsen 1). Chile shares similar building codes and materials with the United States and this gave researchers an opportunity to study how California’s buildings would have fared in a similar earthquake. They found that they were doing many things right with the current building codes but also found some faults and places for improvement (2). Of the three proposed solutions, strong building codes are the most effective method for reducing the damage of earthquakes.

The flaws of the other two solutions outweigh the flaws of the building codes solution. Building codes protect people against earthquakes no matter where they hit. Early warning systems only work if the origin of the earthquake is far enough away. If a fault line runs right under a city and it is the source of a large earthquake, an early warning system won’t do the city much good. Stronger buildings would be the critical factor in reducing damage during the earthquake and saving lives. Of course if the earthquake were far enough away, anyone would want a minute to prepare for the massive shaking that was about to happen. Earthquake insurance would also only work to reduce property damage during an earthquake if people moved out of those areas due to the high prices. Earthquake insurance would only reduce financial damage if a large earthquake were to hit a highly populated area. This would be more important in countries with weaker economies than the United States but it could still help relieve some financial burdens for the United States if a large earthquake hit. Strong building codes would be the best way to save the government millions of dollars in property damage because if buildings are stronger, there is less property damage. There are some drawbacks to the building codes. For one it is impossible to say with absolute certainty that a building will not collapse during any earthquake. There could be an earthquake in the future with a magnitude that exceeds the standards buildings were designed for. Also, older buildings not built to the current codes still need to be reinforced to allow them to stay standing during powerful earthquakes.

Despite the drawbacks, the implemented building codes have shown some of the best results in saving structures and lives during earthquakes. Compared to the other three solutions, building codes have proven themselves the most reliable and have shown their worth over and over again. Imagine living in a big city and all of the skyscrapers were to collapse. Imagine all of the lives that would be lost. The current building codes being implemented around the world in earthquake hot spots have allowed many skyscrapers to survive countless earthquakes and save many lives. There has also been a visible contrast between the countries that strictly enforce their building codes and those that don’t. In China the consequences of not constructing buildings according to code was very evident when the Sichuan Earthquake struck it in 2008. So many people were killed that would have survived if the buildings had been built correctly (Glanz 2). Another smaller benefit that can come from having a home or office built to the highest standard is the ability to attract people to buy or work at that place with a guarantee that they will be safer there during an earthquake than in other places built to a lower standard.

There is a strong need to act now to make sure we can be as safe as possible during future earthquakes. If this problem reaches a crisis point there is no telling what kind of damage it will result in. It is unsure if the world would recover if earthquakes started erupting all around the earth in high frequencies. Earthquakes have wiped out civilizations in the past and there is no doubt that very powerful earthquakes could cause the modern world to suffer the same fate. There is almost no doubt that we have not seen the last of this horrible natural disaster. Earthquakes will almost certainly be a part of the future of the United States causing deaths and destruction. as a nation we need to be prepared for the worst.

This website is an attempt to recreate a MIDI controller using the keyboard. It uses Howler JS audio library for quick and easy audio playing across multiple browsers: It uses the Zip.js library to parse .zip files: