This is fantastic. I wonder if we've passed a tipping point where some of these fundamental machine learning algorithms will be taught the same way we learned the C++ STL back in the day. Really something like a spam filter should look like a shell script, where you pipe things around and use off the shelf tools, rather than spending inordinate amounts of time trying to implement building blocks (imaging having to write grep before you could even start).
Just out of curiosity, does anyone have a link on NN training parallelization? I’m just hitting paywalls. I’m curious if it’s simple or if there are dependencies between neurons that make it hard. As I see it, the next big hurdles after ubiquitization are distributed training (possibly with GPUs) and empirical data on the best tuning parameters for various problem spaces (which will come largely for free when training happens 5 or 10 orders of magnitude faster than today). Eventually we’ll have an answer as to whether AGI can be composed of basic building blocks or if we’re still missing some fundamental insight.
For example, genetic algorithms are fairly easy to parallelize because by their nature they use large and independent populations. About the time I was graduating college in the late 90s, there was great interest in using them to find the initial weights in NNs to avoid local minima, but then I didn’t hear much about it since. Have there been any breakthroughs?
> the same way we learned the C++ STL back in the day
My school forbid you from using STL. You had to build all of your data structures in freshman/sophomore years and then use those same data structures you built in your sophomore/junior/senior projects. It was really crazy because if you never got your lists, trees, graphs, and hashes working you basically wouldn't make it through the CS program. It sucked at the time, but I'm grateful now.
Wow, that sounds very bad to me. It's important to learn how data structures work, but I believe it's more important to learn how to find and re-use existing code/libraries in your work. This is something schools seem to be avoiding and even discouraging (mine didn't take it as far as your school, but they never once covered the standard libraries).
I've tutored students in Java classes who would write a bubble sort every time they needed to sort a List, instead of using Collections.sort, simply because they didn't know Collections.sort existed! Or their own ugly code of looping, indexOf and substring to get values from a CSV instead of using string.split(",") or a premade csv parser (which I know that professor would have been fine with, he would have even encouraged it).
Unless point of the project is to learn how to make one from scratch, you shouldn't be wasting your time writing an alphabetical sorting function when you can be solving novel problems.
It might, but today, I run into tons of developers that don't understand the fundamentals. I worked with a guy who I affectionally named "Stack Man." He basically used a stack for any list implementation. He didn't understand the difference between a list, a queue, and a stack.
You're teaching your students Java and telling them "Just use Collections.sort" and that's fine, but do you understand when Collections.sort is bad? If you don't understand the underlying algorithm of how a merge sort works, you might be telling them to use something that is worse than could actually be worse than a bubble sort in certain cases.
Also Java wasn't taught to CS students at our school, that was only for BIT (Business Information Technology). Teachers wanted you to understand memory and pointers.
>You're teaching your students Java and telling them "Just use Collections.sort" and that's fine, but do you understand when Collections.sort is bad? If you don't understand the underlying algorithm of how a merge sort works, you might be telling them to use something that is worse than could actually be worse than a bubble sort in certain cases.
When your assignment has to be uploaded in 3 hours, and your dataset is 120 items, Collections.Sort is always better than writing your own sort.
I never said I taught them only to use Collections.sort, as that would be just as bad as never teaching them the standard libraries and only teaching data-structures. You need a healthy balance of both fields of knowledge. Otherwise you end up spitting out either programmers who can use library functions without understanding what's actually going on, or programmers who know all about the underlying methods and logic, but can't write code efficiently and quickly enough to actually help a team.
That's not to say you cannot write code efficiently or quickly. You likely did as most good programmers do and learned on your own time as well. But for students who don't do this, they will be lacking huge amounts of practical knowledge.
<i> I worked with a guy who I affectionally named "Stack Man." He basically used a stack for any list implementation. He didn't understand the difference between a list, a queue, and a stack.</i>
That's brilliant, although I'd imagine it could begin to annoy. Slightly obtuse, but I'd have had to get a snippet of this to play every time he checked in a new stack...
>Wow, that sounds very bad to me. It's important to learn how data structures work, but I believe it's more important to learn how to find and re-use existing code/libraries in your work.
If you're trained as a salaried code churner to solve some "pragmatic enterprise problem" yes.
If you're trained as a computer scientist, you have to know how to do stuff properly from first elements, and also how to invent and code similar new stuff, custom tailored to new domains yourself with ease.
You don't get progress in computer science with people just learning to use existing libraries and data structures. At best you can get some innovative apps, but not people able to build the substructure to create the innovative apps of tomorrow.
Not to mention the sorry state of most existing data structures libraries and core APIs, from STL to the Java SDK. A lot of this stuff needs to be burned with fire to be brought to 2014 levels.
We gave a talk on this at hadoop summit recently[1]. This is also what I've implemented in deeplearning4j [2].
The approach called iterative reduce is basically training on mini batches in parallel and averaging the outcomes. The trick is to do it in multiple iterations on each mini batch averaging the parameters at runtime. It's a surprisingly simple technique that works really well.
Unfortunately, it's not as async as sandblaster LBFGS by dean and co[3].
Andrew himself has told me that the gradient was unstable with this approach though.
Right now, distributed training of neural nets is still like the wild west. I can see a few approaches coming up over time though.
You may be looking for stochastic gradient descent (SGD). Training neural nets is very parallelizable using such techniques, and can even be done asynchronously (avoiding having to regularly sync different nodes in the cluster) using systems like parameter server.
Geoff Hinton popularized pre-training nets with unsupervised latent-feature learning autoencoders in the 90s, making training nets much faster. More recently GPUs have been used because they perform matrix/vector multiplication much faster. Computer clusters of hundreds and thousands of nodes are used to run SGD at high speeds with the above techniques, and pretraining with autoencoders isn't quite necessary anymore.
Bayesian optimization is one technique which addresses your concern regarding parameter-choosing. Essentially Bayesian optimization uses Bayesian reasoning to plan experiments which seek to discover the most information about the loss function of the network, thus allowing you to efficiently choose hyperparamters of your model (whatever it is).
Neural nets have definitely been seeing breakthroughs since the 90s–the last few years alone have been amazing. Check out "Google Brain", "ImageNet neural net", "Never-ending learning (NELL)", Geoff Hinton (reinvention of nns), Tom Mitchell (online learning, proto-AI), Yan LeCun (convnets), Nando de Freitas (bayes opt). Lots of good stuff out there!
I've been playing around with neural nets and find that swarm optimization is particularly effective; you can train high-dimensional neural net spaces with less testing per iteration. I think this works because there's a lot of arbitrary symmetrical choices embedded in the neural net. For example, hidden layer neurons are essentially arbitrarily chosen.
Also, I can't use backpropagation for what I'm trying to do.
Thanks for the suggestions. I had heard of Yann LeCun, I'm trying to do image recognition in an "different" way than CNNs, I'm going to read up on the other people in the list.
I prefer using PSO versus feed-forward gradient descent because emprically for my system, a particle size of ~100 gives comparable results to standard gradient descent, on a 400-or-so neuron neural net, which is ~4x slower.
I can't use backpropagation because some of the output nodes are used to apply a transformation on the input followed by reapplication of the (now transformed) input through the neural net.
That sounds similar to a recurrent neural net? Backprop definitely works on those with some modifications. I just asked some guys at CMU/Google, they suggested looking more at SGD or some other MCMC methods that may have included the word "expert" or maybe were similar to simulated annealing or Gibb's sampling… which I can't recall… I know nothing about it, and I'm totally unqualified to say anything about them, but maybe (hopefully!) some of those words will lead to something useful! The guys I chatted with don't study PSO, so maybe don't know enough to be helpful, but were a bit down on it in general.
The guys at h2o.ai have been parallelizing machine learning algos for their open source project. I guess their major insight is to ship around different steps of the algo to different nodes rather than ship around data. And they avoid the sort step in hadoop too.
I like this one because it is easier to get an idea of what the layout of the neural network itself is, and gives more control.
Off the topic of neural networks in javascript but on the topic of neural networks in general, the readme links a cool paper on applying convolutional neural networks to old games.
Let me warrant this by saying I don't know the first thing about how neural networks actually work, but can someone help me out and explain why the following isn't working?
I added an extra input to the training data for the XOR example:
Each training pattern should have an input and an output, both of which can be either an array of numbers from 0 to 1 or a hash of numbers from 0 to 1.
Not sure about the implementation, like is .train() auto scaling the network size? I'm guessing there isn't enough neurons to do 3-bit xor logic. Try adding more hidden layers!
Also keep in mind that there needs to be neurons to convert a value into binary, and back! Which is also going to take a lot of neurons, you could try doing dec -> binary conversion before sending it to the neural network:
outputs of vanilla neural nets are in the range of (0,1). Usually the last step of the neural net is to take the final output and transform it using the logistic sigmoid function, which takes values from all real numbers and compresses it to that range.
If you want to create a neural net that can, say, identify handwritten digits and output a result 0..9 it is usually better to create a neural net that has 10 outputs, each corresponding to one digit, rather than a neural net that has one output, say, parsed between (0,0.1) = 0, [0.1,0.2) = 1, etc.
Do you mean for this particular implementation or in general? If the latter, here's an animation of input-to-hidden weights during the training of an autoencoder on MNIST data:
Yikes. Sorry about that. I don't remember the last time I used a photo upload service, so I just used the first one that came up. I didn't notice the NSFW stuff due to adblock probably. Anyway, here's a safer link for anyone else:
Nice! Although training 100k+ samples may crash or be prohibitively slow. Maybe enabling Decaf for training on the GPU in the backend would be a killer feature!
The reason why it's used is because it's a very simple example of a non-linearly separable function.
You can interpret boolean functions as classification problems: given some input(s) x, classify the output as either 1 (class A) or 0 (class B).
The truth table of an XOR function looks like this:
x1 | x2 | XOR | Class
---------------------
0 | 0 | 0 | B
0 | 1 | 1 | A
1 | 0 | 1 | A
1 | 1 | 0 | B
If you were to plot the XOR function in R2, it would look something like this:
x2 ^
|
|
A B
|
|
B-----A----->
x1
To be non-linearly separable means that there is no straight line that you can draw in the above plot that will split the classes. Thus, we need a non-linear classifier, e.g. a neural net.
It's a bit hard to elaborate. You have to rember that a classification problem may be generalized to more than just two dimensions, in fact it can be generalized to however many dimensions are necessary to describe the data. In the general multidimensional cases linearly separable relates to the number and type of hyperplanes (subspaces) necessary to correctly classify the data.
In this simple 2-d case the subspace is a line, so this example is linearly separable. In the 3rd dimensional case linearly separable refers to classification using a single 2-dimensional plane. For 4 dimensions with the data, you need a single 3-dimensional hyperplane for it to be linearly separable, and so on.
Just out of curiosity, does anyone have a link on NN training parallelization? I’m just hitting paywalls. I’m curious if it’s simple or if there are dependencies between neurons that make it hard. As I see it, the next big hurdles after ubiquitization are distributed training (possibly with GPUs) and empirical data on the best tuning parameters for various problem spaces (which will come largely for free when training happens 5 or 10 orders of magnitude faster than today). Eventually we’ll have an answer as to whether AGI can be composed of basic building blocks or if we’re still missing some fundamental insight.
For example, genetic algorithms are fairly easy to parallelize because by their nature they use large and independent populations. About the time I was graduating college in the late 90s, there was great interest in using them to find the initial weights in NNs to avoid local minima, but then I didn’t hear much about it since. Have there been any breakthroughs?