当前位置：网站首页>Coursera deep learning notes

Coursera deep learning notes

2022-07-19 07:24:00 【Alex Tech Bolg】

Train and dev set don’t need to be 70% - 30%. (Dev set just needs to be big enough for us to evaluate. )
Make sure dev and test set come from same distribution. (Deep learning model needs to be fed lots of data to train, so sometime we need web crawler to get more data, which comes from different distribution. This rule of thumb can make that the progress in machine learning algorithm will be faster. )
If you don’t need an unbiased estimate of performance, it’s fine to only have train and dev set.

optimal (Bayes) error
Compare the error on train and dev set with the optimal (Bayes) error to diagnose whether the model has high bias or high variance or both or neither.

Please add a picture description

Please add a picture description

The first intuitive explanation ： hold $\lambda$ Set it to a very large value , A lot of parameters w become 0, In this way, the model changes from a complex neural network to a simpler neural network , Extreme cases may become linear models .
Second interpretation ： With tanh For example , increase $\lambda$ ,w Reduce ,z Will be limited to a smaller range , This section covers tanh On , It happens to be approximately linear

implement dropout (inverted dropout)

Please add a picture description

Inverted dropout In order not to change E(Z), In this way, when forecasting , There will be no more scaling The problem of
In every time iteration, It's different notes By dropout Set to 0

Making predictions at test time

Please add a picture description

Intuition： purple note Cannot rely on one input, Because every one of them input Can be random eliminate, therefore dropout Sure spread out weights, So that's what happened shrink weights The role of , So with L2 normalization Same effect .
Downside：loss function J Difficult to calculate , Because every time dropout It's all random , So we can't go through loss To see how well the model is trained . The common practice is to put dropout Get rid of , If loss function The curve drops well , Then take it. dropout add , Look at the change of the final result .

Data augmentation
Early stopping

Normalize training data. Then use same $\mu$ and $\sigma$ to normalize test data because we want to guarantee the train and test data go through same transformation.

Why normalize inputs?

Rough intuition：Your cost function will be in a more round and easier to optimize when your features are on similar scales.
For dramatic difference of scales in features, it is important to normalize them. If your features come in on similar scales, then this step is less important although performing this normalization pretty much never does any harm.

If your activations or gradients increase or decrease exponentially as a function of L(number of layers), then these values could get really big or small. This will make training very difficult.
Especially for exponentially small, the gradient descent will take tiny little steps, which will take a long time for gradient descent to learn anything.