Artificial Intelligence Nanodegree Term 2
Table of Contents
Chapter 1 - Deep Learning
Links
-
Forum https://discussions.udacity.com
-
Other Solutions
- https://github.com/lemuelbarango/dog-breed-classifier
- https://github.com/binojohnthomas/AIND-RNN
- https://github.com/morganel?tab=repositories
- cmiller112000
- angelmtenor
- https://github.com/angelmtenor/deep-learning
Graduated
- Email: ndgrad-support@udacity.com alumni-support@udacity.com
Concentrations
- About
- introductory project that utilizes a commercially-available API to build a complete solution, and,
- capstone project where you will try to solve one challenging problem in the chosen area
-
Concentrations
- NLP - label words in sentence with Part-of-Speech (POS) tags as named entities
- Grammars https://classroom.udacity.com/courses/cs101/lessons/48299949/concepts/487192400923
- VUI
- Links
- https://developer.amazon.com/alexa-skills-kit/alexa-skills-developer-training
- Applications
- Apple Siri
- Microsoft Cortana
- Google Home
- Amazon Alexa on Echo
- Device
- Amazon Echo Dot V2 http://www.smarthome.com.au/z-wave/z-wave-accessories.html
- Links
- NLP - label words in sentence with Part-of-Speech (POS) tags as named entities
- Previews
- Voice Preview https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/38e74312-3173-4456-919d-bcb00a82bfb5/lessons/dc1efdfd-e07f-4a5c-ab35-dbb274a25c88/concepts/last-viewed
- NLP Preview https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/ac7813e7-2907-44e4-a9a5-388bcc4edd38/lessons/5f3de1f2-df97-46c4-a2ba-82418c66f9e5/concepts/last-viewed
- Computer Vision Preview https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/0917fcad-9a95-401c-9c92-8cec8f6dc09e/lessons/260fb8ce-eb1d-4ea2-864b-d8ed31b7082f/concepts/last-viewed
TODO
- AWS Training Free
- https://www.aws.training
- https://www.awseducate.com/microsite/Training
Instructors
- Luis Serrano - Course Developer / Instructor (did ML for Google Youtube recommendations)
- Alexis Cook - Applied maths / biologist, uses Deep Learning @alexis_b_cook
- Arpan Chakraborty - Builds Computer Vision and Machine Learning courses
- San Camacho - Expert in Computer Vision (applied in medical tech to self-driving car navigation)
- Dana Sheehan - Elec Engineer, MSc with interest in AI
- Jeremy Watt - ML engineer educator and uni textbook author, likes NLP and Computer Vision, and maths optimisation, wrote “Machine Learning Refined”
- Raisa Honey - Deep Learning researcher with teaching experience in ML, wrote “Machine Learning Refined”
Deep Learning
- Applications
- Defeating humans in games
- Detecting spam emails
- Forecasting stock prices
- Recognising images in pictures
- Diagnosing illnesses
- Self driving cars
- Components
- Neural Networks
- Dfn: Mimics way brain operates with neurons that fire off info
- Way to visualise Neural Networks: Given some data comprising groups of Red and Blue points, then Neural Networks find the best line that separates the groups. Or for complex data, the boundary to separate the points will be more complex
- Types:
- Deep Neural Networks - many Nodes, Edges, Layers
- Neural Networks
- Classification Problems
- Example 1: Predict whether student is admitted into university by analysing
n-Columns of known admissions data samples from prior student applicants
and Plot data on graph in n-Dimensions. Create an equation/Model
that generates a “Line” (if 2 columns, 2D), “Plane” (if 3 Columns, 3D),
or “Hyper Plane” (if n Columns, n-D) that is a Linear Demarcation Boundary
between all known samples points that were “Accepted” and “Rejected”
Problem - Given students, results of Tests and Grades results, and an Admissions officer - Given known admission samples, predict whether another student admitted - Known (previous data): Accepted - Student 1 - Test 9/10, Grade 8/10 Rejected - Student 2 - Test 3/10, Grade 4/10 - Unknown: ? - Student 3 - Test 7/10, Grade 6/10 Solution - Determine how many Columns we want to plot to determine how many Dimensions the plot will be in (including "Higher Dimensions") i.e. 2x Columns - Plot in 2D with demarcation boundary "Line" 3x Columns - Plot in 3D with demarcation bounary "Plane" nx Columns - Plot in nD with demarcation boundary "Hyper Plane" - Plot all Known Test results i.e. if 2D then (X1 (X-axis), Grade results on X2 (Y-axis) - Create Demarcation Boundary representing our Model between where students likely accepted/rejected. - Linear Equation represents the demarcation of the Model, and if result of substituting Unknown prediction Grade and Test results into equation are score < 0 then Reject, or if >= 0 then Accept. i.e. W1 * X1 + W2 * X2 + b * 1 = 0 # product of two Vectors W * x + b = 0 # abbreviated in Vector notation where W = (W1, W2) # Vector W - Weights where x = (X1, X2) # Vector x - Inputs where b # Bias where Y = 0 or 1 # Label that we're trying to predict # for given coordinates (X1, X2) # Y == 0 if Rejected (under demarcation line of Model) # Y == 1 if Accepted (above demarcation line of Model) Note: Points are of form (X1, X2, Y) y^ = 1 if Score of W * x + b >= 0 # y Hat is what algorithm Predicts 0 if Score of W * x + b < 0 # the labels will be Note: Points on the Demarcation Boundary Line have Score of 0 when substitute coordinates into equation. Note: Goal of algorithm is to have y^ (prediction) resembling Y (actual) as closely as possible (i.e. finding the demarcation boundary line that keeps the previous Y == 1 above, and Y == 0 below it) - If more than >= 3x Data Columns (i.e. Test Result, Grade Result, Class Rank) we Fit the data by using >= 3x Dimensions/axis (i.e. X1, X2, X3) i.e. X1 (X) - Test X2 (Y) - Grades X3 (Z) - Class Rank - Plot each sample as a point on the graph - Demarcation Boundary line is plotted as a 3D "Plane" (possibly on an angle) W1 * X1 + W2 * X2 + W3 * X3 + b * 1 = 0 # product of three Vectors W * x + b = 0 # abbreviated in Vector notation y^ = 1 if W * x + b >= 0 # y Hat is what algorithm Predicts 0 if W * x + b < 0 # the labels will be - Colour each sample depending on the Region of the sample (from 2x Regions available) (i.e. whether above or below the "Plane") - If have 'n' Data Columns we Fit the data in 'n'-dimensional space, where each sample Point contains coordinates (i.e. (X1, X2, ..., Xn) ), and where Label is Y, then: - Demarcation Boundary Line: (n - 1) Dimensional "Hyper Plane" (i.e. a High Dimensional Equivalent of a "Line" in 2D or a "Plane" in 3D) - Equation W1 * X1 + W2 * X2 + Wn * Xn + b * 1 = 0 # product of 'n' Vectors # (each Vector has 'n' entries, one for each Column from Data set W * x + b = 0 # abbreviated in Vector notation y^ = 1 if W * x + b >= 0 # y Hat is what algorithm Predicts 0 if W * x + b < 0 # the labels will be - Solve Unknown prediction by plotting them on most likely demarcaton side. Whilst the Model will make some mistakes we can assume this prediction is correct with some confidence
- Example 1: Predict whether student is admitted into university by analysing
n-Columns of known admissions data samples from prior student applicants
and Plot data on graph in n-Dimensions. Create an equation/Model
that generates a “Line” (if 2 columns, 2D), “Plane” (if 3 Columns, 3D),
or “Hyper Plane” (if n Columns, n-D) that is a Linear Demarcation Boundary
between all known samples points that were “Accepted” and “Rejected”
- Perceptrons
- Dfn:
- Perceptrons are called Neural Networks since they look like the Neural Networks in the Brain. Perceptrons calculate an equation in Node 1 based on Input Node values. Similarly, the Brain Neuron takes values from “Dendrites” Input Nodes (nervous impulses), processes them, and decides whether to output a nervous impulse via the Axon. Create Neural Networks by concatenating Multi-Layered (multiple) Perceptrons to mimic the way the Brain connects Neurons by making successive outputs the input to the next.
-
Visualise
INPUT NODES NODE 1 - SUMMATION - NODE 2 - STEP - CALC LINEAR EQN ON INPUTS APPLIES STEP ON THE WEIGHTS FN TO RESULT =========== ================================== ===================== | X1 | --- W1 ---> | X2 | --- W2 ---> LINEAR FUNCTION / PLOT STEP FUNCTION ... | Wx + b = (n ∑ i=1) WiXi + b | ----> | Wx + b >= 0 ??? | ---> YES: 1 | | | Wx + b < 0 ??? | NO: 0 | Xn | --- Wn ---> | 1 | --- b --->
-
NOTE: In the future we’ll use different Step Functions
-
Building block of Neural Networks where we encode the Equation (that defines our Model) into a small graph
- Build the Perceptron by:
- Create Model Plot Node containing our Plot inside (showing our Boundary Line and Data Points)
- Create Input Nodes (i.e. for each Sample value for all Columns) to the Model Node
-
Perceptron will plot the Sample values at a point on the Model Node Plot and checks if point is in Positive Region (returns Yes) or Negative Region (returns No)
i.e. Given equation: 0 = W1 * X1 + W2 * X2 + b * 1 # Linear Equation Boundary Substitute: Score = 2 * Test + 1 * Grade - 18 * 1 # Linear Equation with Weights # and Input Types substituted i.e. if we have the following: Weight 1 (Test) : 2 Test result : 7 Weight 2 (Grade): 1 Grade result : 6 Bias unit : -18 we then plot the following on the Perceptron: - Point (7,6) on the plot - Edge 2 between Input Node (Test) and Model Plot Node - Edge 1 between Input Note (Grade) and Model Plot Node - Bias -18 label this value over the Model Plot Node Outcome: Now when we see a Perceptron having Nodes with these labels, we can think of the Linear Equation the nodes generate i.e. | TEST = 7 | ---- 2 ----> | -18 | | GRADES = 6 | ---- 1 ----> Alternative: Alternatively can include the Bias as an Input Node (i.e. think of it in the equation as b * 1) and have b labelling an Edge coming from a 1 Then the Model Plot Note multiplies the values from the incoming Nodes by the values on the corresponding Edges i.e. | TEST = 7 | ---- 2 ----> | SCORE = 1 * 7 + 2 * 6 - 18 * 1 = 1 | | GRADES = 6 | ---- 1 ----> | BIAS = 1 | ---- -18 --> Finally it checks if SCORE is >= 0 or < 0 and returns: 1 (i.e. YES) for SCORE >= 0 signalling student accepted 0 (i.e. NO) for SCORE < 0 i.e. in general | X1 | --- W1 ---> | X2 | --- W2 ---> ... | Wx + b = (n ∑ i=1) WiXi + b | ----> YES: 1 | Wx + b >= 0 ??? | NO: 0 | Xn | --- Wn ---> | 1 | --- b ---> Note: we are implicitly using the "Step Function" (i.e. returns a 1 if Input Positive, and returns 0 if Input Negative)
- Use the “Weights” as Labels in the Plot, since they define the Linear Equation itself
- Example: Perceptrons as Logic Operators
-
Create Perceptrons for logic operators including AND, OR, NOT, and XOR. https://www.youtube.com/watch?v=45K5N0P9wJk
- AND Perceptron
i.e. Plots Bounary line from substituting inputs Weights + Bias' into equation. Then plot each point and returns 1 if point in Positive region and returns 0 if point in Negative region (i.e. below Boundary diagonal line that represents the equation and crosses just below (1, 1)) | INPUT1 | --------> | AND | -------> OUTPUT | INPUT2 | --------> Truth Table ======================== INPUT1 INPUT2 OUTPUT ------------------------ T T T T F F F T F F F F Perceptron Table ======================== INPUT1 INPUT2 OUTPUT ------------------------ 1 1 1 1 0 0 0 1 0 0 0 0
- Code Implementation
import pandas as pd # TODO: Set weight1, weight2, and bias weight1 = 0.0 weight2 = 0.0 bias = 0.0 # DON'T CHANGE ANYTHING BELOW # Inputs and outputs test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)] correct_outputs = [False, False, False, True] outputs = [] # Generate and check output for test_input, correct_output in zip(test_inputs, correct_outputs): linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias output = int(linear_combination >= 0) is_correct_string = 'Yes' if output == correct_output else 'No' outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string]) # Print output num_wrong = len([output[4] for output in outputs if output[4] == 'No']) output_frame = pd.DataFrame(outputs, columns=['Input 1', ' Input 2', ' Linear Combination', ' Activation Output', ' Is Correct']) if not num_wrong: print('Nice! You got it all correct.\n') else: print('You got {} wrong. Keep trying!\n'.format(num_wrong)) print(output_frame.to_string(index=False))
- Code Implementation
- OR Perceptron
- Similar to AND Perceptron
- The gradient of the Demarcation Boundary is same as AND Perceptron but is shifted down by using Different Weights and Bias
-
XOR Perceptron
Perceptron Table ======================== INPUT1 INPUT2 OUTPUT ------------------------ 1 1 0 1 0 1 0 1 1 0 0 0
- AND Perceptron
-
- Tricks
- Trick for Perceptron Algorithm to Split Data Points and Adjust Linear Equation
- STEP 1:
- Choose random linear equation that defines a line with a Positive area and Negative area (each side of line)
- STEP 2:
- Points indicate if they are Correctly or Incorrectly
Classified (i.e. whether on correct side or not)
so we may improve the line
- Misclassified point indicates to line to come closer to it
- Points indicate if they are Correctly or Incorrectly
Classified (i.e. whether on correct side or not)
so we may improve the line
- STEP 3:
- Adjust the linear equation (and associated line)
based on feedback from points instructing line how to move
i.e. given Linear Equation and relative regions: Positive Region 3 * x1 + 4 * x2 - 10 > 0 LINE 3 * x1 + 4 * x2 - 10 = 0 Negative Region 3 * x1 + 4 * x2 - 10 < 0 STEP 1: Misclassified Point in POSITIVE Region given a Misclassified Point incorrectly located in the POSITIVE Region: POINT (4, 5) move the Point closer to the Line by using the 4 and 5 and using them to modify the Linear Equation to make the Line move closer to the Point given the following: Parameters of the Line are: 1) Parameters of Line: 3, 4, -10 2) Points and Bias Unit: (4, 5) 1 Avoid moving toward this NEW LINE drastically as may result in accidently Misclassifying all other Points, instead we want to make SMALL steps toward the point, by using the LEARNING RATE small number. LEARNING RATE: 0.1 now, first Multiple 2) by LEARNING RATE to get 3) (4, 5) 1 4 * 0.1 5 * 0.1 1 * 0.1 0.4 0.5 0.1 and then the Subtract 3) from 1) to get NEW LINE Parameters: - 3 4 -10 0.4 0.5 0.1 ----------- 2.6 3.5 10.1 this gives us a NEW LINE equation of, which will move closer to Point: 2.6 * x1 + 3.5 * x2 - 10.1 = 0 STEP 2: Misclassified Point in NEGATIVE Region Repeat similar to STEP 1, but instead of Subtracting, we instead use Addition 1) Parameters of Line: 3, 4, -10 2) Points and Bias Unit: (1, 1) 1 LEARNING RATE: 0.1 (1, 1) 1 1 * 0.1 1 * 0.1 1 * 0.1 0.1 0.1 0.1 + 3 4 -10 0.1 0.1 0.1 ----------- 3.1 4.1 -9.9 3.1 * x1 + 4.1 * x2 - 9.9 = 0
- Adjust the linear equation (and associated line)
based on feedback from points instructing line how to move
- STEP 1:
- Trick for Perceptron Algorithm to Split Data Points and Adjust Linear Equation
-
Perceptron STEP Algorithm - Linear Data
-
PSEUDOCODE
1. Random Weights Start with random weights: w1, ..., wn, b Apply to Line Equestion: Wx + b = 0 (separates Positive and Negative areas) 2. For every misclassified points (i.e. x1, ..., xn) repeat 2.1 and 2.2 until no more erroneously misclassified points 2.1 If `prediction = 0` (i.e. Positive Point Label misclassified in Negative area) Then: Update weights as follows (adding): for i = 1 ... n change wi + α * xi where α is the Learning Rate (i.e. 0.1) change Bias Unit of b to (b + α) (to move Line closer to Misclassified Point) where b is the Bias Unit 2.2 If `prediction = 1` (i.e. Negative Point Label by missclassified in Positive area): Then: Update the weights as follows (subtracting): for i = 1 ... n change wi - α * xi where α is the Learning Rate (i.e. 0.1) change Bias Unit of b to (b - α) (to move Line closer to Misclassified Point) where b is the Bias Unit
-
CODE
-
Implement the Perceptron STEP Algorithm to separate data in a CSV file
Perceptron steps: Given: Point Coords (p, q) Label y Prediction Equation ^y = step(w1 * x1 + w2 * x2 + b) Then: If Point is: Correctly classified Then do nothing If Point is: Incorrectly classified (i.e. Positive classification but has Negative Label) Then SUBTRACT (α * p, α * q, α) from (w1, w2, b) If Point is: Incorrectly classified (i.e. Positive classification but has Negative Label) Then ADD (α * p, α * q, α) to (w1, w2, b)
- Graph the output of the Perceptron Algorithm by clicking on
test run
- Draws Dotted Lines, showing how algorithm approaches Best solution (Black Solid Line)
- Note:
- Modifying Perceptron Algorithm Parameters including:
- epochs
- Learning Rate
- Randomising initial parameters
- Modifying Perceptron Algorithm Parameters including:
-
-
-
Perceptron STEP Algorithm - Non-Linear Data
- Example - Error Function and Gradient Descent
LINEAR ======================== Given the following data point: * Test: 9/10 * Grades: 1/10 Using Boundary Line, the student will be accepted, since on Positive side of line. NON-LINEAR (ALTERNATIVE): ======================== If only want to accept candidates based on CUSTOM criteria (i.e. must have Test >= 5 and Grades >= 5) then we need to Label the data points differently, and the Positive and Negative region cannot be just a Boundary "Line". Instead we need to separate with a "Circle", "Curve" or "Multiple Lines" Redefine the Perceptron Algorithm that we created for Boundary "Lines" so that it generalises to other types of "Curves" "Error Function" is used to with an Error Metric (distance) to tell us and the computer how badly its doing: - show us how far we are from the ideal solution so we can repeatedly take steps to decrease the error to eventually solve the problem: - check in what directions we can take subsequent steps to get closer to the solution - pick direction that takes us the farthest distance (decreases the error distance) "Gradient Descent" Method is used to overcome issues: Issues: - Local Error Minimum (getting stuck) - which often gives good solution to problem
- Example - Goal Split Data
Discrete Error Fn vs Continuous Error Fn (Differentiable
using high Penalty weights for misclassified points so allows
solving problem with Gradient Descent)
```
- Given data points plotted.
-
Given a Boundary Line between Positive and Negative region
- Goal is to inform computer how far it is from
perfect solution.
- Count qty of errors (i.e. data points misclassified on wrong side of line)
- Decrease the qty errors
- Check directions can move/rotate the Boundary Line to
- PROBLEM:
- DISCRETE ERROR FN - This algorithm uses Calculus used to take small steps (by taking derivatives) but with small steps, each step may not reduce the qty errors (similar to being on top of a pyramid of steps that say jump from 2 to 1 and to 0, and taking small steps in a direction toward the steps, but confusing since all the steps are x levels from the ground)
- ALTERNATIVE
- CONTINUOUS ERROR FN - Allows use of small steps to indicate which direction will decrease the error the most (since small variations in position translate to small variations in error) The Error Function should also be Differentiable
- Build a Continous Error Fn.
- Given plotted points (with say 2 of 6 missclassified) wrt to Boundary Line
- Error Function will assign a Large Penalty
to the misclassified data points
(and Small Penalty to correctly classified data points)
where on the plot we represent the Size of the
point as the Penalty
- Misclassified Penalty - approx. the distance from Boundary Line.
- Correctly classified Penalty - close to 0
- Total Error obtained by counting all errors from
corresponding points (both correctly classified and misclassified points).
- TOTAL ERROR = PT1_ERR + PT2_ERR + … + PTN_ERR
-
Find out what small changes to Boundary Line Parameters that will make small changes to Error Function to make the TOTAL ERROR decrease, which we will see since the misclassified points will now have smaller Penalties (i.e. causing error Penalties to change), and then take a small step in that direction (each step will correctly classify a misclassified point). Repeat until reduce the TOTAL ERROR to its minimum possible value with all points correctly classified.
- IMPORTANT NOTE: Since we can build an Error Function (Continous) with this Penalty property we can now use Gradient Descent to solve the problem ```
- Predictions
```
- Predictions are the Output from the Algorithm.
- Probability is a Function of the distance from
the Boundary Line.
- DISCRETE Answer i.e. Yes or No i.e. Labels the data points: - Positive side of Boundary Line w 1 - Negative side of Boundary Line w 0
- CONTINUOUS Answer (probability between 0 and 1): i.e. 63.8% i.e. Labels the data points: - Positive side of Boundary Line w values >= 50% (further away from Boundary Line the higher the %) - Negative side of Boundary Line w values < 50% (further away from Boundary Line the lower the %) ```
- Switch from Linear to Continuous by changing
Activation Function from Step to Sigmoid:
- Compares Step and Sigmoid Perceptrons https://youtu.be/D5WNzbr6P78?t=3m20s
- Step Function (returns Yes or No)
Step(Wx + b)
(with Boundary Line)- where y == 1 if x >= 0
- where y == 0 if x < 0
- Sigmoid Function (returns Probability of Yes)
(with an entire Probability Space)
(where for each data point in plane we are given probability that the
Label of the point is 1 for Blue points and 0 for Red points)
- Function that for large Positive numbers returns
values close to 1 and for large Negative numbers returns
value close to 0, and for numbers close to 0 returns
value of 0.5
https://www.youtube.com/watch?v=D5WNzbr6P78
σ(x) = 1 / (1 + e^-x)
- i.e. at say Point (0.5, 0.5) the probability P(Blue) == 50%, P(Red) == 50% and at say Point (0.6, 0.4) the probability P(Blue) == 40%, P(Red) == 60%, etc
- Function that for large Positive numbers returns
values close to 1 and for large Negative numbers returns
value close to 0, and for numbers close to 0 returns
value of 0.5
https://www.youtube.com/watch?v=D5WNzbr6P78
- Generate the Probability Space:
- Combine together the:
- Linear Function -
Wx + b
applied to different lines that represent points whereWx + b = -n
…Wx + b = -4
…Wx + b = -2
, …Wx + b = 2
,Wx + b = n
- Sigmoid Function
σ(Wx + b)
whereσ(x) = 1 / (1 + e^-x)
applied to each of the potential lines on the plane, we obtain numbers for each respectively from 0 to 1 for each line, which each represent the Probability of each point and associated line being Blue or Red (between 0 and 1), where the (i.e. P(Blue)), and where:P(Blue) = prediction of model ^y = σ(Wx + b) = % value between 0 and 1 for each line
- when on Boundary Line
Wx + b = 0
, then Sigmoidσ(Wx + b) = 0.5
- when on Boundary Line
- Linear Function -
- Combine together the:
- Multi-Class Classification (3+ classes) and Softmax
- i.e. classify as Blue, Red, Green (instead of just Blue, Red)
e.g. Bi-Classification Problem: Find probability of getting a gift or not. P(gift) = 0.8 P(no gift) = 0.2 - Model takes Inputs (i.e. been good all year, is it your b'day) - Model uses Inputs to calculate Linear Model, which is the Score Score(gift) = Linear Function So, probability of getting gift or not is simply the Sigmoid Function applied to the Score: P(gift) = σ(Score)
Multi-Classification Problem: Find probability of what animal we just saw from duck, beaver, and walruss? i.e. want model to return the following (where combined add to 1): P(duck) == 0.67, P(beaver) == 0.24, and P(walruss) == 0.09 Given Linear Model based on Inputs: Beak? Boolean Teeth Qty? Int Feathers? Boolean Hair? Boolean Calculate Linear Function based on Inputs that ouputs Scores: Score(duck) = 2 = e^2 / (e^2 + e^1 + e^0) = P(duck) = 0.67 Score(beaver) = 1 = e^1 / (e^2 + e^1 + e^0) = P(beaver) = 0.24 Score(walrus) = 0 = e^0 / (e^2 + e^1 + e^0) = P(walrus) = 0.09 Convert Scores into Probabilities by using **Softmax Function** which applies Exponential Function (e^x returns positive number for every input) to each Score to ensure all Scores (outputs of Linear Function) will be positive number, where satisfies: - Combined probabilities must add to 1 - Higher Scores should correspond to higher probability proportion Note: Cannot just divide each score by sum of all scores to get each percentage since possible to get Negative Scores since we're using a Linear Function that may give negative values, or denominator may become 0. Instead we need to convert all scores into positive scores using a function. i.e. Given quantity of classes: N Given Linear Model whose Linear Function that outputs Scores for each of the n classes: Z1, ..., Zn Convert to Scores into Probabilities by saying the Probability that object is in class i will be: P(class i) = e^Zi / (e^Z1 + ... + e^Zn)
- Softmax Function
-
i.e. equivalent to Sigmoid Function but for when problem has 3+ classes
- 2 classes - apply Sigmoid Function to Scores to get Probabilities
- 3+ classes - apply Softmax Function to Scores to get Probabilities
-
-
One-Hot Encoding
- Convert inputs to numbers
Gift? ----------------------- True -> 1 False -> 0
- With multiple classes:
Animal Value ----------------------- Duck -> ? Beaver -> ? Duck -> ? Walrus -> ? Beaver -> ? - Cannot use 1, 2, 3, etc, since assumes dependencies between classes that we cannot have - Instead use "One-Hot Encoding" by creating a Variable for each Class (no unnecessary dependencies) i.e IF Input is Duck THEN Duck is 1 AND Beaver is 0 AND Walrus is 0 Animal Duck? Beaver? Walrus? --------------------------------------------- Duck 1 0 0 Beaver 0 1 0 Duck 1 0 0 Walrus 0 0 1 Beaver 0 1 0
-
Maximum Likelihood (Probability) (and Cross Entropy)
- Avoid taking the PRODUCT
- Instead take the SUM by using LOGS since
Logarithms have identity whereby
log(ab) = log(a) + log(b)
- Use Probability to evaluate and improve our Models
-
Pick the Model that gives Existing Labels the Highest Probability, so by Maximising Probability we pick the Best possible Model
-
Minimise Cross Entropy == Maximise Probability
- Cross Entropy (Error Function) informs us if the Model
is Good or bad since it returns the Errors at each point
(by taking sum of Negatives of the Logarithm of each Probability)
- Good Model == Low Cross Entropy == Likely that events happened with given probabilities
-
Bad Model == High Cross Entropy == Unlikely that events happened with given probabilities
- Negatives of the Logarithms == Errors at each Point
- where “Correctly Classified Points” have Small Errors
- where “Misclassified Points” have Large Errors, such that Cross Entropy informs us if Model is Good or Bad
Given 2x Models with just one point each where Model #1 output Probability is 80% (of "win") where Model #2 output Probability is 55% (of "lose") Best Model has Higher Probability when we Actually "win" Best Model has Lower Probability when we Actually "lose"
Given 2x Models with four points each Find out what Model is Good and Bad by: - Calculate Probability of each point being the colour it is according to the Model - Multiply the Probabilities of all the points to get the Model Arrangement Probability Model #1 - = P(p1_blue, blue) + P(p2_blue, blue) + P(p3_red, red) + P(p4_red, red) = 0.6 * 0.2 * 0.1 * 0.7 = 0.0084 Model #2 - = P(p1_blue, blue) + P(p2_blue, blue) + P(p3_red, red) + P(p4_red, red) = 0.7 * 0.9 * 0.8 * 0.6 = 0.3024 - Take Logs and Sum using Logarithmic identity function Model #1 - = - log(0.6) - log(0.2) - log(0.1) - log(0.7) = 0.51 1.61 2.3 0.36 = 4.8 Model #2 - = - log(0.7) - log(0.9) - log(0.8) - log(0.6) = 0.36 0.1 0.22 0.51 = 1.2 - Take natural Logarithm of e (instead of base 10) by convention. Note that taking Log of value between 0 and 1 is always a Negative number, since Log(1) = 0, now, if we take the Negative of the Logarithm of each Probability, it returns Positive numbers - Best Model has the Highest Model Arrangement Probability - Good Model - Low Cross Entropy - Bad Model - High Cross Entropy Note: Given the calculated probabilities for two models, if we then take the sum of the negative logarithm of each, and then pair each logarithm with the point where it came from then we get a value for each point, and if calculate each value then we'll find that the Misclassified Points have Large values, whereas Correctly Classified Points have Small values, since a Correctly Classified Point has a Probability close to 1, which when we take the Logarithm of it returns a small value, so we can think of the "Negatives of the Logarithms as Errors at each Point" where "Correctly Classified Points" have Small Errors and "Misclassified Points" have Large Errors, such that Cross Entropy informs us if Model is Good or Bad Calculate Model for Error Function: https://www.youtube.com/watch?v=nV1W7oQOlkU Case #1 (Point is "blue") If y == 1 (if point is "blue"/Accept to begin with) P(blue) = ^y (prediction y hat, where "blue" point in "blue" area has higher probability than point in "red" area) Error = -ln(^y) Case #2 (Point is "red") If y == 0 (if point is "red"/Reject to begin with) P(red) = 1 - P(blue) = 1 - ^y (prediction y hat, where "red" point in "red" area has higher probability than point in "blue" area) Error = -ln(1 - ^y) Summarise both Cases: Error = - (1 - y) * ln(1 - ^y) - y * ln(^y) (i.e. if y == 1 then Error is -ln(^y) ) (i.e. if y == 0 then Error is -ln(1 - ^y) ) So, ######## Error Function for Binary Classification Problems ######## is: Error Function = Sum of all the Error Functions of all the Points = - (1/m) * ( (m ∑ j=1) (1 - yj) * ln(1 - ^yj) + yj * ln(^yj) ) = - (1/m) * ( 4.8 ) = 1 (1/4) * ( 4.8 ) = 1.2 Error Function = E(W, b) = - (1/m) * ( (m ∑ j=1) (1 - yj) * ln(1 - σ(Wxi + b)) + yj * ln(σ(Wxi + b)) ) where ^y (prediction of model) is given by Sigmoid of Linear Function (Wx + b) so formula is in terms of Wx and b where xi is Label Note: By convention we're taking the Average not the Sum Goal is to Minimise Error Function
Given 2x Models with n points each - Repeat above PROBLEM: - We will be multiplying multiple n values between 0 and 1 together, which would result in a very small number (BAD). - If change one of the values the product output would change drastically SOLUTION: - AVOID USING A FUNCTION THAT TAKES THE "PRODUCT" - USER A FUNCTION TO TAKE THE "SUM" INSTEAD, BY TAKING "LOGS" WHERE log(ab) = log(a) + log(b) Model #1 - log(n1) + log(n2) + ... + log(n) = ? Model #2 - log(n1) + log(n2) + ... + log(n) = ?
- Best Model has Higher Probability and classifies the Most points Correctly https://www.youtube.com/watch?time_continue=188&v=6nUUeQ9AeUA
- Minimising Error Function results in Best solution
-
Maximum Probability == Minimum Error Function
-
Cross Entropy Example
Given 3x doors with probability of Gift behind them Green door Red door Blue door --------------------------------------------- P(gift) 0.8 @ 0.7 @ 0.1 P(no gift) 0.2 0.3 0.9 @ Problem: - Find scenario with highest probability, assuming independent events. Solution: - Take Largest probability from each column (indicated with @ symbol) - Whole arrangement probability is Product of 3x Largest probabilities P = 0.8 * 0.7 * 0.9 = 0.504 = 50 % Now, let's look at all possible scenarios, since there are 3 doors, each with 2 possibilities, we have 2^3 scenarios: Green door Red door Blue door Probability Cross Entropy (i.e. -ln(Probability)) ------------------------------------------------------------------------------ P(gift) 0.8 0.7 0.1 0.056 2.88 P(gift) 0.8 0.7 0.9 !! 0.504 0.69 P(gift) 0.8 0.3 !! 0.1 0.024 3.73 P(no gift) 0.2 !! 0.7 0.1 0.014 4.27 P(gift) 0.8 0.3 !! 0.9 !! 0.216 1.53 P(no gift) 0.2 !! 0.3 !! 0.9 !! 0.126 2.07 P(no gift) 0.2 !! 0.3 !! 0.1 0.006 5.12 P(no gift) 0.2 !! 0.3 !! 0.9 !! 0.054 2.92 where !! indicates negative probability of given door for a scenario Step 1: Obtain Probability of each arrangement by multiplying the 3x independent probabilities, where Total sum of all arrangement probabilities add to 1: Step 2: Calculate Cross Entropy, which is negative of the logarithm of the probability, such that Events with High Probability have High Cross Entropy
- Derive Cross Entropy formula
Green door Red door Blue door Cross Entropy (i.e. -ln(Probability)) ------------------------------------------------------------------------------ P(gift) 0.8 0.7 0.9 !! -ln(0.8) - ln(0.7) - ln(0.9) (p1) (p2) (1 - p3) y1 = 1 y2 = 1 y3 = 0 where !! indicates the probability of there NOT being a gift behind the door for a given scenario where p1 == 0.8 (prob. of gift) where p2 == 0.7 (prob. of gift) where p3 == 0.1 (prob. of gift) yj = 1 (variable yj is number of presents behind door i, 1 if a present behind, else 0) Cross-Entropy = - (m ∑ j=1) yj * ln(pj) + (1 - yj) * ln(1 - pj) i.e. CE[(1,1,0), (0.8,0.7,0.1)] = 0.69 low since vector (1,1,0) similar to (0.8,0.7,0.1) meaning that arrangment of gifts (1,1,0) is likely to happen based on probabilities given (0.8,0.7,0.1) CE[(0,0,1), (0.8,0.7,0.1)] = 5.12 high since arrangement of gifts given by (0,0,1) is very unlikely from the probabilities given by second set of numbers (0.8,0.7,0.1)
-
Multi-Class Cross Entropy
Given 3x doors with probability of Duck, Beaver, or Walrus behind each Animal Door 1 Door 2 Door 3 --------------------------------------------- P(duck) 0.7 0.3 0.1 P(beaver) 0.2 0.4 0.5 P(walrus) 0.1 0.3 0.4 Note: numbers in each column must add to 1 Given Scenario #1: Door 1 Door 2 Door 3 ------------------------------ Duck Walrus Walrus P = 0.7 * 0.3 * 0.4 = 0.084 CE = -ln(0.7) + -ln(0.3) + -ln(0.4) = 2.48 Probability of Scenario #1 is product of probabilities of each independent event Given all Scenarios: Animal Door 1 Door 2 Door 3 --------------------------------------------- P(duck) p11 p12 p13 P(beaver) p21 p22 p23 P(walrus) p31 p32 p33 y1j = 1 if Duck behind Door i y2j = 1 if Beaver behind Door j y3j = 1 if Walrus behind Door k otherwise y1j, y2j, y3j are 0 Cross-Entropy = - (n ∑ i=1) (m ∑ j=1) yij * ln(pij) ######## Error Function for Multi-Classification Problems ########
-
Logistic Regression Algorithm
- About
```
- Obtain Input data
- Random Model selected
- Error calculated
- Error minimised to obtain a better model ```
-
Error Function Calculation
https://www.youtube.com/watch?v=nV1W7oQOlkU
- Recall using Cross-Entropy to calculate the Best Model (when given 2x models) https://www.youtube.com/watch?v=njq6bYrPqSU
- About
```
-
Gradient Descent
- Dfn: Looks for Direction we will descent the most (reduce the Error the most) and takes a Step in that Direction
https://www.youtube.com/watch?v=26_dnS0r2jc
-
Error Function is
E = W(x + b)
(function of weights)- Graphed as a Plane:
x-axis w1 (Input to function) y-axis Error E (height) z-axis w2 (Input to function)
- Imagine you are at a Point on the Curved Plane
E given by Vector formed by sum of the Partial Derivatives of Error Function E, with respect to Weights w1 and w2 and Bias b (i.e. dE/dw1, dE/dw2), which gives us the Gradient (i.e. Direction) to move ∇E to increase the Error Function the Most, so if we take the Negative of the Gradient -∇E then we can Decrease the Error Function the Most, which is what we'll do! We'll take a Step in the Direction given by the Negative of the Gradient at a Point of the Curved Plane representing the Error Function, which will take us to a Lower Point on the Curved Plane. Repeat this until we get to Lowest Point on Curved Plane
-
Calculate the Gradient
1. Initial "Bad" Prediction (since we are high up on Curved Plane's "Error" axis): ^y = σ(Wx + b) <--- BAD ^y = σ(w1 * x1 + ... + wn * xn + b) Error Function of Initial "Prediction" is: E = W(x + b) "Gradient" of Error Function is: ∇E = (∂E/∂w1, ... , ∂E/∂wn, ∂E/∂b) "Learning Rate" Alpha set to low value to avoid making dramatic changes by using Small α = 0.1 "Take a step" in Negative direction of the Gradient multiplied by Alpha Note: where "Taking a step" is same as Updating the Weights and Bias as follows (i.e. ∂E/∂wi means Partial Derivative of the Error with respect to wi) wi' <-- wi - α * ∂E/∂wi b' <-- b - α * ∂E/∂b This will take us to a Point E(W', b') with Better "Prediction" that has a Lower Error Function, with Weights W' and Bias b': ^y = σ(W'x + b') <--- BETTER
- Graphed as a Plane:
-
Gradient Calculation https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/6124bd95-dec2-44f9-bf3b-498ea57699c7/lessons/47f6c25c-7749-4a02-b807-7a5b37f362e8/concepts/0d92455b-2fa0-4eb8-ae5d-07c7834b8a56
Gradient == Scalar * Point Coords where, Scalar == Label - Prediction where Gradient is Small when Point is "Well Classified" when Label is close to Prediction so we'll change our Point Coordinates a Little whereas Gradient is Large when Point is "Poorly Classified" when Label is far from Prediction so we'll change our Point Coordinates a Lot Note: Similar to Perceptron Algorithm
-
Gradient Descent Algorithm Pseudocode https://www.youtube.com/watch?v=I-l32oR5iMM https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/6124bd95-dec2-44f9-bf3b-498ea57699c7/lessons/47f6c25c-7749-4a02-b807-7a5b37f362e8/concepts/ca6eff40-a3e2-4d53-85f4-d2454b538d87 Lesson 2 Part 25 https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/6124bd95-dec2-44f9-bf3b-498ea57699c7/lessons/47f6c25c-7749-4a02-b807-7a5b37f362e8/concepts/5e9bd75b-a419-45d4-8a2b-88ba847cc814
Step 1: Start with random Weights to give us a Line: w1, ... , wn, b ^y = σ(Wx + b) Step 2: Calculate the Error for every plotted Point, where: Error is Large for Misclassified Points Error is Small for Correctly Classified Points Step 3: For every Point Coordinates: x1, ... , xn For i = 1 ... n Update wi by Adding the Learning Rate α times Partial Derivative of Error Function wrt to wi wi' <--- wi - α * ∂E/∂wi wi - α * ((y - ^y) * xi) Update b similarly b' <--- b - α * ∂E/∂b b - α * (y - ^y) This gives us new Weights and Bias Step 4: Update Weights and Bias to give us a New Line: w1', ... , wn', b' ^y' = σ(W'x + b') Step 5: Repeat process "E-box" qty of times until Error is Small
-
Perceptron Algorithm vs Gradient Descent Algorithm
-
Gradient Descent
^y
may take ANY valuewi' <--- wi - α * ((y - ^y) * xi)
- If Point is Incorrectly Classified:
- Tells Line to come closer (on each step)
- If Point is Correctly Classified:
- CHANGE WEIGHTS (i.e. Point tells Line to go farther away, since if already in Correct region, want Prediction to still get closer and closer to 1, reducing Error farther)
-
Perceptron Algorithm
-
^y
may ONLY take0
or1
for Correctly or Incorrectly classified -
If Point is Correctly Classified:
- DO NOTHING
-
-
Common
- Both Gradient Descent and Perceptron Algorithm
- Misclassified Point tells Line to come closer (trying to get Point on the correct side)
- Both Gradient Descent and Perceptron Algorithm
-
-
Non-Linear Data (Neutral Networks)
-
Linear - Data sets separatable by a Line
-
Non-Linear - Complex Data Sets with highly Non-Linear Boundaries (not separable due to Non-Linear Equation of Line) Neural Networks used instead
- Create Probability Function where:
- Points in Blue region more likely to be Blue,
- Points in Red region are more likely to be Red
- Points on LINE are equally likely to be Blue or Red
- Create Probability Function where:
-
-
Neural Networks (aka Multi-Layer Perceptrons, Neural Network Architecture)
https://www.youtube.com/watch?v=Boy3zHVrWB4
-
Combine Two Perceptrons into Third (more complex one)
Use Probability Function (Sigmoid) for every Point on the Plane (so resulting probab. always between 0 and 1) Step 1: - A = Calc probab. for Point K on Model #1 - B = Calc probab. for Point K on Model #2 Step 2: (neural network, similar to perceptrons) - C = W1 * A + W2 * B - BIAS where W1 is Weight1 where W2 is Weight2 Step 3: (i.e. create resulting curved line non-linear model b/w both linear models) - Sigmoid(C)
-
Combine Two “Linear” Models overimposed into a “Non-Linear” Model
-
i.e.
Linear Line + Linear Line = Non-Linear Line (i.e. Curve)
-
- Note:
- Linear Model is a Probability Space that gives Probability of Point being Blue (in a Region)
-
Combining the Probability of a Point across Two Perceptrons (P1, P2) Linear Models Probability Spaces, i.e. P1 Point probability (of Blue): 0.7 P2 Point probability (of Blue): 0.8
-
Apply the Sigmoid Function to the Sum of the Point Probability across Two Perceptrons
Sigmoid ( 0.7 + 0.8 ) = Resulting Probability of Blue
-
- Weight the Sum
- i.e. if want Model #1 to have more say in resulting probability
-
Neural Network, different Bias Diagrams
- https://www.youtube.com/watch?v=au-Wxkr_skM
- Weights on Left describe Equations of the Linear Models (i.e. Cx + Dy + E = 0)
- Weights on Right describe Linear Combination of the Two Models to obtain resulting Curved Non-Linear Model (Non-Linear Boundary defined by Neural Network)
- https://www.youtube.com/watch?v=au-Wxkr_skM
-
Neural Network Multiple Layers Architecture
-
https://www.youtube.com/watch?v=pg99FkXYK0M
- Layers
- Input Layer
-
Inputs (i.e. x1, x2)
-
Note: when x1, x2, x3 then we’re in 3D space, where Hidden Layers are Planes (instead of Lines) and Output Layer is Non-Linear Region in free space
-
Note: when N Input Layers, we are in N-Dimensional space
-
- Hidden Layer
-
Set of Linear Models created using the first Input Layer (when only 2 Nodes i.e. Red/Blue)
-
Note: when Hidden Layer has 3+ Nodes, then we have a Multi-Class Classification Model and Output Layer is produced for each of the Classes (i.e. Cat, Dog, Bird)
-
Note: when Multiple Hidden Layers we have a Deep Neural Network where Linear Models combine to create Non-Linear Models, and these resulting Non-Linear Models combine to create more Non-Linear Models
Deep Neural Network splits the N-Dimensional space in the Output Layer to have a highly Non-Linear Boundary (i.e. wiggly line)
-
- Output Layer
- Creates a Non-Linear Model from Combining N Linear Models from Hidden Layer (i.e. combine 3x Linear Models to create Triangular Boundary in Output Layer)
- Input Layer
- Layers
-
-
Multi-Class Classification vs Binary Classification (with Deep Neural Networks)
https://www.youtube.com/watch?v=uNTtvxwfox0
-
If have multiple classes and want Model to predict if we have a Duck, Beaver, or Walrus
- APPROACH #1
- Create 3x Neural Networks, one for each Class
Softmax( NN (Duck) + NN (Beaver) + NN (Walrus) )
- APPROACH #2 (BETTER)
-
Create 1x Neural Network, with more Nodes in the Output Layer (where each Output Layer Node gives the probab. that the image is each of the animals)
-
Take the Scores and apply the Softmax Function to obtain Well-defined probabilities
-
-
-
Feedforward (Process of Neural Network)
https://www.youtube.com/watch?v=Ioe3bwMgjAM
-
Feedforward Process used by Neural Networks to turn Input into Output:
-
Neural Network (i.e. Perceptron is simplest NN)
Inputs Input Data Point x = (x1, x2) Label y = 1 (means Point is Blue) Linear Equation of Perceptron Boundary Line in 2D space w1 * x1 + w2 * x2 + b = 0 where w1, w2 (Weights and Edges) where b (Bias on the Node) w1 and w2 as connecting lines between Inputs and Linear Models with greater Thickness of connecting line for the higher Weight values (see 0:56 of video) Perceptron then Plots the Points x1 and x2 and outputs the Probability that the point is Blue (i.e. if Blue Point is in Red Area then Output Probability is Small number, since Point is Not likely to be Blue)
-
Use Matrix Multiplication of Weights for Non-Linear mapping of Complex Neural Networks including Multi-Layer (see 2:42, 3:50) https://www.youtube.com/watch?v=Ioe3bwMgjAM
i.e. output prediction from input Vector ^y = σ o W(3) o σ o W(2) o W(1)(x) where o is Composition of Weights Matrix with Sigmoid Function
-
Error Function
https://www.youtube.com/watch?v=VAX9Di9cjzE
-
-
-
-
Training Neural Networks
-
Backpropogation Method
- Dfn / Steps:
- Feedforward operation https://www.youtube.com/watch?v=1SmY3TZTyUk
- Compare Model Output with
Desired Model Output
- i.e. Bad Neural Network - if predicts Point is Red when it’s actually Blue
- Calculate Error
- Backpropagation
(run the Feedforward backwards) to spread
Error to each of the Weights
-
Backpropagation after Feedforward (3:08) to Train Neural Networks https://www.youtube.com/watch?v=1SmY3TZTyUk
-
Ask Point what it wants to do to be “Better” classified (i.e. if Misclassified Point will ask Boundary Line of Blue Region to come closer to the Point)
-
Listen to the “Better” Model in the Hidden Layer MORE (by INCREASING the Weight (Thickness) of connecting line between Hidden Layer and Output Layer) than we Listen to the “Bad” Model in the Hidden Layer (DECREASE Weight of its connecting line), so that Final Model looks more like the “Better” Model in the Hidden Layer
-
THEN, go Back to the Linear Models in the Hidden Layer and for each Linear Model Ask Points what it wants to do to be “Better” classified (i.e. if Misclassified Point will ask Boundary Line of Blue Region to come closer to the Point, otherwise if Correctly Classified will ask Boundary Line to move away from Point)
- This will update the Weights (thickness) of connecting line between Input Layer and Hidden Layer
-
Now we have “Better” Predictions of all the Models in the Hidden Layer, and for the Model in the Output Layer
-
-
- Update Weights with output of
Backpropagation for “Better” Model
- Point says what wants Model to do:
- Move Boundary Line closer to Point
(if Misclassified),
- Boundary Line moves closer by updating its Weights (thereby defining a new line)
- Move Boundary Line away from Point (if Correctly classified)
- Move Boundary Line closer to Point
(if Misclassified),
- Point says what wants Model to do:
- Calculate Error Function
E(W)
and step in direction of Negative of Gradient-∇E
, to new ModelW'
with New Model Error ``E(W’)` with “Better” Prediction - Note: Error reduces from
E(W)
toE(W')
after taking Gradient Descent step down the Negative of the Gradient-∇E
, such that new Boundary Line is closer to the Point - Repeat until “Good” Model, successively minimising error
- Dfn / Steps:
- Backpropagation (DETAILED MATHS covered by Keras)
-
Multi-Layer Perceptron, Multi-Layered Perceptron, Feedforward https://www.youtube.com/watch?v=tVuZDbUrzzI https://www.youtube.com/watch?v=YAhIBOnbt54 https://www.youtube.com/watch?v=0EoRxu3EeGM
-
Chain Rule https://www.youtube.com/watch?v=0EoRxu3EeGM
Given x, A = f(x), B = g o f(x) Partial Derivatives given by: ∂B / ∂x = ∂B * ∂A / ∂A * ∂x (when composing functions the derivatives multiple)
- Feedforwarding - composing various Functions
- Backpropagation - reverse of Feed Forward calculating Derivative of complex Composition at each Error Function wrt each of the Weights in the Labels using Chain Rule
-
- Neural Networks in Keras
- Packages for Neural Networks, activation function, gradient descent
- Keras https://keras.io/
- TensorFlow https://www.tensorflow.org/
- Caffe http://caffe.berkeleyvision.org/
- Theano http://deeplearning.net/software/theano/
- Scikit-Learn http://scikit-learn.org/stable/
- Keras
- Project - Build a Fully Connected
(Multi-layer) Feedforward Neural Network
to solve the XOR problem
- Steps:
- Load the data
- Define the network
- Train the network
- Steps:
- Project - Student Admissions
- https://github.com/ltfschoen/aind2-dl
- Project - Build a Fully Connected
(Multi-layer) Feedforward Neural Network
to solve the XOR problem
- Packages for Neural Networks, activation function, gradient descent
- Training Model Optimisation
- Failure points
- Poor architecture chosen
- Noisy data
- Model takes too long to run
- Failure points
- Batch vs Stochastic Gradient Descent
- Recap:
- Gradient Descent Algorithm:
-
Reduce height (error function) by taking steps (aka Epochs) following the negative of the gradient of the height
-
Epoch https://www.youtube.com/watch?v=2p58rVgqsgo
- Each steps taken to reduce error function in gradient descent algorithm
-
At each Epoch, we take all the input data, and run it through the entire Neural Network, then find predictions, calculate the error (i.e. how far from actual labels), then back propagate the error in order to update the weights in the Neural Network, to improve the boundary for predicting all our data
- Batch vs Stochastic Gradient Descent
-
Since we perform the above steps at each Epoch for all the data points, these are huge computational steps using lots of memory. Since we don’t need to plugin all our data at every Epoch, when data is well distributed, a small subset of the data would at least give a good idea what the gradient would be, and much quicker!
- Split the data into several Batches (i.e. given 24 points, split into 4x batches of 6 points each)
- Run Batch #1 points through Neural Network
- Calculate error and gradient, and back propagate to update with better weights that define a better boundary region
-
Repeat for subsequent Batches
- Keras
model.fit(X_train, y_train, epochs=1000, batch_size=100, verbose=0)
-
-
- Gradient Descent Algorithm:
- Recap:
- Learning Rate Decay
- High Learning Rate + Large Steps may result in missing the bottom and coming up again
- Low Learning Rate + Small Steps higher chance of arrive local minimum
- If Model is not working, then Lower the Learning Rate
- Ideally the Learning Rate would Decrease as the Model gets closer to the solution
- Note:
- If “steep”, take long steps
- If “plain”, take small steps
- Training and Testing sets to Find “Better” Model
-
Given blue and red points from data set that are plotted on each two classification models with a boundary that separate the blue points from the red points Instead of choosing the “better” model based on a combination of observations (i.e. boundary curve complexity vs less mistakes in classifying points) we using Training and Testing sets.
- STEP 1
- Differentiate Training and Testing sets
- Training Set are Solid Coloured Points
- Testing Set are Hollow Coloured Points (white inside)
- Differentiate Training and Testing sets
- STEP 2
- Train each model with the Training Set (without looking at the Testing Set) to obtain updated boundary
- STEP 3
- Evaluate results by reintroducing the Testing Set to see how we did
- Check how many mistakes each model made with the Testing Set now reintroduced
- Note: Follow our intuition, when comparing which is better between a simpler model and a complex model, always go for the simpler model
-
-
Overfitting vs Underfitting (Types of Errors)
https://youtu.be/SVqEgaT1lXU?t=4m15s
- Overfitting (Too Specific) (aka Error due to Variance)
-
Dfn: Overcomplicating the problem and using a solution that is too excessive (i.e. kill fly with bazooka), may lead to bad solutions and extra complexity (when simpler solution possible)
- Example 1
- Given some data to be classified (i.e. different objects and animals)
-
If we groups we choose are too granular (i.e. Dogs that are orange or gray vs Anything that is orange or gray except dogs) it is Too Complex
- Identify Issues:
- Introduce a Testing Set to see if introduced data is classified correctly or not
- It may fit the data well, but may not generalise correctly
- Example 2
- i.e. memorising textbook instead of studying word by word, so may be able to regurgitate, but not be able to generalise properly to questions
-
- Good Model Fit
-
Dfn: Generalises better than overfitting Always have preferance for Overly Complex models and apply certain Techniques to prevent Overfitting (like using a belt to fit into a pants that are too big) Finding the right Architecture for a Neutral Network
-
Example 2
- i.e. like study well and good result in exam
-
-
Underfitting (Too Simple) (aka Error due to High Bias)
-
Dfn: Oversimplifying the problem and using a solution that is too simple to do the job (i.e. trying to kill Giant with fly swatter)
- Example 1
- Given some data to be classified (i.e. different objects and animals)
-
If we groups we choose are too abstract (i.e. Animals vs No Animals) it is Too Simple
- Identify Issues:
- Misclassify objects
- Example 2
- i.e. not studying enough and failing
-
- Overfitting (Too Specific) (aka Error due to Variance)
- Early Stopping Algorithm
-
Given an Overly complex Neutral Network architecture than what we need (according to Model Complexity Graph)
- After Training Evaluate each Model by introducing a Testing Set
- Introduce a set of Training Set points to each model plot
- Plot the:
- Error vs Epoch for the Training Set
- Error vs Epoch for the Testing Set
- Training and Testing results
- Epoch 1
- Training
- Use random weights
- Underfitting so makes many mistakes
- Testing
- Badly misclassifies both Training and Testing Sets
so both:
- Large Training Error
- Large Testing Error
- Badly misclassifies both Training and Testing Sets
so both:
- Training
- Epochs 20
- Training
- Good model
- Testing
- Small Training Error
- Small Testing Error
- Training
- Epochs 100
- Training
- Fits data better
- Overfit data
- Training error decreases with higher Epochs but Testing errors increases (due to misclassification
- Testing
- Tiny Training Error
- Medium Testing Error
- Training
- Epochs 600
- Training
- Fits the “training data” well by Generalises badly
- Heavily Overfits
- Problem If introduce say a blue point in blue region it may be misclassified as red unless its very close to existing blue point
- Testing
- Tiny Training Error
- Large Testing Error
- Training
- Epoch 1
- Goldilocks Spot and Model Complexity Graph ( Error vs Model Complexity(qty epochs) )
-
https://youtu.be/NnS0FJyVcDQ?t=2m35s
-
Use Model Complexity Graph to identify ideal Model Complexity of ideal quantity of Epochs to use
-
Approach:
- Use Gradient Descent until the Testing Error stops decreasing and instead starts increasing and then we Stop (Early Stopping Algorithm) technique when Training the Neutral Network
-
-
-
Regularisation
https://youtu.be/aX_m9iyK3Ac?t=2m50s
Given two points P1 (-1,-1) and P2 (1,1), which of equations gives smaller error? Solution 1: x1 + x2 = 0 Solution 2: 10x1 + 10x2 = 0 (scalar multiple of Solution 1) Both give same line. Prediction is sigmoid of linear function Using Solution 1 and substituting Point Coordinates for Coefficients: substitute P1; ^y = σ(w1 * x1 + w2 * x2 + b) = σ(1 + 1) = 0.88 substitute P2: ^y = σ(w1 * x1 + w2 * x2 + b) = σ(-1 - 1) = 0.12 Using Solution 2 and substituting Point Coordinates for Coefficients: substitute P1; ^y = σ(w1 * x1 + w2 * x2 + b) = σ(10 + 10) = 0.9999999979 (great prediction, less error, but only subtle Overfitting) substitute P2: ^y = σ(w1 * x1 + w2 * x2 + b) = σ(-10 - 10) = 0.0000000021 (great prediction, less error, but only subtle Overfitting) Error Function of Initial "Prediction" is: E = W(x + b) Solution 2 is "Too Certain" and it will be too hard to tune the model to correct any misclassified errors if any
- Gradient Descent is best used on Models that are NOT “Too Certain” (i.e. lower scalar weights applied to x1, x2, etc, so the function is not too steep, and so the derivatives aren’t too close to 0 and very large at middle of curve)
Since we now know that LARGE COEFFICIENTS ----> OVERFITTING We need to tweak the Error Function to Punish any High Coefficients (penalise Large Weights w1, ..., wn) by taking the Old Error Function and add a term that is Big when the Weights are big by either: * Note: where Lambda constant is how much to Punish the Coefficients * "L1 Regularisation" - Adding sums of absolute values of the Weights * Lambda constant Error Function = Sum of all the Error Functions of all the Points = - (1/m) * ( (m ∑ j=1) (1 - yj) * ln(1 - ^yj) + yj * ln(^yj) + λ (|w1| + ... + |wn|) ) * Usage: * Usually results in sparse vectors where small weights tend toward 0, so use L1 to reduce weights and end up with small set. i.e. Sparsity: (1,0,0,1,0) * Also good for Feature Selection by only selecting the most important points and turns the rest into 0's * "L2 Regularisation" - Add sum of squares of the Weights * Lambda constant Error Function = Sum of all the Error Functions of all the Points = - (1/m) * ( (m ∑ j=1) (1 - yj) * ln(1 - ^yj) + yj * ln(^yj) + λ (w1^2 + ... + wn^2) ) * Usage: * Does not favour sparse vectors since it tries to maintain all weights homogenously small i.e. Sparsity: (0.5,0.3,-0.2,0.4,0.1) * Better for Training Models * Example Given vector (1, 0) then: λ (|w1| + ... + |wn|) = λ (1) λ (w1^2 + ... + wn^2) = λ (1) Given vector (0.5, 0.5) then: λ (|w1| + ... + |wn|) = λ (1) λ (w1^2 + ... + wn^2) = λ (0.25 + 0.25) = 0.5 So therefore "L2 Regularisation" prefers the vector (0.5, 0.5) over the vector (1, 0) since (0.5, 0.5) produces smaller sum of squares and in turn a "Smaller Error Function"
-
BertrAIND Russell Quote
“The whole problem with AI is that bad models are so certain of themselves, and good models so full of doubts”
-
Dropout Method (used when Training Neural Networks)
-
Dfn: Sometimes one part of the Neutral Network has Large Weights and dominates the Training whilst others doesn’t train as much, so as mitigation we need to turn the dominator off sometimes
-
Approach:
-
Randomly turn off some Nodes in Neutral Network as we go through Epochs (i.e. feed forward and backpropagation passes) to force the rest of the Nodes to pick up the slack and get involved in the training (so none dropout)
-
Parameter is passed to the Algorithm in order to drop the Nodes. The Parameter is the Probability that the Node gets dropped at a particular Epoch i.e. if P = 0.2 it means for each Epoch, each Node gets turned off with probability of 20%
-
-
-
Vanishing Gradient Issue
-
Problem
https://www.youtube.com/watch?v=W_JJm_5syFw
- Calculating the Derivative
σ(x) = 1 / (1 + e^-x)
of the Sigmoid function at a point far to the left or far to the right results in 0. But the Derivative is what tells us the Direction to Move. This issue is worse in Multi-Linear Perceptron since the Derivative of an Error Function with respect to a Weight equals the Product of all the Derivatives calculated at the Nodes in the corresponding path to the output, where all those Derivatives are Derivatives as a Sigmoid Function (so they are small), and so their Product is tiny, which makes Training difficult, since Gradient Descent gives us very tiny changes to make on the Weights, causing us to make very small Steps taking forever to reduce the Error
- Calculating the Derivative
-
**Solution 1: Change the Activation Function
-
Tanh - Hyperbolic Tangent Function
-
https://www.youtube.com/watch?v=VzGOR5SlFSw
tanh(x) = (e^x - e^-x) / (e^x + e^-x)
- Similar to Sigmoid but range is -1 to 1 so Derivatives are Larger, which leads to great advances in Neural Networks
-
-
ReLU Rectified Linear Unit
*
relu(x) = { x if x >= 0 0 if x < 0
- Simple function between -1 and 1:
- if Positive return same value Derivative is 1 if Positive
- if Negative return zero
-
Used instead of Sigmoid since improves Training significantly without sacrificing much accuracy since the Derivatives will not be as small and allows us to perform Gradient Descent
-
Note that Final Activation Function in Multi-Layer Perceptron must be between 0 and 1 (i.e. a Sigmoid)
- If the Final Activation Function is a ReLU then we may end up with Regression Models that Predict a value (used in Recurrent Neural Networks)
- Simple function between -1 and 1:
-
-
-
-
-
Gradient Descent
-
Problem
- Getting stuck in Local Minima during Gradient Descent
-
Solutions
- Random Restart (used by Gradient Descent Algorithms)
- Prevent
- Start from a few Random places and use Gradient Descent from all of them to increase the Probability we’ll reach a good Global Minimum
-
Momentum (used by Gradient Descent Algorithms)
https://www.youtube.com/watch?v=r-rYz_PEWC8
-
Move with Momentum so when reach a Local Minimum we power through the Local Minimum and over the next hump. Otherwise when we reach a Local Minima where our Gradient is too small to Step out of the Local Minima and over the next hump.
-
Make the next Step in the Local Minima be the Average of the Previous Steps and where the most recent previous step matters more (and is given a higher Weight) than older steps.
https://youtu.be/r-rYz_PEWC8?t=1m36s
- Beta β (Momentum) is a constant between 0 and 1 that attaches to the Steps
STEP(n) --> STEP(n) + β * STEP * (n-1) + β^2 * STEP * (n-2) + ...
-
- Random Restart (used by Gradient Descent Algorithms)
-
-
Optimisers in Keras (used as Arguments when compiling Keras Models to optimise their performance)
- Links
- https://keras.io/optimizers/
- http://sebastianruder.com/optimizing-gradient-descent/index.html#rmsprop
-
SGD This is Stochastic Gradient Descent. It uses the following parameters:
- Learning rate
- Momentum (takes weighted average of the previous steps, in order to get a bit of momentum and go over bumps, as a way to not get stuck in local minima).
- Nesterov Momentum (This slows down the gradient when it’s closed to the solution).
- Adam
- Adam (Adaptive Moment Estimation) uses a more complicated exponential decay that consists of not just considering the average (first moment), but also the variance (second moment) of the previous steps.
- RMSProp
- RMSProp (RMS stands for Root Mean Squared Error) decreases the learning rate by dividing it by an exponentially decaying average of squared gradients.
- Links
-
Keras Lab
- Goal: Build Neural Network to analyse real data that consists of thousands of movie reviews from IMDB, and the challenge is to Predict the Sentiment Analysis of a review
- Convert inputs to numbers
- Example - Error Function and Gradient Descent
- Dfn:
Chapter 2 - Convolutional Neural Networks (CNN)
- About
- State of the art
- Great Summary https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
- CNN 2D visual http://scs.ryerson.ca/~aharley/vis/conv/flat.html
- CNN 3D visual http://scs.ryerson.ca/~aharley/vis/conv/
- JS source code http://scs.ryerson.ca/~aharley/vis/
- COURSES
- http://cs231n.stanford.edu/
- Applications
- Voice User Interfaces
- Applications
- WaveNet model (by Google)
- Convert text to speech
- Trained sufficiently it sounds like you
- Link About https://deepmind.com/blog/wavenet-generative-model-raw-audio/
- Link Paper https://arxiv.org/pdf/1609.03499.pdf
- Link Singing http://www.creativeai.net/posts/W2C3baXvf2yJSLbY6/a-neural-parametric-singing-synthesizer
- WaveNet model (by Google)
- Applications
- Natural Language Processing (NLP)
- General
- Recurrent Neural Networks (RNNs) used more than CNNs
- Applications
- CNNs used to extract info from sentences for:
- Sentiment Analysis
- CNN for NLP with Text Classification using TensorFlow,
baseline for Sentiment Analysis tasks, etc + Code
- http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
- http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
- CNNs at Facebook for language translation 9x faster than RNN
- https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/
- (hierarchical/parallel process words of sentence)
- CNNs to play Atari Games with DeepMind
- About https://deepmind.com/research/dqn/
- Code https://sites.google.com/a/deepmind.com/dqn/
- Deep Reinforcement Learning using Policy Gradients code beginner http://karpathy.github.io/2016/05/31/rl/
- CNNs used to extract info from sentences for:
- General
- Computer Vision
- General
- Given set of Images, CNN assigns corresponding Label it believes Summarises the Image Content
- Applications
- CNNs used to Teach AI Agents to play Video Games like
paddle game
- CNN has no prior knowledge of what a ball is or knowing precisely what the controls do.
- Only provided screen, score, and controls a user given
- CNN extracts crucial info allowing them to develop a useful strategy
- ClickDraw
- Pictionary playing, guesses what you’re drawing
based on what you draw
- https://quickdraw.withgoogle.com/#
- Auto-suggestions for sketching https://www.autodraw.com/
- Pictionary playing, guesses what you’re drawing
based on what you draw
- DeepMind
- Go boardgame (Ancient Chinese complex) AI agent beat human professional
- Drones flying unfamiliar territory
- Deliver medical supplies to remote areas
- CNNs give drone ability to see or determine what’s happening in streaming video data
- Decoding Images of Text
- Digitise historical books
- Digitise hand-written notes
- Improve Algorithm to handle letters, numbers, punctuation
- Decode road signs for Self-Driving Cars
- Google Street Maps
- Trained algorithm to create better street maps of the world that reads house number signs from street view images
- CNNs used to Teach AI Agents to play Video Games like
paddle game
- Misc
- AI Experiments https://aiexperiments.withgoogle.com/
- AlphaGo https://deepmind.com/research/alphago/
- https://www.technologyreview.com/s/604273/finding-solace-in-defeat-by-artificial-intelligence/?set=604287
- CNNs powering Drones Indoors
- https://www.youtube.com/watch?v=AMDiR61f86Y
- Outdoor navigation with GPS
- http://www.droneomega.com/gps-drone-navigation-works/
- CNN powered Drone Outdoors Autonomous
- https://www.youtube.com/watch?v=wSFYOw4VIYY
- Classify Traffic sign recognition system
- https://github.com/udacity/CarND-Traffic-Sign-Classifier-Project
- Classify Street Signs
- https://github.com/udacity/machine-learning/tree/master/projects/digit_recognition
- CNN to produce a self-driving AI to play Grand Theft Auto V
- https://pythonprogramming.net/game-frames-open-cv-python-plays-gta-v/
- CNNs to convert famous paintings into 3D for Vision Impaired using
CNN to predict Depth from a Single Image https://www.cs.nyu.edu/~deigen/depth/
- http://www.businessinsider.com/3d-printed-works-of-art-for-the-blind-2016-1/?r=AU&IR=T
- CNN to localise Breast Cancer
- https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html
- CNN to save endangered species using HotSpotter http://cs.rpi.edu/hotspotter/
- https://blogs.nvidia.com/blog/2016/11/04/saving-endangered-species/?adbsc=social_20170303_70517416
- CNN to change gender or make you smile in photo
- https://www.digitaltrends.com/photography/faceapp-neural-net-image-editing/
- General
- Voice User Interfaces
-
Deep Learning Recognition
- Interpret hand-written numerical digits
- Design Image Classification Algorithm
- Goal
- Takes pictures of hand-written numbers and identifies the numbers shown in the images
- Tools
- MNIST Database - contains 70k greyscale hand-written
images of digits 0 to 9
- Figure that shows datasets referenced
over time in NIPS papers
- https://www.kaggle.com/benhamner/popular-datasets-over-time/code
- https://nips.cc/
- Figure that shows datasets referenced
over time in NIPS papers
- MNIST Database - contains 70k greyscale hand-written
images of digits 0 to 9
- My Code and Detailed Documentation explaining MLP Steps
- https://github.com/ltfschoen/aind2-cnn
- Links
- TODO - Drop Out Layer technique for Overfitting https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
- TODO - Keras Flatten Layer https://keras.io/layers/core/#flatten
- TODO - Activation Functions http://cs231n.github.io/neural-networks-1/#actfun
- TODO - Fully Connected “Dense” Layer https://keras.io/layers/core/
- TODO - Initialisers https://keras.io/initializers/
- TODO - Loss/Error Functions https://keras.io/losses/
- TODO - Early Stopping and ModelCheckpoint https://keras.io/callbacks/#modelcheckpoint
- TODO - http://machinelearningmastery.com/check-point-deep-learning-models-keras/
- TODO - Performance of other classifiers http://yann.lecun.com/exdb/mnist/
- Black Box of AI Unknowns https://www.technologyreview.com/s/604087/the-dark-secret-at-the-heart-of-ai/
- Goal
- Design Image Classification Algorithm
-
Intepret sophisticated images with complex patterns by replacing Fully Connected Layer (aka Dense Layers) with Locally Connected Layer (aka Convolutional Layers)
- https://www.youtube.com/watch?v=z9wiDg0w-Dc
- https://www.youtube.com/watch?v=h5R_JvdUrUI
-
Colour images https://www.youtube.com/watch?v=RnM1D-XI–8
-
Issues:
- MLPs use many Parameters
-
Only using Densely Connected Layer (aka Fully Connected Layers) i.e. 28x28px image required 0.5M parameters so moderately sized images require high computational complexity
-
Only accepting Vectors as Input as throwing away the 2D info (i.e. spatial info of where pixels are located in reference to each other) contained in image when we Flatten its Matrix into a Vector
-
- MLPs use many Parameters
-
Solution
- Use CNNs instead of MLPs as they process images without losing 2D info since they use “Locally Connected Layer” (aka Sparsely/Convolutional Connected Layers) where Connections between Layers are informed by the 2D structure of the image Matrix, and they accept 2D Matrices as Input.
Instead of connecting ever Hidden Node to every pixel in the original image, break the original image into 4x Regions, such that each Hidden Node is a “Locally/Convolutional Connected Layer” that connects each Hidden Node to only one of the pixels in each of the 4x Regions (i.e. sees only 1/4 of original image) to find patterns Each Hidden Node still reports to the Output Layer, which combines the findings for the discovered patterns that were learnt separately in each Region. It uses far fewer Parameters than Densely Connected Layer, is less prone to Overfitting and understands how to tease out patterns in image data. Expanding quantity of Nodes with multiple Collections in the Hidden Layer, where each Collection contains Nodes that are responsible for analysing different Regions of the original image and allows discovering more Complex Patterns in data. Each of the Hidden Nodes within a Collection would share common group of Weights (parameters), which is the motivation behind Convolutional Layers (i.e. for images where pattern may appear in any region of the image)
-
Convolutional Layers
- Steps
- Given an image 5x5
- Select a Width (W) and Height (H) that defines a Convolution Window (i.e. 3x3 Window)
-
Slide the image horizontally and vertically over Regions of the image pixels. At each position the Window specifies a small piece within the image and defines a Collection of pixels to which we connect a single Hidden Node, and we call this Hidden Layer a Convolutional Layer
- Each Regional Collection of Input Nodes influences the value of a node in the Convolutional Layer by multiplying the Input Nodes by their corresponding Weights and summing the result. Assume the Bias is 0
-
Always add a ReLU Activation Function to the Convolutional Layers (leaves Positive values alone and makes all Negative values 0). See https://www.youtube.com/watch?v=h5R_JvdUrUI
-
Filters (i.e. 3x3) representing the Weights in a grid, whose size matches the size of the Convolutional Window
http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
ReLU ( SUM ( Region Input Nodes * Weights Grid ) ) = 0 where Weights Grid == Filter
-
Note: We try to visualise the Filter so we understand what kind of Pattern the Filter is trying to detect. Each Filter detects only a single Pattern.
- Detection of Multiple Patterns requires the use of
Multiple Filters
- i.e. if want part of a dog image that contains Teeth, Whiskers, then we’d have Two Filters to detect Patterns of each https://www.youtube.com/watch?v=RnM1D-XI–8
-
Weights are not known in advance. Instead the Weights are learnt by the Neural Network as being the Weights that minimise the Loss Function
- Filter Image Kernels for Feature Extraction. It allows
allows you to create your own filter. You can then use your webcam
as input to a convolutional layer and visualize the corresponding activation map
- http://setosa.io/ev/image-kernels/
-
- Steps
- Greyscale vs Colour Images
-
Greyscale images interpreted as 2D array (Width, Height)
- Colour images interpreted as 3D array (Width, Height, Depth)
- RGB images have Depth of 3 (i.e. 3x steps of 2D Matrices, one for each of the Red, Green, and Blue Channels of the image)
- Convolutions over Colour image (3D) are performed using the 3D Filter (stack of 3x 2D matrices), that has a value for each Colour Channel for all pixels in the image array.
-
-
Feature Maps http://iamaaditya.github.io/2016/03/one-by-one-convolution/
- Given a Coloured image with 3x Filters.
- Each Feature Map in Convolutional Layer produced by performing
Region Input Nodes Matrix * Weights Matrix (aka Filter)
may be thought of as Image Channels that may be stacked into a 3D array. This Stack of Image Channels may be provided as Input to another Second Convolutional Layer to discover Patterns within the Patterns that we discovered in the First Convolutional Layer, and so forth.
- Weights and Biases in CNNs
- Both Dense Layers in MLPs and Convolutional Layers in CNNs have Weights and Biases that are initially randomly generated
- In CNNs where the Weights take the form of Convolutional Filters these filters are randomly generated, and the Patterns they are initially designed to detect are also initially randomly generated
-
Loss Functions and Training CNNs
- CNN always have a Loss Function
- Multi-Class Categorisation uses
categorical_cross_entropy_loss
- Multi-Class Categorisation uses
-
Train the Models through Backpropagation and update the Filters (aka Weights matrices) with values that minimise the Loss Function at each Epoch (iteration)
-
The CNN determines the kinds of Patterns it needs to detect based on the Loss Function, we do not tell the CNN the values of the Filters or the kinds of Patterns to detect, the CNN determines this from the dataset.
- Visualising the Patterns we will see that if the dataset contains say Dog images then the CNN is able to on its own learn Filters that appear like Dogs.
- CNN always have a Loss Function
-
Stride and Padding Hyperparameters
-
https://www.youtube.com/watch?v=Qt5SQNcQfgo
- Stride
- amount the Filter slides over the images.
- Padding
- padding around the image with 0’s to plan ahead for the possibility that the Filter may extend part-way over the side of an image if there is an odd difference between dimensions of image and dimensions of the Filter. Padding gives the Filter more space to move so we get contributions from all Regions in the image when populating the Convolutional Layer
padding = 'valid'
means we are ok with potentially losing some Nodes in the Convolutional Layer and do not want paddingpadding = 'same'
means we want padding as do not want to lose any Nodes in the Convolutional Layer
-
-
Convolutional Layers in Keras
-
Docs https://keras.io/layers/convolutional/
- Create a Convolutional Layer in Keras
from keras.layers import Conv2D Conv2D(filters, kernel_size, strides, padding, activation='relu', input_shape)
- Arguments
- filters - number of filters.
- kernel_size - Number specifying both height and width of (square) convolution window.
- Optional arguments
- strides - stride of convolution. default strides is set to 1.
- padding - ‘valid’ or ‘same’. default is ‘valid’ (i.e. no padding)
- activation - typically ‘relu’. default is no activation applied. Strongly encouraged to add a ReLU activation function to every Convolutional Layer in your networks.
- NOTE:
- It is possible to represent both kernel_size and strides as either a number or a tuple.
- When using Convolutional Layer as first layer (appearing after the input layer) in a model,
you must provide an additional
input_shape
argument:input_shape
- Tuple specifying the Height, Width, and Depth (in that order) of the input.- NOTE: Do not include the
input_shape
argument if the Convolutional Layer is not the first layer in your network.
- NOTE: Do not include the
- Example #1
- constructing a CNN, input layer accepts grayscale images that are 200 by 200 pixels
(corresponding to a 3D array with height 200, width 200, and depth 1).
I want next layer to be a convolutional layer with 16 filters, each with a width and height of 2
(kernel size).
When performing the convolution, I’d like the filter to jump two pixels at a time (strides).
I also don’t want the filter to extend outside of the image boundaries; in other words,
I don’t want to pad the image with zeros.
To construct this convolutional layer, I would use the following line of code:
Conv2D(filters=16, kernel_size=2, strides=2, activation='relu', input_shape=(200, 200, 1))
- constructing a CNN, input layer accepts grayscale images that are 200 by 200 pixels
(corresponding to a 3D array with height 200, width 200, and depth 1).
I want next layer to be a convolutional layer with 16 filters, each with a width and height of 2
(kernel size).
When performing the convolution, I’d like the filter to jump two pixels at a time (strides).
I also don’t want the filter to extend outside of the image boundaries; in other words,
I don’t want to pad the image with zeros.
To construct this convolutional layer, I would use the following line of code:
- Example #2
-
want next layer in my CNN to be a convolutional layer that takes the layer constructed in Example 1 as input. Say I’d like my new layer to have 32 filters, each with a Height and Width of 3. When performing the convolution, I’d like the filter to jump 1 pixel at a time. I want the convolutional layer to see all regions of the previous layer, and so I don’t mind if the filter hangs over the edge of the previous layer when it’s performing the convolution. Then, to construct this convolutional layer, I would use the following line of code:
Conv2D(filters=32, kernel_size=3, padding='same', activation='relu')
-
- Example #3
-
If you look up code online, it is also common to see convolutional layers in Keras in this format:
Conv2D(64, (2,2), activation='relu')
-
In this case, there are 64 filters, each with a size of 2x2, and the layer has a ReLU activation function. The other arguments in the layer use the default values, so the convolution uses a stride of 1, and the padding has been set to ‘valid’.
-
- Arguments
- Dimensionality: see conv-dims.py of https://github.com/ltfschoen/aind2-cnn
- Formula for Shape of a Convolutional Layer
- Formula for Number of Parameters in Convolutional Layer
- Pooling Layers
-
https://www.youtube.com/watch?v=OkkIZNs7Cyc
-
Convolutional Networks Summary - http://cs231n.github.io/convolutional-networks/
-
docs https://keras.io/layers/pooling/
- Dfn:
- Pooling Layers take Convolutional Layers as Inputs
- Convolutional Layers are a stack of Feature Maps (one for each Filter)
- Complex datasets with many object Categories require many Filters each responsible for finding a Pattern in the image
- Dimensionality of Convolutional Layers may get quite large since more Filters means a larger stack.
- Higher Dimensionality means use of more Parameters which may lead to Overfitting, so we need a method to reduce the Dimensionality, which is the role of Pooling Layers within a Convolutional Neural Network
- Types of Pooling Layers
- Max Pooling Layers
- take from the Convolutional Layer a stack of Feature Maps as Input
- Define a
- Window Size: i.e. 2x2
- Stride: i.e. 2
- Construct the Max Pooling Layer by working with each Feature Map separately
- Choose the Maximum value within the Window for each Stride.
-
The outcome will be that each Feature Map will be reduced in Width and Height (lower Dimensionality)
- Example
-
Say I’m constructing a CNN, and I’d like to reduce the dimensionality of a convolutional layer by following it with a max pooling layer. Say the convolutional layer has size (100, 100, 15), and I’d like the max pooling layer to have size (50, 50, 15). I can do this by using a 2x2 window in my max pooling layer, with a stride of 2, which could be constructed in the following line of code:
MaxPooling2D(pool_size=2, strides=2)
If you’d instead like to use a stride of 1, but still keep the size of the window at 2x2, then you’d use:
MaxPooling2D(pool_size=2, strides=1)
-
- Global Average Pooling Layer
- Extreme type of Dimensionality Reduction
- takes stack of Feature Maps and computes the average value of the Nodes for each Feature Map in the stack (reduces each Feature Map to a single value) such that the Global Average Pooling Layer converts a 3D array into a Vector
- DO NOT specify Window size or Stride
- Max Pooling Layers
-
- Designing Boilerplate CNN Architecture by arranging layers (for image classification)
- MY CODE AND NOTES
- https://github.com/ltfschoen/aind2-cnn/tree/master/cifar10-classification
- Steps
- https://www.youtube.com/watch?v=kI_RQoYsgQw
- Input
- CNN must accept image array input
- CNNs require a fixed-size input of the provided real-world images
- Select image size
- Resize all images to that same size (i.e. a square 32x32px) with dimensions divisible by 2
- All images are interpreted by computer as 3D array. Width and Height
always higher than Depth
- Coloured RGB image shape = (10, 10, 3)
- where Depth caters for RGB Channels
- Greyscale image shape = (10, 10, 1)
- Coloured RGB image shape = (10, 10, 3)
- Architecture (to allow the model to train “better” and classify objects more accurately)
- Sequence of Convolutional Layers
- Purpose of Sequence of Convolutional Layers is to discover hierarchies of spatial Patterns in the image
- Specify various Hyperparameters as Input to each Convolutional Layer
- Config so Convolutional Layer has same Width and Height as Previous Layer
kernel_size
- between 2 and 5 (i.e. 2x2 to 5x5)- controls Depth of Convolutional Layers since the Convolutional Layer has one Activation Map for each Filter
stride
- 1 (default)padding
- ‘same’ (gives better results)activation
- ‘relu’ (for all Convolutional Layers)
- Increase number of Filters with each higher Convolutional Layer index in the Sequence
- i.e. 1st Convolutional Layer - 16 filters
- i.e. 2nd “ “ - 32 “
- i.e. 3rd “ “ - 64 “
- First Convolutional Layer requires the following additional parameter to be provided:
input_shape
(i.e. (32, 32, 3) means 32x32 pixel colour images in dataset)
- Note:
-
This configuration gradually increases the Depth of the provided Input array (32x32x3) without modifying the Width and Height at each Convolutional Layer i.e. when run the function we’ll see the Depth increasing
- WITHOUT MAX POOLING LAYERS
Input - (None, 32, 32, 3) conv2d_1 - (None, 32, 32, 16) conv2d_2 - (None, 32, 32, 32) conv2d_3 - (None, 32, 32, 64)
- WITH MAX POOLING LAYERS (also decreases Width and Height)
Input - (None, 32, 32, 3) conv2d_1 - (None, 32, 32, 16) max_pooling2d_1 - (None, 16, 16, 16) conv2d_2 - (None, 16, 16, 32) max_pooling2d_2 - (None, 8, 8, 32) conv2d_3 - (None, 8, 8, 64) max_pooling2d_3 - (None, 4, 4, 64)
-
- Config so Convolutional Layer has same Width and Height as Previous Layer
- Accepts Input array and gradually making converting its shape
until its Depth is much taller than its Width and Height
- Eventually it transforms into Vector representation where there is not more spatial info to discover (i.e. whiskers, teeth, etc) in the image and then feed the Vector into Fully Connected Layer(s) (i.e. Dense) to determine what object is contained in the image. i.e. if the Last Max Pooling Layer discovers spatial information that wheels are present in that part of the image, then the Fully Connected Layer transforms that information to predict that a car is likely present in the image with higher probability (i.e. prob(car) = 0.99, prob(dog) = 0.01). This information is NOT pre-specified by us, instead it is learnt by the model during training through backpropagation
- Convolutional Layers increase the Depth of the Input array as it passes through the network
- Max Pooling Layers decrease the Width and Height of the Input array
(spatial dimensions)
- Follow every or every second Convolutional Layer in the Sequence
- Config so Spatial Dimensions become Half what they were in the
Previous Layer
pool_size
- 2stride
- 2padding
- default
- Config so Spatial Dimensions become Half what they were in the
Previous Layer
- Follow every or every second Convolutional Layer in the Sequence
- Sequence of Convolutional Layers
- MY CODE AND NOTES
-
- TODO -
- http://cs231n.github.io/convolutional-networks/
- Interpret hand-written numerical digits
- AWS GPU Instances
- https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/53b2a19e-4e29-4ae7-aaf2-33d195dbdeba/lessons/2df3b94c-4f09-476a-8397-e8841b147f84/concepts/ced1fa22-5723-4212-b73f-08f7f6613cae
- About:
- GPU-enabled server instance on AWS to train larger NN architectures
- Alternative to fast training of NNs on local CPU when built-in is not fast Nvidia GPU
- Setup:
- Login to AWS account https://aws.amazon.com/
- Using Root https://www.amazon.com/ap/signin?openid.assoc_handle=aws&openid.return_to=https%3A%2F%2Fsignin.aws.amazon.com%2Foauth%3Fresponse_type%3Dcode%26client_id%3Darn%253Aaws%253Aiam%253A%253A015428540659%253Auser%252Fhomepage%26redirect_uri%3Dhttps%253A%252F%252Fconsole.aws.amazon.com%252Fconsole%252Fhome%253Fstate%253DhashArgs%252523%2526isauthcode%253Dtrue%26noAuthCookie%3Dtrue&openid.mode=checkid_setup&openid.ns=http://specs.openid.net/auth/2.0&openid.identity=http://specs.openid.net/auth/2.0/identifier_select&openid.claimed_id=http://specs.openid.net/auth/2.0/identifier_select&openid.pape.preferred_auth_policies=MultifactorPhysical&openid.pape.max_auth_age=0&openid.ns.pape=http://specs.openid.net/extensions/pape/1.0&server=/ap/signin&forceMobileApp=&forceMobileLayout=&pageId=aws.ssop&ie=UTF8
- View EC2 Service Limits https://console.aws.amazon.com/ec2/v2/home?#Limits
- Find EC2 Instance Limit for p2.xlarge instance type (virtual server with GPU)
- Submit a Limit Increase Request to increase p2.xlarge to 1 with use case: “I would like to use GPU instances for deep learning.”
- Login to AWS account https://aws.amazon.com/
- Launch Instance
- Visit EC2 Management Console https://console.aws.amazon.com/ec2/v2/home
- Click “Launch Instance” Button
- Select Amazon Machine Image (AMI) with OS for instance and
config and pre-installed software
- Open Existing AMI
- Go to “Community AMIs”
- Search for “udacity-aind2” AMI
- Click “Select”
- Open Existing AMI
- Choose Instance Type (i.e. hardware the AMI runs on)
- Filter to only list “GPU compute”
- Select “p2.xlarge”
- Click “Review and Launch”
- Configure Security Group
(allow special config to allow running and accessing
Jupyter Notebook from AWS)
- Allow access the Jupyter notebook to port 8888 by configuring the AWS Security Group
- Click “Edit Security groups”
- On the “Configure Security Group” page:
- Select “Create a new security group”
- Set the “Security group name” to “Jupyter”
- Set the “Description” to “Jupyter”
- Click “Add Rule”
- Set a “Custom TCP Rule”
- Set the “Port Range” to “8888”
- Select “Anywhere” as the “Source”
- Click “Review and Launch” (again)
- Note:
- EC2 Pricing https://aws.amazon.com/ec2/pricing/on-demand/
- WARNING: AWS CHARGES FEE FROM THIS POINT ONWARDS
- EC2 Instances cost $
- Shutdown instances when not using them
- Storage of EC2 Instances cost $
- Delete EC2 instances when not using them
- ACTIONS > INSTANCE STATE > STOP/TERMINATE
- Note:
- Set AWS Billing Alarms
- EC2 Instances cost $
- Click “Launch” button to launch the GPU instance
- Click “Proceed without a key pair”
- Click “Launch Instances” button
- Click “View Instances” button
- Go to ECS Management Console to watch instance booting
- When it says “2/2 checks passed” then the instance is ready to be logged into
- An “IPv4 Public IP” address (in the format of “X.X.X.X”) appears on the EC2 Dashboard
- Open Terminal and SSH to that address
as user “aind2”:
ssh aind2@X.X.X.X
- Authenticate with the password: aind2
- Success: Now we have a GPU-enabled server on which to train your Neural Networks
- Visit EC2 Management Console https://console.aws.amazon.com/ec2/v2/home
- Launch Jupyter Notebook using EC2 Instance
- On the EC2 Instance
- Clone the aind2-cnn repo:
git clone https://github.com/udacity/aind2-cnn.git source activate aind2 cd aind2-cnn jupyter notebook --ip=0.0.0.0 --no-browser
- Find the output window line that looks like:
- Copy/paste this URL into your browser when you connect for the first time to login with a token: http://0.0.0.0:8888/?token=3156e…
- Copy and paste the complete URL into the address bar of a web browser (Firefox, Safari, Chrome, etc). Before navigating to the URL, replace 0.0.0.0 in the URL with the “IPv4 Public IP” address from the EC2 Dashboard. Press Enter.
- Note that the browser should display the folders contained in the aind2-cnn repo
- Verify that the EC2 Instance can run a
Jupyter Notebook by using https://github.com/udacity/aind2-cnn/blob/master/cifar10-classification/cifar10_mlp.ipynb
- Click “cifar10-classification”
- Click “cifar10_mlp.ipynb”
- Run All Cells in notebook
- Shutdown and Delete EC2 instance
- Example CNN in Keras using AWS EC2 with cifar10-classification
- Video https://www.youtube.com/watch?v=faFvmGDwXX0
- Link
- Cheatsheet for CNNs in Keras
- https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf
- CIFAR-10 Winning Architecture
- http://blog.kaggle.com/2015/01/02/cifar-10-competition-winners-interviews-with-dr-ben-graham-phil-culliton-zygmunt-zajac/
- Cheatsheet for CNNs in Keras
- Note: Previously we used
validation_split
argument in model.fit to 0.2. This removed the final 20% of the training data, which was instead used as validation data. Alternatively instead of having Keras split off the validation set for us, we may opt to hard-code the split ourselves
- Mini Project: CNNs in Keras
- Modify architecture of the neural network in cifar10_cnn.ipynb.
- Specify new CNN architecture in Step 5: Define the Model Architecture in the notebook.
- Example https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py
- Train the new model. Check the accuracy on the test dataset, and report the percentage in the text box below.
- Try different optimiser
-
Example Image Augmentation in Keras
- Links
- Video https://www.youtube.com/watch?v=odStujZq3GY
- Keras
ImageDataGenerator
https://keras.io/preprocessing/image/ - Visualise Augmentations of MNIST dataset http://machinelearningmastery.com/image-augmentation-deep-learning-keras/
- Augmentation to boost performance on Kaggle dataset using less data https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
- Image Augmentation
- Focus algorithm on learning an Invariant Representation
of the image that just checks if object is present in image or not
without dwelling on irrelevant info. Do not want the model to change
its prediction based on any of the following different attributes of the
object, which are not relevant
- Size Scale Invariance
- Angle Rotation Invariance
- Position Translation Invariance
- Technique to increase Invariance of images that Expands the dataset
by Data Augmenting (Data Augmentation also avoids Overfitting and is
better at generalising since model sees many new images)
- Add images to dataset at random Rotations to increase Rotation Invariance
- etc
- MY CODE AND NOTES
- https://github.com/ltfschoen/aind2-cnn/tree/master/cifar10-augmentation
- Note: Some CNNs have some built-in Translation Invariance (i.e. in Max Pooling where we take max value contained in Window, can occur at any pixel in that Window)
- Focus algorithm on learning an Invariant Representation
of the image that just checks if object is present in image or not
without dwelling on irrelevant info. Do not want the model to change
its prediction based on any of the following different attributes of the
object, which are not relevant
-
Jupyter Notebook aind2-cnn/cifar10-augmentation/cifar10_augmentation.ipynb
- Note on
steps_per_epoch
- Recall that
fit_generator
takes many parameters, includingsteps_per_epoch = x_train.shape[0] / batch_size
wherex_train.shape[0]
corresponds to number of unique samples in the training dataset x_train. By settingsteps_per_epoch
to this value, we ensure that the model seesx_train.shape[0]
augmented images in each epoch.
- Recall that
- Links
- Groundbreaking CNN Architectures for Object Classification Tasks
- Top CNN Models in Keras
- https://keras.io/applications/
- Benchmarks of Top CNN Models architectures in Keras https://github.com/jcjohnson/cnn-benchmarks
- TODO - ImageNet Paper http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
- 10M images drawn from 1000 different image categories
- In 2012, AlexNet Architecture was built from the ImageNet Large Scale Visual Recognition Competition
using best available GPUs.
- Pioneered the use of the ReLU Activation Function and Dropout technique to avoid Overfitting
- In 2014, VGG / VGGNet Architecture (Visual Geometry Group)
- VGG16 has 16 total layers, VGG19 has 19 total layers
- Each have long sequence of 3x3 convolutions broken up by 2x2 pooling layers, and finished with 3x fully connected “dense” layers
- Pioneered use of small 3x3 convolution windows (instead of AlexNets 11x11 windows)
- In 2015, ResNet Architecture (similar to VGG, by Microsoft)
- One version has CNN with 152 layers (previously above a certain number of layers the performance declined due to the Vanishing Gradients Problem that arises when go to train the CNN through backpropagation where the gradient signal must be pushed through the entire network, and in deeper network its more likely the signal gets weakened before arriving at destination)
- ResNet adds Connections to deep CNN at skip layers so gradient has shorter route to travel to achieve superhuman performance in image classification
- TODO - VGGNet Paper https://arxiv.org/pdf/1409.1556.pdf
- TODO - ResNet Paper https://arxiv.org/pdf/1512.03385v1.pdf
- Treatment of Vanishing Gradients Problem http://neuralnetworksanddeeplearning.com/chap5.html
- ImageNet Large Scale Visual Recognition Competition (ILSVRC) http://www.image-net.org/challenges/LSVRC/
- Top CNN Models in Keras
- Visualising CNNs to understand them
- Visualising Activation Maps in Convolutional Layers to dig deeper in how CNN works
- Example: Pass Webcam through CNN in real-time
- Take Filters from Convolutional Layers and constructing images that Maximise their Activations
- Steps:
- Start with image containing Random Noise
- Gradually amend the pixels at each step changing them to values that filter more highly activated
- 1st Convolutional Layer
- Random Noise
- Filter that detects specific Colours or Edges
- Output image
- 2nd Convolutional Layer
- Filter that detects Circles or Stripes
- Output image
- 5th Convolutional Layer
- Filter detects Complex Patterns
- Output image
- Example:
- Deep Dreams
- Starting image (i.e. a Tree) is applied a Filter (i.e. a “Statue”) that transforms it into a Hybrid
- Deep Dreams
- Steps:
- Links
- Course on Visualising CNNs http://cs231n.github.io/understanding-cnn/
- Openframeworks.cc for visualising CNNs in real-time http://openframeworks.cc/
- Demo https://aiexperiments.withgoogle.com/what-neural-nets-see
- WaveNets to generate audio
- DeepVis Toolbox https://github.com/yosinski/deep-visualization-toolbox
- http://ml4a.github.io/
- Visualisation Tool https://www.youtube.com/watch?v=AgkfIQ4IGaM&t=78s
- Creating CNN Visualisations https://www.youtube.com/watch?v=ghEmQSxT6tw&t=5s
- Clarifai.com https://www.clarifai.com/
- Picasso Visualiser for CNNs in Keras and TensorFlow https://medium.com/merantix/picasso-a-free-open-source-visualizer-for-cnns-d8ed3a35cfc5
- Visualizing how CNNs see the world. Introduction to Deep Dreams,
along with code for writing your own deep dreams in Keras
- https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
- Music Video uses DeepDreams 3:15-3:40
- https://www.youtube.com/watch?v=XatXy6ZhKZw
- Create DeepDreams without code https://deepdreamgenerator.com/
- Interpretability of CNNs
- Dangers from using deep learning models (that are not yet interpretable) in real-world applications https://blog.openai.com/adversarial-example-research/
- https://arxiv.org/abs/1611.03530
- Visualising Activation Maps in Convolutional Layers to dig deeper in how CNN works
- How a CNN works in Action
- Given an image trained on ImageNet http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf
- Each image represents a pattern that causes the neurons in the first layer to activate (i.e. they are patterns that the first layer recognizes, such as a -45 degree line).
- First Layer of our CNN clearly picks out very simple shapes and patterns like lines and blobs
- Second Layer picks up more complex ideas like circles and stripes. A grayscale grid may be used to represent how the layer of the CNN activates (or “what it sees”) based on the corresponding images from the grid on the right. Second layer of the CNN captures complex ideas like circles, stripes, and rectangles. CNN learns to do this on its own. There is no special instruction for the CNN to focus on more complex objects in deeper layers. That’s just how it normally works out when you feed training data into a CNN.
- Third Layer third layer picks out complex combinations of features from the second layer. These include things like grids, and honeycombs, wheels, and even faces
- Final Layer last layer picks out the highest order ideas that we care about for classification, like dog faces, bird faces, and bicycles
- Transfer Learning
- Links
-
Slides https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/53b2a19e-4e29-4ae7-aaf2-33d195dbdeba/lessons/2df3b94c-4f09-476a-8397-e8841b147f84/concepts/8c202ff3-aab5-46c3-8ed1-0154fa7b566b
- Systematically analyzes the transferability of features learned in pre-trained CNNs
- https://arxiv.org/pdf/1411.1792.pdf
- Cancer detecting CNN https://www.nature.com/articles/nature21056.epdf
-
- Steps Summary
- Deep Learning Professionals Design the Architecture of CNN
- Setting Hyperparameters (Filter Window size, stride, padding)
- Choose Loss Function and Optimiser
- Start Model Training and wait
- Note: State-of-art CNNs architectures available in Keras are result of experimenting with numerous architectures and extensive Hyperparameter tuning and are trained on large ImageNet database that took weeks to train on latest GPUs
- Deep Learning Professionals Design the Architecture of CNN
- About
-
Adapt the State-of-the-art CNN Architectures that have learnt so much about how to find Patterns in image datasets toward our own Classification tasks (instead of creating CNN from scratch, take the learnt understanding and pass it on to a new Deep Learning Model) using the Transfer Learning Technique
- Transfer Learning Approach
- Removing the Final Classification Layers of a State-of-the-art CNN Inception Architecture (i.e. Conv, Pool, Dense) that are specifically pre-trained on the ImageNet data set (containing animals, fruits, etc), but retain the early layers (detecting colors, shapes, general features), and replace with a new Dense layer and only train that
- Update new Dense layer by:
- randomly initialising the Weights in the new Dense layer
- Update all other existing layers by:
- initialise Weights using pre-trained Weights
- During re-training of the entire neural network the parameters were further fine-tuned / optimised to fit to the custom database (i.e. of skin lesions)
- Example
- Diagnosing skin cancer
-
- Different Approach for different Cases
- Case 1: Small Data Set, Similar Data
- Case 2: Small Data Set, Different Data
- Case 3: Large Data Set, Similar Data
-
Case 4: Large Data Set, Different Data
- See CNN Part 27. Transfer Learning Slide for details
- Notes:
- Small dataset of images - 1,000
- Overfitting is concern when using transfer learning with a small data set.
-
Large dataset of images - 1,000,000
- Global Average Pooling (GAP) layers were proposed in 2016 and are not only used for Object Classification (what is in the image) but also Object Localisation (where the object is in the image)
- Small dataset of images - 1,000
- Example
-
Video - https://www.youtube.com/watch?v=HsIAznMM1LA
-
Code
transfer-learning/transfer_learning.ipynb
Note: * After training CNN to identify dog breeds, use the model in an end-to-end pipeline for incorporation into an app
-
- Links
- Global Average Pooling (GAP) Layers for Object Localisation using ResNet http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf
- Code that uses CNN for object localisation https://github.com/alexisbcook/ResNetCAM-keras
- Video on CNN object localisation https://www.youtube.com/watch?v=fZvOy0VXWAI
- Visualization techniques to better understand bottleneck features https://github.com/alexisbcook/keras_transfer_cifar10
- http://machinelearningmastery.com/object-recognition-convolutional-neural-networks-keras-deep-learning-library/
- https://arxiv.org/pdf/1611.10012.pdf
- http://www.pyimagesearch.com/2017/03/20/imagenet-vggnet-resnet-inception-xception-keras/
- https://pythonprogramming.net/haar-cascade-face-eye-detection-python-opencv-tutorial/
- https://www.packtpub.com/books/content/tracking-faces-haar-cascades
- Links
- Recurrent Neural Networks (RNN)
- About
- Supervised Learning type
-
Learning between Input/Output pairs
- Unstructured Input in Vanilla Supervised Learning Models
(i.e. Feedforward Networks)
- Issue as cannot exploit any particular structure
- Do not make any assumptions about how Input Dataset is Structured
- Example
- Given row of data from medical dataset with
single data point (one input-output pair) as
Unstructured Inputs.
Unstructured Input features in any order used to predict
whether a given disease exists by feeding them
into a Vanilla Supervised Learner designed to assume and process
Unstructured Input data (ignore structure)
- Input
- Categories age: 43 gender: male height: 5.9 weight: 188
- Output: disease: No
- Input
- Given row of data from medical dataset with
single data point (one input-output pair) as
Unstructured Inputs.
Unstructured Input features in any order used to predict
whether a given disease exists by feeding them
into a Vanilla Supervised Learner designed to assume and process
Unstructured Input data (ignore structure)
- Example
- Structured Input
- Many data types do have Input Structure
-
Examples
-
Images - input pixels of image patch are related spatially (pixels near one another are similar). If pass to a Vanilla Supervised Learner that’s designed to assume and process Unstructured Input then it won’t care about the spatial correlations with same level of performance, but if we instead use a Convolutional Neural Network (CNN) then if we pass Structured Input it will leverage it (i.e. the spatial correlation b/w pixels)
-
Video
-
Text - where there’s a natural Order to Words and Chars in a Sentence. Trying to predict the next word given Input Words a Vanilla Supervised Learner won’t care (is indifferent) what Order the Input Words are fed in
-
Financial Time Series - where naturally Ordered Structure of past Input History events over time considered and used to predict future value. Vanilla Supervised Learner would be indifferent to Order when we train it, but we should instead exploit the Ordered/Sequential Structure if its available by using Recurrent Neural Networks (RNN)
-
-
- Many data types do have Input Structure
-
- Supervised Learning type
- Usage:
- CNN - Images, Video
- Example
- Linear Relationship (line) splits Feature Space of Input image with faces plotted on one side or non-face on other side
- Example
- RNN -
- Sequential Data - speech recognition (time-series text-to-audio), text generation, stock price prediction
- Example
- Regressor Line indicates of a trend of first weekend revenue predicts movie popularity
- CNN - Images, Video
- Background
- Supervised Learning Problems
- Ordered Sequences Problems
- Financial Time Series
- Example
- Ordered Sequence based on historical price Input of Apple stock price over time
- Example
- Natural Language Processing (NLP)
- Example
- Text generation
- Given small sequence of text, try to Auto-complete
using Supervised Learning where we’ve well trained
a network model on a large text corpus to generate
new sentences
to understand complex relationships b/w Words and Characters from
the English language
- Links
- Academic RNN text generator http://www.cs.toronto.edu/~ilya/rnn.html
- Twitter bots that tweet automatically generated text http://tweet-generator-alex.herokuapp.com/
- NanoGenMo annual contest to automatically produce a 50,000+ novel automatically https://github.com/NaNoGenMo/2016
- Robot Shakespeare a text generator that automatically
produces Shakespear-esk sentences https://github.com/genekogan/RobotShakespeare
- NTLK http://www.nltk.org/
- Input - Ordered Sequence of Words or Chars (Training Text Corpus)
- Output - Ordered Sequence of Chars
- Links
- Given small sequence of text, try to Auto-complete
using Supervised Learning where we’ve well trained
a network model on a large text corpus to generate
new sentences
to understand complex relationships b/w Words and Characters from
the English language
- Machine Translation
- Automatic translation of one language into another
- Input - Ordered Sequence of Words (language X)
- Output - Ordered Sequence of Words (language Y)
- Automatic translation of one language into another
- Text generation
- Example
- Speech Recognition
- Example
- Speech Recognition
- Input - Ordered Sequence of Raw Audio Signal
- Output - Ordered Sequence of Words (text-based)
- Speech Recognition
- Example
- Financial Time Series
- Ordered Sequences Problems
- Supervised Learning Problems
- Modelling Ordered Sequence (Sequential) Data Recursively (used in RNN framework)
-
Use previous values of time-series to predict future values
-
Notation for Generic Ordered Sequence of Values (Ordered by Index)
(S1, S2, S3, ..., SP)
- Generic Ordered Sequence has Big P values
- S1 comes before S2, S2 comes before S3, etc
-
Indices may be of any interpretation, or even Timestamps (indexing when a certain value in a sequence occurred)
- Example
- Elements 1 to 5
S1 S2 S3 S4 S5 my dog is the best
- Elements 1 to 5
- Example: Stock Price History (or time-series generally)
- Ordered Sequence from Left to Right
- S1 at Time 1, S2 and Time 2, etc
- Model Ordered Sequence Structure (often product of real underlying process)
- Example:
- Predict temperature. Input is temperature over time (temperature-based time-series) dependent on factors (i.e. Sun)
-
Predict stock price for investor. Input is price history of given stock that’s dependent on factors (Known and Unknown) i.e. success of product line, vitality of overall economy, CEO and board of directors actions, etc.
- Model Ordered Sequence Recursively
(in lieu of Knowing the underlying process)
- i.e. use past values of sequence to predict future values
- i.e. model future values of sequence mathematically in terms of its predecessors
- SEED is the original value in a recursive sequence (recursive sequences always start with seed value(s))
-
ORDER number of most recent element values used as Inputs each time it recurses to produce future ones i.e. number of previous elements a recursive sequence requires in order to predict future elements
- Mathematically Recursive Sequences Examples
- Example:
- Odd Numbers:
- 1,3,5,7,…
-
S1==1, S2==3
- Generate Ordered Sequence Recursively
S1 = 1 (1x SEED value) S2 = 2 + S1 = 3 S3 = 2 + S2 = 5 (just add 2 to previous sequence value) S4 = 2 + S3 = 7 ... Note: ORDER == 1 - Since uses 1x most recent value each time it recurses
- Unfolded Views of Recursive Sequence
-
https://www.youtube.com/watch?v=OS9yQCTzCkg
- Find Recursive Equation to Generate Values
Function-based Notation of Generic Recursive Sequence *** f(s) = 2 + s *** i.e. f(S1) = 2 + S1 f(S2) = 2 + S2
- Graphical Notation
S1 --(f)--> S2 --(f)--> S3 ... ST-1 --(f)--> ST ...
-
- Folded View of Recursive Sequence
- Single line represent all our recursive levels
S1 = 1 St = f(St - 1), t = 2, 3, 4
- Graphical Model Analogue
(feeds output into itself repeatedly)
__ f __ __f__ | | | | S1 > S2 ... > ST
- Graph the Step (x-axis) vs Values (y-axis)
- Recursive Sequence - Every value in sequence can be defined in terms of its predecessors (except the first value) i.e. where future elements are based mathematically on previous values
- Fibonacci sequence:
-
1,1,2,3,5,8,13,21
-
Note: This recursive sequence generates the Golden Ratio and creates a Spiralling Effect when represented Geometrically
- Generate Ordered Sequence recursively
(this time with 2x SEED values)
S1 = 1, S2 = 1 (2x SEED values) S3 = S1 + S2 = 2 (just sum previous two elements) S4 = S3 + S1 = 3 S5 = S4 + S3 = 5 S6 = S5 + S4 = 8 ... Note: ORDER == 2 - Since uses 2x most recent values each time it recurses
-
Unfolded Views of Recursive Sequence
- Find Recursive Equation to Generate Values
Function-based Notation of Generic Recursive Sequence *** f(St - 2, St - 1) = St - 2 + St - 1 *** S1 = 1, S2 = 2 S3 = f(S1, S2) S4 = f(S2, S3) S5 = f(S3, S4) ...
- Find Recursive Equation to Generate Values
- Folded View of Recursive Sequence
- Single line represent all our recursive levels
S1 = 1, S2 = 1 St = f(St - 2, St - 1), t = 3, 4, 5, ...
-
- Odd Numbers:
- Links: Fibonacci Sequence https://en.wikipedia.org/wiki/Fibonacci_number
- Example:
-
Example 2: Rayleigh - Reverse Steps
-
Define a few SEED values and Recursive Equation then Generate a Recursive Sequence using the Recursive Equation
1) Seed Recursive Sequence with 2x SEED values: S1 = 1, S2 = 0.5 S3 = 0.4 * MAX(0, 1 + S1 - 0.1 * S2) (use S1 and S2 Inputs) S4 = 0.4 * MAX(0, 1 + S2 - 0.1 * S3) (use S2 and S3 Inputs) ... Note: ORDER == 2 - Since uses 2x most recent values each time it recurses (i.e. 0.4 times the linear combo of S1 and S2 through Rayleigh function that takes an Input and Outputs the Max of that Input and 0)
- Folded View of Recursive Sequence
- Single line represent all our recursive levels
S1 = 1, S2 = 0. St = f(St-2, St-1), t = 3, 4, 5, ... f(St-2, St-1) = 0.4 * MAX(0, St-2 - 0.1 * St-1)
- Graphed of oscillation at 4:20 https://www.youtube.com/watch?v=KN_dRCy3rtw
- Folded View of Recursive Sequence
-
- Example:
- Basic Graphical Model Representation using Maths to understand RNNs
- See earlier examples of Odd
- Unfolded Views of Recursive Sequence
-
Folded View of Recursive Sequence
- Summary https://youtu.be/OS9yQCTzCkg?t=4m36s
- See earlier examples of Odd
-
- Expressing Recursive Sequences
- Functionality
- Graphically (graphical models)
- Programatically (in code)
- Views of a Recursive Sequence
- Unfolded
- Folded
- Drive a Hidden Recursive Sequence using any Driver (Input Sequence)
- About
- Generating methods for Markov Chains
- Dynamic systems
- RNNs
- Example
- Create model of savings account balance at end of each month
(month-to-month basis)
Denote the following: h1 = initial savings balance at end of 1st month ht = savings balance at end of month t st = income (or loss) at end of month t (i.e. drivers/influencers of savings balance each month) i.e. s1 is income (or loss) at end of 1st month etc Example Model for monthly savings level h1 = 0 (initial savings level during month 1, the seed value) h2 = h1 + s1 (add 1st months savings to 1st months income (or loss)) h3 = h2 + s2 (add previous months balance to previous months income (or loss)) ... repeat for every month (time period) Folded view of month-to-month savings balance summarises these recursive updates h1 = 0 ht = ht-1 + st-1, t = 2, 3, 4, ... where future values of sequence h are fully dependence on the previous values Note: - h is recursive sequence always - s may be recursive OR random Simulation of how monthly income (or loss) drives the savings balance - Monthly income (or loss) simulated as random variable of value of either -1, 0, or 1 (with equal probability). - Initialise savings account balance at 0 - Simulate 23x months worth of income (or loss) - Update formula used to generate monthly savings account balance - Refer to Simulation on graph: https://youtu.be/JQ2Nzzxx5oQ?t=4m10s
- Create model of savings account balance at end of each month
(month-to-month basis)
- Example
-
Create model of real stock price sequence end of each month (month-to-month basis)
-
Driver (i.e. Input Sequence) - sequence driving things along
-
Hidden Sequence - sequence being driven (since we do not actually receive its values as data, instead we generate them recursively using the Driver)
Folded view of month-to-month real stock price summarises these recursive updates h1 = 0 (could be set to any value) ht = tanh(ht-1 + st-1), t = 2, 3, 4, ... Note: - s (Driver) is sequence of real stock price that may be recursive OR random - s is used to drive a recursive sequence where each new element in the sequence after the seed is created by adding the previous value of h to the driver sequence s, and then taking tanh of the result Simulation - Refer to Simulation on graph (Value vs Step): https://youtu.be/JQ2Nzzxx5oQ?t=4m38s - h (Hidden Sequence) is layered on top of the driver sequence s in the graph and both appear similar, since the driver sequence s is more Structured than the drive used in the previous example (where we modelled savings account balance) Function Notation representing the Recursive Update h1 = 0 ht = f(ht-1, st-1), t = 2, 3, 4, ... Graphical Model representing this Generic Hidden Sequence https://youtu.be/JQ2Nzzxx5oQ?t=6m02s
-
- About
-
How to Inject the assumption of Recursivity directly into a Supervised Learner (using Feedforward Networks)
- Definitions
- Recursivity - Modelling the Structure of Recursive Ordered Sequences
- Recursive Sequence - Able to generate new values in a sequence by combining old values using a specific formula
- Lazy Way
-
Adjust Vanilla Supervised Learners to deal with Ordered Sequence data
-
Goal
-
Given an Input Sequence (dataset Driver) we want to Model it as Recursive using a formula that approximately generates values of that Sequence, given previous values, then use that formula to make predictions about future values in the Sequence (whether the Sequence itself is truly Recursive or not) (i.e. perform Supervised Learning with Ordered Sequences)
-
Note:
- Given an Input Sequence we Model it as Recursive to make meaningful predictions
- Inject the Structural assumption into a Supervised Learner
- Options to approach
- Feedforward Networks (simple approach) (reverse engineer the notion of Recursivity and inject it as a parameterised Model into a Supervised Learner)
- RNNs (complex approach)
- Options to approach
-
-
- Definitions
-
Injecting Recursivity into a Supervised Learner
-
Example using Random Guessing:
- Given a Random Sequence (Original Sequence) of numbers that
we suppose is recursive
1,3,5,7,9,11,13,15
- Assume the sequence is of ORDER == 1, then SEED value is just 1st value in series
- Steps to find the formula for the recursive update of the sequence
- Pick a function
Pick a function: g(s) Where: s1 = 1 s2 = 2 + s1 s3 = 2 + s2 ... s8 = 2 + s7
- Use the function with a Seed value to see if it generates something close to sequence we have
- Inject first Seed value into the function as parameter, and
inject successive hat values into next value to
generate a new sequence using the function
s1 = 1 ^s2 = g(s1) ^s3 = 2 + s2 = g(^s2) ... ^s8 = 2 + s7 = g(^s7)
- New sequence generated with the function
s1, ^s2, ^s3, ..., ^s8
- Compare new sequence with original sequence to check
how close our alignment is
between recursive function and recursive formula
s1, ^s2, ^s3, ..., ^s8 VS s1, s2, s3, ..., s8
- Pick a function
- Sample implementation of finding steps
- Pick Random Function
g(s) = 1 - 0.5 * s
- Generate an 8-value sequence starting with Seed value s1 check how close you get to the original sequence
- Plot the comparison between
- Random Sequence (Original Sequence)
- Random Function (Proposed Solution Recursive Parameterised Sequence Function)
- Try a different Random Function (Proposed Solution Recursive Parameterised Sequence Function)
- Repeat
- Pick Random Function
- Note: cannot rely on just guessing to determine the Proposed Solution Recursive Parameterised Sequence Function, we need to instead Learn such a Function using the Sequence Original Sequence itself
- Given a Random Sequence (Original Sequence) of numbers that
we suppose is recursive
- Example 1 (Simple):
Using Recursive Parameterised Sequence Function as our
Recursive Approximator that has Weights we may tune to the given Sequence
- Summary of steps
- Inject Recursivity into a Supervised Learner in order to
Model and Ordered Sequence Recursively
- Proposing Recursive Parameterised Formula (Network Architecture)
- Windowing the Sequence to produce Regression Input-Output Pairs
- Parameter Tuning using the Input-Output Pairs
- Sequence Generation using the Trained Network as a Regressor
- Inject Recursivity into a Supervised Learner in order to
Model and Ordered Sequence Recursively
- Given a Random Sequence (Original Sequence) of numbers that
we suppose is recursive,
- Note: Goal is to make a recursive approximation of by first making a guess about the architecture of its recursive formula and then tuning the parameters of the architecture optimally using the sequence itself (ideally try multiple architectures to find best one)
1,3,5,7,9,11,13,15
-
Pick a Simple Linear Parameterised Function as the Recursive Approximator (simply Feedforward Network) with 2x Weights that we’ll use to Learn Weights
w0
andw1
by Fittingg(s) = w0 + w1 * s
-
Model each element of Sequence past the Seed as a Linear Combination of its previous element
s1 = 1 s2 = w0 + w1 * s1 s3 = w0 + w1 * s2 ... s8 = w0 + w1 * s7
- Weights (i.e.
w0
andw1
) need to be learnt - Equalities (Levels of Recursion) (i.e.
s2
tos1
,s3
tos2
, etc) must be determined, since if they hold for some values of Weightsw0
andw1
then we have a Recursive Formula that will generate our sequence
- Weights (i.e.
- Find the best Weights to make Equalities hold as best possible
- Learn by Forming and then Minimising a
Least Squares Cost Function (breaking recursion into levels)
- Ignore the top Level (i.e. s1 = 1)
-
At each Level of Recursion, we take the difference between both sides of the Equality and square the result
s2 = w0 + w1 * s1 ===> (s2 - (w0 + w1 * s1))^2 s3 = w0 + w1 * s2 ===> (s3 - (w0 + w1 * s2))^2 ... s8 = w0 + w1 * s7 ===> (s8 - (w0 + w1 * s7))^2
- Then Sum up the differences of each level of recursion
to give us a Least Squares Cost Function
8 ∑ t=2 (st - (w0 + w1 * st-1))^2
- Minimise the Least Squares Cost Function over the
Weights to give us their Optimal values, which gives
the Best Recursive Formula of the Form:
g(s) = w0 + w1 * s
-
Note: Resolving the formula uses Regression where out Input-Output Pairs consist of consecutive elements of the Sequence
- Note: Recursive Approximator
g(s)
is a simple Feedforward Network (Linear Function)
- Learn by Forming and then Minimising a
Least Squares Cost Function (breaking recursion into levels)
-
After Training, Tune the Network to generate new values in the Sequence
- Process Training Sequence (aka Windowing)
- Fit an Order-One Recursive Formula to the Sequence of numbers
-
Extract the Set of Regression Input-Output Pairs from the Sequence to perform and Minimise the Least Squares Cost Function
- Input-Output Pair 1 (Elements 1 and 2 are first two elements of Sequence)
- Input
s1
, Outputs2
- Input
- Input-Output Pair 2
- Input
s1
, Outputs2
- Input
-
etc (Slide over Input Window one unit to the right of Graph that shows Input-Outputs over time)
- Note: If the Sum is
8 ∑ t=2
there will be 6x pairs (since sum from 2 to upper limit 8, where P == 8)
- Input-Output Pair 1 (Elements 1 and 2 are first two elements of Sequence)
-
Add Input-Output Pairs to Summands of Least Squares Loss
Input Output Summand s1 s2 (s2 - (w0 + w1 * s1))^2 s2 s3 (s3 - (w0 + w1 * s2))^2 ... sp-1 sp (sp - (w0 + w1 * sp-1))^2
- Inputs are calculated given the first 7x members of
the Sequence in the example
Input Output [[1] [[3] [3] [5] [5] [7] ... [13]] [15]]
-
Fitting with Keras by constructing a Model to reflect the derivations of the Least Squares Loss
- Build Feedforward Network (FFN) with One Linear Layer
to perform regression on our input/output data
and Output Loss using Mean Squared Error. then
Fit the model
# Build FFN to perform regression on input/output data model = Sequential() layer = Dense(1, input_dim=1, activation='linear') model.add(layer) model.compile(loss='mean_squared_error', optimizer='adam') # Fit the Model with Batch size and Epochs qty model.fit(x, y, epochs = 3000, batch_size = 3 call_backs = callbacks_list, verbose = 0)
- Build Feedforward Network (FFN) with One Linear Layer
to perform regression on our input/output data
and Output Loss using Mean Squared Error. then
Fit the model
- After Training, substitute each Input value into the
FFN (Linear Combination
8 ∑ t=2 (st - (w0 + w1 * st-1))^2
) to make a set of predictionsg(st-1)
on the Training Set - Set of Predictions built
Input Output Predictions g(st-1) [[1] [[3] [[ 2.999 ] [3] [5] [ 4.999 ] [5] [7] [ 6.999 ] ... [13]] [15]] [ 15.000 ]
-
Compare the Predictions to the Output to check how close they are (i.e. aim is to achieve a fair approximation of the true recursive function i.e.
f(s) = 2 + s
in the case of the Odd Sequence example ) - Print the Learned Weights (see how similar to original model
and true function and associated Coefficients
that we’re aiming for)
model.get_weights() [array(((1.000]], dtype - float32) array((1.999],dtype - float32)] g(s) = w0 + w1 * s g(s) = 1.99999 + 1.000001 * s
- Notes (about Trained Network):
- Used a very simple Feedforward Network (FFN) to fully train network/weights
- Usage is possible also as any other classical trained predictor by exporting the Original Sequence to a Training and Testing set and tune the Weights to minimise Testing Error (rather than minimising the Training set Error)
- Use of Trained Network as a Generative Model to produce new unseen elements of the sequence
- Generating Next Points in Sequence using Full Trained Network
for the Generical Sequence
- https://www.youtube.com/watch?v=6LgdU4avFSk
- Simple Network Model
g(s) = w0 + w1 * s
- Substitute the last element of the Sequence
sp
into the Simple Network Model to give Generated Output of new point that’s Generated using the Trained Network (may or may not be close to true future values of the sequence) -
Repeatedly move/slide the Window forward to the Next Input (i.e. plug
^sp + 1
into the network to give output^sp + 2
) - Generate Outputs
Input Output sp ^sp + 1 = g(sp) (generated 1st point) ^sp + 1 ^sp + 2 = g(^sp + 1) (generated 2nd point) ...
- Using above to the Odd Sequence Example to generate points:
Input Output [[ 15 ] [[ 17.0007 ] [[ 17.0007 ] [[ 19.0009 ] ...
-
Graphical Model View
-
of the simple Linear Network showing how each element of the Sequence is related to the shared Weights
w0
andw1
-
https://youtu.be/I72EOcAroFk?t=2m44s
-
- Summary of steps
- **Example 2 - Rayleigh (Complex) where we inject recursivity into a supervised learner:
- Instead of creating the Sequence ourselves, we are given the Sequence of Values and aim to Model it using a Recursive Formula
- Suppose we have the first 50 values of the Sequence of Values
- Assume we do not know what precise Recursive Formula generates the data
(even though we really know its
S1 = 1, S2 = 0.5 S3 = 0.4 * MAX(0, 1 + S1 - 0.1 * S2) (use S1 and S2 Inputs) S4 = 0.4 * MAX(0, 1 + S2 - 0.1 * S3) (use S2 and S3 Inputs) ... S50 = 0.4 * MAX(0, 1 + S48 - 0.1 * S49) (use S48 and S49 Inputs) Note: ORDER == 2 - Since uses 2x most recent values each time it recurses (i.e. 0.4 times the linear combo of S1 and S2 through Rayleigh function that takes an Input and Outputs the Max of that Input and 0)
- We Propose to Fit a simple Parameterised Rayleigh Architecture as an attempt to try and see how close it gets to a Recursive Formula
- Model each Non-Seed value of the Sequence as follows
(linear combination of two prior elements shoved through a Rayleigh function):
S1 = 1, S2 = 0.5 S3 = W0 + W1 * MAX(0, W2 + W3 * S1 + W4 * S2) S4 = W0 + W1 * MAX(0, W2 + W3 * S2 + W4 * S3) S50 = W0 + W1 * MAX(0, W2 + W3 * S48 + W4 * S49)
- Tune the Weights (
W0
toW4
) as did previously by Squaring the Difference between both sides at each Level of the Recursion, giving a number of Squared Error Terms(S3 = W0 + W1 * MAX(0, W2 + W3 * S1 + W4 * S2))^2 (S4 = W0 + W1 * MAX(0, W2 + W3 * S2 + W4 * S3))^2 ... (S50 = W0 + W1 * MAX(0, W2 + W3 * S48 + W4 * S49))^2
- Sum up the Squared Error Terms giving the Least Squares Loss Function
(since this is a Regression problem our Regressor is a
Two-Layered Feedforward Network with 1x RELU and 1x Linear
that may be viewed with a
Recursive Formula or as a Graphical Model)
50 ∑ t=3 (ST - W0 + W1 * MAX(0, W2 + W3 * ST-2 + W4 * ST-1))^2
- Graphical Model of Feedforward Architecture (showing how Weights are
shared b/w Regression Points)
-
https://youtu.be/ZFWOCob2gZ8?t=1m43s
-
g(St-2, St-1) = W0 + W1 * MAX(0, W2 + W3 * St-2 + W4 * St-1)
-
- Minimise the Least Squares Loss Function by transforming the
Series into a Set of Input-Output Pairs when Processing the Sequence
- Window with Input Size == 2
- Since must substitute the Last 2 Entries of the Sequence to Predict the Next one (each input takes the the past 2x elements of the sequence to predict the next one)
Window of Length 2 Input-Output Pair 1 Input 1 Input 2 Output 1 Input-Output Pair 2 Input 2 Input 3 (Output 1) Output 2 ...
- Code to Minimise the Least Squares Loss
# Create model with Two-Layers and a Least Squares Loss Function # Minimises and recovers its optimal Weights by a # Stochastic Gradient Descent model = Sequential() model.add(Dense(1, input_dim=2, activation='relu')) model.add(Dense(1, activation='linear') model.compile(loss='mean_squared_error', optimizer='adam') # Fit the model model.fit(x, y, epochs=1000, batch_size=20, callbacks=callbacks_list, verbose=0)
- Preview the resulting Fit with the Training Set (after Training) of the 50 elements in the Sequence (i.e. Graph with Step x-axis, Value y-axis)
- Plot the Input-Output Pairs (Input x-axis, Output y-axis) that were
formed when using FNN-based recursive approximation method
based on the Original Sequence
- View dependence between Inputs and Outputs. If there is dependence then its NOT IID
- FLAWS with Feed forward Neural Networks (FNN)
- https://www.youtube.com/watch?v=IXtAGSJOpDQ
- If a Sequences consists of consecutive Independent and Identically Distributed (IID) pairs, then change to values of one pair of elements should not have any effect on the following values
- Pure Recursivity is the exact opposite of IID. It is where every value depends fundamentally on those before, since the FNN-approach is geared toward trying to learn model depednency in the form of recursivity, but when we tune our model we end up doing the opposite and provoke Independence instead (the opposite of what we want)
- Generate new values using a Regressor beginning at the End of the Training Set and Preview them by Overlaying them on the next Actual Sequence
- Check if Generated Fit matches Original (to indicate that Learnt Model is right)
- Check Weights of the Model in Keras.
May find that generated Weights are different than the Original, but
that’s ok, since our aim was to find a Recursive Formula that explains the
behaviours of our Sequence, which we’ve done, and there may be
lots of Recursive Formulas that could be used to generate the Sequence
(more than 1x correct way to model this Rayleigh sequence)
w0 = 1.886 w1 = 1.309 w2 = 0.305 w3 = 0.305 w4 = -0.641
- Window with Input Size == 2
- Interesting Notes:
-
Different Architectures may be used to approximate a given Ordered Sequence of Values (i.e. many different Architectures can model a Sequence created by one particular Recursive Formula) that we wish to Model Recursively
- Example:
Given Sequence created as Output of the using the ReLU Network g(St-2, St-1) = W0 + W1 * tanh(W2 + W3 * St-2 + W4 * St-1) The TANH Network can Fit the Sequence good, following the same procedure used before for Training on the first 50 elements used before. Graph with Training Fit https://youtu.be/Xf1oAaTd42w?t=38s Generate values that closely mirror the remainder of the true Sequence as follows Graph with Generator Fit https://youtu.be/Xf1oAaTd42w?t=43s
- Example:
-
Adding Noise (Gaussian Noise) when Generating the Sequence so it is almost Recursive (except for the noise)
S1 = 1, S2 = 0.5 S3 = 0.4 * MAX(0, 1 + S1 - 0.1 * S2) + ɛ1 (use S1 and S2 Inputs with Noise) S4 = 0.4 * MAX(0, 1 + S2 - 0.1 * S3) + ɛ2 (use S2 and S3 Inputs with Noise)
-
Follow previous steps to Fit the ReLU-based Regressor to the Training Set of this Noisy Sequence, producing a Fit that performs well as Training Fit
- Graph of Training Fit
- https://youtu.be/Xf1oAaTd42w?t=1m21s
- Graph of Training Fit
-
Use the Tuned Regressor to Generate values as before as the Generative Fit
- Graph of Generative Fit
- https://youtu.be/Xf1oAaTd42w?t=2m24s
- Graph of Generative Fit
-
Note: Both the Training Fit and the Generated Fit points come from same Tuned Recursive Formula. So we should consider the Training Fit and the Generated Fit points as a SINGLE Recursive Sequence that we’re using to approximate our True Sequence as closely as possible
- Graph Recursive Sequence (combination of both Training Fit + Generative Fit)
- https://youtu.be/Xf1oAaTd42w?t=2m49s
- Graph Recursive Sequence (combination of both Training Fit + Generative Fit)
-
Note: The Original Sequence shown in black was NOT Recursive but the Recursive Sequence shown in green is by design (since was created using the Recursive Formula used to find an approximation to truth)
-
-
- Example 3 - Real Financial Time-Series Dataset
- Previously we transformed pursuit of an approximation of a Sequence into a Regression problem
- Now we’ll apply the approach to a real dataset
- Given
Given historical stock price dataset graphed Step VS Time https://www.youtube.com/watch?v=UfOUisfQPZc?t=18s Use: - Order 5 - Linear Network Architecture - Window size = 5 Build the architecture using Keras # Create model model = Sequential() model.add(Dense(1, input_dim=window_size, activation='linear')) model.compile(loss='mean_squared_error', optimizer='adam') Train on the first 100 elements of the Sequence (i.e. Steps 0 to 100) Note: We do not know of a True Recursive form of the Sequence (or if one even exists) Aim: Resolve a formula that explains the behaviour of this Ordered Sequence Recursively (i.e. approximates it with a truly Recursive Sequence) Training the Model allows visualising the Fit on the first 100 points * Graph "Training Fit" https://www.youtube.com/watch?v=UfOUisfQPZc?t=1m05s Generate say 40 new points using Tuned Linear Regressor * Graph "Generative Fit" https://www.youtube.com/watch?v=UfOUisfQPZc?t=1m07s Check how Fit compares between Generative Fit and Original * Note: May not be a strong fit to True Sequence since underlying dataset is more complex than the architecture we're using for the Recursive Approximation. * Regardless of the architecture, being able to predict precisely the stock price MANY time periods in future using historical price alone is impossible to * SHORT time periods in the future may be predictable based on historical price alone If task is to predict stock price over SHORT periods in the future then: * DO NOT need Regressor as a Generative Model * Only need Regressor as a Training / Testing instrument * https://www.youtube.com/watch?v=UfOUisfQPZc?t=1m46s i.e. Given Trained Regressor, Window size == 5 Using it to perform Predictions that are 1 period in future using financial time-series Slice the financial time-series into 2x Parts: - Training (already done) - Testing Test the efficiency of the predictor by Windowing the last 5x elements of the Training Set, and use the Predictor to estimate the next value Repeat for next unit by moving the Window forward 1x unit (using 4x units from Tail of Training Set, and 1x unit First one from Testing Set), and use the Predictor to estimate the next value * https://www.youtube.com/watch?v=UfOUisfQPZc?t=2m35s Repeat the above until we have all our Predictions and we can overlay Test Set Predictions on our True Sequence for visual comparison * https://www.youtube.com/watch?v=UfOUisfQPZc?t=3m12s
- Recap
- Goal: Model Ordered Sequences Recursively using an Feed forward Neural Networks (FNN) approach
- Approach: Resolve a Recursive Formula
- Using the Recursive Formula to construct Recursive Approximation
- Recursive Approximation used to:
- Choose Architecture (Order, Functionality)
- Break Recursion into Levels
- Windowing the Sequence (producing Input/Output Pairs)
- Minimising the Loss to Tune Parameters of this Architecture
- If Sequences are Continuous values
- Then use “Least Squares Loss”
- If Sequence is Discrete values (i.e. text-data)
- Then use “Logistic Result Max Loss”
- If Sequences are Continuous values
- Using Tuned Regressor as Generative Model (if possible)
- Recursive Formula
- Gives when properly Tuned a 100% Recursive Sequence that approximates a True Sequence (Original that may or may not be recursive)
- Noted that Generative Models are not appropriate for some applications (i.e. financial time-series) where traditional Train/Test should be used
- Recursive Neural Network (RNN) Framework Fundamentals
- Previously
- Using the FNN-approach we model recursivity correctly but we completely lose dependence (IID instead) on earlier levels from further levels when we tune parameters which is fundamental to recursivity
-
Derivation of an RNN (that Improves on FNN without losing dependence offering better more structured recursion that stresses Dependence between levels explicitly). Known as the
SimpleRNN
model in Keras-
RNNs came from the desire to enforce greater Dependency between further levels on earlier levels where each level ingests its predecessor (Hidden States are driven by Input Sequence) that builds on the failings of the FNN approach that failed to enforce this Dependency
-
Goal: Avoid further levels becoming Independent of earlier levels. So we must enforce more Dependency between levels by enhancing our recursion.
- Step 1:
- Re-write earlier steps of recursion, we want the LHS and RHS of equality to hold as best possible (i.e. approximately hold),
-
Add an Auxiliary Variable (aka Hidden States) (i.e. h1 to h4) at each line. Since while we recurse on
h
we observes
, ands
Drivesh
. They help organise the derivation and reminds us that RHS of each level actually defines a Sequence when taken together based on their Input and Parameter sidess1 = h1 = α s2 ~= h2 = g(s1) s3 ~= h3 = g(s2) s4 ~= h4 = g(s3)
- Step 2
-
Remove the LHS of each level since wen want to approximate the True Sequence and get stuff out of our field of view
h1 = α h2 = g(s1) h3 = g(s2) h4 = g(s3)
-
Now our aim is the Tune the function to
s4
that we just removed (along with removings1
,s2
, ands3
-
- Step 3
- Adjust our Recursion to enforce Dependency between the levels
(avoid Independence across consecutive levels as was failing of FNNs)
by Forcing consecutive level Dependency
(i.e. so each level after the Seed is functionally Dependency on
the preceding level, i.e. 4th level functionally Dependent on the 3rd, etc)
- Force Dependency by making Architecture ingest the previous level (plugging in 3rd line into 4th line of the Architecture). i.e. for 4th level using any parameterised function of 2x Inputs: s3 (as usual) and h3 (for Dependency)
h1 = α h2 = g(s1) = f(h1, s1) h3 = g(s2) = f(h2, s2) h4 = g(s3) = f(h3, s3) ... ht = g(st-1) = f(ht-1, st-1)
- i.e.
f(h3, s3) = tanh(w0 + w1 * h3 + w2 * s3)
- Adjust our Recursion to enforce Dependency between the levels
(avoid Independence across consecutive levels as was failing of FNNs)
by Forcing consecutive level Dependency
(i.e. so each level after the Seed is functionally Dependency on
the preceding level, i.e. 4th level functionally Dependent on the 3rd, etc)
- Step 4
- https://youtu.be/Y3-YuSbhbQM?t=6m15s
- Roll-up the recursion, showing that the
RNN in the form of a Hidden Sequence that is Driven by Input
(i.e. taking a Sequence
s
and Driving a Sequence recursivelyh
). This was covered previouslyh1 = α ht = f(ht-1, st-1), t >= 2
h
is Hidden since was not directly observed but was instead Driven using an Input Sequences
- Previously using the FNN-model we turned the definition of recursivity on its head and used it to develop a recursive approximation to our Input.
- Now, with RNNs we’ve turned the Hidden Sequence concept on its head,
by taking the Hidden Sequence model and fitting it into our Input
(by tuning
f
to approximate the Drivers
as well as possible) and using it to develop a recursive approximation to our Sequence
- Plot Hidden
h
vs Drivers
- Step 1:
-
- Formulate Least Squares Error/Loss using RNNs
- Given
s1 = h1 = α s2 ~= h2 = g(s1) = f(h1, s1) s3 ~= h3 = g(s2) = f(h2, s2) s4 ~= h4 = g(s3) = f(h3, s3) ... st ~= ht = g(st-1) = f(ht-1, st-1)
- Remove Hidden State variables introduced during derivation
s2 ~= f(h1, s1) s3 ~= f(h2, s2) i.e. = f(f(h1, s1), s2) s4 ~= f(h3, s3) i.e. = f(f(h2, s2), s3) = f(f(f(h1, s1), s2), s3) ... st ~= f(ht-1, st-1) where h2 = f(h1, s1)
- RNN level dependent on ALL previous levels (complete history
of sequence values that precede it)
(whereas shallow FNN is only dependent on immediate previous level)
using Hidden State of the previous level
- i.e.
s3
dependent ons2
ands1
- i.e.
s4
dependent ons3
,s2
ands1
- i.e.
- RNN level dependent on ALL previous levels (complete history
of sequence values that precede it)
(whereas shallow FNN is only dependent on immediate previous level)
using Hidden State of the previous level
- Make these approximate equalities hold as tight as possible
by Squaring the Error at each level and add them up
(s2 - f(h1, s1))^2 (s3 - f(h2, s2))^2 (s4 - f(h3, s3))^2 ... (st - f(ht-1, st-1))^2
- Minimise the Sum over the first
P
elements of the Sequence to get a Least Squares Error/LossP ∑ t=2 (st - f(ht-1, st-1))^2
-
Note 1: Broke into levels that are each explicitly Dependent on each other (unlike with the FNN approach where we lost it)
-
Note 2: When using Architectures with Bounded Input i.e.
f(h,s) = tanh(w0 + w1 * h + w2 * s)
often used in RNNs its is good to Minimise the Difference between each Sequence element and the Linear Combination of the corresponding Hidden State to ensure values >1 may be reached by either:- ADJUSTING the
Least Squares Error/Loss Function as follows:
P ∑ t=2 (st - (b + w * f(ht-1, st-1))^2
- Alternatively bake-in the Linear Combination into the Recursion directly at each level
- ADJUSTING the
Least Squares Error/Loss Function as follows:
-
- Given
-
Apply the RNN Framework in Keras
- Example 1: ReLU-generated sequence
-
https://www.youtube.com/watch?v=F5PVwVrEVHY
-
Note: RNN Regressor is our generator
-
Fit RNN Architecture to the first 50 elements of the ReLU generated Sequence we saw earlier (shown ‘blue’). Use as a Sequence Generator as well (as was done with FNN-approach)
model = Sequential() mode.add(SimpleRNN(3, input_shape=(2,1), activation='relu')) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer=optimizer)
- Plot the Original Sequence, Training Fit, and Generative Fit
-
- Example 2: Apply RNN to fitting a Financial time-series dataset
-
https://www.youtube.com/watch?v=F5PVwVrEVHY
- Fit first 2/3 of the dataset using below code snippet:
model = Sequential() mode.add(SimpleRNN(1, input_shape=(5,1))) mode.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer=optimizer)
-
Plot Original Sequence, Training Fit, Testing / Generative Fit
- Note:
- Recursive models (RNNs) typically used for short-term
predictions on financial time-series datasets.
- Require Windowing longer sequences is a practical matter, not architectural matter
- With financial time-series its more appropriate to use the Regressor as a more traditional Training/Testing tool rather than just a Pure Generator given the complexity of the phenomenon
- Recursive models (RNNs) typically used for short-term
predictions on financial time-series datasets.
-
- Example 1: ReLU-generated sequence
- Previously
-
Recursive Neural Network (RNN) Framework - Characteristics
-
Related to sequences and lists
-
Memory (loops allow info to persist) and Dependency
-
https://www.youtube.com/watch?v=0B8O2eNv2DY
-
Compare RNN and FNN
-
RNN more expressive and data-driven than FNN (by explicit modelling of dependencies between consecutive levels of the recursion)
-
RNN level dependent on ALL previous levels (complete history of sequence values that precede it) so it has MEMORY (whereas shallow FNN is only dependent on immediate previous level) using Hidden State of the previous level
-
RNNs have more Memory since each Hidden State contains a complete History of the Input Sequence up to that point
-
-
-
-
Recursive Neural Network (RNN) Framework - Graphical Models
-
https://www.youtube.com/watch?v=LON9wniFUiE
-
Used Graphical Models to view
-
Unfolded View of recursions
h1 = α h2 = f(h1, s1) h3 = f(h2, s2) h4 = f(h3, s3) ... ht = f(ht-1, st-1)
-
Unfolded Graphical View of Purely recursive sequence that is a Hidden Sequence
h
driven bys
-
Adding Prediction of
^st
then Graphical Model must denote a Supervised Learner where information flows when using Gradient Descent
-
-
Folded View (compact) of recursions
h1 = α ht = f(ht-1, st-1), t >= 2
- Folded Graphical View of model of a driven Hidden Sequence
but when adding
^st
we know the model is being used as a Predictor
- Folded Graphical View of model of a driven Hidden Sequence
but when adding
-
-
- Recursive Neural Network (RNN) Framework - Training - Technical issues
- Technical issues
- Optimisation
- Vanishing Gradient problem (affects FNNs and RNNs)
-
Mitigation
-
Regularising Activation Functions or different level Architectures such as Long Short Term Memory (LSTM)
-
Variations of Stochastic Gradient Descent (SGD) (modifications to avoid the issue)
-
Basic concerns like Windowing in Deep Networks since each State in a RNN adds a Hidden Layer to the corresponding network may be mitigated by cutting up Longer Sequences into Shorter Sequences and treating them as a Batch
-
-
- Vanishing Gradient problem (affects FNNs and RNNs)
-
Memory life
-
Data requirements
- Deep Networks (function approximators like RNNs) for high performance require large datasets for their expressive power to show cutting edge results
i.e. text-generation with ‘000s of datapoints to play with
- Optimisation
- Technical issues
- RNN Summary
-
https://www.youtube.com/watch?v=EFrAo74C8Ow
-
Notes: Activation Functions
- Sigmoid Layer
- decides what parts of the cell state we’re going to output (i.e. forget is 0, keep is 1)
- Tanh Layer
- put the cell state through tanh (to push the values to be between −1 and 1) as candidates
- ReLU
- max of 0 or greater value
- Sigmoid Layer
-
- Links
- RNNs from Deep Learning https://github.com/angelmtenor/deep-learning/blob/master/intro-to-rnns/Anna_KaRNNa.ipynb
-
Long Short Term Memory Networks (LSTMs)
- Slides
- Architecture of LSTMS
- https://www.youtube.com/watch?v=70MgF-IwAr8
- https://www.youtube.com/watch?v=gjb68a4XsqE
- Architecture of RNN
- https://www.youtube.com/watch?v=ycwthhdx8ws
- Architecture of LSTMS
- Definition:
-
Useful when neural network needs to switch between remembering recent things, and things from long time ago
-
AWESOME Links about LSTMs
-
WOW - http://colah.github.io/posts/2015-08-Understanding-LSTMs/
-
WOW - http://blog.echen.me/2017/05/30/exploring-lstms/
-
Augmented RNNs & Attention http://distill.pub/2016/augmented-rnns/
-
Other RNN/LSTM with Python Code character-level language models http://karpathy.github.io/2015/05/21/rnn-effectiveness/
-
Translation https://arxiv.org/pdf/1508.07909.pdf
-
- Issue with RNNs
- RNNs mostly store Short-Term Memory
- Key past information we’re trying to predict with may be in an earlier RNN (in sequence of RNNs) but input information is squished repeatedly by Sigmoid functions (that output numbers between 0 and 1 describing how much each component should be let through) and lost, so we need Long Term Memory
- Training a network (of multiple RNNs) using Backpropagation (recursive application of the chain rule from calculus) all the way back to much earlier RNN may lead to problems like the Vanishing Gradient
-
Solution - LSTM Networks
-
Input to LSTM has both Long-Term Memory and Short-Term Memory that both get merged at each stage with current Event (what we just saw) (protects old information better)
-
Goal of LSTM Node Architecture of Gates:
- Create Prediction
- of what the image is (i.e. long term memory may be required to know this) by combine Long-Term Memory and Short-Term Memory and Event
- Update with New Long Term Memory for next stage
- by merging Long-Term Memory and Short-Term Memory and Event
- Update with New Short-Term Memory for next stage
- by merging Long-Term Memory and Short-Term Memory and Event
- Create Prediction
-
Output is prediction of what the Input is and as part of the Input for the next iteration of the Neural Network
-
-
Architecture of LSTMs
-
Gates in LSTM Node Architecture:
-
Summary of All Gates using an arbitrary Architecture that is known to work
-
https://www.youtube.com/watch?v=IF8FlKW-Zo0
-
TODO - Invent new Architectures that actually WORK as this spaces is under development
-
-
Forget Gate
-
https://www.youtube.com/watch?v=iWxpfxLUPSU
-
Input is Long-Term Memory (LTM) is multiplied by a Forget Factor
ft
(to forgets everything no longer considered useful -
ft
calculated with 1x small Layer Neural Network that combines inputs STM and E with a Linear Functionft = σ(Wf[STMt-1, Et] + bf)
-
-
Learn Gate
-
Slide https://www.youtube.com/watch?v=aVHVI7ovbHY
- Combines inputs Short-Term Memory (STM) and Event (E)
using a Linear Function (i.e.
tanh
) that consists of joining the Vectors (STM and E), multiplying by a matrix of Weights (W), adding a Bias (b), and squishing result withtanh
Activation Function to create New Info (N):Nt = tanh(Wn[STMt-1, Et] + bn) * it
-
Forgets/Ignores any unnecessary info to give New Info (N) by multiplying by an Ignore Factor
i
Vector that multiplies element-wise. - Note:
i
is calculated by building a small Neural Network that accepts Inputs of Short-Term Memory and Event, passing them through a Linear Function (with new Weights matrix and new Bias, and squishing with Sigmoid Function to keep between 0 and 1)it = σ(Wi[STMt, Et] + bi)
-
-
Remember Gate
-
https://www.youtube.com/watch?v=0qlm86HaXuU
- Accepts combination of
- Forget Gate (input is LTM)
- Learn Gate (input is combined STM and E)
- Outputs a New Long Term Memory
LTMt = LTMt-1 * ft + Nt * it
-
-
Use Gate
-
https://www.youtube.com/watch?v=2kDufi6FDjU
-
Decides what information to use from what we previously knew + what now know, and use it to make a Prediction
- Accepts combination of
- Forget Gate (input is LTM)
- Learn Gate (input is combined STM and E)
- Outputs a New Short Term Memory (which is the Prediction)
- Input to FORGET GATE is LTMt-1 - Output of FORGET GATE is small Neural Network #1 that uses the tanh Activation Function Ut = tanh(Wu * LTMt-1 * ft + bu) - Inputs of STM and E are applied to another small Neural Network #2 using the Sigmoid Activation Function Vt = tanh(Wv[STMt-1, Et] + bv) - Final Output it multiplies both the Outputs of the small Neural Network #1 and small Neural Network #2 together STMt = Ut * Vt
-
-
-
-
- Slides
-
Other Architectures that work
-
Slide https://www.youtube.com/watch?v=MsxFDuYlTuQ
-
Gated Recurrent Unit (GRU)
-
LSTM with Peephole Connections
- Previously
- Forget Factor calculated as combo of STM and E (but LTM was not included in decision)
- Now
- Connect the LTM into the Neural Network that calculates the Forget Factor (makes decisions inside the LSTM), where mathematically the Input to Sigmoid is larger since we’re concatenating it with LTM matrix
- Previously
-
-
-
RNN Project
- Time Series Prediction and Text Generation.
- Goal: Use RNNs and LSTMs for two major purposes:
- Predict stock prices.
- Generate Sherlock Holmes text.
- Goal: Use RNNs and LSTMs for two major purposes:
- Time Series Prediction and Text Generation.
- About
Natural Language Processing (NLP)
-
NLP Pipeline
- My Code Examples of it all:
- https://github.com/ltfschoen/AIND-NLP
- Stages
- Text Processing
- Raw text input
- Pandas working with text
- https://pandas.pydata.org/pandas-docs/stable/text.html
- Sources
- Website textual content
- Raw HTML markup
- PDFs
- Word docs
- Speech recognition system
- Book scanned with OCR
- Website textual content
- Pandas working with text
- Build text processing functions
- Clean
- Python Regular Expressions
- https://docs.python.org/3/library/re.html
- Remove HTML tags
- Parse HTML using BeautifulSoup to
extract text without tags
(since Regular Expressions not suitable)
- https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Use Beautiful Soup to walk the DOM tree
- Parse HTML using BeautifulSoup to
extract text without tags
(since Regular Expressions not suitable)
- Remove non-relevant data
- Remove source-specific markers
- Retain only plain text
to reduce complexity of procedures
(i.e. and, the, are, of)
- Use NLTK Stopwords
- Python Regular Expressions
- Normalise
- Lowercase (so each word represented by unique token)
- Remove punctuation
- for Document Classification and Clustering where low-level details don’t matter much
- Replace with Space so words don’t concatenate
- Remove extra spaces
- Tokenise (aka Symbol)
- http://www.nltk.org/api/nltk.tokenize.html
- Split text into Tokens (Sequence of Words)
- Remove common words that don’t offer meaning to reduce complexity and still may be inferred (aka NLTK Stop Words)
- NLTK
- Smart way of tokenising
- Even includes one for parsing Twitter handles, hashtags, emoticons, etc
- Process Words
- Identify Different using NLTK
pos_tag
:- Parts of Speech (Nouns, Verbs, Named Entities)
- Understand words in sentence to better understand what’s being said
- Identify Relationships between Words
- Identify Cross-References between Words
- Named Entities are Noun phrases that refer to specific Object,
Person, or Place. Use
ne_chunk
to label Named Entities in text- Usage: Index and search for news articles of companies that are of interest
- Parts of Speech (Nouns, Verbs, Named Entities)
- Convert Words into Canonical forms using (further simplify and normalise
different variations of words)
- Stemming
- Dfn: Reducing a word to its ‘stem’ (aka root form) to reduce complexity
whilst retaining essence of meaning of Words
- i.e.
branch
is the root ofbranches
,branching
,branches
since all convey same thing.
- i.e.
- Note: Important that all Words are reduced to the SAME STEM since
captures same common idea (ok spelling mistake in root)
- i.e.
cach
is ok as root ofcaches
,caching
- i.e.
- NLTK Stemming Types include:
- Porter
- Snowball
- Language-specific Stemmers
- Dfn: Reducing a word to its ‘stem’ (aka root form) to reduce complexity
whilst retaining essence of meaning of Words
- Lemmatisation
- Dfn: Reduces words to normalised root form, but uses a Dictionary to
map different variances of a Word back to its root to overcome
non-trivial inflections
- i.e.
be
is the root ofis
,were
,was
- i.e.
one
is the root of pluralones
- i.e.
- NLTK Lemmatisers
- WordNet database (Default)
- Usage: Initialise instance of WordNet Lemmatiser and pass
individual words to the
lemmatize
method - Note:
- Lemmatisers may be more Memory intensive than Stemming since stores in Dict
- Lemmastisers’ final form is a meaningful root word (i.e.
cache
, notcach
like would be done with a Stemming) - Lemmatiser must make assumptions about the Part of Speech (PoS)
for each word it’s trying to transform. i.e. WordNet Lemmatiser
defaults to Nouns, which may be overridden by specifying the
parameter
pos='v'
for Verbs - Chained procedures are often used
- Dfn: Reduces words to normalised root form, but uses a Dictionary to
map different variances of a Word back to its root to overcome
non-trivial inflections
- Stemming
- Identify Different using NLTK
- Transform ready for next stage
- Raw text input
- Feature Extraction
- Build feature extractors
- Text
- Considerations
- Since text data represented on modern computers using Encoding (i.e. ASCII, Unicode) that map each character to a number that are stored and transmitted by computers as Binary, which have an implicity ordering (i.e. 65 < 67)
- Incorrect to assume that A < B < C since may mislead our NLP algorithms
- Words carry meaning of concern, NOT individual Characters
- Computers DO NOT have standard representation for Words (since just Sequences of ASCII or Unicode values without Meaning or Relationships captured between Words)
- Goal
- Generate representation for Text Data
(similar to Pixels used for images) that
we may used as Features for Modelling
- Depends on Model we’re using
- Graph-based Model to extract
insights
- Words represented as symbolic Nodes with Relationships between like Coordinate
- Statistical Models
- Numerical representation required to represent Words
- Graph-based Model to extract
insights
- Depends on Task trying to accomplish
- Document-level task
- Spam Detection
-
Sentiment Analysis
- Per-Document Representation
- Bag-of-Words (BoW)
- Dfn: Treats each Document (unit of text to analyse)
as an Unordered Collection (Bag of Words)
- Example: Document types
- Compare essays prepared by students for plagarism, then each essay would be a Document
- Analysing the sentiment conveyed by tweets then each tweet would be a Document
- Example: Document types
-
Issue: https://www.youtube.com/watch?v=LYYWIrWbBq4
-
BoW Treats each Word as being equally important, but intuitively we know some Words occur more frequently in a Corpus.
- Solution:
-
TF-IDF assigns Weights to words that signifies their relevance in documents
-
About TF-IDF approach:
- Count the Document Frequency (number of times each word occurs out of all Terms aka Columns in Document-Term Matrix)
- Divide the Term Frequencies by the Document Frequency associated with that Term (giving a Metric that’s proportional to the frequency of occurrence of a term in a document, but inversely proportional to number of documents it appears in).
- Highlights words that are more Unique to a document (with higher value) and are better for characterising it
-
- Usage of BoW or TF-IDF Representation
- Document Classification Task
- Spam Detection
- Use TF-IDF Vectors as Features as well as labels “Spam” and “Not Spam” to setup a Supervised Learning Problem
- Spam Detection
- Document Classification Task
-
- Steps
- Obtain Bag of Words from raw text we apply
Text Processing steps
(cleaning, normalising, splitting into words,
stemming, lematisation, etc) then.
- INEFFICIENT to then treat the
resulting Tokens as an Unordered Collection
(aka a Set), and multiple occurrences not included,
i.e. "Little house on little Prairie" --> { "littl", "hous", "priari" }
- EFFICIENT: Document-Term Matrix
(illustrates relationship between Documents is Rows and
Words/Terms in Columns where each element is a
Term Frequency i.e. how frequently term occurs in a document)
https://www.youtube.com/watch?v=A7M1z8yLl0w
Convert each Document
(where a set of Documents is a Corpus)
into a Vector of numbers that represents how many times
a Word occurs in a Document
- Collect all Unique Words present in Corpus (C) to form Vocabulary (V), arrange them in some order as Vector element positions OR Table Columns, then assume each Document is a Row, then count the number of occurrences of each Word in each
- INEFFICIENT to then treat the
resulting Tokens as an Unordered Collection
(aka a Set), and multiple occurrences not included,
- Usage:
- Compare two documents based on how many words they
have in common (or how similar Term Frequencies are).
-
BAD - Mathematically performed by calculating the Dot Product between two row vectors that equals sum of the products of the corresponding elements (where the greater the Dot Product, the more similar the two vectors are), but flaw of Dot Product is it only compares values of overlap, but not affected by other values that aren’t in common (so different pairs may get same Dot Product as identical pairs.
-
EFFICIENT - Cos Similarity https://youtu.be/A7M1z8yLl0w?t=3m15s where divide the Dot Product of two Vectors by the product of their magnitudes (or euclidean norms), where Vectors thought of as arrows in n-dimensional space, then this is equal to Cosine of the angle Theta that’s between each of two vectors.
- Most Similar - Identical Vectors - Cosine output: 1
- Partly Similar - Orthogonal Vectors - Cosine output: 0
- Not Similar - Exactly Opposite Vectors - Cosine outpput: -1
-
- Compare two documents based on how many words they
have in common (or how similar Term Frequencies are).
- Obtain Bag of Words from raw text we apply
Text Processing steps
(cleaning, normalising, splitting into words,
stemming, lematisation, etc) then.
- Treat each document like a BoW
allows computing simple statistics that
characterise it, where the Statistics may be
improved by assigning appropriate Weights
to Words using a TFIDF Scheme
(Term Frequency–Inverse Document Frequency) for more
accurate comparison between documents.
- i.e. in certain apps need Numerical representations of individual Words by use of Word Embeddings method
One-Hot Encoding * https://www.youtube.com/watch?v=a0j1CDXFYZI
* Background: * Previously characterised entire Document (Collection of Words) as one Unit where inferences made are at Document-level * Document Topics * Document Similarity * Document Sentiment * Purpose: * Deeper Analysis of Text requires a Numerical representation for each Word (where treat each word like a Class, and assign each a Vector that has 1 in a positions the word exists in the word and 0 elsewhere) * Similar to BoW, but we keep a Word in each Bag and build a Vector for each * Issues * Doesn't work when have Large Vocabulary to deal with, since size of Word representation grows with qty of Words, so need to use: * **Word Embeddings** as a way to that we can **Control the Size of the Word Representation by limiting it to a fixed-sized Vector** (i.e. find an embedding for each Word in Vector space that exhibits desired properties) * i.e. * if two words similar in meaning they should be closer together than those that aren't) * if two pairs of word have similar difference in their meaning they should be separated similar distance in vector space * **Word2Vec** (Word Embedding type) * Dfn: Popular **Word Embedding** used in practice by transforming Words into Vectors * Approaches: * https://www.youtube.com/watch?v=7jjappzGRe0 * **Continous Skip-Gram Model** * About: * Model where given Middle Word that predicts Neighboring Words in a Sentence for a given Words in the Sentence is likely to capture the contextual meaning of Words * Steps * Pick any Word in Sentence * Convert Word into a One-Hot Encoded Vector * Feed One-Hot Encoded Vector into a Neural Network (or other probabilistic model) that's designed to Predict surrounding words (its context) * Loss Function will optimise the Weights or parameters of the Model and repeat until it learns to predict context words as best possible * Note: Taking an intermediate representation such as Hidden Layer in Neural Network, then the Outputs of that layer for a given word will become the corresponding "Word Vector" * **Continuous Bag of Words Model** * About * Model where given Neighboring Words that predicts a Word in a Sentence given some Neighboring Words in the Sentence is likely to capture the contextual meaning of Words * Robust representation of words since Meaning of each Word is distributed throughout the Vector * Vector size is independent of vocabulary (size of word vector is up to us on how we want to choose performance vs complexity but **Vector size remains Constant no matter how many Words we Train on**). Note that this differs from Bag of Words where Vector size grows with number of unique words. * Pre-Train a Large number of Word Vectors, so can then use them Efficiently without having to transform repeatedly since they are Trained once, and Stored in a Lookup Table * Ready for use in Deep Learning Architectures * i.e. may be used a Input Vector for RNNs * Possible to use RNNs to learn even **better Word Embeddings** * **Optimisations** possible to further reduce Model and Training Complexity, such as * Representing Output Words using: * **Hierarchical Softmax** * Computing Loss using: * **Sparse Cross Entropy** * **GloVe (Global Vectors for Word Representation)** (Word Embedding type) * https://www.youtube.com/watch?v=KK3PMIiIn8o * About * TODO - https://nlp.stanford.edu/pubs/glove.pdf * Directly Optimise Vector representation of each Word using **Co-Occurrent Probabilities Statistics** (with Context and Target Words occurrences) to capture Similarities and Differences between Words * Differs from Word2Vec that sets up an auxiliary word predictiontask) * Note: Use the Log of the values since values of Co-Occurrence Probabilities are small * Note: Refinement over using Raw Probabilities is to Optimise for the Ratio of Probabilities * **Distributed Hypothesis** * https://www.youtube.com/watch?v=gj8u1KG0H2w * Words occurring in same Context tend to have similar Meanings * When large context of Sentences used to Learn in Word Embedding, Sentences with common context Words are Vectors that are closer together * Add another **Dimension** in Word Vectors to capture **Differences** and **Similarities** where Word meanings vary to make the Word Vector more expressive * **Example: Neural Network Architecture for NLP task of Predicting a Word** * Add Word Embedding Layer and apply **Transfer Learning** * Narrow scope model (i.e. medical terminology) * Broad scope mode * RNN Layer Example https://youtu.be/gj8u1KG0H2w?t=3m36s * Use Pre-Trained Word Embedding as a Lookup (i.e Word2Vec or GloVe) * Then only need Learn/Train the later Recurrent Layers specific to our task ``` - One-hot encoded word - Word Embedding - Lookup (Word2Vec or GloVe) - Word Vector - Learn - Recurrent Layers - Dense Layers - One-hot encoded output ``` * **t-SNE (t-Distributed Stochastic Neighbor Embedding)** * Dfn: * Dimensionality reduction technique that maps high dimensional vectors to a lower dimensional space and useful for **Visualising Word Embeddings** since preserves the linear Substructures and Relationships learnt by the Word Embedding Model * Clusters groups of Words or Images according to associated Class Labels * Tool for better understanding the representation that a network learns and identifying bugs * Similar to Principle Component Analysis (PCA) but adds extra property when performing transformation whereby it tries to maintain relative distances b/w objects so * Similar objects stay close together * Dissimilar objects stay apart
- Dfn: Treats each Document (unit of text to analyse)
as an Unordered Collection (Bag of Words)
- Doc2Vec
- Bag-of-Words (BoW)
- Individual Words and Phrases for
Text Generation and Machine Translation
- Word-level Representation
(i.e. fox -> 0.4,0.7,0.1,0.5
dog -> 0.4,0.5,0.2,0.6)
- Word2Vec
- Glove
- Word-level Representation
(i.e. fox -> 0.4,0.7,0.1,0.5
dog -> 0.4,0.5,0.2,0.6)
- Document-level task
- Depends on Model we’re using
- Generate representation for Text Data
(similar to Pixels used for images) that
we may used as Features for Modelling
- Considerations
- Images
- Images stored in computer memory, where each
pixel contains relative intensity of light
at locatin in image.
- Colour images have 1x value per Primary colour Red, Blue, Green, that carry relevant info, so Two Pixels with similar values are perceived similar, so is OK to use Pixel values in Numerical Model (after Edge Detection and Filtering)
- Images stored in computer memory, where each
pixel contains relative intensity of light
at locatin in image.
- Text
- Extract/produce relevant feature representations
that are:
- appropriate for model type planning to use
- appropraite for NLP task trying to accomplish
- Build feature extractors
- Modelling
- Dfn and Usage:
- Observations in a form that allows us to understand them better and predict new unseen occurrences
- Build models that achieve various NLP tasks:
- Classification Models
- Sentiment Analysis
- Spam Detection
- Topic Modelling
- Grouping Related Documents
- Ranking
- Improving Search Relevance
- Machine Translation Systems
- Converting Text between languages
- Others
- Extending and adapting techniques to design an appropriate solution
- Classification Models
- Steps
- Build models for NLP tasks
- Design Baseline Model
- Statistical model
- Machine Learning model
- Fit Model Parameters to Training data using an Optimisation procedure (Known data)
- Use to make Predictions on Unseen data
- Considerations
- Numerical Features allow use of any
Machine Learning Model
- Support Vector Machines
- Decision Trees
- Neural Networks
- Custom Models (combining multiple for improved performance)
- Numerical Features allow use of any
Machine Learning Model
- Utilising
- Deploy as Web/mobile app
- Integrate with other services
- Dfn and Usage:
- Iterate
- Rethink features that are required and in turn our text processing routines
- Text Processing
- Considerations
- Dependencies between steps
- Design decisions
- Choose existing libraries and tools
- Non-linear workflow of iterating repeatedly
- Project: Machine Translation
- Different Methods:
- Rule-Based Machine Translation (RBMT) - Classical
- https://en.wikipedia.org/wiki/Rule-based_machine_translation
- Based on Linguistic Info about Source and Target languages retrieved from Multi-lingual Dictionaries and Grammars that cover Semantic, Morphological, and Syntactic regularities of each language
- Given Input Sentences (Language A), the RBMT System generates Output Sentences (Language B) based on Analysis of Semantic, Morphological, and Syntactic of both Source and Target Languages in translation task
- Statistical Machine Translation
- https://en.wikipedia.org/wiki/Statistical_machine_translation
- Translations generated based on Statistical Models whose parameters are derived from analysis of Bilinguil Text Corpus
- Example-based Machine Translation
- https://en.wikipedia.org/wiki/Example-based_machine_translation
- Bilinguil Corpus with parallel texts with translation by analogy (case-based reasoning approach)
- Rule-Based Machine Translation (RBMT) - Classical
- Problems
- Still unsolved, just many papers
- Solutions
- Neural Networks large leap forward
- Task
- Build Deep Neural Network that functions as part of end-to-end Machine Translation Pipeline that accepts English text Input and returns French translation Output
- Different Methods:
- My Code Examples of it all: