Artificial Intelligence Nanodegree Term 2

Chapter 1 - Deep Learning
Chapter 2 - Convolutional Neural Networks (CNN)

Chapter 1 - Deep Learning

Graduated

Email: ndgrad-support@udacity.com alumni-support@udacity.com

Concentrations

About
- introductory project that utilizes a commercially-available API to build a complete solution, and,
- capstone project where you will try to solve one challenging problem in the chosen area
Concentrations
- NLP - label words in sentence with Part-of-Speech (POS) tags as named entities
  - Grammars https://classroom.udacity.com/courses/cs101/lessons/48299949/concepts/487192400923
- VUI
  - Links
    - https://developer.amazon.com/alexa-skills-kit/alexa-skills-developer-training
  - Applications
    - Apple Siri
    - Microsoft Cortana
    - Google Home
    - Amazon Alexa on Echo
  - Device
    - Amazon Echo Dot V2 http://www.smarthome.com.au/z-wave/z-wave-accessories.html
Previews
- Voice Preview https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/38e74312-3173-4456-919d-bcb00a82bfb5/lessons/dc1efdfd-e07f-4a5c-ab35-dbb274a25c88/concepts/last-viewed
- NLP Preview https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/ac7813e7-2907-44e4-a9a5-388bcc4edd38/lessons/5f3de1f2-df97-46c4-a2ba-82418c66f9e5/concepts/last-viewed
- Computer Vision Preview https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/0917fcad-9a95-401c-9c92-8cec8f6dc09e/lessons/260fb8ce-eb1d-4ea2-864b-d8ed31b7082f/concepts/last-viewed

TODO

AWS Training Free
- https://www.aws.training
- https://www.awseducate.com/microsite/Training

Instructors

Luis Serrano - Course Developer / Instructor (did ML for Google Youtube recommendations)
Alexis Cook - Applied maths / biologist, uses Deep Learning @alexis_b_cook
Arpan Chakraborty - Builds Computer Vision and Machine Learning courses
San Camacho - Expert in Computer Vision (applied in medical tech to self-driving car navigation)
Dana Sheehan - Elec Engineer, MSc with interest in AI
Jeremy Watt - ML engineer educator and uni textbook author, likes NLP and Computer Vision, and maths optimisation, wrote “Machine Learning Refined”
Raisa Honey - Deep Learning researcher with teaching experience in ML, wrote “Machine Learning Refined”

Deep Learning

Applications
- Defeating humans in games
- Detecting spam emails
- Forecasting stock prices
- Recognising images in pictures
- Diagnosing illnesses
- Self driving cars
Components
- Neural Networks
  - Dfn: Mimics way brain operates with neurons that fire off info
  - Way to visualise Neural Networks: Given some data comprising groups of Red and Blue points, then Neural Networks find the best line that separates the groups. Or for complex data, the boundary to separate the points will be more complex
  - Types:
    - Deep Neural Networks - many Nodes, Edges, Layers

Classification Problems

Example 1: Predict whether student is admitted into university by analysing n-Columns of known admissions data samples from prior student applicants and Plot data on graph in n-Dimensions. Create an equation/Model that generates a “Line” (if 2 columns, 2D), “Plane” (if 3 Columns, 3D), or “Hyper Plane” (if n Columns, n-D) that is a Linear Demarcation Boundary between all known samples points that were “Accepted” and “Rejected”

  Problem

      - Given students, results of Tests and Grades results, and an Admissions officer
      - Given known admission samples, predict whether another student admitted
      - Known (previous data):
          Accepted - Student 1 - Test 9/10, Grade 8/10
          Rejected - Student 2 - Test 3/10, Grade 4/10
      - Unknown:
          ?        - Student 3 - Test 7/10, Grade 6/10

  Solution

      - Determine how many Columns we want to plot to determine
        how many Dimensions the plot will be in (including "Higher Dimensions")
          i.e.
              2x Columns - Plot in 2D with demarcation boundary "Line"
              3x Columns - Plot in 3D with demarcation bounary "Plane"
              nx Columns - Plot in nD with demarcation boundary "Hyper Plane"

      - Plot all Known Test results
          i.e. if 2D then  (X1 (X-axis), Grade results on X2 (Y-axis)

      - Create Demarcation Boundary representing our Model
        between where students likely accepted/rejected.

          - Linear Equation represents the demarcation of the Model, and
            if result of substituting Unknown prediction Grade and Test results
            into equation are score < 0 then Reject, or if >= 0 then Accept.


              i.e.
                  W1 * X1 + W2 * X2 + b * 1 = 0  # product of two Vectors
                  W * x             + b = 0      # abbreviated in Vector notation
                      where W = (W1, W2)         # Vector W - Weights
                      where x = (X1, X2)         # Vector x - Inputs
                      where b                    # Bias
                      where Y = 0 or 1           # Label that we're trying to predict
                                                 # for given coordinates (X1, X2)
                                                 #   Y == 0 if Rejected (under demarcation line of Model)
                                                 #   Y == 1 if Accepted (above demarcation line of Model)

                  Note: Points are of form (X1, X2, Y)

                  y^ = 1 if Score of W * x + b >= 0       # y Hat is what algorithm Predicts
                       0 if Score of W * x + b < 0        # the labels will be

                       Note: Points on the Demarcation Boundary Line have Score of 0 when
                       substitute coordinates into equation.

                  Note: Goal of algorithm is to have y^ (prediction) resembling Y (actual)
                  as closely as possible (i.e. finding the demarcation boundary line that
                  keeps the previous Y == 1 above, and Y == 0 below it)


          - If more than >= 3x Data Columns (i.e. Test Result, Grade Result, Class Rank)
            we Fit the data by using >= 3x Dimensions/axis (i.e. X1, X2, X3)

              i.e. X1 (X) - Test
                   X2 (Y) - Grades
                   X3 (Z) - Class Rank

              - Plot each sample as a point on the graph
              - Demarcation Boundary line is plotted as a 3D "Plane" (possibly on an angle)

                  W1 * X1 + W2 * X2 + W3 * X3 + b * 1 = 0  # product of three Vectors
                  W * x             + b = 0                # abbreviated in Vector notation

                  y^ = 1 if W * x + b >= 0       # y Hat is what algorithm Predicts
                       0 if W * x + b < 0        # the labels will be

              - Colour each sample depending on the Region of the sample (from 2x Regions available)
              (i.e. whether above or below the "Plane")

          - If have 'n' Data Columns we Fit the data in 'n'-dimensional space,
            where each sample Point contains coordinates (i.e. (X1, X2, ..., Xn) ),
            and where Label is Y, then:

              - Demarcation Boundary Line:

                  (n - 1) Dimensional "Hyper Plane"

                  (i.e. a High Dimensional Equivalent of a "Line" in 2D or a "Plane" in 3D)

              - Equation

                  W1 * X1 + W2 * X2 + Wn * Xn + b * 1 = 0  # product of 'n' Vectors
                                                           # (each Vector has 'n' entries, one for each Column from Data set

                  W * x             + b = 0                # abbreviated in Vector notation

                  y^ = 1 if W * x + b >= 0       # y Hat is what algorithm Predicts
                       0 if W * x + b < 0        # the labels will be


      - Solve Unknown prediction by plotting them on most likely demarcaton side.
        Whilst the Model will make some mistakes we can assume this
        prediction is correct with some confidence

Perceptrons

Dfn:

Perceptrons are called Neural Networks since they look like the Neural Networks in the Brain. Perceptrons calculate an equation in Node 1 based on Input Node values. Similarly, the Brain Neuron takes values from “Dendrites” Input Nodes (nervous impulses), processes them, and decides whether to output a nervous impulse via the Axon. Create Neural Networks by concatenating Multi-Layered (multiple) Perceptrons to mimic the way the Brain connects Neurons by making successive outputs the input to the next.

Visualise

  INPUT NODES             NODE 1 - SUMMATION -                    NODE 2 - STEP -
                                   CALC LINEAR EQN ON INPUTS               APPLIES STEP
                                   ON THE WEIGHTS                          FN TO RESULT
  ===========             ==================================      =====================

  | X1 |  --- W1 --->

  | X2 |  --- W2 --->     LINEAR FUNCTION / PLOT                  STEP FUNCTION

  ...                     | Wx + b = (n ∑ i=1) WiXi + b |  ---->  | Wx + b >= 0 ???     |  --->  YES: 1
                          |                             |         | Wx + b < 0 ???      |        NO:  0
  | Xn |  --- Wn --->

  | 1  |  --- b  --->

NOTE: In the future we’ll use different Step Functions
Building block of Neural Networks where we encode the Equation (that defines our Model) into a small graph

Build the Perceptron by:

Create Model Plot Node containing our Plot inside (showing our Boundary Line and Data Points)
Create Input Nodes (i.e. for each Sample value for all Columns) to the Model Node

Perceptron will plot the Sample values at a point on the Model Node Plot and checks if point is in Positive Region (returns Yes) or Negative Region (returns No)

  i.e.
      Given equation:

      0 = W1 * X1 + W2 * X2 + b * 1          # Linear Equation Boundary

      Substitute:

      Score = 2 * Test + 1 * Grade - 18 * 1  # Linear Equation with Weights
                                             # and Input Types substituted


      i.e. if we have the following:

          Weight 1 (Test) : 2
          Test result     : 7
          Weight 2 (Grade): 1
          Grade result    : 6
          Bias unit       : -18

          we then plot the following on the Perceptron:

          - Point (7,6)   on the plot
          - Edge  2       between Input Node (Test) and Model Plot Node
          - Edge  1       between Input Note (Grade) and Model Plot Node
          - Bias  -18     label this value over the Model Plot Node

      Outcome:
          Now when we see a Perceptron having Nodes with these labels,
          we can think of the Linear Equation the nodes generate

          i.e.

              | TEST = 7   |   ---- 2 ---->
                                             | -18 |
              | GRADES = 6 |   ---- 1 ---->



      Alternative:
          Alternatively can include the Bias as an Input Node
          (i.e. think of it in the equation as b * 1)
          and have b labelling an Edge coming from a 1

          Then the Model Plot Note multiplies the values from
          the incoming Nodes by the values on the corresponding
          Edges

          i.e.

              | TEST = 7   |   ---- 2 ---->
                                             | SCORE = 1 * 7 + 2 * 6 - 18 * 1 = 1 |
              | GRADES = 6 |   ---- 1 ---->

              | BIAS = 1   |   ---- -18 -->

          Finally it checks if SCORE is >= 0 or < 0
          and returns:

              1 (i.e. YES) for SCORE >= 0  signalling student accepted
              0 (i.e. NO) for SCORE < 0


          i.e. in general

              | X1 |  --- W1 --->

              | X2 |  --- W2 --->

              ...                     | Wx + b = (n ∑ i=1) WiXi + b |  ---->   YES: 1
                                      | Wx + b >= 0 ???             |          NO:  0
              | Xn |  --- Wn --->

              | 1  |  --- b  --->

          Note: we are implicitly using the "Step Function"
          (i.e. returns a 1 if Input Positive, and returns 0 if Input Negative)

Use the “Weights” as Labels in the Plot, since they define the Linear Equation itself

Example: Perceptrons as Logic Operators

Create Perceptrons for logic operators including AND, OR, NOT, and XOR. https://www.youtube.com/watch?v=45K5N0P9wJk

AND Perceptron

  i.e.
                                  Plots Bounary line from
                                  substituting inputs
                                  Weights + Bias' into
                                  equation. Then plot each point
                                  and returns
                                  1 if point in Positive region
                                  and returns 0 if point in
                                  Negative region (i.e. below
                                  Boundary diagonal line that
                                  represents the equation
                                  and crosses just below (1, 1))

      |  INPUT1   |   -------->
                                  | AND     |   -------> OUTPUT
      |  INPUT2   |   -------->


      Truth Table
      ========================
      INPUT1   INPUT2   OUTPUT
      ------------------------
      T        T        T
      T        F        F
      F        T        F
      F        F        F

      Perceptron Table
      ========================
      INPUT1   INPUT2   OUTPUT
      ------------------------
      1        1        1
      1        0        0
      0        1        0
      0        0        0

Code Implementation

  import pandas as pd

  # TODO: Set weight1, weight2, and bias
  weight1 = 0.0
  weight2 = 0.0
  bias = 0.0


  # DON'T CHANGE ANYTHING BELOW
  # Inputs and outputs
  test_inputs = [(0, 0), (0, 1), (1, 0), (1, 1)]
  correct_outputs = [False, False, False, True]
  outputs = []

  # Generate and check output
  for test_input, correct_output in zip(test_inputs, correct_outputs):
      linear_combination = weight1 * test_input[0] + weight2 * test_input[1] + bias
      output = int(linear_combination >= 0)
      is_correct_string = 'Yes' if output == correct_output else 'No'
      outputs.append([test_input[0], test_input[1], linear_combination, output, is_correct_string])

  # Print output
  num_wrong = len([output[4] for output in outputs if output[4] == 'No'])
  output_frame = pd.DataFrame(outputs, columns=['Input 1', '  Input 2', '  Linear Combination', '  Activation Output', '  Is Correct'])
  if not num_wrong:
      print('Nice!  You got it all correct.\n')
  else:
      print('You got {} wrong.  Keep trying!\n'.format(num_wrong))
  print(output_frame.to_string(index=False))

OR Perceptron
- Similar to AND Perceptron
- The gradient of the Demarcation Boundary is same as AND Perceptron but is shifted down by using Different Weights and Bias

XOR Perceptron

 Perceptron Table
 ========================
 INPUT1   INPUT2   OUTPUT
 ------------------------
 1        1        0
 1        0        1
 0        1        1
 0        0        0

Tricks

Trick for Perceptron Algorithm to Split Data Points and Adjust Linear Equation

STEP 1:
- Choose random linear equation that defines a line with a Positive area and Negative area (each side of line)
STEP 2:
- Points indicate if they are Correctly or Incorrectly Classified (i.e. whether on correct side or not) so we may improve the line
  - Misclassified point indicates to line to come closer to it

STEP 3:

Adjust the linear equation (and associated line) based on feedback from points instructing line how to move

  i.e.
  given Linear Equation and relative regions:

      Positive Region             3 * x1 + 4 * x2 - 10 > 0
      LINE                        3 * x1 + 4 * x2 - 10 = 0
      Negative Region             3 * x1 + 4 * x2 - 10 < 0


  STEP 1: Misclassified Point in POSITIVE Region

      given a Misclassified Point incorrectly located
      in the POSITIVE Region:

          POINT                       (4, 5)

      move the Point closer to the Line by using the
      4 and 5 and using them to modify the Linear Equation
      to make the Line move closer to the Point

      given the following: Parameters of the Line are:

          1) Parameters of Line:      3, 4, -10

          2) Points and Bias Unit:    (4, 5) 1

      Avoid moving toward this NEW LINE drastically as may
      result in accidently Misclassifying all other Points,
      instead we want to make SMALL steps toward the point,
      by using the LEARNING RATE small number.

          LEARNING RATE:              0.1

      now, first Multiple 2) by LEARNING RATE to get 3)

                                      (4, 5) 1
                                      4 * 0.1  5 * 0.1  1 * 0.1
                                      0.4      0.5      0.1

      and then the Subtract 3) from 1) to get NEW LINE Parameters:

                                   -  3    4    -10
                                      0.4  0.5  0.1
                                    -----------
                                      2.6  3.5  10.1

      this gives us a NEW LINE equation of, which will move closer to Point:

                                      2.6 * x1 + 3.5 * x2 - 10.1 = 0

  STEP 2: Misclassified Point in NEGATIVE Region

      Repeat similar to STEP 1, but instead of Subtracting,
      we instead use Addition


          1) Parameters of Line:      3, 4, -10

          2) Points and Bias Unit:    (1, 1) 1

          LEARNING RATE:              0.1

                                      (1, 1) 1
                                      1 * 0.1  1 * 0.1  1 * 0.1
                                      0.1      0.1      0.1

                                   +  3    4    -10
                                      0.1  0.1  0.1
                                    -----------
                                      3.1  4.1  -9.9

                                     3.1 * x1 + 4.1 * x2 - 9.9 = 0

Perceptron STEP Algorithm - Linear Data

PSEUDOCODE

  1.  Random Weights

      Start with random weights:              w1, ..., wn, b

      Apply to Line Equestion:                Wx + b = 0      (separates Positive and Negative areas)

  2.  For every misclassified points  (i.e. x1, ..., xn)
      repeat 2.1 and 2.2 until no more erroneously misclassified points

      2.1
          If `prediction = 0` (i.e. Positive Point Label misclassified in Negative area)

          Then:
              Update weights as follows (adding):

                  for i = 1 ... n

                      change wi + α * xi

                          where α is the Learning Rate (i.e. 0.1)

                  change Bias Unit of b to (b + α)
                      (to move Line closer to Misclassified Point)

                      where b is the Bias Unit

      2.2
          If `prediction = 1` (i.e. Negative Point Label by missclassified in Positive area):

          Then:
              Update the weights as follows (subtracting):

                  for i = 1 ... n

                      change wi - α * xi

                          where α is the Learning Rate (i.e. 0.1)

                  change Bias Unit of b to (b - α)
                      (to move Line closer to Misclassified Point)

                      where b is the Bias Unit

CODE

Implement the Perceptron STEP Algorithm to separate data in a CSV file

  Perceptron steps:

      Given:

          Point Coords            (p, q)
          Label                   y
          Prediction Equation     ^y = step(w1 * x1 + w2 * x2 + b)

      Then:

          If Point is:
              Correctly classified
          Then
              do nothing

          If Point is:
              Incorrectly classified (i.e. Positive classification but has Negative Label)
          Then
              SUBTRACT (α * p, α * q, α) from (w1, w2, b)

          If Point is:
              Incorrectly classified (i.e. Positive classification but has Negative Label)
          Then
              ADD (α * p, α * q, α) to (w1, w2, b)

Graph the output of the Perceptron Algorithm by clicking on test run
- Draws Dotted Lines, showing how algorithm approaches Best solution (Black Solid Line)
Note:
- Modifying Perceptron Algorithm Parameters including:
  - epochs
  - Learning Rate
  - Randomising initial parameters

Perceptron STEP Algorithm - Non-Linear Data
- Example - Error Function and Gradient Descent
```
  LINEAR
  ========================
  Given the following data point:
      * Test: 9/10
      * Grades: 1/10

  Using Boundary Line, the student will be
  accepted, since on Positive side of line.


  NON-LINEAR (ALTERNATIVE):
  ========================


  If only want to accept candidates based on
  CUSTOM criteria
  (i.e. must have Test >= 5 and Grades >= 5)
  then we need to Label the data points
  differently, and the Positive and
  Negative region cannot be just a
  Boundary "Line".

  Instead we need to separate with a
  "Circle", "Curve" or "Multiple Lines"

  Redefine the Perceptron Algorithm that
  we created for Boundary "Lines" so that
  it generalises to other types of "Curves"

  "Error Function" is used to
  with an Error Metric (distance) to tell us and
  the computer how badly its doing:
      - show us how far we are from the
      ideal solution so we can repeatedly
      take steps to decrease the error
      to eventually solve the problem:
          - check in what directions we
          can take subsequent steps to
          get closer to the solution
          - pick direction that takes us
          the farthest distance
          (decreases the error distance)

  "Gradient Descent" Method is used to
  overcome issues:

      Issues:
          - Local Error Minimum (getting stuck)
              - which often gives good solution to problem
```
- Example - Goal Split Data Discrete Error Fn vs Continuous Error Fn (Differentiable using high Penalty weights for misclassified points so allows solving problem with Gradient Descent) ```
  - Given data points plotted.
  - Given a Boundary Line between Positive and Negative region
  - Goal is to inform computer how far it is from perfect solution.
    - Count qty of errors (i.e. data points misclassified on wrong side of line)
    - Decrease the qty errors
      - Check directions can move/rotate the Boundary Line to
    - PROBLEM:
      - DISCRETE ERROR FN - This algorithm uses Calculus used to take small steps (by taking derivatives) but with small steps, each step may not reduce the qty errors (similar to being on top of a pyramid of steps that say jump from 2 to 1 and to 0, and taking small steps in a direction toward the steps, but confusing since all the steps are x levels from the ground)
    - ALTERNATIVE
      - CONTINUOUS ERROR FN - Allows use of small steps to indicate which direction will decrease the error the most (since small variations in position translate to small variations in error) The Error Function should also be Differentiable
  - Build a Continous Error Fn.
  - Given plotted points (with say 2 of 6 missclassified) wrt to Boundary Line
  - Error Function will assign a Large Penalty to the misclassified data points (and Small Penalty to correctly classified data points) where on the plot we represent the Size of the point as the Penalty
    - Misclassified Penalty - approx. the distance from Boundary Line.
    - Correctly classified Penalty - close to 0
  - Total Error obtained by counting all errors from corresponding points (both correctly classified and misclassified points).
    - TOTAL ERROR = PT1_ERR + PT2_ERR + … + PTN_ERR
  - Find out what small changes to Boundary Line Parameters that will make small changes to Error Function to make the TOTAL ERROR decrease, which we will see since the misclassified points will now have smaller Penalties (i.e. causing error Penalties to change), and then take a small step in that direction (each step will correctly classify a misclassified point). Repeat until reduce the TOTAL ERROR to its minimum possible value with all points correctly classified.
  - IMPORTANT NOTE: Since we can build an Error Function (Continous) with this Penalty property we can now use Gradient Descent to solve the problem ```
- Predictions ```
  - Predictions are the Output from the Algorithm.
  - Probability is a Function of the distance from the Boundary Line.
    - DISCRETE Answer i.e. Yes or No i.e. Labels the data points: - Positive side of Boundary Line w 1 - Negative side of Boundary Line w 0
    - CONTINUOUS Answer (probability between 0 and 1): i.e. 63.8% i.e. Labels the data points: - Positive side of Boundary Line w values >= 50% (further away from Boundary Line the higher the %) - Negative side of Boundary Line w values < 50% (further away from Boundary Line the lower the %) ```
- Switch from Linear to Continuous by changing Activation Function from Step to Sigmoid:
  - Compares Step and Sigmoid Perceptrons https://youtu.be/D5WNzbr6P78?t=3m20s
  - Step Function (returns Yes or No) Step(Wx + b) (with Boundary Line)
    - where y == 1 if x >= 0
    - where y == 0 if x < 0
  - Sigmoid Function (returns Probability of Yes) (with an entire Probability Space) (where for each data point in plane we are given probability that the Label of the point is 1 for Blue points and 0 for Red points)
    - Function that for large Positive numbers returns values close to 1 and for large Negative numbers returns value close to 0, and for numbers close to 0 returns value of 0.5 https://www.youtube.com/watch?v=D5WNzbr6P78
      σ(x) = 1 / (1 + e^-x)
    - i.e. at say Point (0.5, 0.5) the probability P(Blue) == 50%, P(Red) == 50% and at say Point (0.6, 0.4) the probability P(Blue) == 40%, P(Red) == 60%, etc
  - Generate the Probability Space:
    - Combine together the:
      - Linear Function - Wx + b applied to different lines that represent points where Wx + b = -n … Wx + b = -4 … Wx + b = -2, … Wx + b = 2, Wx + b = n
      - Sigmoid Function σ(Wx + b) where σ(x) = 1 / (1 + e^-x) applied to each of the potential lines on the plane, we obtain numbers for each respectively from 0 to 1 for each line, which each represent the Probability of each point and associated line being Blue or Red (between 0 and 1), where the (i.e. P(Blue)), and where:
        P(Blue) = prediction of model ^y = σ(Wx + b) = % value between 0 and 1 for each line
        
        when on Boundary Line Wx + b = 0, then Sigmoid σ(Wx + b) = 0.5
- Multi-Class Classification (3+ classes) and Softmax
  - i.e. classify as Blue, Red, Green (instead of just Blue, Red)
```
  e.g.

  Bi-Classification Problem:

      Find probability of getting a gift or not.

          P(gift) = 0.8

          P(no gift) = 0.2

      - Model takes Inputs (i.e. been good all year, is it your b'day)
      - Model uses Inputs to calculate Linear Model, which is the Score

          Score(gift) = Linear Function

      So, probability of getting gift or not is simply the
      Sigmoid Function applied to the Score:

          P(gift) = σ(Score)
```
```
  Multi-Classification Problem:

      Find probability of what animal we just saw from
      duck, beaver, and walruss?

      i.e. want model to return the following (where combined add to 1):
          P(duck) == 0.67,
          P(beaver) == 0.24, and
          P(walruss) == 0.09

  Given Linear Model based on Inputs:
      Beak?       Boolean
      Teeth Qty?  Int
      Feathers?   Boolean
      Hair?       Boolean

  Calculate Linear Function based on Inputs that ouputs Scores:
          Score(duck)     = 2     =   e^2 / (e^2 + e^1 + e^0)     =   P(duck)     = 0.67
          Score(beaver)   = 1     =   e^1 / (e^2 + e^1 + e^0)     =   P(beaver)   = 0.24
          Score(walrus)   = 0     =   e^0 / (e^2 + e^1 + e^0)     =   P(walrus)   = 0.09

  Convert Scores into Probabilities by using **Softmax Function** which
  applies Exponential Function (e^x returns positive number for every input)
  to each Score to ensure all Scores (outputs of Linear Function)
  will be positive number, where satisfies:
      - Combined probabilities must add to 1
      - Higher Scores should correspond to higher probability proportion

      Note: Cannot just divide each score by sum of all scores to get
      each percentage since possible to get Negative Scores since we're using
      a Linear Function that may give negative values, or denominator may
      become 0. Instead we need to convert all scores into positive scores
      using a function.

  i.e.

      Given quantity of classes:                      N

      Given Linear Model whose Linear Function that
      outputs Scores for each of the n classes:       Z1, ..., Zn

      Convert to Scores into Probabilities by
      saying the Probability that object is in
      class i will be:                                P(class i) = e^Zi / (e^Z1 + ... + e^Zn)
```
  - Softmax Function
    - i.e. equivalent to Sigmoid Function but for when problem has 3+ classes
    - 2 classes - apply Sigmoid Function to Scores to get Probabilities
    - 3+ classes - apply Softmax Function to Scores to get Probabilities
- One-Hot Encoding
  - Convert inputs to numbers
    Gift? ----------------------- True -> 1 False -> 0
  - With multiple classes:
    Animal Value ----------------------- Duck -> ? Beaver -> ? Duck -> ? Walrus -> ? Beaver -> ? - Cannot use 1, 2, 3, etc, since assumes dependencies between classes that we cannot have - Instead use "One-Hot Encoding" by creating a Variable for each Class (no unnecessary dependencies) i.e IF Input is Duck THEN Duck is 1 AND Beaver is 0 AND Walrus is 0 Animal Duck? Beaver? Walrus? --------------------------------------------- Duck 1 0 0 Beaver 0 1 0 Duck 1 0 0 Walrus 0 0 1 Beaver 0 1 0
  - Maximum Likelihood (Probability) (and Cross Entropy)
    - Avoid taking the PRODUCT
    - Instead take the SUM by using LOGS since Logarithms have identity whereby
      log(ab) = log(a) + log(b)
    - Use Probability to evaluate and improve our Models
    - Pick the Model that gives Existing Labels the Highest Probability, so by Maximising Probability we pick the Best possible Model
    - Minimise Cross Entropy == Maximise Probability
    - Cross Entropy (Error Function) informs us if the Model is Good or bad since it returns the Errors at each point (by taking sum of Negatives of the Logarithm of each Probability)
      - Good Model == Low Cross Entropy == Likely that events happened with given probabilities
      - Bad Model == High Cross Entropy == Unlikely that events happened with given probabilities
      - Negatives of the Logarithms == Errors at each Point
        
        where “Correctly Classified Points” have Small Errors
        
        where “Misclassified Points” have Large Errors, such that Cross Entropy informs us if Model is Good or Bad
      Given 2x Models with just one point each where Model #1 output Probability is 80% (of "win") where Model #2 output Probability is 55% (of "lose") Best Model has Higher Probability when we Actually "win" Best Model has Lower Probability when we Actually "lose"
      Given 2x Models with four points each Find out what Model is Good and Bad by: - Calculate Probability of each point being the colour it is according to the Model - Multiply the Probabilities of all the points to get the Model Arrangement Probability Model #1 - = P(p1_blue, blue) + P(p2_blue, blue) + P(p3_red, red) + P(p4_red, red) = 0.6 * 0.2 * 0.1 * 0.7 = 0.0084 Model #2 - = P(p1_blue, blue) + P(p2_blue, blue) + P(p3_red, red) + P(p4_red, red) = 0.7 * 0.9 * 0.8 * 0.6 = 0.3024 - Take Logs and Sum using Logarithmic identity function Model #1 - = - log(0.6) - log(0.2) - log(0.1) - log(0.7) = 0.51 1.61 2.3 0.36 = 4.8 Model #2 - = - log(0.7) - log(0.9) - log(0.8) - log(0.6) = 0.36 0.1 0.22 0.51 = 1.2 - Take natural Logarithm of e (instead of base 10) by convention. Note that taking Log of value between 0 and 1 is always a Negative number, since Log(1) = 0, now, if we take the Negative of the Logarithm of each Probability, it returns Positive numbers - Best Model has the Highest Model Arrangement Probability - Good Model - Low Cross Entropy - Bad Model - High Cross Entropy Note: Given the calculated probabilities for two models, if we then take the sum of the negative logarithm of each, and then pair each logarithm with the point where it came from then we get a value for each point, and if calculate each value then we'll find that the Misclassified Points have Large values, whereas Correctly Classified Points have Small values, since a Correctly Classified Point has a Probability close to 1, which when we take the Logarithm of it returns a small value, so we can think of the "Negatives of the Logarithms as Errors at each Point" where "Correctly Classified Points" have Small Errors and "Misclassified Points" have Large Errors, such that Cross Entropy informs us if Model is Good or Bad Calculate Model for Error Function: https://www.youtube.com/watch?v=nV1W7oQOlkU Case #1 (Point is "blue") If y == 1 (if point is "blue"/Accept to begin with) P(blue) = ^y (prediction y hat, where "blue" point in "blue" area has higher probability than point in "red" area) Error = -ln(^y) Case #2 (Point is "red") If y == 0 (if point is "red"/Reject to begin with) P(red) = 1 - P(blue) = 1 - ^y (prediction y hat, where "red" point in "red" area has higher probability than point in "blue" area) Error = -ln(1 - ^y) Summarise both Cases: Error = - (1 - y) * ln(1 - ^y) - y * ln(^y) (i.e. if y == 1 then Error is -ln(^y) ) (i.e. if y == 0 then Error is -ln(1 - ^y) ) So, ######## Error Function for Binary Classification Problems ######## is: Error Function = Sum of all the Error Functions of all the Points = - (1/m) * ( (m ∑ j=1) (1 - yj) * ln(1 - ^yj) + yj * ln(^yj) ) = - (1/m) * ( 4.8 ) = 1 (1/4) * ( 4.8 ) = 1.2 Error Function = E(W, b) = - (1/m) * ( (m ∑ j=1) (1 - yj) * ln(1 - σ(Wxi + b)) + yj * ln(σ(Wxi + b)) ) where ^y (prediction of model) is given by Sigmoid of Linear Function (Wx + b) so formula is in terms of Wx and b where xi is Label Note: By convention we're taking the Average not the Sum Goal is to Minimise Error Function
      Given 2x Models with n points each - Repeat above PROBLEM: - We will be multiplying multiple n values between 0 and 1 together, which would result in a very small number (BAD). - If change one of the values the product output would change drastically SOLUTION: - AVOID USING A FUNCTION THAT TAKES THE "PRODUCT" - USER A FUNCTION TO TAKE THE "SUM" INSTEAD, BY TAKING "LOGS" WHERE log(ab) = log(a) + log(b) Model #1 - log(n1) + log(n2) + ... + log(n) = ? Model #2 - log(n1) + log(n2) + ... + log(n) = ?
    - Best Model has Higher Probability and classifies the Most points Correctly https://www.youtube.com/watch?time_continue=188&v=6nUUeQ9AeUA
    - Minimising Error Function results in Best solution
    - Maximum Probability == Minimum Error Function
    - Cross Entropy Example
      Given 3x doors with probability of Gift behind them Green door Red door Blue door --------------------------------------------- P(gift) 0.8 @ 0.7 @ 0.1 P(no gift) 0.2 0.3 0.9 @ Problem: - Find scenario with highest probability, assuming independent events. Solution: - Take Largest probability from each column (indicated with @ symbol) - Whole arrangement probability is Product of 3x Largest probabilities P = 0.8 * 0.7 * 0.9 = 0.504 = 50 % Now, let's look at all possible scenarios, since there are 3 doors, each with 2 possibilities, we have 2^3 scenarios: Green door Red door Blue door Probability Cross Entropy (i.e. -ln(Probability)) ------------------------------------------------------------------------------ P(gift) 0.8 0.7 0.1 0.056 2.88 P(gift) 0.8 0.7 0.9 !! 0.504 0.69 P(gift) 0.8 0.3 !! 0.1 0.024 3.73 P(no gift) 0.2 !! 0.7 0.1 0.014 4.27 P(gift) 0.8 0.3 !! 0.9 !! 0.216 1.53 P(no gift) 0.2 !! 0.3 !! 0.9 !! 0.126 2.07 P(no gift) 0.2 !! 0.3 !! 0.1 0.006 5.12 P(no gift) 0.2 !! 0.3 !! 0.9 !! 0.054 2.92 where !! indicates negative probability of given door for a scenario Step 1: Obtain Probability of each arrangement by multiplying the 3x independent probabilities, where Total sum of all arrangement probabilities add to 1: Step 2: Calculate Cross Entropy, which is negative of the logarithm of the probability, such that Events with High Probability have High Cross Entropy
    - Derive Cross Entropy formula
      Green door Red door Blue door Cross Entropy (i.e. -ln(Probability)) ------------------------------------------------------------------------------ P(gift) 0.8 0.7 0.9 !! -ln(0.8) - ln(0.7) - ln(0.9) (p1) (p2) (1 - p3) y1 = 1 y2 = 1 y3 = 0 where !! indicates the probability of there NOT being a gift behind the door for a given scenario where p1 == 0.8 (prob. of gift) where p2 == 0.7 (prob. of gift) where p3 == 0.1 (prob. of gift) yj = 1 (variable yj is number of presents behind door i, 1 if a present behind, else 0) Cross-Entropy = - (m ∑ j=1) yj * ln(pj) + (1 - yj) * ln(1 - pj) i.e. CE[(1,1,0), (0.8,0.7,0.1)] = 0.69 low since vector (1,1,0) similar to (0.8,0.7,0.1) meaning that arrangment of gifts (1,1,0) is likely to happen based on probabilities given (0.8,0.7,0.1) CE[(0,0,1), (0.8,0.7,0.1)] = 5.12 high since arrangement of gifts given by (0,0,1) is very unlikely from the probabilities given by second set of numbers (0.8,0.7,0.1)
    - Multi-Class Cross Entropy
      Given 3x doors with probability of Duck, Beaver, or Walrus behind each Animal Door 1 Door 2 Door 3 --------------------------------------------- P(duck) 0.7 0.3 0.1 P(beaver) 0.2 0.4 0.5 P(walrus) 0.1 0.3 0.4 Note: numbers in each column must add to 1 Given Scenario #1: Door 1 Door 2 Door 3 ------------------------------ Duck Walrus Walrus P = 0.7 * 0.3 * 0.4 = 0.084 CE = -ln(0.7) + -ln(0.3) + -ln(0.4) = 2.48 Probability of Scenario #1 is product of probabilities of each independent event Given all Scenarios: Animal Door 1 Door 2 Door 3 --------------------------------------------- P(duck) p11 p12 p13 P(beaver) p21 p22 p23 P(walrus) p31 p32 p33 y1j = 1 if Duck behind Door i y2j = 1 if Beaver behind Door j y3j = 1 if Walrus behind Door k otherwise y1j, y2j, y3j are 0 Cross-Entropy = - (n ∑ i=1) (m ∑ j=1) yij * ln(pij) ######## Error Function for Multi-Classification Problems ########
  - Logistic Regression Algorithm
    - About ```
      - Obtain Input data
      - Random Model selected
      - Error calculated
      - Error minimised to obtain a better model ```
    - Error Function Calculation
      
      https://www.youtube.com/watch?v=nV1W7oQOlkU
      - Recall using Cross-Entropy to calculate the Best Model (when given 2x models) https://www.youtube.com/watch?v=njq6bYrPqSU
  - Gradient Descent
    - Dfn: Looks for Direction we will descent the most (reduce the Error the most) and takes a Step in that Direction
    https://www.youtube.com/watch?v=26_dnS0r2jc
    - Error Function is E = W(x + b) (function of weights)
      - Graphed as a Plane:
        x-axis w1 (Input to function) y-axis Error E (height) z-axis w2 (Input to function)
      - Imagine you are at a Point on the Curved Plane
        E given by Vector formed by sum of the Partial Derivatives of Error Function E, with respect to Weights w1 and w2 and Bias b (i.e. dE/dw1, dE/dw2), which gives us the Gradient (i.e. Direction) to move ∇E to increase the Error Function the Most, so if we take the Negative of the Gradient -∇E then we can Decrease the Error Function the Most, which is what we'll do! We'll take a Step in the Direction given by the Negative of the Gradient at a Point of the Curved Plane representing the Error Function, which will take us to a Lower Point on the Curved Plane. Repeat this until we get to Lowest Point on Curved Plane
      - Calculate the Gradient
        
        1. Initial "Bad" Prediction (since we are high up on Curved Plane's "Error" axis): ^y = σ(Wx + b) <--- BAD ^y = σ(w1 * x1 + ... + wn * xn + b) Error Function of Initial "Prediction" is: E = W(x + b) "Gradient" of Error Function is: ∇E = (∂E/∂w1, ... , ∂E/∂wn, ∂E/∂b) "Learning Rate" Alpha set to low value to avoid making dramatic changes by using Small α = 0.1 "Take a step" in Negative direction of the Gradient multiplied by Alpha Note: where "Taking a step" is same as Updating the Weights and Bias as follows (i.e. ∂E/∂wi means Partial Derivative of the Error with respect to wi) wi' <-- wi - α * ∂E/∂wi b' <-- b - α * ∂E/∂b This will take us to a Point E(W', b') with Better "Prediction" that has a Lower Error Function, with Weights W' and Bias b': ^y = σ(W'x + b') <--- BETTER
    - Gradient Calculation https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/6124bd95-dec2-44f9-bf3b-498ea57699c7/lessons/47f6c25c-7749-4a02-b807-7a5b37f362e8/concepts/0d92455b-2fa0-4eb8-ae5d-07c7834b8a56
      Gradient == Scalar * Point Coords where, Scalar == Label - Prediction where Gradient is Small when Point is "Well Classified" when Label is close to Prediction so we'll change our Point Coordinates a Little whereas Gradient is Large when Point is "Poorly Classified" when Label is far from Prediction so we'll change our Point Coordinates a Lot Note: Similar to Perceptron Algorithm
    - Gradient Descent Algorithm Pseudocode https://www.youtube.com/watch?v=I-l32oR5iMM https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/6124bd95-dec2-44f9-bf3b-498ea57699c7/lessons/47f6c25c-7749-4a02-b807-7a5b37f362e8/concepts/ca6eff40-a3e2-4d53-85f4-d2454b538d87 Lesson 2 Part 25 https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/6124bd95-dec2-44f9-bf3b-498ea57699c7/lessons/47f6c25c-7749-4a02-b807-7a5b37f362e8/concepts/5e9bd75b-a419-45d4-8a2b-88ba847cc814
      Step 1: Start with random Weights to give us a Line: w1, ... , wn, b ^y = σ(Wx + b) Step 2: Calculate the Error for every plotted Point, where: Error is Large for Misclassified Points Error is Small for Correctly Classified Points Step 3: For every Point Coordinates: x1, ... , xn For i = 1 ... n Update wi by Adding the Learning Rate α times Partial Derivative of Error Function wrt to wi wi' <--- wi - α * ∂E/∂wi wi - α * ((y - ^y) * xi) Update b similarly b' <--- b - α * ∂E/∂b b - α * (y - ^y) This gives us new Weights and Bias Step 4: Update Weights and Bias to give us a New Line: w1', ... , wn', b' ^y' = σ(W'x + b') Step 5: Repeat process "E-box" qty of times until Error is Small
  - Perceptron Algorithm vs Gradient Descent Algorithm
    - Gradient Descent
      - ^y may take ANY value
        wi' <--- wi - α * ((y - ^y) * xi)
      - If Point is Incorrectly Classified:
        
        Tells Line to come closer (on each step)
      - If Point is Correctly Classified:
        
        CHANGE WEIGHTS (i.e. Point tells Line to go farther away, since if already in Correct region, want Prediction to still get closer and closer to 1, reducing Error farther)
    - Perceptron Algorithm
      - ^y may ONLY take 0 or 1 for Correctly or Incorrectly classified
      - If Point is Correctly Classified:
        
        DO NOTHING
    - Common
      - Both Gradient Descent and Perceptron Algorithm
        
        Misclassified Point tells Line to come closer (trying to get Point on the correct side)
  - Non-Linear Data (Neutral Networks)
    - Linear - Data sets separatable by a Line
    - Non-Linear - Complex Data Sets with highly Non-Linear Boundaries (not separable due to Non-Linear Equation of Line) Neural Networks used instead
      - Create Probability Function where:
        
        Points in Blue region more likely to be Blue,
        
        Points in Red region are more likely to be Red
        
        Points on LINE are equally likely to be Blue or Red
  - Neural Networks (aka Multi-Layer Perceptrons, Neural Network Architecture)
    
    https://www.youtube.com/watch?v=Boy3zHVrWB4
    - Combine Two Perceptrons into Third (more complex one)
      Use Probability Function (Sigmoid) for every Point on the Plane (so resulting probab. always between 0 and 1) Step 1: - A = Calc probab. for Point K on Model #1 - B = Calc probab. for Point K on Model #2 Step 2: (neural network, similar to perceptrons) - C = W1 * A + W2 * B - BIAS where W1 is Weight1 where W2 is Weight2 Step 3: (i.e. create resulting curved line non-linear model b/w both linear models) - Sigmoid(C)
      - Combine Two “Linear” Models overimposed into a “Non-Linear” Model
      - i.e. Linear Line + Linear Line = Non-Linear Line (i.e. Curve)
    - Note:
      - Linear Model is a Probability Space that gives Probability of Point being Blue (in a Region)
    - Combining the Probability of a Point across Two Perceptrons (P1, P2) Linear Models Probability Spaces, i.e. P1 Point probability (of Blue): 0.7 P2 Point probability (of Blue): 0.8
      - Apply the Sigmoid Function to the Sum of the Point Probability across Two Perceptrons
        
        Sigmoid ( 0.7 + 0.8 ) = Resulting Probability of Blue
    - Weight the Sum
      - i.e. if want Model #1 to have more say in resulting probability
    - Neural Network, different Bias Diagrams
      - https://www.youtube.com/watch?v=au-Wxkr_skM
        
        Weights on Left describe Equations of the Linear Models (i.e. Cx + Dy + E = 0)
        
        Weights on Right describe Linear Combination of the Two Models to obtain resulting Curved Non-Linear Model (Non-Linear Boundary defined by Neural Network)
    - Neural Network Multiple Layers Architecture
      - https://www.youtube.com/watch?v=pg99FkXYK0M
        
        Layers
        
        Input Layer
        
        Inputs (i.e. x1, x2)
        
        Note: when x1, x2, x3 then we’re in 3D space, where Hidden Layers are Planes (instead of Lines) and Output Layer is Non-Linear Region in free space
        
        Note: when N Input Layers, we are in N-Dimensional space
        
        Hidden Layer
        
        Set of Linear Models created using the first Input Layer (when only 2 Nodes i.e. Red/Blue)
        
        Note: when Hidden Layer has 3+ Nodes, then we have a Multi-Class Classification Model and Output Layer is produced for each of the Classes (i.e. Cat, Dog, Bird)
        
        Note: when Multiple Hidden Layers we have a Deep Neural Network where Linear Models combine to create Non-Linear Models, and these resulting Non-Linear Models combine to create more Non-Linear Models
        
        Deep Neural Network splits the N-Dimensional space in the Output Layer to have a highly Non-Linear Boundary (i.e. wiggly line)
        
        Output Layer
        
        Creates a Non-Linear Model from Combining N Linear Models from Hidden Layer (i.e. combine 3x Linear Models to create Triangular Boundary in Output Layer)
    - Multi-Class Classification vs Binary Classification (with Deep Neural Networks)
      
      https://www.youtube.com/watch?v=uNTtvxwfox0
      - If have multiple classes and want Model to predict if we have a Duck, Beaver, or Walrus
      - APPROACH #1
        
        Create 3x Neural Networks, one for each Class
        
        Softmax( NN (Duck) + NN (Beaver) + NN (Walrus) )
      - APPROACH #2 (BETTER)
        
        Create 1x Neural Network, with more Nodes in the Output Layer (where each Output Layer Node gives the probab. that the image is each of the animals)
        
        Take the Scores and apply the Softmax Function to obtain Well-defined probabilities
    - Feedforward (Process of Neural Network)
      
      https://www.youtube.com/watch?v=Ioe3bwMgjAM
      - Feedforward Process used by Neural Networks to turn Input into Output:
        
        Neural Network (i.e. Perceptron is simplest NN)
        
        Inputs Input Data Point x = (x1, x2) Label y = 1 (means Point is Blue) Linear Equation of Perceptron Boundary Line in 2D space w1 * x1 + w2 * x2 + b = 0 where w1, w2 (Weights and Edges) where b (Bias on the Node) w1 and w2 as connecting lines between Inputs and Linear Models with greater Thickness of connecting line for the higher Weight values (see 0:56 of video) Perceptron then Plots the Points x1 and x2 and outputs the Probability that the point is Blue (i.e. if Blue Point is in Red Area then Output Probability is Small number, since Point is Not likely to be Blue)
        
        Use Matrix Multiplication of Weights for Non-Linear mapping of Complex Neural Networks including Multi-Layer (see 2:42, 3:50) https://www.youtube.com/watch?v=Ioe3bwMgjAM
        
        i.e. output prediction from input Vector ^y = σ o W(3) o σ o W(2) o W(1)(x) where o is Composition of Weights Matrix with Sigmoid Function
        
        Error Function
        
        https://www.youtube.com/watch?v=VAX9Di9cjzE
    - Training Neural Networks
      - Backpropogation Method
        
        Dfn / Steps:
        
        Feedforward operation https://www.youtube.com/watch?v=1SmY3TZTyUk
        
        Compare Model Output with Desired Model Output
        
        i.e. Bad Neural Network - if predicts Point is Red when it’s actually Blue
        
        Calculate Error
        
        Backpropagation (run the Feedforward backwards) to spread Error to each of the Weights
        
        Backpropagation after Feedforward (3:08) to Train Neural Networks https://www.youtube.com/watch?v=1SmY3TZTyUk
        
        Ask Point what it wants to do to be “Better” classified (i.e. if Misclassified Point will ask Boundary Line of Blue Region to come closer to the Point)
        
        Listen to the “Better” Model in the Hidden Layer MORE (by INCREASING the Weight (Thickness) of connecting line between Hidden Layer and Output Layer) than we Listen to the “Bad” Model in the Hidden Layer (DECREASE Weight of its connecting line), so that Final Model looks more like the “Better” Model in the Hidden Layer
        
        THEN, go Back to the Linear Models in the Hidden Layer and for each Linear Model Ask Points what it wants to do to be “Better” classified (i.e. if Misclassified Point will ask Boundary Line of Blue Region to come closer to the Point, otherwise if Correctly Classified will ask Boundary Line to move away from Point)
        
        This will update the Weights (thickness) of connecting line between Input Layer and Hidden Layer
        
        Now we have “Better” Predictions of all the Models in the Hidden Layer, and for the Model in the Output Layer
        
        Update Weights with output of Backpropagation for “Better” Model
        
        Point says what wants Model to do:
        
        Move Boundary Line closer to Point (if Misclassified),
        
        Boundary Line moves closer by updating its Weights (thereby defining a new line)
        
        Move Boundary Line away from Point (if Correctly classified)
        
        Calculate Error Function E(W) and step in direction of Negative of Gradient -∇E, to new Model W' with New Model Error ``E(W’)` with “Better” Prediction
        
        Note: Error reduces from E(W) to E(W') after taking Gradient Descent step down the Negative of the Gradient -∇E, such that new Boundary Line is closer to the Point
        
        Repeat until “Good” Model, successively minimising error
      - Backpropagation (DETAILED MATHS covered by Keras)
        
        Multi-Layer Perceptron, Multi-Layered Perceptron, Feedforward https://www.youtube.com/watch?v=tVuZDbUrzzI https://www.youtube.com/watch?v=YAhIBOnbt54 https://www.youtube.com/watch?v=0EoRxu3EeGM
        
        Chain Rule https://www.youtube.com/watch?v=0EoRxu3EeGM
        
        Given x, A = f(x), B = g o f(x) Partial Derivatives given by: ∂B / ∂x = ∂B * ∂A / ∂A * ∂x (when composing functions the derivatives multiple)
        
        Feedforwarding - composing various Functions
        
        Backpropagation - reverse of Feed Forward calculating Derivative of complex Composition at each Error Function wrt each of the Weights in the Labels using Chain Rule
      - Neural Networks in Keras
        
        Packages for Neural Networks, activation function, gradient descent
        
        Keras https://keras.io/
        
        TensorFlow https://www.tensorflow.org/
        
        Caffe http://caffe.berkeleyvision.org/
        
        Theano http://deeplearning.net/software/theano/
        
        Scikit-Learn http://scikit-learn.org/stable/
        
        Keras
        
        Project - Build a Fully Connected (Multi-layer) Feedforward Neural Network to solve the XOR problem
        
        Steps:
        
        Load the data
        
        Define the network
        
        Train the network
        
        Project - Student Admissions
        
        https://github.com/ltfschoen/aind2-dl
      - Training Model Optimisation
        
        Failure points
        
        Poor architecture chosen
        
        Noisy data
        
        Model takes too long to run
      - Batch vs Stochastic Gradient Descent
        
        Recap:
        
        Gradient Descent Algorithm:
        
        Reduce height (error function) by taking steps (aka Epochs) following the negative of the gradient of the height
        
        Epoch https://www.youtube.com/watch?v=2p58rVgqsgo
        
        Each steps taken to reduce error function in gradient descent algorithm
        
        At each Epoch, we take all the input data, and run it through the entire Neural Network, then find predictions, calculate the error (i.e. how far from actual labels), then back propagate the error in order to update the weights in the Neural Network, to improve the boundary for predicting all our data
        
        Batch vs Stochastic Gradient Descent
        
        Since we perform the above steps at each Epoch for all the data points, these are huge computational steps using lots of memory. Since we don’t need to plugin all our data at every Epoch, when data is well distributed, a small subset of the data would at least give a good idea what the gradient would be, and much quicker!
        
        Split the data into several Batches (i.e. given 24 points, split into 4x batches of 6 points each)
        
        Run Batch #1 points through Neural Network
        
        Calculate error and gradient, and back propagate to update with better weights that define a better boundary region
        
        Repeat for subsequent Batches
        
        Keras
        model.fit(X_train, y_train, epochs=1000, batch_size=100, verbose=0)
      - Learning Rate Decay
        
        High Learning Rate + Large Steps may result in missing the bottom and coming up again
        
        Low Learning Rate + Small Steps higher chance of arrive local minimum
        
        If Model is not working, then Lower the Learning Rate
        
        Ideally the Learning Rate would Decrease as the Model gets closer to the solution
        
        Note:
        
        If “steep”, take long steps
        
        If “plain”, take small steps
      - Training and Testing sets to Find “Better” Model
        
        Given blue and red points from data set that are plotted on each two classification models with a boundary that separate the blue points from the red points Instead of choosing the “better” model based on a combination of observations (i.e. boundary curve complexity vs less mistakes in classifying points) we using Training and Testing sets.
        
        STEP 1
        
        Differentiate Training and Testing sets
        
        Training Set are Solid Coloured Points
        
        Testing Set are Hollow Coloured Points (white inside)
        
        STEP 2
        
        Train each model with the Training Set (without looking at the Testing Set) to obtain updated boundary
        
        STEP 3
        
        Evaluate results by reintroducing the Testing Set to see how we did
        
        Check how many mistakes each model made with the Testing Set now reintroduced
        
        Note: Follow our intuition, when comparing which is better between a simpler model and a complex model, always go for the simpler model
      - Overfitting vs Underfitting (Types of Errors)
        
        https://youtu.be/SVqEgaT1lXU?t=4m15s
        
        Overfitting (Too Specific) (aka Error due to Variance)
        
        Dfn: Overcomplicating the problem and using a solution that is too excessive (i.e. kill fly with bazooka), may lead to bad solutions and extra complexity (when simpler solution possible)
        
        Example 1
        
        Given some data to be classified (i.e. different objects and animals)
        
        If we groups we choose are too granular (i.e. Dogs that are orange or gray vs Anything that is orange or gray except dogs) it is Too Complex
        
        Identify Issues:
        
        Introduce a Testing Set to see if introduced data is classified correctly or not
        
        It may fit the data well, but may not generalise correctly
        
        Example 2
        
        i.e. memorising textbook instead of studying word by word, so may be able to regurgitate, but not be able to generalise properly to questions
        
        Good Model Fit
        
        Dfn: Generalises better than overfitting Always have preferance for Overly Complex models and apply certain Techniques to prevent Overfitting (like using a belt to fit into a pants that are too big) Finding the right Architecture for a Neutral Network
        
        Example 2
        
        i.e. like study well and good result in exam
        
        Underfitting (Too Simple) (aka Error due to High Bias)
        
        Dfn: Oversimplifying the problem and using a solution that is too simple to do the job (i.e. trying to kill Giant with fly swatter)
        
        Example 1
        
        Given some data to be classified (i.e. different objects and animals)
        
        If we groups we choose are too abstract (i.e. Animals vs No Animals) it is Too Simple
        
        Identify Issues:
        
        Misclassify objects
        
        Example 2
        
        i.e. not studying enough and failing
      - Early Stopping Algorithm
        
        Given an Overly complex Neutral Network architecture than what we need (according to Model Complexity Graph)
        
        After Training Evaluate each Model by introducing a Testing Set
        
        Introduce a set of Training Set points to each model plot
        
        Plot the:
        
        Error vs Epoch for the Training Set
        
        Error vs Epoch for the Testing Set
        
        Training and Testing results
        
        Epoch 1
        
        Training
        
        Use random weights
        
        Underfitting so makes many mistakes
        
        Testing
        
        Badly misclassifies both Training and Testing Sets so both:
        
        Large Training Error
        
        Large Testing Error
        
        Epochs 20
        
        Training
        
        Good model
        
        Testing
        
        Small Training Error
        
        Small Testing Error
        
        Epochs 100
        
        Training
        
        Fits data better
        
        Overfit data
        
        Training error decreases with higher Epochs but Testing errors increases (due to misclassification
        
        Testing
        
        Tiny Training Error
        
        Medium Testing Error
        
        Epochs 600
        
        Training
        
        Fits the “training data” well by Generalises badly
        
        Heavily Overfits
        
        Problem If introduce say a blue point in blue region it may be misclassified as red unless its very close to existing blue point
        
        Testing
        
        Tiny Training Error
        
        Large Testing Error
        
        Goldilocks Spot and Model Complexity Graph ( Error vs Model Complexity(qty epochs) )
        
        https://youtu.be/NnS0FJyVcDQ?t=2m35s
        
        Use Model Complexity Graph to identify ideal Model Complexity of ideal quantity of Epochs to use
        
        Approach:
        
        Use Gradient Descent until the Testing Error stops decreasing and instead starts increasing and then we Stop (Early Stopping Algorithm) technique when Training the Neutral Network
      - Regularisation
        
        https://youtu.be/aX_m9iyK3Ac?t=2m50s
        
        Given two points P1 (-1,-1) and P2 (1,1), which of equations gives smaller error? Solution 1: x1 + x2 = 0 Solution 2: 10x1 + 10x2 = 0 (scalar multiple of Solution 1) Both give same line. Prediction is sigmoid of linear function Using Solution 1 and substituting Point Coordinates for Coefficients: substitute P1; ^y = σ(w1 * x1 + w2 * x2 + b) = σ(1 + 1) = 0.88 substitute P2: ^y = σ(w1 * x1 + w2 * x2 + b) = σ(-1 - 1) = 0.12 Using Solution 2 and substituting Point Coordinates for Coefficients: substitute P1; ^y = σ(w1 * x1 + w2 * x2 + b) = σ(10 + 10) = 0.9999999979 (great prediction, less error, but only subtle Overfitting) substitute P2: ^y = σ(w1 * x1 + w2 * x2 + b) = σ(-10 - 10) = 0.0000000021 (great prediction, less error, but only subtle Overfitting) Error Function of Initial "Prediction" is: E = W(x + b) Solution 2 is "Too Certain" and it will be too hard to tune the model to correct any misclassified errors if any
        
        Gradient Descent is best used on Models that are NOT “Too Certain” (i.e. lower scalar weights applied to x1, x2, etc, so the function is not too steep, and so the derivatives aren’t too close to 0 and very large at middle of curve)
        
        Since we now know that LARGE COEFFICIENTS ----> OVERFITTING We need to tweak the Error Function to Punish any High Coefficients (penalise Large Weights w1, ..., wn) by taking the Old Error Function and add a term that is Big when the Weights are big by either: * Note: where Lambda constant is how much to Punish the Coefficients * "L1 Regularisation" - Adding sums of absolute values of the Weights * Lambda constant Error Function = Sum of all the Error Functions of all the Points = - (1/m) * ( (m ∑ j=1) (1 - yj) * ln(1 - ^yj) + yj * ln(^yj) + λ (|w1| + ... + |wn|) ) * Usage: * Usually results in sparse vectors where small weights tend toward 0, so use L1 to reduce weights and end up with small set. i.e. Sparsity: (1,0,0,1,0) * Also good for Feature Selection by only selecting the most important points and turns the rest into 0's * "L2 Regularisation" - Add sum of squares of the Weights * Lambda constant Error Function = Sum of all the Error Functions of all the Points = - (1/m) * ( (m ∑ j=1) (1 - yj) * ln(1 - ^yj) + yj * ln(^yj) + λ (w1^2 + ... + wn^2) ) * Usage: * Does not favour sparse vectors since it tries to maintain all weights homogenously small i.e. Sparsity: (0.5,0.3,-0.2,0.4,0.1) * Better for Training Models * Example Given vector (1, 0) then: λ (|w1| + ... + |wn|) = λ (1) λ (w1^2 + ... + wn^2) = λ (1) Given vector (0.5, 0.5) then: λ (|w1| + ... + |wn|) = λ (1) λ (w1^2 + ... + wn^2) = λ (0.25 + 0.25) = 0.5 So therefore "L2 Regularisation" prefers the vector (0.5, 0.5) over the vector (1, 0) since (0.5, 0.5) produces smaller sum of squares and in turn a "Smaller Error Function"
        
        BertrAIND Russell Quote
        
        “The whole problem with AI is that bad models are so certain of themselves, and good models so full of doubts”
      - Dropout Method (used when Training Neural Networks)
        
        Dfn: Sometimes one part of the Neutral Network has Large Weights and dominates the Training whilst others doesn’t train as much, so as mitigation we need to turn the dominator off sometimes
        
        Approach:
        
        Randomly turn off some Nodes in Neutral Network as we go through Epochs (i.e. feed forward and backpropagation passes) to force the rest of the Nodes to pick up the slack and get involved in the training (so none dropout)
        
        Parameter is passed to the Algorithm in order to drop the Nodes. The Parameter is the Probability that the Node gets dropped at a particular Epoch i.e. if P = 0.2 it means for each Epoch, each Node gets turned off with probability of 20%
      - Vanishing Gradient Issue
        
        Problem
        
        https://www.youtube.com/watch?v=W_JJm_5syFw
        
        Calculating the Derivative σ(x) = 1 / (1 + e^-x) of the Sigmoid function at a point far to the left or far to the right results in 0. But the Derivative is what tells us the Direction to Move. This issue is worse in Multi-Linear Perceptron since the Derivative of an Error Function with respect to a Weight equals the Product of all the Derivatives calculated at the Nodes in the corresponding path to the output, where all those Derivatives are Derivatives as a Sigmoid Function (so they are small), and so their Product is tiny, which makes Training difficult, since Gradient Descent gives us very tiny changes to make on the Weights, causing us to make very small Steps taking forever to reduce the Error
        
        **Solution 1: Change the Activation Function
        
        Tanh - Hyperbolic Tangent Function
        
        https://www.youtube.com/watch?v=VzGOR5SlFSw
        
        tanh(x) = (e^x - e^-x) / (e^x + e^-x)
        
        Similar to Sigmoid but range is -1 to 1 so Derivatives are Larger, which leads to great advances in Neural Networks
        
        ReLU Rectified Linear Unit
        
        * relu(x) = { x if x >= 0 0 if x < 0
        
        Simple function between -1 and 1:
        
        if Positive return same value Derivative is 1 if Positive
        
        if Negative return zero
        
        Used instead of Sigmoid since improves Training significantly without sacrificing much accuracy since the Derivatives will not be as small and allows us to perform Gradient Descent
        
        Note that Final Activation Function in Multi-Layer Perceptron must be between 0 and 1 (i.e. a Sigmoid)
        
        If the Final Activation Function is a ReLU then we may end up with Regression Models that Predict a value (used in Recurrent Neural Networks)
  - Gradient Descent
    - Problem
      - Getting stuck in Local Minima during Gradient Descent
    - Solutions
      - Random Restart (used by Gradient Descent Algorithms)
        
        Prevent
        
        Start from a few Random places and use Gradient Descent from all of them to increase the Probability we’ll reach a good Global Minimum
      - Momentum (used by Gradient Descent Algorithms)
        
        https://www.youtube.com/watch?v=r-rYz_PEWC8
        
        Move with Momentum so when reach a Local Minimum we power through the Local Minimum and over the next hump. Otherwise when we reach a Local Minima where our Gradient is too small to Step out of the Local Minima and over the next hump.
        
        Make the next Step in the Local Minima be the Average of the Previous Steps and where the most recent previous step matters more (and is given a higher Weight) than older steps.
        
        https://youtu.be/r-rYz_PEWC8?t=1m36s
        
        Beta β (Momentum) is a constant between 0 and 1 that attaches to the Steps
        
        STEP(n) --> STEP(n) + β * STEP * (n-1) + β^2 * STEP * (n-2) + ...
  - Optimisers in Keras (used as Arguments when compiling Keras Models to optimise their performance)
    - Links
      - https://keras.io/optimizers/
      - http://sebastianruder.com/optimizing-gradient-descent/index.html#rmsprop
    - SGD This is Stochastic Gradient Descent. It uses the following parameters:
      - Learning rate
      - Momentum (takes weighted average of the previous steps, in order to get a bit of momentum and go over bumps, as a way to not get stuck in local minima).
      - Nesterov Momentum (This slows down the gradient when it’s closed to the solution).
    - Adam
      - Adam (Adaptive Moment Estimation) uses a more complicated exponential decay that consists of not just considering the average (first moment), but also the variance (second moment) of the previous steps.
    - RMSProp
      - RMSProp (RMS stands for Root Mean Squared Error) decreases the learning rate by dividing it by an exponentially decaying average of squared gradients.
  - Keras Lab
    - Goal: Build Neural Network to analyse real data that consists of thousands of movie reviews from IMDB, and the challenge is to Predict the Sentiment Analysis of a review

Chapter 2 - Convolutional Neural Networks (CNN)

About
- State of the art
- Great Summary https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
  - CNN 2D visual http://scs.ryerson.ca/~aharley/vis/conv/flat.html
  - CNN 3D visual http://scs.ryerson.ca/~aharley/vis/conv/
  - JS source code http://scs.ryerson.ca/~aharley/vis/
COURSES
- http://cs231n.stanford.edu/
Applications
- Voice User Interfaces
  - Applications
    - WaveNet model (by Google)
      - Convert text to speech
      - Trained sufficiently it sounds like you
      - Link About https://deepmind.com/blog/wavenet-generative-model-raw-audio/
      - Link Paper https://arxiv.org/pdf/1609.03499.pdf
      - Link Singing http://www.creativeai.net/posts/W2C3baXvf2yJSLbY6/a-neural-parametric-singing-synthesizer
- Natural Language Processing (NLP)
  - General
    - Recurrent Neural Networks (RNNs) used more than CNNs
  - Applications
    - CNNs used to extract info from sentences for:
      - Sentiment Analysis
    - CNN for NLP with Text Classification using TensorFlow, baseline for Sentiment Analysis tasks, etc + Code
      - http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
      - http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
    - CNNs at Facebook for language translation 9x faster than RNN
      - https://code.facebook.com/posts/1978007565818999/a-novel-approach-to-neural-machine-translation/
      - (hierarchical/parallel process words of sentence)
    - CNNs to play Atari Games with DeepMind
      - About https://deepmind.com/research/dqn/
      - Code https://sites.google.com/a/deepmind.com/dqn/
      - Deep Reinforcement Learning using Policy Gradients code beginner http://karpathy.github.io/2016/05/31/rl/
- Computer Vision
  - General
    - Given set of Images, CNN assigns corresponding Label it believes Summarises the Image Content
  - Applications
    - CNNs used to Teach AI Agents to play Video Games like paddle game
      - CNN has no prior knowledge of what a ball is or knowing precisely what the controls do.
      - Only provided screen, score, and controls a user given
      - CNN extracts crucial info allowing them to develop a useful strategy
    - ClickDraw
      - Pictionary playing, guesses what you’re drawing based on what you draw
        
        https://quickdraw.withgoogle.com/#
        
        Auto-suggestions for sketching https://www.autodraw.com/
    - DeepMind
      - Go boardgame (Ancient Chinese complex) AI agent beat human professional
    - Drones flying unfamiliar territory
      - Deliver medical supplies to remote areas
      - CNNs give drone ability to see or determine what’s happening in streaming video data
    - Decoding Images of Text
      - Digitise historical books
      - Digitise hand-written notes
      - Improve Algorithm to handle letters, numbers, punctuation
      - Decode road signs for Self-Driving Cars
    - Google Street Maps
      - Trained algorithm to create better street maps of the world that reads house number signs from street view images
  - Misc
    - AI Experiments https://aiexperiments.withgoogle.com/
    - AlphaGo https://deepmind.com/research/alphago/
      - https://www.technologyreview.com/s/604273/finding-solace-in-defeat-by-artificial-intelligence/?set=604287
    - CNNs powering Drones Indoors
      - https://www.youtube.com/watch?v=AMDiR61f86Y
    - Outdoor navigation with GPS
      - http://www.droneomega.com/gps-drone-navigation-works/
    - CNN powered Drone Outdoors Autonomous
      - https://www.youtube.com/watch?v=wSFYOw4VIYY
    - Classify Traffic sign recognition system
      - https://github.com/udacity/CarND-Traffic-Sign-Classifier-Project
    - Classify Street Signs
      - https://github.com/udacity/machine-learning/tree/master/projects/digit_recognition
    - CNN to produce a self-driving AI to play Grand Theft Auto V
      - https://pythonprogramming.net/game-frames-open-cv-python-plays-gta-v/
    - CNNs to convert famous paintings into 3D for Vision Impaired using CNN to predict Depth from a Single Image https://www.cs.nyu.edu/~deigen/depth/
      - http://www.businessinsider.com/3d-printed-works-of-art-for-the-blind-2016-1/?r=AU&IR=T
    - CNN to localise Breast Cancer
      - https://research.googleblog.com/2017/03/assisting-pathologists-in-detecting.html
    - CNN to save endangered species using HotSpotter http://cs.rpi.edu/hotspotter/
      - https://blogs.nvidia.com/blog/2016/11/04/saving-endangered-species/?adbsc=social_20170303_70517416
    - CNN to change gender or make you smile in photo
      - https://www.digitaltrends.com/photography/faceapp-neural-net-image-editing/
Deep Learning Recognition
- Interpret hand-written numerical digits
  - Design Image Classification Algorithm
    - Goal
      - Takes pictures of hand-written numbers and identifies the numbers shown in the images
    - Tools
      - MNIST Database - contains 70k greyscale hand-written images of digits 0 to 9
        
        Figure that shows datasets referenced over time in NIPS papers
        
        https://www.kaggle.com/benhamner/popular-datasets-over-time/code
        
        https://nips.cc/
    - My Code and Detailed Documentation explaining MLP Steps
      - https://github.com/ltfschoen/aind2-cnn
    - Links
      - TODO - Drop Out Layer technique for Overfitting https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf
      - TODO - Keras Flatten Layer https://keras.io/layers/core/#flatten
      - TODO - Activation Functions http://cs231n.github.io/neural-networks-1/#actfun
      - TODO - Fully Connected “Dense” Layer https://keras.io/layers/core/
        
        TODO - Initialisers https://keras.io/initializers/
      - TODO - Loss/Error Functions https://keras.io/losses/
      - TODO - Early Stopping and ModelCheckpoint https://keras.io/callbacks/#modelcheckpoint
      - TODO - http://machinelearningmastery.com/check-point-deep-learning-models-keras/
      - TODO - Performance of other classifiers http://yann.lecun.com/exdb/mnist/
      - Black Box of AI Unknowns https://www.technologyreview.com/s/604087/the-dark-secret-at-the-heart-of-ai/
- Intepret sophisticated images with complex patterns by replacing Fully Connected Layer (aka Dense Layers) with Locally Connected Layer (aka Convolutional Layers)
  - https://www.youtube.com/watch?v=z9wiDg0w-Dc
  - https://www.youtube.com/watch?v=h5R_JvdUrUI
  - Colour images https://www.youtube.com/watch?v=RnM1D-XI–8
  - Issues:
    - MLPs use many Parameters
      - Only using Densely Connected Layer (aka Fully Connected Layers) i.e. 28x28px image required 0.5M parameters so moderately sized images require high computational complexity
      - Only accepting Vectors as Input as throwing away the 2D info (i.e. spatial info of where pixels are located in reference to each other) contained in image when we Flatten its Matrix into a Vector
  - Solution
    - Use CNNs instead of MLPs as they process images without losing 2D info since they use “Locally Connected Layer” (aka Sparsely/Convolutional Connected Layers) where Connections between Layers are informed by the 2D structure of the image Matrix, and they accept 2D Matrices as Input.
    Instead of connecting ever Hidden Node to every pixel in the original image, break the original image into 4x Regions, such that each Hidden Node is a “Locally/Convolutional Connected Layer” that connects each Hidden Node to only one of the pixels in each of the 4x Regions (i.e. sees only 1/4 of original image) to find patterns Each Hidden Node still reports to the Output Layer, which combines the findings for the discovered patterns that were learnt separately in each Region. It uses far fewer Parameters than Densely Connected Layer, is less prone to Overfitting and understands how to tease out patterns in image data. Expanding quantity of Nodes with multiple Collections in the Hidden Layer, where each Collection contains Nodes that are responsible for analysing different Regions of the original image and allows discovering more Complex Patterns in data. Each of the Hidden Nodes within a Collection would share common group of Weights (parameters), which is the motivation behind Convolutional Layers (i.e. for images where pattern may appear in any region of the image)
- Convolutional Layers
  - Steps
    - Given an image 5x5
    - Select a Width (W) and Height (H) that defines a Convolution Window (i.e. 3x3 Window)
    - Slide the image horizontally and vertically over Regions of the image pixels. At each position the Window specifies a small piece within the image and defines a Collection of pixels to which we connect a single Hidden Node, and we call this Hidden Layer a Convolutional Layer
    - Each Regional Collection of Input Nodes influences the value of a node in the Convolutional Layer by multiplying the Input Nodes by their corresponding Weights and summing the result. Assume the Bias is 0
    - Always add a ReLU Activation Function to the Convolutional Layers (leaves Positive values alone and makes all Negative values 0). See https://www.youtube.com/watch?v=h5R_JvdUrUI
    - Filters (i.e. 3x3) representing the Weights in a grid, whose size matches the size of the Convolutional Window
      
      http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution
      ReLU ( SUM ( Region Input Nodes * Weights Grid ) ) = 0 where Weights Grid == Filter
      - Note: We try to visualise the Filter so we understand what kind of Pattern the Filter is trying to detect. Each Filter detects only a single Pattern.
      - Detection of Multiple Patterns requires the use of Multiple Filters
        
        i.e. if want part of a dog image that contains Teeth, Whiskers, then we’d have Two Filters to detect Patterns of each https://www.youtube.com/watch?v=RnM1D-XI–8
      - Weights are not known in advance. Instead the Weights are learnt by the Neural Network as being the Weights that minimise the Loss Function
      - Filter Image Kernels for Feature Extraction. It allows allows you to create your own filter. You can then use your webcam as input to a convolutional layer and visualize the corresponding activation map
        
        http://setosa.io/ev/image-kernels/
- Greyscale vs Colour Images
  - Greyscale images interpreted as 2D array (Width, Height)
  - Colour images interpreted as 3D array (Width, Height, Depth)
    - RGB images have Depth of 3 (i.e. 3x steps of 2D Matrices, one for each of the Red, Green, and Blue Channels of the image)
  - Convolutions over Colour image (3D) are performed using the 3D Filter (stack of 3x 2D matrices), that has a value for each Colour Channel for all pixels in the image array.
- Feature Maps http://iamaaditya.github.io/2016/03/one-by-one-convolution/
  - Given a Coloured image with 3x Filters.
  - Each Feature Map in Convolutional Layer produced by performing Region Input Nodes Matrix * Weights Matrix (aka Filter) may be thought of as Image Channels that may be stacked into a 3D array. This Stack of Image Channels may be provided as Input to another Second Convolutional Layer to discover Patterns within the Patterns that we discovered in the First Convolutional Layer, and so forth.
- Weights and Biases in CNNs
  - Both Dense Layers in MLPs and Convolutional Layers in CNNs have Weights and Biases that are initially randomly generated
  - In CNNs where the Weights take the form of Convolutional Filters these filters are randomly generated, and the Patterns they are initially designed to detect are also initially randomly generated
- Loss Functions and Training CNNs
  - CNN always have a Loss Function
    - Multi-Class Categorisation uses categorical_cross_entropy_loss
  - Train the Models through Backpropagation and update the Filters (aka Weights matrices) with values that minimise the Loss Function at each Epoch (iteration)
  - The CNN determines the kinds of Patterns it needs to detect based on the Loss Function, we do not tell the CNN the values of the Filters or the kinds of Patterns to detect, the CNN determines this from the dataset.
  - Visualising the Patterns we will see that if the dataset contains say Dog images then the CNN is able to on its own learn Filters that appear like Dogs.
- Stride and Padding Hyperparameters
  - https://www.youtube.com/watch?v=Qt5SQNcQfgo
  - Stride
    - amount the Filter slides over the images.
  - Padding
    - padding around the image with 0’s to plan ahead for the possibility that the Filter may extend part-way over the side of an image if there is an odd difference between dimensions of image and dimensions of the Filter. Padding gives the Filter more space to move so we get contributions from all Regions in the image when populating the Convolutional Layer
    - padding = 'valid' means we are ok with potentially losing some Nodes in the Convolutional Layer and do not want padding
    - padding = 'same' means we want padding as do not want to lose any Nodes in the Convolutional Layer
- Convolutional Layers in Keras
  - Docs https://keras.io/layers/convolutional/
  - Create a Convolutional Layer in Keras
    from keras.layers import Conv2D Conv2D(filters, kernel_size, strides, padding, activation='relu', input_shape)
    - Arguments
      - filters - number of filters.
      - kernel_size - Number specifying both height and width of (square) convolution window.
    - Optional arguments
      - strides - stride of convolution. default strides is set to 1.
      - padding - ‘valid’ or ‘same’. default is ‘valid’ (i.e. no padding)
      - activation - typically ‘relu’. default is no activation applied. Strongly encouraged to add a ReLU activation function to every Convolutional Layer in your networks.
    - NOTE:
      - It is possible to represent both kernel_size and strides as either a number or a tuple.
      - When using Convolutional Layer as first layer (appearing after the input layer) in a model, you must provide an additional input_shape argument:
        
        input_shape - Tuple specifying the Height, Width, and Depth (in that order) of the input.
        
        NOTE: Do not include the input_shape argument if the Convolutional Layer is not the first layer in your network.
    - Example #1
      - constructing a CNN, input layer accepts grayscale images that are 200 by 200 pixels (corresponding to a 3D array with height 200, width 200, and depth 1). I want next layer to be a convolutional layer with 16 filters, each with a width and height of 2 (kernel size). When performing the convolution, I’d like the filter to jump two pixels at a time (strides). I also don’t want the filter to extend outside of the image boundaries; in other words, I don’t want to pad the image with zeros. To construct this convolutional layer, I would use the following line of code:
        Conv2D(filters=16, kernel_size=2, strides=2, activation='relu', input_shape=(200, 200, 1))
    - Example #2
      - want next layer in my CNN to be a convolutional layer that takes the layer constructed in Example 1 as input. Say I’d like my new layer to have 32 filters, each with a Height and Width of 3. When performing the convolution, I’d like the filter to jump 1 pixel at a time. I want the convolutional layer to see all regions of the previous layer, and so I don’t mind if the filter hangs over the edge of the previous layer when it’s performing the convolution. Then, to construct this convolutional layer, I would use the following line of code:
        
        Conv2D(filters=32, kernel_size=3, padding='same', activation='relu')
    - Example #3
      - If you look up code online, it is also common to see convolutional layers in Keras in this format:
        
        Conv2D(64, (2,2), activation='relu')
      - In this case, there are 64 filters, each with a size of 2x2, and the layer has a ReLU activation function. The other arguments in the layer use the default values, so the convolution uses a stride of 1, and the padding has been set to ‘valid’.
  - Dimensionality: see conv-dims.py of https://github.com/ltfschoen/aind2-cnn
    - Formula for Shape of a Convolutional Layer
    - Formula for Number of Parameters in Convolutional Layer
  - Pooling Layers
    - https://www.youtube.com/watch?v=OkkIZNs7Cyc
    - Convolutional Networks Summary - http://cs231n.github.io/convolutional-networks/
    - docs https://keras.io/layers/pooling/
    - Dfn:
      - Pooling Layers take Convolutional Layers as Inputs
      - Convolutional Layers are a stack of Feature Maps (one for each Filter)
      - Complex datasets with many object Categories require many Filters each responsible for finding a Pattern in the image
      - Dimensionality of Convolutional Layers may get quite large since more Filters means a larger stack.
      - Higher Dimensionality means use of more Parameters which may lead to Overfitting, so we need a method to reduce the Dimensionality, which is the role of Pooling Layers within a Convolutional Neural Network
    - Types of Pooling Layers
      - Max Pooling Layers
        
        take from the Convolutional Layer a stack of Feature Maps as Input
        
        Define a
        
        Window Size: i.e. 2x2
        
        Stride: i.e. 2
        
        Construct the Max Pooling Layer by working with each Feature Map separately
        
        Choose the Maximum value within the Window for each Stride.
        
        The outcome will be that each Feature Map will be reduced in Width and Height (lower Dimensionality)
        
        Example
        
        Say I’m constructing a CNN, and I’d like to reduce the dimensionality of a convolutional layer by following it with a max pooling layer. Say the convolutional layer has size (100, 100, 15), and I’d like the max pooling layer to have size (50, 50, 15). I can do this by using a 2x2 window in my max pooling layer, with a stride of 2, which could be constructed in the following line of code:
        
        MaxPooling2D(pool_size=2, strides=2)
        
        If you’d instead like to use a stride of 1, but still keep the size of the window at 2x2, then you’d use:
        
        MaxPooling2D(pool_size=2, strides=1)
      - Global Average Pooling Layer
        
        Extreme type of Dimensionality Reduction
        
        takes stack of Feature Maps and computes the average value of the Nodes for each Feature Map in the stack (reduces each Feature Map to a single value) such that the Global Average Pooling Layer converts a 3D array into a Vector
        
        DO NOT specify Window size or Stride
  - Designing Boilerplate CNN Architecture by arranging layers (for image classification)
    - MY CODE AND NOTES
      - https://github.com/ltfschoen/aind2-cnn/tree/master/cifar10-classification
    - Steps
      - https://www.youtube.com/watch?v=kI_RQoYsgQw
      - Input
        
        CNN must accept image array input
        
        CNNs require a fixed-size input of the provided real-world images
        
        Select image size
        
        Resize all images to that same size (i.e. a square 32x32px) with dimensions divisible by 2
        
        All images are interpreted by computer as 3D array. Width and Height always higher than Depth
        
        Coloured RGB image shape = (10, 10, 3)
        
        where Depth caters for RGB Channels
        
        Greyscale image shape = (10, 10, 1)
      - Architecture (to allow the model to train “better” and classify objects more accurately)
        
        Sequence of Convolutional Layers
        
        Purpose of Sequence of Convolutional Layers is to discover hierarchies of spatial Patterns in the image
        
        Specify various Hyperparameters as Input to each Convolutional Layer
        
        Config so Convolutional Layer has same Width and Height as Previous Layer
        
        kernel_size - between 2 and 5 (i.e. 2x2 to 5x5)
        
        controls Depth of Convolutional Layers since the Convolutional Layer has one Activation Map for each Filter
        
        stride - 1 (default)
        
        padding - ‘same’ (gives better results)
        
        activation - ‘relu’ (for all Convolutional Layers)
        
        Increase number of Filters with each higher Convolutional Layer index in the Sequence
        
        i.e. 1st Convolutional Layer - 16 filters
        
        i.e. 2nd “ “ - 32 “
        
        i.e. 3rd “ “ - 64 “
        
        First Convolutional Layer requires the following additional parameter to be provided:
        
        input_shape (i.e. (32, 32, 3) means 32x32 pixel colour images in dataset)
        
        Note:
        
        This configuration gradually increases the Depth of the provided Input array (32x32x3) without modifying the Width and Height at each Convolutional Layer i.e. when run the function we’ll see the Depth increasing
        
        WITHOUT MAX POOLING LAYERS
        Input - (None, 32, 32, 3) conv2d_1 - (None, 32, 32, 16) conv2d_2 - (None, 32, 32, 32) conv2d_3 - (None, 32, 32, 64)
        
        WITH MAX POOLING LAYERS (also decreases Width and Height)
        Input - (None, 32, 32, 3) conv2d_1 - (None, 32, 32, 16) max_pooling2d_1 - (None, 16, 16, 16) conv2d_2 - (None, 16, 16, 32) max_pooling2d_2 - (None, 8, 8, 32) conv2d_3 - (None, 8, 8, 64) max_pooling2d_3 - (None, 4, 4, 64)
        
        Accepts Input array and gradually making converting its shape until its Depth is much taller than its Width and Height
        
        Eventually it transforms into Vector representation where there is not more spatial info to discover (i.e. whiskers, teeth, etc) in the image and then feed the Vector into Fully Connected Layer(s) (i.e. Dense) to determine what object is contained in the image. i.e. if the Last Max Pooling Layer discovers spatial information that wheels are present in that part of the image, then the Fully Connected Layer transforms that information to predict that a car is likely present in the image with higher probability (i.e. prob(car) = 0.99, prob(dog) = 0.01). This information is NOT pre-specified by us, instead it is learnt by the model during training through backpropagation
        
        Convolutional Layers increase the Depth of the Input array as it passes through the network
        
        Max Pooling Layers decrease the Width and Height of the Input array (spatial dimensions)
        
        Follow every or every second Convolutional Layer in the Sequence
        
        Config so Spatial Dimensions become Half what they were in the Previous Layer
        
        pool_size - 2
        
        stride - 2
        
        padding - default
- TODO -
  - http://cs231n.github.io/convolutional-networks/
AWS GPU Instances
- https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/53b2a19e-4e29-4ae7-aaf2-33d195dbdeba/lessons/2df3b94c-4f09-476a-8397-e8841b147f84/concepts/ced1fa22-5723-4212-b73f-08f7f6613cae
- About:
  - GPU-enabled server instance on AWS to train larger NN architectures
  - Alternative to fast training of NNs on local CPU when built-in is not fast Nvidia GPU
- Setup:
  - Login to AWS account https://aws.amazon.com/
    - Using Root https://www.amazon.com/ap/signin?openid.assoc_handle=aws&openid.return_to=https%3A%2F%2Fsignin.aws.amazon.com%2Foauth%3Fresponse_type%3Dcode%26client_id%3Darn%253Aaws%253Aiam%253A%253A015428540659%253Auser%252Fhomepage%26redirect_uri%3Dhttps%253A%252F%252Fconsole.aws.amazon.com%252Fconsole%252Fhome%253Fstate%253DhashArgs%252523%2526isauthcode%253Dtrue%26noAuthCookie%3Dtrue&openid.mode=checkid_setup&openid.ns=http://specs.openid.net/auth/2.0&openid.identity=http://specs.openid.net/auth/2.0/identifier_select&openid.claimed_id=http://specs.openid.net/auth/2.0/identifier_select&openid.pape.preferred_auth_policies=MultifactorPhysical&openid.pape.max_auth_age=0&openid.ns.pape=http://specs.openid.net/extensions/pape/1.0&server=/ap/signin&forceMobileApp=&forceMobileLayout=&pageId=aws.ssop&ie=UTF8
  - View EC2 Service Limits https://console.aws.amazon.com/ec2/v2/home?#Limits
  - Find EC2 Instance Limit for p2.xlarge instance type (virtual server with GPU)
  - Submit a Limit Increase Request to increase p2.xlarge to 1 with use case: “I would like to use GPU instances for deep learning.”
- Launch Instance
  - Visit EC2 Management Console https://console.aws.amazon.com/ec2/v2/home
    - Click “Launch Instance” Button
    - Select Amazon Machine Image (AMI) with OS for instance and config and pre-installed software
      - Open Existing AMI
        
        Go to “Community AMIs”
        
        Search for “udacity-aind2” AMI
        
        Click “Select”
    - Choose Instance Type (i.e. hardware the AMI runs on)
      - Filter to only list “GPU compute”
      - Select “p2.xlarge”
      - Click “Review and Launch”
    - Configure Security Group (allow special config to allow running and accessing Jupyter Notebook from AWS)
      - Allow access the Jupyter notebook to port 8888 by configuring the AWS Security Group
      - Click “Edit Security groups”
      - On the “Configure Security Group” page:
        
        Select “Create a new security group”
        
        Set the “Security group name” to “Jupyter”
        
        Set the “Description” to “Jupyter”
        
        Click “Add Rule”
        
        Set a “Custom TCP Rule”
        
        Set the “Port Range” to “8888”
        
        Select “Anywhere” as the “Source”
        
        Click “Review and Launch” (again)
      - Note:
        
        EC2 Pricing https://aws.amazon.com/ec2/pricing/on-demand/
      - WARNING: AWS CHARGES FEE FROM THIS POINT ONWARDS
        
        EC2 Instances cost $
        
        Shutdown instances when not using them
        
        Storage of EC2 Instances cost $
        
        Delete EC2 instances when not using them
        
        ACTIONS > INSTANCE STATE > STOP/TERMINATE
        
        Note:
        
        Set AWS Billing Alarms
      - Click “Launch” button to launch the GPU instance
      - Click “Proceed without a key pair”
      - Click “Launch Instances” button
      - Click “View Instances” button
      - Go to ECS Management Console to watch instance booting
        
        When it says “2/2 checks passed” then the instance is ready to be logged into
        
        An “IPv4 Public IP” address (in the format of “X.X.X.X”) appears on the EC2 Dashboard
      - Open Terminal and SSH to that address as user “aind2”: ssh aind2@X.X.X.X
      - Authenticate with the password: aind2
      - Success: Now we have a GPU-enabled server on which to train your Neural Networks
- Launch Jupyter Notebook using EC2 Instance
  - On the EC2 Instance
  - Clone the aind2-cnn repo:
    git clone https://github.com/udacity/aind2-cnn.git source activate aind2 cd aind2-cnn jupyter notebook --ip=0.0.0.0 --no-browser
  - Find the output window line that looks like:
  - Copy/paste this URL into your browser when you connect for the first time to login with a token: http://0.0.0.0:8888/?token=3156e…
  - Copy and paste the complete URL into the address bar of a web browser (Firefox, Safari, Chrome, etc). Before navigating to the URL, replace 0.0.0.0 in the URL with the “IPv4 Public IP” address from the EC2 Dashboard. Press Enter.
  - Note that the browser should display the folders contained in the aind2-cnn repo
  - Verify that the EC2 Instance can run a Jupyter Notebook by using https://github.com/udacity/aind2-cnn/blob/master/cifar10-classification/cifar10_mlp.ipynb
    - Click “cifar10-classification”
    - Click “cifar10_mlp.ipynb”
    - Run All Cells in notebook
- Shutdown and Delete EC2 instance
Example CNN in Keras using AWS EC2 with cifar10-classification
- Video https://www.youtube.com/watch?v=faFvmGDwXX0
- Link
  - Cheatsheet for CNNs in Keras
    - https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Keras_Cheat_Sheet_Python.pdf
  - CIFAR-10 Winning Architecture
    - http://blog.kaggle.com/2015/01/02/cifar-10-competition-winners-interviews-with-dr-ben-graham-phil-culliton-zygmunt-zajac/
- Note: Previously we used validation_split argument in model.fit to 0.2. This removed the final 20% of the training data, which was instead used as validation data. Alternatively instead of having Keras split off the validation set for us, we may opt to hard-code the split ourselves
Mini Project: CNNs in Keras
- Modify architecture of the neural network in cifar10_cnn.ipynb.
- Specify new CNN architecture in Step 5: Define the Model Architecture in the notebook.
  - Example https://github.com/fchollet/keras/blob/master/examples/cifar10_cnn.py
- Train the new model. Check the accuracy on the test dataset, and report the percentage in the text box below.
- Try different optimiser
Example Image Augmentation in Keras
- Links
  - Video https://www.youtube.com/watch?v=odStujZq3GY
  - Keras ImageDataGenerator https://keras.io/preprocessing/image/
  - Visualise Augmentations of MNIST dataset http://machinelearningmastery.com/image-augmentation-deep-learning-keras/
  - Augmentation to boost performance on Kaggle dataset using less data https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
- Image Augmentation
  - Focus algorithm on learning an Invariant Representation of the image that just checks if object is present in image or not without dwelling on irrelevant info. Do not want the model to change its prediction based on any of the following different attributes of the object, which are not relevant
    - Size Scale Invariance
    - Angle Rotation Invariance
    - Position Translation Invariance
  - Technique to increase Invariance of images that Expands the dataset by Data Augmenting (Data Augmentation also avoids Overfitting and is better at generalising since model sees many new images)
    - Add images to dataset at random Rotations to increase Rotation Invariance
    - etc
  - MY CODE AND NOTES
    - https://github.com/ltfschoen/aind2-cnn/tree/master/cifar10-augmentation
  - Note: Some CNNs have some built-in Translation Invariance (i.e. in Max Pooling where we take max value contained in Window, can occur at any pixel in that Window)
- Jupyter Notebook aind2-cnn/cifar10-augmentation/cifar10_augmentation.ipynb
- Note on steps_per_epoch
  - Recall that fit_generator takes many parameters, including steps_per_epoch = x_train.shape[0] / batch_size where x_train.shape[0] corresponds to number of unique samples in the training dataset x_train. By setting steps_per_epoch to this value, we ensure that the model sees x_train.shape[0] augmented images in each epoch.
Groundbreaking CNN Architectures for Object Classification Tasks
- Top CNN Models in Keras
  - https://keras.io/applications/
- Benchmarks of Top CNN Models architectures in Keras https://github.com/jcjohnson/cnn-benchmarks
- TODO - ImageNet Paper http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
  - 10M images drawn from 1000 different image categories
  - In 2012, AlexNet Architecture was built from the ImageNet Large Scale Visual Recognition Competition using best available GPUs.
    - Pioneered the use of the ReLU Activation Function and Dropout technique to avoid Overfitting
  - In 2014, VGG / VGGNet Architecture (Visual Geometry Group)
    - VGG16 has 16 total layers, VGG19 has 19 total layers
    - Each have long sequence of 3x3 convolutions broken up by 2x2 pooling layers, and finished with 3x fully connected “dense” layers
    - Pioneered use of small 3x3 convolution windows (instead of AlexNets 11x11 windows)
  - In 2015, ResNet Architecture (similar to VGG, by Microsoft)
    - One version has CNN with 152 layers (previously above a certain number of layers the performance declined due to the Vanishing Gradients Problem that arises when go to train the CNN through backpropagation where the gradient signal must be pushed through the entire network, and in deeper network its more likely the signal gets weakened before arriving at destination)
    - ResNet adds Connections to deep CNN at skip layers so gradient has shorter route to travel to achieve superhuman performance in image classification
- TODO - VGGNet Paper https://arxiv.org/pdf/1409.1556.pdf
- TODO - ResNet Paper https://arxiv.org/pdf/1512.03385v1.pdf
- Treatment of Vanishing Gradients Problem http://neuralnetworksanddeeplearning.com/chap5.html
- ImageNet Large Scale Visual Recognition Competition (ILSVRC) http://www.image-net.org/challenges/LSVRC/
Visualising CNNs to understand them
- Visualising Activation Maps in Convolutional Layers to dig deeper in how CNN works
  - Example: Pass Webcam through CNN in real-time
- Take Filters from Convolutional Layers and constructing images that Maximise their Activations
  - Steps:
    - Start with image containing Random Noise
    - Gradually amend the pixels at each step changing them to values that filter more highly activated
  - 1st Convolutional Layer
    - Random Noise
    - Filter that detects specific Colours or Edges
    - Output image
  - 2nd Convolutional Layer
    - Filter that detects Circles or Stripes
    - Output image
  - 5th Convolutional Layer
    - Filter detects Complex Patterns
    - Output image
  - Example:
    - Deep Dreams
      - Starting image (i.e. a Tree) is applied a Filter (i.e. a “Statue”) that transforms it into a Hybrid
- Links
  - Course on Visualising CNNs http://cs231n.github.io/understanding-cnn/
  - Openframeworks.cc for visualising CNNs in real-time http://openframeworks.cc/
    - Demo https://aiexperiments.withgoogle.com/what-neural-nets-see
  - WaveNets to generate audio
  - DeepVis Toolbox https://github.com/yosinski/deep-visualization-toolbox
    - http://ml4a.github.io/
  - Visualisation Tool https://www.youtube.com/watch?v=AgkfIQ4IGaM&t=78s
  - Creating CNN Visualisations https://www.youtube.com/watch?v=ghEmQSxT6tw&t=5s
  - Clarifai.com https://www.clarifai.com/
  - Picasso Visualiser for CNNs in Keras and TensorFlow https://medium.com/merantix/picasso-a-free-open-source-visualizer-for-cnns-d8ed3a35cfc5
  - Visualizing how CNNs see the world. Introduction to Deep Dreams, along with code for writing your own deep dreams in Keras
    - https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html
  - Music Video uses DeepDreams 3:15-3:40
    - https://www.youtube.com/watch?v=XatXy6ZhKZw
  - Create DeepDreams without code https://deepdreamgenerator.com/
  - Interpretability of CNNs
    - Dangers from using deep learning models (that are not yet interpretable) in real-world applications https://blog.openai.com/adversarial-example-research/
    - https://arxiv.org/abs/1611.03530
How a CNN works in Action
- Given an image trained on ImageNet http://www.matthewzeiler.com/pubs/arxive2013/eccv2014.pdf
- Each image represents a pattern that causes the neurons in the first layer to activate (i.e. they are patterns that the first layer recognizes, such as a -45 degree line).
- First Layer of our CNN clearly picks out very simple shapes and patterns like lines and blobs
- Second Layer picks up more complex ideas like circles and stripes. A grayscale grid may be used to represent how the layer of the CNN activates (or “what it sees”) based on the corresponding images from the grid on the right. Second layer of the CNN captures complex ideas like circles, stripes, and rectangles. CNN learns to do this on its own. There is no special instruction for the CNN to focus on more complex objects in deeper layers. That’s just how it normally works out when you feed training data into a CNN.
- Third Layer third layer picks out complex combinations of features from the second layer. These include things like grids, and honeycombs, wheels, and even faces
- Final Layer last layer picks out the highest order ideas that we care about for classification, like dog faces, bird faces, and bicycles
Transfer Learning
- Links
  - Slides https://classroom.udacity.com/nanodegrees/nd889/parts/16cf5df5-73f0-4afa-93a9-de5974257236/modules/53b2a19e-4e29-4ae7-aaf2-33d195dbdeba/lessons/2df3b94c-4f09-476a-8397-e8841b147f84/concepts/8c202ff3-aab5-46c3-8ed1-0154fa7b566b
  - Systematically analyzes the transferability of features learned in pre-trained CNNs
    - https://arxiv.org/pdf/1411.1792.pdf
  - Cancer detecting CNN https://www.nature.com/articles/nature21056.epdf
- Steps Summary
  - Deep Learning Professionals Design the Architecture of CNN
    - Setting Hyperparameters (Filter Window size, stride, padding)
    - Choose Loss Function and Optimiser
    - Start Model Training and wait
  - Note: State-of-art CNNs architectures available in Keras are result of experimenting with numerous architectures and extensive Hyperparameter tuning and are trained on large ImageNet database that took weeks to train on latest GPUs
- About
  - Adapt the State-of-the-art CNN Architectures that have learnt so much about how to find Patterns in image datasets toward our own Classification tasks (instead of creating CNN from scratch, take the learnt understanding and pass it on to a new Deep Learning Model) using the Transfer Learning Technique
  - Transfer Learning Approach
    - Removing the Final Classification Layers of a State-of-the-art CNN Inception Architecture (i.e. Conv, Pool, Dense) that are specifically pre-trained on the ImageNet data set (containing animals, fruits, etc), but retain the early layers (detecting colors, shapes, general features), and replace with a new Dense layer and only train that
    - Update new Dense layer by:
      - randomly initialising the Weights in the new Dense layer
    - Update all other existing layers by:
      - initialise Weights using pre-trained Weights
    - During re-training of the entire neural network the parameters were further fine-tuned / optimised to fit to the custom database (i.e. of skin lesions)
  - Example
    - Diagnosing skin cancer
- Different Approach for different Cases
  - Case 1: Small Data Set, Similar Data
  - Case 2: Small Data Set, Different Data
  - Case 3: Large Data Set, Similar Data
  - Case 4: Large Data Set, Different Data
  - See CNN Part 27. Transfer Learning Slide for details
- Notes:
  - Small dataset of images - 1,000
    - Overfitting is concern when using transfer learning with a small data set.
  - Large dataset of images - 1,000,000
  - Global Average Pooling (GAP) layers were proposed in 2016 and are not only used for Object Classification (what is in the image) but also Object Localisation (where the object is in the image)
- Example
  - Video - https://www.youtube.com/watch?v=HsIAznMM1LA
  - Code transfer-learning/transfer_learning.ipynb
  Note: * After training CNN to identify dog breeds, use the model in an end-to-end pipeline for incorporation into an app
- Links
  - Global Average Pooling (GAP) Layers for Object Localisation using ResNet http://cnnlocalization.csail.mit.edu/Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf
  - Code that uses CNN for object localisation https://github.com/alexisbcook/ResNetCAM-keras
  - Video on CNN object localisation https://www.youtube.com/watch?v=fZvOy0VXWAI
  - Visualization techniques to better understand bottleneck features https://github.com/alexisbcook/keras_transfer_cifar10
  - http://machinelearningmastery.com/object-recognition-convolutional-neural-networks-keras-deep-learning-library/
  - https://arxiv.org/pdf/1611.10012.pdf
  - http://www.pyimagesearch.com/2017/03/20/imagenet-vggnet-resnet-inception-xception-keras/
  - https://pythonprogramming.net/haar-cascade-face-eye-detection-python-opencv-tutorial/
  - https://www.packtpub.com/books/content/tracking-faces-haar-cascades
Recurrent Neural Networks (RNN)
- About
  - Supervised Learning type
    - Learning between Input/Output pairs
    - Unstructured Input in Vanilla Supervised Learning Models (i.e. Feedforward Networks)
      - Issue as cannot exploit any particular structure
      - Do not make any assumptions about how Input Dataset is Structured
        
        Example
        
        Given row of data from medical dataset with single data point (one input-output pair) as Unstructured Inputs. Unstructured Input features in any order used to predict whether a given disease exists by feeding them into a Vanilla Supervised Learner designed to assume and process Unstructured Input data (ignore structure)
        
        Input
        
        Categories age: 43 gender: male height: 5.9 weight: 188
        
        Output: disease: No
    - Structured Input
      - Many data types do have Input Structure
        
        Examples
        
        Images - input pixels of image patch are related spatially (pixels near one another are similar). If pass to a Vanilla Supervised Learner that’s designed to assume and process Unstructured Input then it won’t care about the spatial correlations with same level of performance, but if we instead use a Convolutional Neural Network (CNN) then if we pass Structured Input it will leverage it (i.e. the spatial correlation b/w pixels)
        
        Video
        
        Text - where there’s a natural Order to Words and Chars in a Sentence. Trying to predict the next word given Input Words a Vanilla Supervised Learner won’t care (is indifferent) what Order the Input Words are fed in
        
        Financial Time Series - where naturally Ordered Structure of past Input History events over time considered and used to predict future value. Vanilla Supervised Learner would be indifferent to Order when we train it, but we should instead exploit the Ordered/Sequential Structure if its available by using Recurrent Neural Networks (RNN)
- Usage:
  - CNN - Images, Video
    - Example
      - Linear Relationship (line) splits Feature Space of Input image with faces plotted on one side or non-face on other side
  - RNN -
    - Sequential Data - speech recognition (time-series text-to-audio), text generation, stock price prediction
    - Example
      - Regressor Line indicates of a trend of first weekend revenue predicts movie popularity
- Background
  - Supervised Learning Problems
    - Ordered Sequences Problems
      - Financial Time Series
        
        Example
        
        Ordered Sequence based on historical price Input of Apple stock price over time
      - Natural Language Processing (NLP)
        
        Example
        
        Text generation
        
        Given small sequence of text, try to Auto-complete using Supervised Learning where we’ve well trained a network model on a large text corpus to generate new sentences to understand complex relationships b/w Words and Characters from the English language
        
        Links
        
        Academic RNN text generator http://www.cs.toronto.edu/~ilya/rnn.html
        
        Twitter bots that tweet automatically generated text http://tweet-generator-alex.herokuapp.com/
        
        NanoGenMo annual contest to automatically produce a 50,000+ novel automatically https://github.com/NaNoGenMo/2016
        
        Robot Shakespeare a text generator that automatically produces Shakespear-esk sentences https://github.com/genekogan/RobotShakespeare
        
        NTLK http://www.nltk.org/
        
        Input - Ordered Sequence of Words or Chars (Training Text Corpus)
        
        Output - Ordered Sequence of Chars
        
        Machine Translation
        
        Automatic translation of one language into another
        
        Input - Ordered Sequence of Words (language X)
        
        Output - Ordered Sequence of Words (language Y)
      - Speech Recognition
        
        Example
        
        Speech Recognition
        
        Input - Ordered Sequence of Raw Audio Signal
        
        Output - Ordered Sequence of Words (text-based)
- Modelling Ordered Sequence (Sequential) Data Recursively (used in RNN framework)
  - Use previous values of time-series to predict future values
  - Notation for Generic Ordered Sequence of Values (Ordered by Index) (S1, S2, S3, ..., SP)
    - Generic Ordered Sequence has Big P values
    - S1 comes before S2, S2 comes before S3, etc
    - Indices may be of any interpretation, or even Timestamps (indexing when a certain value in a sequence occurred)
    - Example
      - Elements 1 to 5
        S1 S2 S3 S4 S5 my dog is the best
    - Example: Stock Price History (or time-series generally)
      - Ordered Sequence from Left to Right
      - S1 at Time 1, S2 and Time 2, etc
  - Model Ordered Sequence Structure (often product of real underlying process)
    - Example:
      - Predict temperature. Input is temperature over time (temperature-based time-series) dependent on factors (i.e. Sun)
      - Predict stock price for investor. Input is price history of given stock that’s dependent on factors (Known and Unknown) i.e. success of product line, vitality of overall economy, CEO and board of directors actions, etc.
      - Model Ordered Sequence Recursively (in lieu of Knowing the underlying process)
        
        i.e. use past values of sequence to predict future values
        
        i.e. model future values of sequence mathematically in terms of its predecessors
        
        SEED is the original value in a recursive sequence (recursive sequences always start with seed value(s))
        
        ORDER number of most recent element values used as Inputs each time it recurses to produce future ones i.e. number of previous elements a recursive sequence requires in order to predict future elements
        
        Mathematically Recursive Sequences Examples
        
        Example:
        
        Odd Numbers:
        
        1,3,5,7,…
        
        S1==1, S2==3
        
        Generate Ordered Sequence Recursively
        S1 = 1 (1x SEED value) S2 = 2 + S1 = 3 S3 = 2 + S2 = 5 (just add 2 to previous sequence value) S4 = 2 + S3 = 7 ... Note: ORDER == 1 - Since uses 1x most recent value each time it recurses
        
        Unfolded Views of Recursive Sequence
        
        https://www.youtube.com/watch?v=OS9yQCTzCkg
        
        Find Recursive Equation to Generate Values
        Function-based Notation of Generic Recursive Sequence *** f(s) = 2 + s *** i.e. f(S1) = 2 + S1 f(S2) = 2 + S2
        
        Graphical Notation
        S1 --(f)--> S2 --(f)--> S3 ... ST-1 --(f)--> ST ...
        
        Folded View of Recursive Sequence
        
        Single line represent all our recursive levels
        
        S1 = 1 St = f(St - 1), t = 2, 3, 4
        
        Graphical Model Analogue (feeds output into itself repeatedly)
        __ f __ __f__ | | | | S1 > S2 ... > ST
        
        Graph the Step (x-axis) vs Values (y-axis)
        
        Recursive Sequence - Every value in sequence can be defined in terms of its predecessors (except the first value) i.e. where future elements are based mathematically on previous values
        
        Fibonacci sequence:
        
        1,1,2,3,5,8,13,21
        
        Note: This recursive sequence generates the Golden Ratio and creates a Spiralling Effect when represented Geometrically
        
        Generate Ordered Sequence recursively (this time with 2x SEED values)
        S1 = 1, S2 = 1 (2x SEED values) S3 = S1 + S2 = 2 (just sum previous two elements) S4 = S3 + S1 = 3 S5 = S4 + S3 = 5 S6 = S5 + S4 = 8 ... Note: ORDER == 2 - Since uses 2x most recent values each time it recurses
        
        Unfolded Views of Recursive Sequence
        
        Find Recursive Equation to Generate Values
        Function-based Notation of Generic Recursive Sequence *** f(St - 2, St - 1) = St - 2 + St - 1 *** S1 = 1, S2 = 2 S3 = f(S1, S2) S4 = f(S2, S3) S5 = f(S3, S4) ...
        
        Folded View of Recursive Sequence
        
        Single line represent all our recursive levels
        
        S1 = 1, S2 = 1 St = f(St - 2, St - 1), t = 3, 4, 5, ...
        
        Links: Fibonacci Sequence https://en.wikipedia.org/wiki/Fibonacci_number
        
        Example 2: Rayleigh - Reverse Steps
        
        Define a few SEED values and Recursive Equation then Generate a Recursive Sequence using the Recursive Equation
        
        1) Seed Recursive Sequence with 2x SEED values: S1 = 1, S2 = 0.5 S3 = 0.4 * MAX(0, 1 + S1 - 0.1 * S2) (use S1 and S2 Inputs) S4 = 0.4 * MAX(0, 1 + S2 - 0.1 * S3) (use S2 and S3 Inputs) ... Note: ORDER == 2 - Since uses 2x most recent values each time it recurses (i.e. 0.4 times the linear combo of S1 and S2 through Rayleigh function that takes an Input and Outputs the Max of that Input and 0)
        
        Folded View of Recursive Sequence
        
        Single line represent all our recursive levels
        
        S1 = 1, S2 = 0. St = f(St-2, St-1), t = 3, 4, 5, ... f(St-2, St-1) = 0.4 * MAX(0, St-2 - 0.1 * St-1)
        
        Graphed of oscillation at 4:20 https://www.youtube.com/watch?v=KN_dRCy3rtw
  - Basic Graphical Model Representation using Maths to understand RNNs
    - See earlier examples of Odd
      - Unfolded Views of Recursive Sequence
      - Folded View of Recursive Sequence
      - Summary https://youtu.be/OS9yQCTzCkg?t=4m36s
- Expressing Recursive Sequences
  - Functionality
  - Graphically (graphical models)
  - Programatically (in code)
- Views of a Recursive Sequence
  - Unfolded
  - Folded
- Drive a Hidden Recursive Sequence using any Driver (Input Sequence)
  - About
    - Generating methods for Markov Chains
    - Dynamic systems
    - RNNs
  - Example
    - Create model of savings account balance at end of each month (month-to-month basis)
      Denote the following: h1 = initial savings balance at end of 1st month ht = savings balance at end of month t st = income (or loss) at end of month t (i.e. drivers/influencers of savings balance each month) i.e. s1 is income (or loss) at end of 1st month etc Example Model for monthly savings level h1 = 0 (initial savings level during month 1, the seed value) h2 = h1 + s1 (add 1st months savings to 1st months income (or loss)) h3 = h2 + s2 (add previous months balance to previous months income (or loss)) ... repeat for every month (time period) Folded view of month-to-month savings balance summarises these recursive updates h1 = 0 ht = ht-1 + st-1, t = 2, 3, 4, ... where future values of sequence h are fully dependence on the previous values Note: - h is recursive sequence always - s may be recursive OR random Simulation of how monthly income (or loss) drives the savings balance - Monthly income (or loss) simulated as random variable of value of either -1, 0, or 1 (with equal probability). - Initialise savings account balance at 0 - Simulate 23x months worth of income (or loss) - Update formula used to generate monthly savings account balance - Refer to Simulation on graph: https://youtu.be/JQ2Nzzxx5oQ?t=4m10s
  - Example
    - Create model of real stock price sequence end of each month (month-to-month basis)
    - Driver (i.e. Input Sequence) - sequence driving things along
    - Hidden Sequence - sequence being driven (since we do not actually receive its values as data, instead we generate them recursively using the Driver)
      Folded view of month-to-month real stock price summarises these recursive updates h1 = 0 (could be set to any value) ht = tanh(ht-1 + st-1), t = 2, 3, 4, ... Note: - s (Driver) is sequence of real stock price that may be recursive OR random - s is used to drive a recursive sequence where each new element in the sequence after the seed is created by adding the previous value of h to the driver sequence s, and then taking tanh of the result Simulation - Refer to Simulation on graph (Value vs Step): https://youtu.be/JQ2Nzzxx5oQ?t=4m38s - h (Hidden Sequence) is layered on top of the driver sequence s in the graph and both appear similar, since the driver sequence s is more Structured than the drive used in the previous example (where we modelled savings account balance) Function Notation representing the Recursive Update h1 = 0 ht = f(ht-1, st-1), t = 2, 3, 4, ... Graphical Model representing this Generic Hidden Sequence https://youtu.be/JQ2Nzzxx5oQ?t=6m02s
- How to Inject the assumption of Recursivity directly into a Supervised Learner (using Feedforward Networks)
  - Definitions
    - Recursivity - Modelling the Structure of Recursive Ordered Sequences
    - Recursive Sequence - Able to generate new values in a sequence by combining old values using a specific formula
  - Lazy Way
    - Adjust Vanilla Supervised Learners to deal with Ordered Sequence data
    - Goal
      - Given an Input Sequence (dataset Driver) we want to Model it as Recursive using a formula that approximately generates values of that Sequence, given previous values, then use that formula to make predictions about future values in the Sequence (whether the Sequence itself is truly Recursive or not) (i.e. perform Supervised Learning with Ordered Sequences)
      - Note:
        
        Given an Input Sequence we Model it as Recursive to make meaningful predictions
        
        Inject the Structural assumption into a Supervised Learner
        
        Options to approach
        
        Feedforward Networks (simple approach) (reverse engineer the notion of Recursivity and inject it as a parameterised Model into a Supervised Learner)
        
        RNNs (complex approach)
- Injecting Recursivity into a Supervised Learner
  - Example using Random Guessing:
    - Given a Random Sequence (Original Sequence) of numbers that we suppose is recursive 1,3,5,7,9,11,13,15
    - Assume the sequence is of ORDER == 1, then SEED value is just 1st value in series
    - Steps to find the formula for the recursive update of the sequence
      - Pick a function
        Pick a function: g(s) Where: s1 = 1 s2 = 2 + s1 s3 = 2 + s2 ... s8 = 2 + s7
      - Use the function with a Seed value to see if it generates something close to sequence we have
      - Inject first Seed value into the function as parameter, and inject successive hat values into next value to generate a new sequence using the function
        s1 = 1 ^s2 = g(s1) ^s3 = 2 + s2 = g(^s2) ... ^s8 = 2 + s7 = g(^s7)
      - New sequence generated with the function
        s1, ^s2, ^s3, ..., ^s8
      - Compare new sequence with original sequence to check how close our alignment is between recursive function and recursive formula
        s1, ^s2, ^s3, ..., ^s8 VS s1, s2, s3, ..., s8
    - Sample implementation of finding steps
      - Pick Random Function
        g(s) = 1 - 0.5 * s
      - Generate an 8-value sequence starting with Seed value s1 check how close you get to the original sequence
      - Plot the comparison between
        
        Random Sequence (Original Sequence)
        
        Random Function (Proposed Solution Recursive Parameterised Sequence Function)
      - Try a different Random Function (Proposed Solution Recursive Parameterised Sequence Function)
      - Repeat
    - Note: cannot rely on just guessing to determine the Proposed Solution Recursive Parameterised Sequence Function, we need to instead Learn such a Function using the Sequence Original Sequence itself
  - Example 1 (Simple): Using Recursive Parameterised Sequence Function as our Recursive Approximator that has Weights we may tune to the given Sequence
    - Summary of steps
      - Inject Recursivity into a Supervised Learner in order to Model and Ordered Sequence Recursively
        
        Proposing Recursive Parameterised Formula (Network Architecture)
        
        Windowing the Sequence to produce Regression Input-Output Pairs
        
        Parameter Tuning using the Input-Output Pairs
        
        Sequence Generation using the Trained Network as a Regressor
    - Given a Random Sequence (Original Sequence) of numbers that we suppose is recursive,
      - Note: Goal is to make a recursive approximation of by first making a guess about the architecture of its recursive formula and then tuning the parameters of the architecture optimally using the sequence itself (ideally try multiple architectures to find best one)
      1,3,5,7,9,11,13,15
    - Pick a Simple Linear Parameterised Function as the Recursive Approximator (simply Feedforward Network) with 2x Weights that we’ll use to Learn Weights w0 and w1 by Fitting
      g(s) = w0 + w1 * s
    - Model each element of Sequence past the Seed as a Linear Combination of its previous element
      s1 = 1 s2 = w0 + w1 * s1 s3 = w0 + w1 * s2 ... s8 = w0 + w1 * s7
      - Weights (i.e. w0 and w1) need to be learnt
      - Equalities (Levels of Recursion) (i.e. s2 to s1, s3 to s2, etc) must be determined, since if they hold for some values of Weights w0 and w1 then we have a Recursive Formula that will generate our sequence
    - Find the best Weights to make Equalities hold as best possible
      - Learn by Forming and then Minimising a Least Squares Cost Function (breaking recursion into levels)
        
        Ignore the top Level (i.e. s1 = 1)
        
        At each Level of Recursion, we take the difference between both sides of the Equality and square the result
        
        s2 = w0 + w1 * s1 ===> (s2 - (w0 + w1 * s1))^2 s3 = w0 + w1 * s2 ===> (s3 - (w0 + w1 * s2))^2 ... s8 = w0 + w1 * s7 ===> (s8 - (w0 + w1 * s7))^2
        
        Then Sum up the differences of each level of recursion to give us a Least Squares Cost Function
        8 ∑ t=2 (st - (w0 + w1 * st-1))^2
        
        Minimise the Least Squares Cost Function over the Weights to give us their Optimal values, which gives the Best Recursive Formula of the Form:
        g(s) = w0 + w1 * s
        
        Note: Resolving the formula uses Regression where out Input-Output Pairs consist of consecutive elements of the Sequence
        
        Note: Recursive Approximator g(s) is a simple Feedforward Network (Linear Function)
    - After Training, Tune the Network to generate new values in the Sequence
    - Process Training Sequence (aka Windowing)
      - Fit an Order-One Recursive Formula to the Sequence of numbers
      - Extract the Set of Regression Input-Output Pairs from the Sequence to perform and Minimise the Least Squares Cost Function
        
        Input-Output Pair 1 (Elements 1 and 2 are first two elements of Sequence)
        
        Input s1, Output s2
        
        Input-Output Pair 2
        
        Input s1, Output s2
        
        etc (Slide over Input Window one unit to the right of Graph that shows Input-Outputs over time)
        
        Note: If the Sum is 8 ∑ t=2 there will be 6x pairs (since sum from 2 to upper limit 8, where P == 8)
      - Add Input-Output Pairs to Summands of Least Squares Loss
        
        Input Output Summand s1 s2 (s2 - (w0 + w1 * s1))^2 s2 s3 (s3 - (w0 + w1 * s2))^2 ... sp-1 sp (sp - (w0 + w1 * sp-1))^2
      - Inputs are calculated given the first 7x members of the Sequence in the example
        Input Output [[1] [[3] [3] [5] [5] [7] ... [13]] [15]]
      - Fitting with Keras by constructing a Model to reflect the derivations of the Least Squares Loss
        
        Build Feedforward Network (FFN) with One Linear Layer to perform regression on our input/output data and Output Loss using Mean Squared Error. then Fit the model
        # Build FFN to perform regression on input/output data model = Sequential() layer = Dense(1, input_dim=1, activation='linear') model.add(layer) model.compile(loss='mean_squared_error', optimizer='adam') # Fit the Model with Batch size and Epochs qty model.fit(x, y, epochs = 3000, batch_size = 3 call_backs = callbacks_list, verbose = 0)
      - After Training, substitute each Input value into the FFN (Linear Combination 8 ∑ t=2 (st - (w0 + w1 * st-1))^2) to make a set of predictions g(st-1) on the Training Set
      - Set of Predictions built
        Input Output Predictions g(st-1) [[1] [[3] [[ 2.999 ] [3] [5] [ 4.999 ] [5] [7] [ 6.999 ] ... [13]] [15]] [ 15.000 ]
      - Compare the Predictions to the Output to check how close they are (i.e. aim is to achieve a fair approximation of the true recursive function i.e. f(s) = 2 + s in the case of the Odd Sequence example )
      - Print the Learned Weights (see how similar to original model and true function and associated Coefficients that we’re aiming for)
        model.get_weights() [array(((1.000]], dtype - float32) array((1.999],dtype - float32)] g(s) = w0 + w1 * s g(s) = 1.99999 + 1.000001 * s
      - Notes (about Trained Network):
        
        Used a very simple Feedforward Network (FFN) to fully train network/weights
        
        Usage is possible also as any other classical trained predictor by exporting the Original Sequence to a Training and Testing set and tune the Weights to minimise Testing Error (rather than minimising the Training set Error)
        
        Use of Trained Network as a Generative Model to produce new unseen elements of the sequence
    - Generating Next Points in Sequence using Full Trained Network for the Generical Sequence
      - https://www.youtube.com/watch?v=6LgdU4avFSk
      - Simple Network Model g(s) = w0 + w1 * s
      - Substitute the last element of the Sequence sp into the Simple Network Model to give Generated Output of new point that’s Generated using the Trained Network (may or may not be close to true future values of the sequence)
      - Repeatedly move/slide the Window forward to the Next Input (i.e. plug ^sp + 1 into the network to give output ^sp + 2)
      - Generate Outputs
        Input Output sp ^sp + 1 = g(sp) (generated 1st point) ^sp + 1 ^sp + 2 = g(^sp + 1) (generated 2nd point) ...
      - Using above to the Odd Sequence Example to generate points:
        Input Output [[ 15 ] [[ 17.0007 ] [[ 17.0007 ] [[ 19.0009 ] ...
    - Graphical Model View
      - of the simple Linear Network showing how each element of the Sequence is related to the shared Weights w0 and w1
      - https://youtu.be/I72EOcAroFk?t=2m44s
  - **Example 2 - Rayleigh (Complex) where we inject recursivity into a supervised learner:
    - Instead of creating the Sequence ourselves, we are given the Sequence of Values and aim to Model it using a Recursive Formula
    - Suppose we have the first 50 values of the Sequence of Values
    - Assume we do not know what precise Recursive Formula generates the data (even though we really know its
      S1 = 1, S2 = 0.5 S3 = 0.4 * MAX(0, 1 + S1 - 0.1 * S2) (use S1 and S2 Inputs) S4 = 0.4 * MAX(0, 1 + S2 - 0.1 * S3) (use S2 and S3 Inputs) ... S50 = 0.4 * MAX(0, 1 + S48 - 0.1 * S49) (use S48 and S49 Inputs) Note: ORDER == 2 - Since uses 2x most recent values each time it recurses (i.e. 0.4 times the linear combo of S1 and S2 through Rayleigh function that takes an Input and Outputs the Max of that Input and 0)
    - We Propose to Fit a simple Parameterised Rayleigh Architecture as an attempt to try and see how close it gets to a Recursive Formula
    - Model each Non-Seed value of the Sequence as follows (linear combination of two prior elements shoved through a Rayleigh function):
      S1 = 1, S2 = 0.5 S3 = W0 + W1 * MAX(0, W2 + W3 * S1 + W4 * S2) S4 = W0 + W1 * MAX(0, W2 + W3 * S2 + W4 * S3) S50 = W0 + W1 * MAX(0, W2 + W3 * S48 + W4 * S49)
    - Tune the Weights (W0 to W4) as did previously by Squaring the Difference between both sides at each Level of the Recursion, giving a number of Squared Error Terms
      (S3 = W0 + W1 * MAX(0, W2 + W3 * S1 + W4 * S2))^2 (S4 = W0 + W1 * MAX(0, W2 + W3 * S2 + W4 * S3))^2 ... (S50 = W0 + W1 * MAX(0, W2 + W3 * S48 + W4 * S49))^2
    - Sum up the Squared Error Terms giving the Least Squares Loss Function (since this is a Regression problem our Regressor is a Two-Layered Feedforward Network with 1x RELU and 1x Linear that may be viewed with a Recursive Formula or as a Graphical Model)
      50 ∑ t=3 (ST - W0 + W1 * MAX(0, W2 + W3 * ST-2 + W4 * ST-1))^2
    - Graphical Model of Feedforward Architecture (showing how Weights are shared b/w Regression Points)
      - https://youtu.be/ZFWOCob2gZ8?t=1m43s
      - g(St-2, St-1) = W0 + W1 * MAX(0, W2 + W3 * St-2 + W4 * St-1)
    - Minimise the Least Squares Loss Function by transforming the Series into a Set of Input-Output Pairs when Processing the Sequence
      - Window with Input Size == 2
        
        Since must substitute the Last 2 Entries of the Sequence to Predict the Next one (each input takes the the past 2x elements of the sequence to predict the next one)
        
        Window of Length 2 Input-Output Pair 1 Input 1 Input 2 Output 1 Input-Output Pair 2 Input 2 Input 3 (Output 1) Output 2 ...
      - Code to Minimise the Least Squares Loss
        # Create model with Two-Layers and a Least Squares Loss Function # Minimises and recovers its optimal Weights by a # Stochastic Gradient Descent model = Sequential() model.add(Dense(1, input_dim=2, activation='relu')) model.add(Dense(1, activation='linear') model.compile(loss='mean_squared_error', optimizer='adam') # Fit the model model.fit(x, y, epochs=1000, batch_size=20, callbacks=callbacks_list, verbose=0)
      - Preview the resulting Fit with the Training Set (after Training) of the 50 elements in the Sequence (i.e. Graph with Step x-axis, Value y-axis)
      - Plot the Input-Output Pairs (Input x-axis, Output y-axis) that were formed when using FNN-based recursive approximation method based on the Original Sequence
        
        View dependence between Inputs and Outputs. If there is dependence then its NOT IID
        
        FLAWS with Feed forward Neural Networks (FNN)
        
        https://www.youtube.com/watch?v=IXtAGSJOpDQ
        
        If a Sequences consists of consecutive Independent and Identically Distributed (IID) pairs, then change to values of one pair of elements should not have any effect on the following values
        
        Pure Recursivity is the exact opposite of IID. It is where every value depends fundamentally on those before, since the FNN-approach is geared toward trying to learn model depednency in the form of recursivity, but when we tune our model we end up doing the opposite and provoke Independence instead (the opposite of what we want)
      - Generate new values using a Regressor beginning at the End of the Training Set and Preview them by Overlaying them on the next Actual Sequence
      - Check if Generated Fit matches Original (to indicate that Learnt Model is right)
      - Check Weights of the Model in Keras. May find that generated Weights are different than the Original, but that’s ok, since our aim was to find a Recursive Formula that explains the behaviours of our Sequence, which we’ve done, and there may be lots of Recursive Formulas that could be used to generate the Sequence (more than 1x correct way to model this Rayleigh sequence)
        w0 = 1.886 w1 = 1.309 w2 = 0.305 w3 = 0.305 w4 = -0.641
  - Interesting Notes:
    - Different Architectures may be used to approximate a given Ordered Sequence of Values (i.e. many different Architectures can model a Sequence created by one particular Recursive Formula) that we wish to Model Recursively
      - Example:
        Given Sequence created as Output of the using the ReLU Network g(St-2, St-1) = W0 + W1 * tanh(W2 + W3 * St-2 + W4 * St-1) The TANH Network can Fit the Sequence good, following the same procedure used before for Training on the first 50 elements used before. Graph with Training Fit https://youtu.be/Xf1oAaTd42w?t=38s Generate values that closely mirror the remainder of the true Sequence as follows Graph with Generator Fit https://youtu.be/Xf1oAaTd42w?t=43s
    - Adding Noise (Gaussian Noise) when Generating the Sequence so it is almost Recursive (except for the noise)
      S1 = 1, S2 = 0.5 S3 = 0.4 * MAX(0, 1 + S1 - 0.1 * S2) + ɛ1 (use S1 and S2 Inputs with Noise) S4 = 0.4 * MAX(0, 1 + S2 - 0.1 * S3) + ɛ2 (use S2 and S3 Inputs with Noise)
      - Follow previous steps to Fit the ReLU-based Regressor to the Training Set of this Noisy Sequence, producing a Fit that performs well as Training Fit
        
        Graph of Training Fit
        
        https://youtu.be/Xf1oAaTd42w?t=1m21s
      - Use the Tuned Regressor to Generate values as before as the Generative Fit
        
        Graph of Generative Fit
        
        https://youtu.be/Xf1oAaTd42w?t=2m24s
      - Note: Both the Training Fit and the Generated Fit points come from same Tuned Recursive Formula. So we should consider the Training Fit and the Generated Fit points as a SINGLE Recursive Sequence that we’re using to approximate our True Sequence as closely as possible
        
        Graph Recursive Sequence (combination of both Training Fit + Generative Fit)
        
        https://youtu.be/Xf1oAaTd42w?t=2m49s
      - Note: The Original Sequence shown in black was NOT Recursive but the Recursive Sequence shown in green is by design (since was created using the Recursive Formula used to find an approximation to truth)
  - Example 3 - Real Financial Time-Series Dataset
    - Previously we transformed pursuit of an approximation of a Sequence into a Regression problem
    - Now we’ll apply the approach to a real dataset
    - Given
      Given historical stock price dataset graphed Step VS Time https://www.youtube.com/watch?v=UfOUisfQPZc?t=18s Use: - Order 5 - Linear Network Architecture - Window size = 5 Build the architecture using Keras # Create model model = Sequential() model.add(Dense(1, input_dim=window_size, activation='linear')) model.compile(loss='mean_squared_error', optimizer='adam') Train on the first 100 elements of the Sequence (i.e. Steps 0 to 100) Note: We do not know of a True Recursive form of the Sequence (or if one even exists) Aim: Resolve a formula that explains the behaviour of this Ordered Sequence Recursively (i.e. approximates it with a truly Recursive Sequence) Training the Model allows visualising the Fit on the first 100 points * Graph "Training Fit" https://www.youtube.com/watch?v=UfOUisfQPZc?t=1m05s Generate say 40 new points using Tuned Linear Regressor * Graph "Generative Fit" https://www.youtube.com/watch?v=UfOUisfQPZc?t=1m07s Check how Fit compares between Generative Fit and Original * Note: May not be a strong fit to True Sequence since underlying dataset is more complex than the architecture we're using for the Recursive Approximation. * Regardless of the architecture, being able to predict precisely the stock price MANY time periods in future using historical price alone is impossible to * SHORT time periods in the future may be predictable based on historical price alone If task is to predict stock price over SHORT periods in the future then: * DO NOT need Regressor as a Generative Model * Only need Regressor as a Training / Testing instrument * https://www.youtube.com/watch?v=UfOUisfQPZc?t=1m46s i.e. Given Trained Regressor, Window size == 5 Using it to perform Predictions that are 1 period in future using financial time-series Slice the financial time-series into 2x Parts: - Training (already done) - Testing Test the efficiency of the predictor by Windowing the last 5x elements of the Training Set, and use the Predictor to estimate the next value Repeat for next unit by moving the Window forward 1x unit (using 4x units from Tail of Training Set, and 1x unit First one from Testing Set), and use the Predictor to estimate the next value * https://www.youtube.com/watch?v=UfOUisfQPZc?t=2m35s Repeat the above until we have all our Predictions and we can overlay Test Set Predictions on our True Sequence for visual comparison * https://www.youtube.com/watch?v=UfOUisfQPZc?t=3m12s
  - Recap
    - Goal: Model Ordered Sequences Recursively using an Feed forward Neural Networks (FNN) approach
    - Approach: Resolve a Recursive Formula
      - Using the Recursive Formula to construct Recursive Approximation
      - Recursive Approximation used to:
        
        Choose Architecture (Order, Functionality)
        
        Break Recursion into Levels
        
        Windowing the Sequence (producing Input/Output Pairs)
        
        Minimising the Loss to Tune Parameters of this Architecture
        
        If Sequences are Continuous values
        
        Then use “Least Squares Loss”
        
        If Sequence is Discrete values (i.e. text-data)
        
        Then use “Logistic Result Max Loss”
        
        Using Tuned Regressor as Generative Model (if possible)
      - Recursive Formula
        
        Gives when properly Tuned a 100% Recursive Sequence that approximates a True Sequence (Original that may or may not be recursive)
      - Noted that Generative Models are not appropriate for some applications (i.e. financial time-series) where traditional Train/Test should be used
  - Recursive Neural Network (RNN) Framework Fundamentals
    - Previously
      - Using the FNN-approach we model recursivity correctly but we completely lose dependence (IID instead) on earlier levels from further levels when we tune parameters which is fundamental to recursivity
    - Derivation of an RNN (that Improves on FNN without losing dependence offering better more structured recursion that stresses Dependence between levels explicitly). Known as the SimpleRNN model in Keras
      - RNNs came from the desire to enforce greater Dependency between further levels on earlier levels where each level ingests its predecessor (Hidden States are driven by Input Sequence) that builds on the failings of the FNN approach that failed to enforce this Dependency
      - Goal: Avoid further levels becoming Independent of earlier levels. So we must enforce more Dependency between levels by enhancing our recursion.
        
        Step 1:
        
        Re-write earlier steps of recursion, we want the LHS and RHS of equality to hold as best possible (i.e. approximately hold),
        
        Add an Auxiliary Variable (aka Hidden States) (i.e. h1 to h4) at each line. Since while we recurse on h we observe s, and s Drives h. They help organise the derivation and reminds us that RHS of each level actually defines a Sequence when taken together based on their Input and Parameter sides
        
        s1 = h1 = α s2 ~= h2 = g(s1) s3 ~= h3 = g(s2) s4 ~= h4 = g(s3)
        
        Step 2
        
        Remove the LHS of each level since wen want to approximate the True Sequence and get stuff out of our field of view
        
        h1 = α h2 = g(s1) h3 = g(s2) h4 = g(s3)
        
        Now our aim is the Tune the function to s4 that we just removed (along with removing s1, s2, and s3
        
        Step 3
        
        Adjust our Recursion to enforce Dependency between the levels (avoid Independence across consecutive levels as was failing of FNNs) by Forcing consecutive level Dependency (i.e. so each level after the Seed is functionally Dependency on the preceding level, i.e. 4th level functionally Dependent on the 3rd, etc)
        
        Force Dependency by making Architecture ingest the previous level (plugging in 3rd line into 4th line of the Architecture). i.e. for 4th level using any parameterised function of 2x Inputs: s3 (as usual) and h3 (for Dependency)
        
        h1 = α h2 = g(s1) = f(h1, s1) h3 = g(s2) = f(h2, s2) h4 = g(s3) = f(h3, s3) ... ht = g(st-1) = f(ht-1, st-1)
        
        i.e. f(h3, s3) = tanh(w0 + w1 * h3 + w2 * s3)
        
        Step 4
        
        https://youtu.be/Y3-YuSbhbQM?t=6m15s
        
        Roll-up the recursion, showing that the RNN in the form of a Hidden Sequence that is Driven by Input (i.e. taking a Sequence s and Driving a Sequence recursively h). This was covered previously
        h1 = α ht = f(ht-1, st-1), t >= 2
        
        h is Hidden since was not directly observed but was instead Driven using an Input Sequence s
        
        Previously using the FNN-model we turned the definition of recursivity on its head and used it to develop a recursive approximation to our Input.
        
        Now, with RNNs we’ve turned the Hidden Sequence concept on its head, by taking the Hidden Sequence model and fitting it into our Input (by tuning f to approximate the Driver s as well as possible) and using it to develop a recursive approximation to our Sequence
        
        Plot Hidden h vs Driver s
    - Formulate Least Squares Error/Loss using RNNs
      - Given
        s1 = h1 = α s2 ~= h2 = g(s1) = f(h1, s1) s3 ~= h3 = g(s2) = f(h2, s2) s4 ~= h4 = g(s3) = f(h3, s3) ... st ~= ht = g(st-1) = f(ht-1, st-1)
      - Remove Hidden State variables introduced during derivation
        s2 ~= f(h1, s1) s3 ~= f(h2, s2) i.e. = f(f(h1, s1), s2) s4 ~= f(h3, s3) i.e. = f(f(h2, s2), s3) = f(f(f(h1, s1), s2), s3) ... st ~= f(ht-1, st-1) where h2 = f(h1, s1)
        
        RNN level dependent on ALL previous levels (complete history of sequence values that precede it) (whereas shallow FNN is only dependent on immediate previous level) using Hidden State of the previous level
        
        i.e. s3 dependent on s2 and s1
        
        i.e. s4 dependent on s3, s2 and s1
      - Make these approximate equalities hold as tight as possible by Squaring the Error at each level and add them up
        (s2 - f(h1, s1))^2 (s3 - f(h2, s2))^2 (s4 - f(h3, s3))^2 ... (st - f(ht-1, st-1))^2
      - Minimise the Sum over the first P elements of the Sequence to get a Least Squares Error/Loss
        P ∑ t=2 (st - f(ht-1, st-1))^2
        
        Note 1: Broke into levels that are each explicitly Dependent on each other (unlike with the FNN approach where we lost it)
        
        Note 2: When using Architectures with Bounded Input i.e. f(h,s) = tanh(w0 + w1 * h + w2 * s) often used in RNNs its is good to Minimise the Difference between each Sequence element and the Linear Combination of the corresponding Hidden State to ensure values >1 may be reached by either:
        
        ADJUSTING the Least Squares Error/Loss Function as follows:
        P ∑ t=2 (st - (b + w * f(ht-1, st-1))^2
        
        Alternatively bake-in the Linear Combination into the Recursion directly at each level
    - Apply the RNN Framework in Keras
      - Example 1: ReLU-generated sequence
        
        https://www.youtube.com/watch?v=F5PVwVrEVHY
        
        Note: RNN Regressor is our generator
        
        Fit RNN Architecture to the first 50 elements of the ReLU generated Sequence we saw earlier (shown ‘blue’). Use as a Sequence Generator as well (as was done with FNN-approach)
        
        model = Sequential() mode.add(SimpleRNN(3, input_shape=(2,1), activation='relu')) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer=optimizer)
        
        Plot the Original Sequence, Training Fit, and Generative Fit
      - Example 2: Apply RNN to fitting a Financial time-series dataset
        
        https://www.youtube.com/watch?v=F5PVwVrEVHY
        
        Fit first 2/3 of the dataset using below code snippet:
        model = Sequential() mode.add(SimpleRNN(1, input_shape=(5,1))) mode.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer=optimizer)
        
        Plot Original Sequence, Training Fit, Testing / Generative Fit
        
        Note:
        
        Recursive models (RNNs) typically used for short-term predictions on financial time-series datasets.
        
        Require Windowing longer sequences is a practical matter, not architectural matter
        
        With financial time-series its more appropriate to use the Regressor as a more traditional Training/Testing tool rather than just a Pure Generator given the complexity of the phenomenon
  - Recursive Neural Network (RNN) Framework - Characteristics
    - Related to sequences and lists
    - Memory (loops allow info to persist) and Dependency
      - https://www.youtube.com/watch?v=0B8O2eNv2DY
      - Compare RNN and FNN
        
        RNN more expressive and data-driven than FNN (by explicit modelling of dependencies between consecutive levels of the recursion)
        
        RNN level dependent on ALL previous levels (complete history of sequence values that precede it) so it has MEMORY (whereas shallow FNN is only dependent on immediate previous level) using Hidden State of the previous level
        
        RNNs have more Memory since each Hidden State contains a complete History of the Input Sequence up to that point
  - Recursive Neural Network (RNN) Framework - Graphical Models
    - https://www.youtube.com/watch?v=LON9wniFUiE
    - Used Graphical Models to view
      - Unfolded View of recursions
        
        h1 = α h2 = f(h1, s1) h3 = f(h2, s2) h4 = f(h3, s3) ... ht = f(ht-1, st-1)
        
        Unfolded Graphical View of Purely recursive sequence that is a Hidden Sequence h driven by s
        
        Adding Prediction of ^st then Graphical Model must denote a Supervised Learner where information flows when using Gradient Descent
      - Folded View (compact) of recursions
        
        h1 = α ht = f(ht-1, st-1), t >= 2
        
        Folded Graphical View of model of a driven Hidden Sequence but when adding ^st we know the model is being used as a Predictor
  - Recursive Neural Network (RNN) Framework - Training - Technical issues
    - Technical issues
      - Optimisation
        
        Vanishing Gradient problem (affects FNNs and RNNs)
        
        Mitigation
        
        Regularising Activation Functions or different level Architectures such as Long Short Term Memory (LSTM)
        
        Variations of Stochastic Gradient Descent (SGD) (modifications to avoid the issue)
        
        Basic concerns like Windowing in Deep Networks since each State in a RNN adds a Hidden Layer to the corresponding network may be mitigated by cutting up Longer Sequences into Shorter Sequences and treating them as a Batch
      - Memory life
      - Data requirements
        
        Deep Networks (function approximators like RNNs) for high performance require large datasets for their expressive power to show cutting edge results
        
        i.e. text-generation with ‘000s of datapoints to play with
  - RNN Summary
    - https://www.youtube.com/watch?v=EFrAo74C8Ow
    - Notes: Activation Functions
      - Sigmoid Layer
        
        decides what parts of the cell state we’re going to output (i.e. forget is 0, keep is 1)
      - Tanh Layer
        
        put the cell state through tanh (to push the values to be between −1 and 1) as candidates
      - ReLU
        
        max of 0 or greater value
  - Links
    - RNNs from Deep Learning https://github.com/angelmtenor/deep-learning/blob/master/intro-to-rnns/Anna_KaRNNa.ipynb
  - Long Short Term Memory Networks (LSTMs)
    - Slides
      - Architecture of LSTMS
        
        https://www.youtube.com/watch?v=70MgF-IwAr8
        
        https://www.youtube.com/watch?v=gjb68a4XsqE
      - Architecture of RNN
        
        https://www.youtube.com/watch?v=ycwthhdx8ws
    - Definition:
      - Useful when neural network needs to switch between remembering recent things, and things from long time ago
      - AWESOME Links about LSTMs
        
        WOW - http://colah.github.io/posts/2015-08-Understanding-LSTMs/
        
        WOW - http://blog.echen.me/2017/05/30/exploring-lstms/
        
        Augmented RNNs & Attention http://distill.pub/2016/augmented-rnns/
        
        Other RNN/LSTM with Python Code character-level language models http://karpathy.github.io/2015/05/21/rnn-effectiveness/
        
        Translation https://arxiv.org/pdf/1508.07909.pdf
      - Issue with RNNs
        
        RNNs mostly store Short-Term Memory
        
        Key past information we’re trying to predict with may be in an earlier RNN (in sequence of RNNs) but input information is squished repeatedly by Sigmoid functions (that output numbers between 0 and 1 describing how much each component should be let through) and lost, so we need Long Term Memory
        
        Training a network (of multiple RNNs) using Backpropagation (recursive application of the chain rule from calculus) all the way back to much earlier RNN may lead to problems like the Vanishing Gradient
      - Solution - LSTM Networks
        
        Input to LSTM has both Long-Term Memory and Short-Term Memory that both get merged at each stage with current Event (what we just saw) (protects old information better)
        
        Goal of LSTM Node Architecture of Gates:
        
        Create Prediction
        
        of what the image is (i.e. long term memory may be required to know this) by combine Long-Term Memory and Short-Term Memory and Event
        
        Update with New Long Term Memory for next stage
        
        by merging Long-Term Memory and Short-Term Memory and Event
        
        Update with New Short-Term Memory for next stage
        
        by merging Long-Term Memory and Short-Term Memory and Event
        
        Output is prediction of what the Input is and as part of the Input for the next iteration of the Neural Network
      - Architecture of LSTMs
        
        Gates in LSTM Node Architecture:
        
        Summary of All Gates using an arbitrary Architecture that is known to work
        
        https://www.youtube.com/watch?v=IF8FlKW-Zo0
        
        TODO - Invent new Architectures that actually WORK as this spaces is under development
        
        Forget Gate
        
        https://www.youtube.com/watch?v=iWxpfxLUPSU
        
        Input is Long-Term Memory (LTM) is multiplied by a Forget Factor ft (to forgets everything no longer considered useful
        
        ft calculated with 1x small Layer Neural Network that combines inputs STM and E with a Linear Function
        
        ft = σ(Wf[STMt-1, Et] + bf)
        
        Learn Gate
        
        Slide https://www.youtube.com/watch?v=aVHVI7ovbHY
        
        Combines inputs Short-Term Memory (STM) and Event (E) using a Linear Function (i.e. tanh) that consists of joining the Vectors (STM and E), multiplying by a matrix of Weights (W), adding a Bias (b), and squishing result with tanh Activation Function to create New Info (N):
        Nt = tanh(Wn[STMt-1, Et] + bn) * it
        
        Forgets/Ignores any unnecessary info to give New Info (N) by multiplying by an Ignore Factor i Vector that multiplies element-wise.
        
        Note: i is calculated by building a small Neural Network that accepts Inputs of Short-Term Memory and Event, passing them through a Linear Function (with new Weights matrix and new Bias, and squishing with Sigmoid Function to keep between 0 and 1)
        it = σ(Wi[STMt, Et] + bi)
        
        Remember Gate
        
        https://www.youtube.com/watch?v=0qlm86HaXuU
        
        Accepts combination of
        
        Forget Gate (input is LTM)
        
        Learn Gate (input is combined STM and E)
        
        Outputs a New Long Term Memory
        LTMt = LTMt-1 * ft + Nt * it
        
        Use Gate
        
        https://www.youtube.com/watch?v=2kDufi6FDjU
        
        Decides what information to use from what we previously knew + what now know, and use it to make a Prediction
        
        Accepts combination of
        
        Forget Gate (input is LTM)
        
        Learn Gate (input is combined STM and E)
        
        Outputs a New Short Term Memory (which is the Prediction)
        
        - Input to FORGET GATE is LTMt-1 - Output of FORGET GATE is small Neural Network #1 that uses the tanh Activation Function Ut = tanh(Wu * LTMt-1 * ft + bu) - Inputs of STM and E are applied to another small Neural Network #2 using the Sigmoid Activation Function Vt = tanh(Wv[STMt-1, Et] + bv) - Final Output it multiplies both the Outputs of the small Neural Network #1 and small Neural Network #2 together STMt = Ut * Vt
  - Other Architectures that work
    - Slide https://www.youtube.com/watch?v=MsxFDuYlTuQ
    - Gated Recurrent Unit (GRU)
    - LSTM with Peephole Connections
      - Previously
        
        Forget Factor calculated as combo of STM and E (but LTM was not included in decision)
      - Now
        
        Connect the LTM into the Neural Network that calculates the Forget Factor (makes decisions inside the LSTM), where mathematically the Input to Sigmoid is larger since we’re concatenating it with LTM matrix
- RNN Project
  - Time Series Prediction and Text Generation.
    - Goal: Use RNNs and LSTMs for two major purposes:
      - Predict stock prices.
      - Generate Sherlock Holmes text.

Natural Language Processing (NLP)

NLP Pipeline
- My Code Examples of it all:
  - https://github.com/ltfschoen/AIND-NLP
- Stages
  - Text Processing
    - Raw text input
      - Pandas working with text
        
        https://pandas.pydata.org/pandas-docs/stable/text.html
      - Sources
        
        Website textual content
        
        Raw HTML markup
        
        PDFs
        
        Word docs
        
        Speech recognition system
        
        Book scanned with OCR
    - Build text processing functions
    - Clean
      - Python Regular Expressions
        
        https://docs.python.org/3/library/re.html
      - Remove HTML tags
        
        Parse HTML using BeautifulSoup to extract text without tags (since Regular Expressions not suitable)
        
        https://www.crummy.com/software/BeautifulSoup/bs4/doc/
        
        Use Beautiful Soup to walk the DOM tree
      - Remove non-relevant data
      - Remove source-specific markers
      - Retain only plain text to reduce complexity of procedures (i.e. and, the, are, of)
        
        Use NLTK Stopwords
    - Normalise
      - Lowercase (so each word represented by unique token)
      - Remove punctuation
        
        for Document Classification and Clustering where low-level details don’t matter much
        
        Replace with Space so words don’t concatenate
      - Remove extra spaces
    - Tokenise (aka Symbol)
      - http://www.nltk.org/api/nltk.tokenize.html
      - Split text into Tokens (Sequence of Words)
      - Remove common words that don’t offer meaning to reduce complexity and still may be inferred (aka NLTK Stop Words)
      - NLTK
        
        Smart way of tokenising
        
        Even includes one for parsing Twitter handles, hashtags, emoticons, etc
    - Process Words
      - Identify Different using NLTK pos_tag:
        
        Parts of Speech (Nouns, Verbs, Named Entities)
        
        Understand words in sentence to better understand what’s being said
        
        Identify Relationships between Words
        
        Identify Cross-References between Words
        
        Named Entities are Noun phrases that refer to specific Object, Person, or Place. Use ne_chunk to label Named Entities in text
        
        Usage: Index and search for news articles of companies that are of interest
      - Convert Words into Canonical forms using (further simplify and normalise different variations of words)
        
        Stemming
        
        Dfn: Reducing a word to its ‘stem’ (aka root form) to reduce complexity whilst retaining essence of meaning of Words
        
        i.e. branch is the root of branches, branching, branches since all convey same thing.
        
        Note: Important that all Words are reduced to the SAME STEM since captures same common idea (ok spelling mistake in root)
        
        i.e. cach is ok as root of caches, caching
        
        NLTK Stemming Types include:
        
        Porter
        
        Snowball
        
        Language-specific Stemmers
        
        Lemmatisation
        
        Dfn: Reduces words to normalised root form, but uses a Dictionary to map different variances of a Word back to its root to overcome non-trivial inflections
        
        i.e. be is the root of is, were, was
        
        i.e. one is the root of plural ones
        
        NLTK Lemmatisers
        
        WordNet database (Default)
        
        Usage: Initialise instance of WordNet Lemmatiser and pass individual words to the lemmatize method
        
        Note:
        
        Lemmatisers may be more Memory intensive than Stemming since stores in Dict
        
        Lemmastisers’ final form is a meaningful root word (i.e. cache, not cach like would be done with a Stemming)
        
        Lemmatiser must make assumptions about the Part of Speech (PoS) for each word it’s trying to transform. i.e. WordNet Lemmatiser defaults to Nouns, which may be overridden by specifying the parameter pos='v' for Verbs
        
        Chained procedures are often used
    - Transform ready for next stage
  - Feature Extraction
    - Build feature extractors
      - Text
        
        Considerations
        
        Since text data represented on modern computers using Encoding (i.e. ASCII, Unicode) that map each character to a number that are stored and transmitted by computers as Binary, which have an implicity ordering (i.e. 65 < 67)
        
        Incorrect to assume that A < B < C since may mislead our NLP algorithms
        
        Words carry meaning of concern, NOT individual Characters
        
        Computers DO NOT have standard representation for Words (since just Sequences of ASCII or Unicode values without Meaning or Relationships captured between Words)
        
        Goal
        
        Generate representation for Text Data (similar to Pixels used for images) that we may used as Features for Modelling
        
        Depends on Model we’re using
        
        Graph-based Model to extract insights
        
        Words represented as symbolic Nodes with Relationships between like Coordinate
        
        Statistical Models
        
        Numerical representation required to represent Words
        
        Depends on Task trying to accomplish
        
        Document-level task
        
        Spam Detection
        
        Sentiment Analysis
        
        Per-Document Representation
        
        Bag-of-Words (BoW)
        
        Dfn: Treats each Document (unit of text to analyse) as an Unordered Collection (Bag of Words)
        
        Example: Document types
        
        Compare essays prepared by students for plagarism, then each essay would be a Document
        
        Analysing the sentiment conveyed by tweets then each tweet would be a Document
        
        Issue: https://www.youtube.com/watch?v=LYYWIrWbBq4
        
        BoW Treats each Word as being equally important, but intuitively we know some Words occur more frequently in a Corpus.
        
        Solution:
        
        TF-IDF assigns Weights to words that signifies their relevance in documents
        
        About TF-IDF approach:
        
        Count the Document Frequency (number of times each word occurs out of all Terms aka Columns in Document-Term Matrix)
        
        Divide the Term Frequencies by the Document Frequency associated with that Term (giving a Metric that’s proportional to the frequency of occurrence of a term in a document, but inversely proportional to number of documents it appears in).
        
        Highlights words that are more Unique to a document (with higher value) and are better for characterising it
        
        Usage of BoW or TF-IDF Representation
        
        Document Classification Task
        
        Spam Detection
        
        Use TF-IDF Vectors as Features as well as labels “Spam” and “Not Spam” to setup a Supervised Learning Problem
        
        Steps
        
        Obtain Bag of Words from raw text we apply Text Processing steps (cleaning, normalising, splitting into words, stemming, lematisation, etc) then.
        
        INEFFICIENT to then treat the resulting Tokens as an Unordered Collection (aka a Set), and multiple occurrences not included,
        i.e. "Little house on little Prairie" --> { "littl", "hous", "priari" }
        
        EFFICIENT: Document-Term Matrix (illustrates relationship between Documents is Rows and Words/Terms in Columns where each element is a Term Frequency i.e. how frequently term occurs in a document) https://www.youtube.com/watch?v=A7M1z8yLl0w Convert each Document (where a set of Documents is a Corpus) into a Vector of numbers that represents how many times a Word occurs in a Document
        
        Collect all Unique Words present in Corpus (C) to form Vocabulary (V), arrange them in some order as Vector element positions OR Table Columns, then assume each Document is a Row, then count the number of occurrences of each Word in each
        
        Usage:
        
        Compare two documents based on how many words they have in common (or how similar Term Frequencies are).
        
        BAD - Mathematically performed by calculating the Dot Product between two row vectors that equals sum of the products of the corresponding elements (where the greater the Dot Product, the more similar the two vectors are), but flaw of Dot Product is it only compares values of overlap, but not affected by other values that aren’t in common (so different pairs may get same Dot Product as identical pairs.
        
        EFFICIENT - Cos Similarity https://youtu.be/A7M1z8yLl0w?t=3m15s where divide the Dot Product of two Vectors by the product of their magnitudes (or euclidean norms), where Vectors thought of as arrows in n-dimensional space, then this is equal to Cosine of the angle Theta that’s between each of two vectors.
        
        Most Similar - Identical Vectors - Cosine output: 1
        
        Partly Similar - Orthogonal Vectors - Cosine output: 0
        
        Not Similar - Exactly Opposite Vectors - Cosine outpput: -1
        
        Treat each document like a BoW allows computing simple statistics that characterise it, where the Statistics may be improved by assigning appropriate Weights to Words using a TFIDF Scheme (Term Frequency–Inverse Document Frequency) for more accurate comparison between documents.
        
        i.e. in certain apps need Numerical representations of individual Words by use of Word Embeddings method
        
        One-Hot Encoding * https://www.youtube.com/watch?v=a0j1CDXFYZI
        
        * Background: * Previously characterised entire Document (Collection of Words) as one Unit where inferences made are at Document-level * Document Topics * Document Similarity * Document Sentiment * Purpose: * Deeper Analysis of Text requires a Numerical representation for each Word (where treat each word like a Class, and assign each a Vector that has 1 in a positions the word exists in the word and 0 elsewhere) * Similar to BoW, but we keep a Word in each Bag and build a Vector for each * Issues * Doesn't work when have Large Vocabulary to deal with, since size of Word representation grows with qty of Words, so need to use: * **Word Embeddings** as a way to that we can **Control the Size of the Word Representation by limiting it to a fixed-sized Vector** (i.e. find an embedding for each Word in Vector space that exhibits desired properties) * i.e. * if two words similar in meaning they should be closer together than those that aren't) * if two pairs of word have similar difference in their meaning they should be separated similar distance in vector space * **Word2Vec** (Word Embedding type) * Dfn: Popular **Word Embedding** used in practice by transforming Words into Vectors * Approaches: * https://www.youtube.com/watch?v=7jjappzGRe0 * **Continous Skip-Gram Model** * About: * Model where given Middle Word that predicts Neighboring Words in a Sentence for a given Words in the Sentence is likely to capture the contextual meaning of Words * Steps * Pick any Word in Sentence * Convert Word into a One-Hot Encoded Vector * Feed One-Hot Encoded Vector into a Neural Network (or other probabilistic model) that's designed to Predict surrounding words (its context) * Loss Function will optimise the Weights or parameters of the Model and repeat until it learns to predict context words as best possible * Note: Taking an intermediate representation such as Hidden Layer in Neural Network, then the Outputs of that layer for a given word will become the corresponding "Word Vector" * **Continuous Bag of Words Model** * About * Model where given Neighboring Words that predicts a Word in a Sentence given some Neighboring Words in the Sentence is likely to capture the contextual meaning of Words * Robust representation of words since Meaning of each Word is distributed throughout the Vector * Vector size is independent of vocabulary (size of word vector is up to us on how we want to choose performance vs complexity but **Vector size remains Constant no matter how many Words we Train on**). Note that this differs from Bag of Words where Vector size grows with number of unique words. * Pre-Train a Large number of Word Vectors, so can then use them Efficiently without having to transform repeatedly since they are Trained once, and Stored in a Lookup Table * Ready for use in Deep Learning Architectures * i.e. may be used a Input Vector for RNNs * Possible to use RNNs to learn even **better Word Embeddings** * **Optimisations** possible to further reduce Model and Training Complexity, such as * Representing Output Words using: * **Hierarchical Softmax** * Computing Loss using: * **Sparse Cross Entropy** * **GloVe (Global Vectors for Word Representation)** (Word Embedding type) * https://www.youtube.com/watch?v=KK3PMIiIn8o * About * TODO - https://nlp.stanford.edu/pubs/glove.pdf * Directly Optimise Vector representation of each Word using **Co-Occurrent Probabilities Statistics** (with Context and Target Words occurrences) to capture Similarities and Differences between Words * Differs from Word2Vec that sets up an auxiliary word predictiontask) * Note: Use the Log of the values since values of Co-Occurrence Probabilities are small * Note: Refinement over using Raw Probabilities is to Optimise for the Ratio of Probabilities * **Distributed Hypothesis** * https://www.youtube.com/watch?v=gj8u1KG0H2w * Words occurring in same Context tend to have similar Meanings * When large context of Sentences used to Learn in Word Embedding, Sentences with common context Words are Vectors that are closer together * Add another **Dimension** in Word Vectors to capture **Differences** and **Similarities** where Word meanings vary to make the Word Vector more expressive * **Example: Neural Network Architecture for NLP task of Predicting a Word** * Add Word Embedding Layer and apply **Transfer Learning** * Narrow scope model (i.e. medical terminology) * Broad scope mode * RNN Layer Example https://youtu.be/gj8u1KG0H2w?t=3m36s * Use Pre-Trained Word Embedding as a Lookup (i.e Word2Vec or GloVe) * Then only need Learn/Train the later Recurrent Layers specific to our task ``` - One-hot encoded word - Word Embedding - Lookup (Word2Vec or GloVe) - Word Vector - Learn - Recurrent Layers - Dense Layers - One-hot encoded output ``` * **t-SNE (t-Distributed Stochastic Neighbor Embedding)** * Dfn: * Dimensionality reduction technique that maps high dimensional vectors to a lower dimensional space and useful for **Visualising Word Embeddings** since preserves the linear Substructures and Relationships learnt by the Word Embedding Model * Clusters groups of Words or Images according to associated Class Labels * Tool for better understanding the representation that a network learns and identifying bugs * Similar to Principle Component Analysis (PCA) but adds extra property when performing transformation whereby it tries to maintain relative distances b/w objects so * Similar objects stay close together * Dissimilar objects stay apart
        
        Doc2Vec
        
        Individual Words and Phrases for Text Generation and Machine Translation
        
        Word-level Representation (i.e. fox -> 0.4,0.7,0.1,0.5 dog -> 0.4,0.5,0.2,0.6)
        
        Word2Vec
        
        Glove
      - Images
        
        Images stored in computer memory, where each pixel contains relative intensity of light at locatin in image.
        
        Colour images have 1x value per Primary colour Red, Blue, Green, that carry relevant info, so Two Pixels with similar values are perceived similar, so is OK to use Pixel values in Numerical Model (after Edge Detection and Filtering)
    - Extract/produce relevant feature representations that are:
      - appropriate for model type planning to use
      - appropraite for NLP task trying to accomplish
  - Modelling
    - Dfn and Usage:
      - Observations in a form that allows us to understand them better and predict new unseen occurrences
      - Build models that achieve various NLP tasks:
        
        Classification Models
        
        Sentiment Analysis
        
        Spam Detection
        
        Topic Modelling
        
        Grouping Related Documents
        
        Ranking
        
        Improving Search Relevance
        
        Machine Translation Systems
        
        Converting Text between languages
        
        Others
        
        Extending and adapting techniques to design an appropriate solution
    - Steps
      - Build models for NLP tasks
      - Design Baseline Model
        
        Statistical model
        
        Machine Learning model
      - Fit Model Parameters to Training data using an Optimisation procedure (Known data)
      - Use to make Predictions on Unseen data
    - Considerations
      - Numerical Features allow use of any Machine Learning Model
        
        Support Vector Machines
        
        Decision Trees
        
        Neural Networks
        
        Custom Models (combining multiple for improved performance)
    - Utilising
      - Deploy as Web/mobile app
      - Integrate with other services
  - Iterate
    - Rethink features that are required and in turn our text processing routines
- Considerations
  - Dependencies between steps
  - Design decisions
  - Choose existing libraries and tools
  - Non-linear workflow of iterating repeatedly
- Project: Machine Translation
  - Different Methods:
    - Rule-Based Machine Translation (RBMT) - Classical
      - https://en.wikipedia.org/wiki/Rule-based_machine_translation
      - Based on Linguistic Info about Source and Target languages retrieved from Multi-lingual Dictionaries and Grammars that cover Semantic, Morphological, and Syntactic regularities of each language
      - Given Input Sentences (Language A), the RBMT System generates Output Sentences (Language B) based on Analysis of Semantic, Morphological, and Syntactic of both Source and Target Languages in translation task
    - Statistical Machine Translation
      - https://en.wikipedia.org/wiki/Statistical_machine_translation
      - Translations generated based on Statistical Models whose parameters are derived from analysis of Bilinguil Text Corpus
    - Example-based Machine Translation
      - https://en.wikipedia.org/wiki/Example-based_machine_translation
      - Bilinguil Corpus with parallel texts with translation by analogy (case-based reasoning approach)
  - Problems
    - Still unsolved, just many papers
  - Solutions
    - Neural Networks large leap forward
  - Task
    - Build Deep Neural Network that functions as part of end-to-end Machine Translation Pipeline that accepts English text Input and returns French translation Output

Written on May 17, 2017

Luke Schoen