Artificial Intelligence Nanodegree Term 1

Table of Contents

Chapter 1 - Artificial Intelligence Nanodegree Term 1

Term 2 About

  • https://medium.com/udacity/ai-nanodegree-program-syllabus-term-2-deep-learning-in-depth-d935197b66ec
  • Orientation http://go.udacity.com/aind-term2-orientation
    • Slack - #ai-deeplearning
    • Stanford Deep Learning https://www.youtube.com/playlist?list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6

Background and Definition

  • AI comprises design and building intelligent entities/agents and requires
    • Intelligence
    • Artifact (i.e. computer)
  • AI methods concern:
    • Human Thinking Automation (i.e. decision-making, problem solving, learning)
      • Cognitive Science Modelling Approach
        • Build precise and testable theory to express as computer program algorithm AI models with matching input/output behaviour including trace of reasoning steps when solving problems comparable to human performance (similar to General Problem Solver of 1961), using:
          • Introspection (i.e. of human thoughts)
          • Psychological Experiments (i.e. observe human action)
          • Brain Imaging (i.e. observe brain action, incorporating neurophysiological evidence into computer models)
    • Mental Computation Models (i.e. perception, reasoning, action)
    • Behaviour
  • AI measures success against:
    • Human Intelligence Performance Comparison (i.e. Turing Test to see if can decipher whether response by human or computer using Computer Vision and Robotics for object perception/manipulation)
      • Computer Capabilities
        • Natural Language Processing (i.e. communicates in English)
        • Knowledge Representation (i.e. store current knowledge, store received/heard info)
        • Automated Reasoning (i.e. reason and respond to requests using stored data)
        • Machine Learning (i.e. adapt to new situations and detect/deal with patterns)
    • Rationality (i.e. ideal performance with given data and assumptions)
      • THOUGHT - Logic Notation (discovered by Aristotle) by reasoning with statements about any object and relationships between with guidance on reasoning steps to try first to minimise computational resources to solve problems in “principle” using Intelligent Computational Reasoning Systems (differs from solving in “practice”)
      • ACTION (by rational agent) -
        • Autonomous operation
        • Perception of environment
        • Adapt to change
        • Uptime maintained
        • Goal pursuit
        • Deliver best outcome/decision
        • Rationality using inference given uncertainy (using Knowledge Representation)
  • AI approach to:
    • Simple Probems
      • Hypothesis that perfect rationality required for analysis
    • Complex Problems
      • Realise that unable to achieve perfect rationality requiring high computational demands
  • Definitions (of Mind and Matter)
    • Rationalism - Reasoning to understand the world
    • Dualism - Human mind has dual qualities governed by both soul/spiritual laws and physical laws
    • Materialism - Human mind is constituted from brain operation according to laws of physics
    • Empirism
      • Flow of Knowledge
        • Source of Knowledge
        • Physical mind uses Senses to manipulate Knowledge with reasoning to achieve Understanding
    • Logical Positivism (Doctrine) - Combines Rationalism and Empiricism whereby Knowledge arises from interconnected logical theories connected to Observation Sentences corresponding to Sensory Inputs
    • Free will - Perception of choices to a choosing entity
    • Induction Principle - Senses are reliant on rules (Confirmation Theory) when acquiring Knowledge from their Experience (exposure to repeated associations between elements)
    • Actions (occur after Reasoning) - Actions justified through logical connection between Goals, and Knowledge of an Action’s outcome (assuming the possible end outcome based on Experience)
  • Rational Decision Theory
    • Issues
      • Decision-making Rules when several Actions achieve same Goal
      • Decision-making Rules when no Action achieves Goal
      • Decision-making Rules when uncertainty
    • AI Science
      • Maths (areas contributing to AI)
        • Logic
          • Algorithms using maths reasoning for logical deduction. Procedure to prove any true statement in boolean first-order logic of objects and relations, but first-order logic does not capture principle of maths induction required to characterise natural numbers. Incompleteness Theorem proved that in arithmetic (elementary natural numbers) some true statements are undecidable (no proof within the theory), whereby some functions on integers cannot be represented by an algoritm.
          • Turing Machine was developed and showed some functions could not be Computed
          • Theory of References related logic objects with real-world objects
        • Computation
          • Tractability - (easy) problems where time required to solve instance of problem grows with Polynomial growth. Tractable sub-problems should be approached when generating intelligent behaviour
            • Polynomial growth
          • Intractable (hard) problems - time required to solve instance of problem grows Exponentially with size of instances
          • Exponential growth in complexity means even moderately large problem instances cannot be solved in reasonable timeframe
          • NP-completeness Theory showed there are many NP-complete problems (combination of search and reasoning), such that if a given problem class may be reduced to one of these NP-complete problems, it is likely Intractable
        • Probability
          • Quantitative Science
            • Bayes’ Rule used for uncertain measurements/reasoning of AI systems. Statistical methods used to overcome incomplete theories
      • Science
        • Economics
          • Definitions:
            • Economies comprise individual agents making choices that lead to preferred outcomes that maximise economic well-being
            • Sellers may assert they prefer money whilst they hope Buyers prefer their goods
          • Large Economies
            • Decision Theory (framework for decisions under uncertain environment)
              • Probability Theory
              • Utility Theory
              • Usage: Suitable for large economies where individual agents should not be concerned about Actions undertaken jointly by multiple other agents
          • Small Economies
            • Game Theory
              • Usage: Suitable for small economies where individual agents may be significantly affected by Actions of other individual agents
              • Note: Discovered that rational agents should appear to adopt random policies (ambiguous prescription for Actions)
          • Other
            • Satisficing
              • Definition: “Good enough” decisions rather than optimally calculated decisions more accurately describe human behaviour
            • Optimisation of Sequence Operations
              • Definition: Rational agent decisions when multiple Actions in sequence result in payoff
        • Neuroscience
          • Brain
            • Thought, Action, Consciousness enables Minds - based on collection of neurons
              • Mysticism Theory - Alternative theory whereby mind operates in mystical realmn beyond physical science
            • Cognitive Functions Segregated
              • Speech production (left hemisphere)
            • Neuronal Structure
              • Electrochemical signals sent via neuron junctions to control brain activity and learning
              • Information processing in outer layer of brain using tunnels of tissue 0.5mm dia / 4.0mm depth containing 20,000 neurons
              • Unknown is how damaged functions adopted by others
              • Unknown is how individual memory stored
              • Note: Single-cell neuron activity artificially stimulated (electrically, chemically, optically) for recording of neuron I/O mapping relating to cognitive processes
            • Sensory Signals
              • Publish/Subscribe mapping between brain areas and bodily parts (control output or sensory input received)
              • Note: Multiple maps apparent in some creatures that change every few weeks
              • Note: Brain activity images using EEG and MRI
          • Research Applications
            • Math Models applied to nerve cells (neurons) to study nervous system
        • Psychology
          • Behaviourism
            • Achievements: Discoveries in some animals but less with humans
            • Human Behaviour Experiments
              • Subjective methodology
                • Introspective data is subjective (unlikely experimenter disagree with own theory)
              • Note: Objective methodology applied to humans
            • Animal Behaviour Experiments
              • Objective measures of stimulus given and resulting actions/response
                • Lack of introspective data
          • Cognitive Psychology
            • Definition: Brain viewed as a CPU
            • Knowledge-based Agent Info Processing Steps
              • Translate stimulus to internal representation
              • Manipulate internal representation using Cognitive Processes to achieve new internal representation
              • Translate back into Action
          • Computer Models should be like Cognitive Theory and address psychology of the following by describing information processing mechanism for cognitive function implementation:
            • Memory
            • Language
            • Logical Thought
        • Computer Engineering
          • Future power increases from mass parallelism
          • AI work has pioneered ideas used in computer science (i.e. object-oriented programming, etc)
        • Control Theory and Cybernetics
          • Control Theory Tools
            • Calculus
            • Matrix Algebra
          • Control Theory Systems
            • Systems described by “fixed” sets of continous variables
          • Artifact (computer) Systems
            • Self-regulating feedback control system machines (modify behaviour in response to env changes). i.e. submarine, thermostat
          • AI
            • Escapes limitations (NOT “fixed”)
            • Logical Inference and Computation Tools solve problems in:
              • Language
              • Vision
              • Planning
          • Purposive Behaviour
            • Regulatory mechanism that minimises error toward goal state
          • Intelligence AI Source
            • Homeostatic devices with feedback loops creating stable adaptive behaviour
          • Stochastic Optimal Control
            • Goal of system designs that behave optimally (maximising objective function over time)
        • Linguistics
          • Behaviourist Approach - to Language Learning
            • Ignores Creativity in language (i.e. child makes sentences)
          • Syntactic Structures & Chomsky’s Theory
          • Natural Language Processing (aka Computational Linguistics)
            • Understand sentence structures, subject matter, and context to understand a language and use Knowledge Representation to convert into form a computer may reason with

History of AI

up to page 18

  • Python zip, map, lambda https://bradmontgomery.net/blog/pythons-zip-map-and-lambda/
  • Iterate Dictionary/Hash: for k, v in values.items():
  • Iterate Array: for val,i in enumerate(range(len(a))): http://www.diveintopython.net/file_handling/for_loops.html values[val]
  • Iterate over String: [i for i in k]
  • Append to Array: elim_box.append((k, v))
  • Replace values in string: removed = original.replace("M", "") http://www.petercollingridge.co.uk/book/export/html/362

Extra Curricular

  • AIMA textbook example algorithms https://github.com/aimacode/aima-python
    • lecture 6 (right ahead of project 2) and from AIMA only had to cover 5.1-5.4 specifically references DFS and iterative deepening from earlier chapters
  • Github projects

https://github.com/udacity?utf8=%E2%9C%93&q=aind https://github.com/udacity/artificial-intelligence https://github.com/danainschool/aima-python https://github.com/udacity/AIND-Isolation https://github.com/udacity/AIND-Planning https://github.com/ltfschoen/AIND-Sudoku https://github.com/ltfschoen/hmmlearn

Python requirements

  • itertools, list comprehensions, tuples, function with multiple return values, default values as function params, functions as first class citizen (pass functions like variables), creating classes
  • Anaconda
  • Python 3
  • Courses preparation https://www.udacity.com/course/design-of-computer-programs–cs212

OpenAI

https://github.com/openai https://openai.com/blog/

Research Papers in AI

https://github.com/songrotek/Deep-Learning-Papers-Reading-Roadmap https://github.com/terryum/awesome-deep-learning-papers?files=1 http://www.arxiv-sanity.com/ https://research.google.com/pubs/papers.html

  • Summaries already done https://docs.google.com/spreadsheets/d/1xej5Nca2xUUtrZ1GCyPjFMqI9ZgNq_OhgnTxOOMQ2js/edit#gid=404493967
  • DL related (Jack Clark at OpenAI to crowdsource via dhruvp): https://docs.google.com/spreadsheets/d/1MT8LRuEn3xdJ7j6_XQjRZwV_fRmrpjaWwZKGk2Dqm4k/edit#gid=0

Competitions in AI

  • General AI Challenge

https://www.general-ai-challenge.org/active-rounds https://discourse.general-ai-challenge.org/

Convert Markdown to PDF

  • https://gitprint.com/

Reading - Python

  • Book “Data structures and algorithms in Python”
    • Recursion chapter, Tree/graph algorithms
  • Official Python Docs - itertools and other Python packages
  • AI Overview
    • DARPA https://youtu.be/-O01G3tSYpU
  • AIMA book
    • Chapters http://aima.cs.berkeley.edu/2nd-ed/newchap11.pdf
    • Code examples for each of the algorithms https://github.com/aimacode
  • Python e-books: http://www.diveintopython3.net https://learnpythonthehardway.org/book/ http://www.deeplearningbook.org/

  • MIT AI course (similar topics to AIND) https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-034-artificial-intelligence-fall-2010/lecture-videos/

Week 0

  • Kick-off event https://youtu.be/z7V7Ri6P_dA
  • About https://www.udacity.com/course/artificial-intelligence-nanodegree–nd889

Week 1

  • Guides
    • Sebastian Thrun - Udacity co-founder
      • Key takeaways:
        • Reinforces that AI surpassing human efficiency for repetitive tasks
        • Highlights self-driving car applications, benefits (i.e. safety), and modern suppliers
        • Identifies need for multi-tasking AI platform
        • Suggests students to build innovative experiments and forward to Sebastian (i.e. using Arduino)
      • Q&A (Starring Luke Schoen)
        • https://youtu.be/AijSw9Q3oas
        • Initiative of re-writing or summarising research papers so more easily understood by layman by using visuals and animations like Udacity uses
          • TODO - Central repository that unifies papers into a single site
            • Interactive tool where can play with parameters with different outputs
            • i.e. create site with all animations associated with Artificial Intelligence (Udacity would like to work with people involved)
        • Cyc http://www.cyc.com/
    • Peter Norvig - Google - AI book author
    • Thad Starner - Georgia Tech - Contextual Computing & Google X researcher
    • Arpan Chakraborty - Georgia Tech - Computer Vision - AI research
    • David Joyner - Georgia Tech - AI lecturer - applies AI to tools
    • Dhruv Parthasarathy - AI Program director
    • Lisbeth Ortega - Community Manager
    • Kathleen Mullaney - Career Services
    • Rachel Higgens - Career Services
    • Trinh Nguyen - Career Services - Program Coordinator
    • Kirsten Bailer - Career Services - Employer Relations
    • Luis Serrano - Instructor/Course Developer
    • Dean Webb - Mentor Online
    • Shelly - provides Quizzes during course
    • Help: ai-support@udacity.com OR Chris LaPollo (outside enrolment)
  • Udacity Code Style Guide - https://udacity.github.io/git-styleguide/

  • Project 1 - Sukoku AI-agent
    • Build and code AI-agent solves any Sudoku problem
      • Learn 2x AI techniques
        • Constraint Propagation
        • Search
      • Advanced Sukoku strategy to improve agent speed
    • Based on blogpost http://norvig.com/sudoku.html
  • Project 2 - Game playing AI-agent
    • AI-agent plays games and defeats opponents in isolation
      • Learn 2x AI Strategies (used by Google’s Go agent AlphaGo)
        • Minimax
        • Alpha-beta pruning
    • AI-agent guides pacman through maze faster and eats food efficiently
      • Learn 2x AI strategies
        • Breadth-first search
        • Depth-first search
        • A-Start search
  • Project 3 - Planning AI-agent
    • AI-agent plans optimal route transport cargo from A to B (used in logistics)
  • Project 4 - Natural Language Processing ??? AI-agent
    • AI-agent automatically reads sign language chars and convert to English
      • Learn 1x AI strategy
        • Hidden Markov Models
  • Deadlines
    • Deadline means submitted and Reviewed
  • Code Reviews - Unlimited
  • Mentor - Leadership, Guidance, SME
    • Weekly checkin so track.
    • Help set learning goals.
    • Guide to supplementary resources when get stuck
    • Respond to any questions about the program
  • Office Hours https://calendar.google.com/calendar/embed?src=knowlabs.com_h7k8bibekume01f054v7ns8v1o%40group.calendar.google.com
    • instructors, content creators and other Udacity staff to answer questions about the content and projects
  • Forums - Tech Ques https://discussions.udacity.com/
    • AIND Term 1 Categories - https://discussions.udacity.com/categories
      • Projects category – post anything specific to projects in respective sub-categories there
      • Announcements specific to cohort including reminders and additional resources
      • Cafe interact with your peers
  • Community - Online/Offline Study Groups arranged by Udacity
  • Career Services - http://career-resource-center.udacity.com
    • Udacity partners
      • Google, IBM Watson, Accenture, AT&T, Priceline.com, Nvidia, Mercedes-Benz, Amazon
    • Career Path - Location, etc
    • Career Development in Classroom
      • Resume updates
      • Udacity Reviews for free
  • Other Students -
    • Sample Search https://www.google.com.au/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=code+solution+naked+twins+sudoku+aind+github
    • Solution to Sudoku Naked Twins - https://github.com/vxy10/AIND-Sudoku/blob/master/solution.py
    • https://github.com/Kiyo921/AI-sudoku
    • Project 2
      • https://github.com/philferriere/aind-projects
      • https://github.com/bdiesel?tab=repositories
      • https://github.com/ziyanfeng/AIND-Isolation
      • https://github.com/davidventuri/udacity-aind
      • https://github.com/mohankarthik?tab=repositories
      • https://github.com/mohankarthik/AIND-P2-Isolation
      • https://github.com/mohankarthik/AIND-L1-Pacman
      • https://github.com/mlgill?tab=repositories
      • https://github.com/mlgill/AIND-Isolation
      • https://github.com/mlgill/AIND-Planning
      • https://github.com/mlgill/AIND-Simulated_Annealing
      • https://github.com/morganel?tab=repositories
      • https://github.com/morganel/AIND-Isolation-1
      • https://github.com/morganel/AIND-Planning
      • https://github.com/madhavajay/nd889
      • https://github.com/frogermcs
      • arunkv
        • https://github.com/arunkv/AIND-Planning/commits/master
      • sumitbinnani
    • Recogniser project
      • DONE + Part 4 https://github.com/frogermcs
      • DONE + Part 4 https://github.com/climberwb/AIND-SignLangRecognizer
      • DONE https://github.com/IlanStefanon/AIND-Recognizer
      • DONE https://github.com/binojohnthomas/AIND-Recognizer
      • DONE https://github.com/iamsteveng/AIND-Recognizer
      • DONE + Part 4 https://github.com/ArthurLu/AIND-Recognizer
      • HALF-DONE https://github.com/curiousily/AIND-Recognizer
      • HALF-DONE https://github.com/tienthanh8490/AIND-Recognizer
      • ??? https://github.com/RenSilvaAU/AIND-Recognizer
      • ??? https://github.com/Blindshooter/AIND-Recognizer
      • No progress https://github.com/zouyu9631/AIND-Recognizer
      • No progress https://github.com/edhzsz/AIND-Recognizer
      • No progress https://github.com/AgremE/AIND-Recognizer
      • ??? https://github.com/WestonLu/AIND-Recognizer
  • Chat - Slack https://artificial-intelligence.udacity.com/
    • ai-nd.slack.com
    • LinkedIn Links https://docs.google.com/spreadsheets/d/1kdiF8weRpmaF91xqmeZ27tqf7IPgWqhpYxeynDb–SA/edit#gid=0
      • Intro @channel Hello everyone. My name is Mike Salem and I am the Nanodegree Service Lead for AI. I want to welcome each and every one of you to Artificial Intelligence!

      For introductions:

      1. Your name (first name is OK)
      2. Where are you from / located
      3. What’s your background / have you taken any AI courses in the past?
      4. Why did you sign up for the AIND? What would you like to learn?
      5. (Optional) Do you have any cool AI related things you would like to show off? be creative. It could be a github account, a youtube videos, etc.

      We want your experience at Udacity to be the best possible. So please chat with other people in the channel, make friends, and most importantly learn and have fun!

  • Bug/Issue/Problem Reporting https://waffle.io/udacity/aind-issue-reports
    • https://github.com/udacity/aind-issue-reports/issues
  • IntelliJ Python Debugger Setup

Find Miniconda Python: $ which python /Users/Ls/miniconda3/bin/python

File > Project Structure > Platform Settings > SDKs > + > Python SDK > /Users/Ls/miniconda3/bin/python

File > Project Settings > Project > Project SDK > Python 3.6.0 (~/miniconda3/bin/python…)

Run > Debug > + > Python

  • Name: Sudoku
  • Script: /Users/Ls/code/aind/term01/lesson01/function.py
  • Use Specified Interpreter: Python 3.6.0 (~/miniconda3/bin/python…)

  • Lesson 1 - Anaconda
    • About - Version management (specifically built for data science)
    • About Conda
      • Dfn (Conda) - Package manager (install library/software) and Virtual Environment manager (similar to virtualenv and pyenv)
        • Note: pip is the default Python package manager (general packages)
        • Note: conda is Python package manager (focus on data science packages), not specific to just Python packages
      • Dfn (Anaconda) - Large software distribution that comes with Conda, python, and 150 scientific packages (data science) and dependencies
        • i.e. comes with: Numpy, Scipy and Scikit-learn compiled with the MKL library, speeding up various math operations
      • Dfn (Miniconda) - Small software distributin comes with Conda, python (install required packages manually)
    • Usage of Conda
      • Conda Documentation https://conda.io/docs/using/index.html
      • Cheat Sheet https://conda.io/docs/_downloads/conda-cheatsheet.pdf
      • Cheat Sheet Advanced http://know.continuum.io/rs/387-XNW-688/images/conda-cheatsheet.pdf?mkt_tok=eyJpIjoiWm1Sak56UmpZMlZsTW1aaSIsInQiOiJOc3Rud2NraTI4ODlzVmJBSnllZ0JJWk1ZSlhLbE5QOEhZUWRMMDY2NWUrNGZxalRnSXNcL2NOM3h2UVFJN0ZJV0YybGpGOFNrdEJWa0Rnd1hlRnFzXC9wdUJcL1c1bkxsWVZIT1JzXC9vRGwwSmNvREpGXC9UYk9LV01UcjV6b1Z3TlN2In0%3D
      • Another https://leifengtechblog.files.wordpress.com/2016/01/conda-cheatsheet.pdf
      • Steps
        • Create new environement for each project (separate package versions so no version conflicts) conda create -n tea_facts python 3
        • Delete env conda remove -n tea_facts --all (or conda env remove -n tea_facts)
        • Activiate environment source activate tea_facts
        • Deactivate environment source deactivate tea_facts
        • Show default packages installed conda list
        • Show non-Python packages conda search --canonical | grep -v 'py\d\d'
        • Update environment with new packages (i.e. work with t-data, visualisations, code development) conda install numpy pandas matplotlib conda install jupyter notebook
        • Update a package conda update ____
        • Update all packages conda update --all
        • Find a package conda search ___
        • Export list of packages in environment to file for easy loading dependencies by others (similar to pip freeze > requirements.txt) conda
      • Other
        • Help conda install --help
    • Installation of Conda
      • Anaconda
        • Download https://www.continuum.io/downloads
        • Cheat Sheet of Installation https://docs.continuum.io/_downloads/Anaconda_CheatSheet.pdf
      • Miniconda https://conda.io/miniconda.html
        • https://github.com/conda/conda
        • Download v3.5
        • Run with bash Miniconda3-latest-MacOSX-x86_64.sh
        • Check installation location which conda => /Users/Ls/miniconda3/bin/conda conda list

             # packages in environment at /Users/Ls/miniconda3:
             #
             cffi                      1.9.1                    py36_0  
             conda                     4.3.11                   py36_0  
             conda-env                 2.6.0                         0  
             cryptography              1.7.1                    py36_0  
             idna                      2.2                      py36_0  
             openssl                   1.0.2k                        0  
             pip                       9.0.1                    py36_1  
             pyasn1                    0.1.9                    py36_0  
             pycosat                   0.6.1                    py36_1  
             pycparser                 2.17                     py36_0  
             pyopenssl                 16.2.0                   py36_0  
             python                    3.6.0                         0  
             readline                  6.2                           2  
             requests                  2.12.4                   py36_0  
             ruamel_yaml               0.11.14                  py36_1  
             setuptools                27.2.0                   py36_0  
             six                       1.10.0                   py36_0  
             sqlite                    3.13.0                        0  
             tk                        8.5.18                        0  
             wheel                     0.29.0                   py36_0  
             xz                        5.2.2                         1  
             yaml                      0.1.6                         0  
             zlib                      1.2.8                         3
          
      • List envs

      conda info --envs (or conda env list)

         # conda environments:
         #
         root                  *  /Users/Ls/miniconda3
      
      • Clone the root env

      conda create --name rootclone --clone root

      • Create env with specific Python version conda create --name aind-3.6.0 python=3.6.0

      • Checkout the aind env

      source activate aind

      • Save env (incl packages and Python version for sharing). Also include a pip requirements.txt file using pip freeze for people not using conda

      conda env export > environment.yaml pip freeze > requirements.txt

      • Load env from YAML env file conda env create -f environment.yaml
    • Installation of Spyder IDE
      • https://github.com/spyder-ide/spyder

      • Create subenv with Spyder IDE, update to latest versions source activate root (root) $ python --version ==> 3.6.0 (root) $ conda upgrade conda (root) $ conda upgrade --all (root) $ conda create -n spyder python=3.6.0 (root) $ source activate spyder
      • Install Spyder IDE in subenv (spyder) $ conda install qt pyqt (spyder) $ conda install spyder (spyder) $ conda upgrade --all
      • Run Spyder IDE (spyder) $ spyder
    • Python 3 vs Python 3
      • Usage of print from Python 2 instead of Python 3’s print function (surround with print("a","b") not print "a" "b")
      • Use new print function in old Python 2 code by importing from __future__ import print_function
    • Env variables
      • https://conda.io/docs/using/envs.html
    • Pre-load with AI Nanodegree Python environment
      • Download aind-environment-unix.yml
      • Load conda env create -f /Users/Ls/Downloads/aind-environment-unix.yml
      • Switch to AIND env source activate aind
    • Install PyGame for visualisations of AI programs for portfolio sharing brew install sdl sdl_image sdl_mixer sdl_ttf portmidi mercurial pip install pygame
    • Get PyGame library directory
        >>> import site;print(site.getsitepackages())
        ['~/miniconda3/envs/aind/lib/python3.6/site-packages']
      
    • Within IntelliJ, go to Project Structure (CMD+;)> Platform Settings > SDKs, and Add ~/miniconda3/bin/python as a Python 3.6.0 SDK home path, then add to this an additional Classpath of ~/miniconda3/envs/aind/lib/python3.6/site-packages
    • Within IntelliJ, go to Project Structure > Platform Settings > Modules
  • Project 1 - Suduko
    • About Sudoku http://www.conceptispuzzles.com/?uri=puzzle/sudoku/rules
      • Given: 9x9 grid partially completed
      • Objective: Fill grid with missing digits so each row, each col, and each of 9 principal 3x3 sub-squares contains all digits 1-9
      • Extras:
        • Strategies other than Naked Twins like: (Standard + Diagonal Sudoku)
    • Goal
      • Build an AI-agent that solves all Sudoku problems and provides intro to techniques: Constraint Propagation, and Search
    • Note: AI
      • Composed of combining many simple ideas to solve complex problem
    • Dfn: Constraint Propagation
      • e.g: Extract max info from applying simple Local Constraints (i.e. to each square) to iteratively narrow search space of possible solutions
      • Risk: Never solve every puzzle (incl. hard ones)
      • Example: Elimination of possible values for each peer within each square here http://norvig.com/sudoku.html states that even using Naked Twins strategy we never know if we can solve every puzzle (even hard ones)
    • Dfn: Search
      • e.g. Given two or more possibilities available, branch out and create whole tree of possibilities, then traverse tree and consider all until find solution
      • Risk: May take forever to run
      • Example:
        • Option 1: Brute Force Approach: Single machine instruction http://norvig.com/sudoku.html (i.e. try all possibilities, takes a long time)
        • Option 2: Constraint Propagation (Constraint satisfaction problem): Multiple machine instructions each processing multiple possibilities (i.e. eliminate possibilities on each try, so not need try all possibilities)
          • https://en.wikipedia.org/wiki/Constraint_satisfaction
          • https://en.wikipedia.org/wiki/Search_algorithm
          • Depth-First Search Algorithm:
            • Dfn: Depth First Search (DFS) algorithm traverses a graph in a depthward motion and uses a stack to remember to get the next vertex to start a search, when a dead end occurs in any iteration
            • Check not already found solution or contradiction
            • Select unfilled square and try possible values until find successful value, and for each possible value check if with that possible value we can successfully search for a solution.
            • Note: This is a recursive search since we recursively consider all possibile outcomes for a value before trying a different value
            • Note: Create new copy of values for each recursive call to search so each branch of the search tree is independent and does not confuse other branches (hence may be more efficient to implement set of possible values for a Sudoku square as a string, rather than a Set or a List)
          • Search Implementation
            • Variable Ordering - i.e. choice of which square to try first
              • Minimum Remaining Values Heuristic - i.e. choose a square with min. qty of possible values
                • i.e. if choose square with 7 possibilities we expect to guess wrong with probability of 6/7, whereas if choose square with 2 possibilities we expect only 1/2 (50%) probability
            • Value Ordering - i.e. choice of which digit to try for first Sudoku square
            • Randomize Ordering Heuristic - Randomise the value ordering may avoid any deadly combinations of value choices that may take 10,000 times longer to find a contradiction
            • Least-Constraining Value Heuristic - Chooses the first value that imposes the least constraints on peers
    • Problem:
        . . 3 |. 2 . |6 . .
        9 . . |3 . 5 |. . 1
        . . 1 |8 . 6 |4 . .
        ------+------+------
        . . 8 |1 . 2 |9 . .
        7 . . |. . . |. . 8
        . . 6 |7 . 8 |2 . .
        ------+------+------
        . . 2 |6 . 9 |5 . .
        8 . . |2 . 3 |. . 9
        . . 5 |. 1 . |3 . .
      
    • Solution: ‘483921657, 967345821, 251876493, 548132976, 729564138, 136798245, 372689514, 814253769, 695417382’
          1 2 3  4 5 6  7 8 9
        A 4 8 3 |9 2 1 |6 5 7
        B 9 6 7 |3 4 5 |8 2 1
        C 2 5 1 |8 7 6 |4 9 3
          ------+------+------
        D 5 4 8 |1 3 2 |9 7 6
        E 7 2 9 |5 6 4 |1 3 8
        F 1 3 6 |7 9 8 |2 4 5
          ------+------+------
        G 3 7 2 |6 8 9 |5 1 4
        H 8 1 4 |2 5 3 |7 6 9
        I 6 9 5 |4 1 7 |3 8 2
      
    • Steps:
      • If box has a value, then all boxes in same row, same column, or same 3x3 square cannot have same value
      • If only one allowed value for given box in row, column, or 3x3 square, then the box is assigned that value
      • Naming Conventions
        • Rows and Columns
          • Rows labelled A, B, C, D, E, F, G, H, I.
          • Columns labelled 1, 2, 3, 4, 5, 6, 7, 8, 9
        • 3x3 squares will not be labelled
        • Boxes, Units and Peers (name important elements created by rows and columns) * Boxes - Individual squares at the intersection of rows and columns (i.e. ‘A1’, ‘A2’, …, ‘I9’) * Units - Complete rows, columns, and 3x3 squares. Each unit is a set of 9 boxes, there are 27 units total. * Peers - All other boxes associated with a particular box (such as ‘A1’), that belong to the same unit (i.e. belong to the same row, column, or 3x3 square) (i.e. Peers of ‘A1’: row: A2, A3, A4, A5, A6, A7, A8, A9 column: B1, C1, D1, E1, F1, G1, H1, I1 3x3 square: B2, B3, C2, C3 (since A1, A2, A3, B1, C1 are already counted). * Note: Each box has 20 peers
      • Elimination Strategy:
        • Eliminate possible values for a box by analysing peers
        • If a box has a value assigned, then none of its peers may also have its value
        • Update grid with possible values for each box
        • Use to compute grid_values() so each box value represents all the available values for that box. i.e. i.e. box ‘B5’ would have value ‘47’ (because 4 and 7 are its only possible values). Starting value returned for every empty box will be ‘123456789’ instead of ‘.’
      • Only Choice Technique:
        • If one of the possible values in any one of the unsolved boxes of a unit/square that is not also one of the remaining values in the other unsolved boxes, then it is the Only Choice for that box
      • Search Strategy:
        • Example Usage:
          • Game Playing
          • Route Planning (to find solutions efficiently)
          • Alpha Go https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf
            • Selection
            • Expansion
            • Evaluation
            • Backup
          • Depth First Search Algorithm Steps:
            • Start from root, traverse the tree always in the most-left-first fashion, and when reach end leaf, move to next branch
            • Step 1: Pick an unsolved box that has the fewest possible numbers to try out i.e. 89 (less work), and create a Tree of possibilities to try out
            • Step 2:
              • Step 2a: Try the first branch’s possible value of 8, using Constraint Propagation
              • Step 2b: If not done after first branch’s first level, then choose from that unit/square another box with fewest possible numbers (i.e. 159), and create 3x 2nd level branches to Try solving with using 8 with each of its values (i.e. 1, 5, and 9) using Constraint Propagation
              • Step 2c: Repeat with more levels as required
            • Step 3: Try for remaining branches
            • Step 2: If Step 1-3 results in no solution, repeat for another one of its possible values
            • Note: 4x more work, but each step gets easier
      • Constraint Propagation:
        • Using local constraints in a space (i.e. Sudoku square constraints) to reduce search space. Each constraint that is enforced introduces new constraints for other parts of board that helps reduce remaining possibilities
          • Example Usage:
            • Map Coloring Problem: Prevent two adjacent map segments from having same color
            • Crypto-Arithmetic Puzzles: Where each unique letter represents a digit in a calculation (i.e. TWO + TWO = FOUR), and no number starts with a 0. Goal is find mapping from letters to digits to satisfy equations by using constraints imposed by the equation to create an AI-agent to solve the proble via Contraint Propagation
            • Sudoku:
              • Options:
                • Constraints: Elimination
                  • Pick solved box, and apply the Eliminate function to remove the value of the box from the possibilities of its peers
                • Constraints: Only Choice
                  • Pick a Unit, if there is one value in the unsolved boxes that is only allowed in one of the boxes, then assign it to the box
                • Constraints: Search
                  • Use Depth First Search algorithm to solve harder Sudoku problems using recursion and calling function that uses Constraint Propagation (i.e. Elimination and Only Choice techniques)
                • Note: Constraint Propagation Technique (using Elim & Only Choice) Applied
                  • Step 1: Pick a solved box and apply Eliminate function
                  • Step 2: With the more complete grid, pick a Unit and apply Only Choice function
                  • Step 3: Repeat Step 1 for a different solved box
                  • Step 4: Repeat Step 2
                  • etc
                  • Solution ?? Or Stuck ?
                  • Important Note: May only handle EASY puzzle
      • Naked Twins Technique (extension) http://www.sudokudragon.com/tutorialnakedtwins.htm
        • Given a Sudoku puzzle partially completed with some boxes with possibilities for their values. If there are two boxes in the same row or column that both contain the same two values i.e. 23 and 23, then we know for sure that 2 and 3 can only occur in those boes in that unit (column 3), so:
          • Remove 2 and 3 from possible values in that column
      • Diagonal Sudoku (by modifying the algorithm)
    • Project 1:
      • Added Python Debugging functionality from main.py file (see my Gists on GitHub)
      • Added assertions
      • Get PyGame library directory
          >>> import site;print(site.getsitepackages())
          ['/Users/Ls/miniconda3/envs/aind/lib/python3.6/site-packages']
        
      • Code Review Comments:
        • Logging with default level ERROR could be added to debug the code. Logs can also help to understand the algorithms. See https://docs.python.org/3/howto/logging.html
        • Assert statements could be used https://wiki.python.org/moin/UsingAssertionsEffectively
      • Utility methods in separate file
    • MyPy - Static Type Checking
      • Install MyPy python3 -m pip install mypy.
      • Install Typing python3 -m pip install typing.
      • Import Typing import typing; from typing import *,
      • Apply MyPy static type checking to an existing function where expect warnings def solve(grid: str) -> None:
      • Run MyPy Linter with mypy solution.py
      • Apply MyPy with expected types def solve(grid: str) -> typing.Dict:

      • Installation http://mypy.readthedocs.io/en/latest/getting_started.html#getting-started python3 -m pip install mypy
        • Also installed typing, even though at this website it says it’s only necessary when running old Python 2 http://mypy.readthedocs.io/en/latest/faq.html python3 -m pip install typing
          • And then added to top of file: import typing; from typing import *
          • Implemented into function with: def solve(grid: str) -> typing.Dict:
        • Run MyPy Linter mypy solution.py
      • Basic Summary http://mypy.readthedocs.io/en/latest/basics.html
      • Note:
        • Example built-in types: http://mypy.readthedocs.io/en/latest/builtin_types.html
      • Run MyPy Linter that performs type checking of program without running it and provides warnings mypy program.py
          PySudoku.py:1: error: Cannot find module named 'pygame'
          PySudoku.py:1: note: (Perhaps setting MYPYPATH or using the "--ignore-missing-imports" flag would help)
          PySudoku.py:3: error: Cannot find module named 'SudokuSquare'
          PySudoku.py:4: error: Cannot find module named 'GameResources'
          PySudoku.py:65: error: Name 'main' is not defined
        
      • Extend MyPy Linter to perform type checking of Libraries that have Stubs that are skeletons of public interface of library classes, variables, functions and their types but with Dummy function bodies (i.e. those incorporated in the Typeshed project from Python Built-ins and Standard Library)
        • Type inference of uses Stubs for definitions
            # Sample code in .py file
            x = chr(4)
          
            # Sample MyPy Library Stub in .pyi file (ellipsis used for function body)
            def chr(code: int) -> str: ...
          
        • Create Custom Stubs https://github.com/python/mypy/wiki/Creating-Stubs-For-Python-Modules
          • Write Library Stub in .pyi file within either same directory as the Library Module OR in specific Stub directory (i.e. myproject/stubs) and set environment variable pointing to export MYPYPATH=~/work/myproject/stubs
          • Use subdirectory with __init__.pyi for packages
          • In directories with both .pyi and .py the .pyi file takes precedence
      • Contribute Custom Library Stubs created into the Typeshed Repo
      • Static Type Checking (instead of Dynamic Type Checking)
        • Type Annotations added to functions (type checker reports type errors within function)
        • Apply return type as None by default so statically typed context results in type check error
            def greeting(name: str, prefix: str = 'Mr.') -> None:
                 return 'Hello, {} {}'.format(name, prefix)
          
        • Use typing module to use type definitions in statically typed code
            from typing import Iterable
          
            def greet_all(names: Iterable[str]) -> None:
                for name in names:
                    print('Hello, {}'.format(name))
          
      • Previously Dynamic Type Checking
        • No type checking annotation in function signatures
    • Lesson 4: Intro to AI
      • Optimisation of Navigation and Route Finding Domain
        • Computationally hard problems
          • Find shortest path on map from A to B
            • Apply graph of city nodes
            • Connect city nodes by road edges
          • Option 1: Breadth-First Search
            • Inefficencies
              • Considers all possible paths out of city until reaches destination via one of them
              • Only considers major roadways
            • Issues
              • Intractable space of possible paths to search if consider smaller streets/alleys
          • Option 2: Heuristic to Manually Solve (i.e. AI for smart idea, a rule, function, or constraint that informs an otherwise Brute-force Algorithm to act more optimally)
            • Consider major roads
            • Defer considering major roads going too far
            • Prioritise paths that take closer to destination
          • Option 3: A* Search Algorithm
            • Draw straight line path between source and destination
            • Priorities how our algorithm considers and evaluates different possibilities by try testing (using Direct Distance Heuristic) all possible next steps from current position so new direct distance changes (prevent unnecessary searching and converges on optimal direction quickly). Choose Most Promising Choice with smallest direct distance as next step (but may not turn out to be Optimal). Store each other path for later consideration.
      • Intelligent Agent
        • Dfn: Actions taken to Maximise its expected Utility given desired Goal
        • Comparison of Agents
          • Rational Behaviour extent - requires Optimal Behaviour (regardless of what Cognitive Mechanism used), however, the Optimal Solution is hard to achieve (i.e. Sudoku problem, etc), but AI-agents are not always expected to perform Optimally
          • Constraints faced by the AI-agent must also be considered i.e.
            • Partially Observable env
            • Limited computational resources (speed, memory)
            • Rules of task such as require respond by deadline
          • Bounded Optimality
            • Quantifying Intelligence feasibly and practically
            • Level of Performance / Bound for AI-agent to meet (i.e. Chess AI-agent to win 60% of time against human chess masters, or Route Finding Algorithm that always find route no more than 2km longer than Optimal route)
        • i.e. Anticipate and React to changes in environment (i.e. other drivers)
      • Game AI-Agent (Tic Tac Toe)
        • Goal - Design AI-agent to play Game (Tic Tac Toe) using Search Strategy (similar to route-finding problem)
        • Step 1
          • Determine Nodes and Edges in the Graph we search
            • Option A -
              • Nodes - Each cell
              • Edges - Connect two if adjacent
              • INSUFFICIENT - Does not capture enough info on what to do next
            • Option B -
              • Node & Edges - Current Player position (row/col Tuple) {0, 1, 2} Note: Two Nodes connectable if row/col differ by 1 (i.e. (0, 1) connectable to (1, 0) but not (2, 2)
              • INSUFFICIENT
            • Option C -
              • Opponent Tracking - Track of opponents position too
              • Node - Tuple (i.e. <(0, 1), (2, 0)>
              • Edges - Connect if both Player positions differ by max 1
              • INSUFFICIENT - Since need know which cells already filled, as cannot move into them
            • CORRECT - Option D -
              • Node - Whole Board is Node, with one Node for every possible arrangement of X’s and O’s
              • Edges - Connect two Nodes if there is a valid move that changes the board from one node to another
                • The edges embody the Rules of the game
                  • i.e. only possible to go to finite qty of next states from current state (limits qty of moves need to consider)
              • SUFFICIENT - Usable for Tic Tac Toe!!! Since will know position of Computer Player and Opponent Player so know what to do next (by Excludes any extraneous details we don’t need to solve it)
              • Note:
                • 3x3 Board -
                  • First Computer Move - 9 possibilities
                  • First Opponent Move - 8 possibilities
                  • etc
              • Note: Search Indicators
                • Pruning the Search Tree
                  • Ignorable / Non-Winnable Moves (some moves are wasteful)
                    • Evaluated Condition - i.e. When enemy has X on same row and column and proposed placement
                • Adversarial Search
                  • Minimax Algorithm (maximise chance of win on our turn, whilst enemy trying to minimise our chance of win on their turn) Note: AI-agent Dfn - Anticipates and plans around expected changes in its environment, including those introduced by other agents
                    • Anticipate Enemy Movies (intuition of recognising that enemy is Human also trying to Win)
                      • Evaluate Condition - i.e. When enemy can make sequence of moves that win after our move
                        • Rule Out certain moves as being BAD removes potential successors from its consideration
      • Monty Hall Problem
        • Given 3x doors, randomly assigned behind each is either car or goat
          • Strategy:
            • Stay - Wins 33.33% of time
            • Switch - Wins 66.67% of time
          • Strategy: 1) C G G - better to STAY x 0

          2) C G G - better to STAY x 0

          3) C G G - better to SWITCH x 0

            C G G - never, so counts as 3) probability
            0 x
          

          4) C G G - better to SWITCH 0 x

            C G G - never, so counts as 4) probability
            0   x
          
        • Probability Theory - captures uncertainty in real-world to make sensible decisions when unsure
          • Prior Probability - best guess given no other evidence
              P(CarA) + P(CarB) + P(CarC) == 1
              P(CarA) == 1/3, P(CarB) == 1/3, P(CarC) == 1/3 (aka **Prior Probability**)
            
            • Equal to 1 since probability of at least one is car is 100% (sum of probabilities equals 1)
            • So each door has 1/3 probability of car behind it
          • Move 1 -
            • Player 1 Select door A
            • Monty Open door B
              • OpenB (Bbservation)
              • Posterior Probability - belief after making an Observation
                  P(CarB | OpenB) == 0
                
              • Note: 2x doors still undisclosed, probability of car behind either is NOT 1/2, since Monty (enemy) knowingly opened door with Goat behind it (never reveal the car on this move)
                  NOT THIS - P(CarA) == 1/2, P(CarC) == 1/2
                
              • We are actually interested in Posterior Probability of finding car behind Door C, given that Monty opened Door B, and the implicit Rule that Monty never reveals the car, so express the Posterior Probability in terms of other quantities that can compute using Bayes’ Theorem
                  P(CarC | OpenB) ==
                
                      == (P(OpenB | CarC) * P(CarC)) / P(OpenB)
                      == (Likelihood * Prior) / Marginal Probability
                
                      Likelihoood:
                          where P(OpenB | CarC) == 1
                          since if CarC true, Monty's only option is OpenB,
                          since OpenA already picked by us (not option to Monty)
                
                      Prior:
                          where P(CarC) == 1/3,
                          which previously computed as 1/3
                
                      Marginal Probability:
                          P(OpenB) == Possibility 1 + Possibility 2 + Possibility 3
                                          where Possibility 1:
                                              Car behind A with 1/3 possibility, and in this case Monty either:
                                              OpenB or OpenC, both with 1/2 possibility,
                                              so Probability of OpenB **given** CarA is 1/3 (i.e. P(OpenB | CarA))
                
                                          where Possibility 2:
                                              Car behind B has 1/3 possibility, but in this case
                                              Monty would not reveal it, so Probability of OpenB
                                              given CarB is 0
                                   == P(CarA) * P(OpenB | CarA) +
                                      P(CarB) * P(OpenB | CarB) +
                                      P(CarC) * P(OpenB | CarC)
                                   == 1/3     * 1/2             +
                                      1/3     * 0               +
                                      1/3     * 1
                                   == 1/6 + 1/3
                                   == 1/2
                
                       so, updated Bayes' Theorem
                       == (Likelihood * Prior) / Marginal Probability
                       == (1          * 1/3  ) / 1/2
                       == 2/3
                       (so probability that the car is behind the other closed door has
                        increased to 2/3)
                
          • Note: Empirical Evaluation (by characterisation of given situation coupled with a formal method helps make decisions in face of Uncertainty) shows that consistent switch strategy wins 2/3 of time
      • Intelligence Dfn:
        • Ability to produce Reasonable behaviour while dealing with different sources of complexity
          • Intelligent Objects: Thermos, Plant, Siri, Rock, Arpan
          • E.g: If a person you know was not human but acted same, you probably wouldn’t tell the difference
        • Note:
          • Intelligence often used to label this we cannot explain,
          • Artificial General Intelligence Dfn (sub-field of AI) deals with goal of:
            • Ability to replicate complex nature of human thinking that accomplishes a variety of intellectual tasks like humans do
          • Algorithms or Formulas used to label things we CAN explain
          • Philosophically this implies that:
            • Anything we design can never be intelligent, as we would know how it works
            • Humans assume they are intelligent now, as we do not fully understand how brains work, but when we do, will we not consider ourselves intelligent anymore?
          • Intelligence - must not be contingent on how we Perceive something
          • Intelligence must be a Property that emerges from the system itself
          • Intelligence - should be defined as the ability to do something with a decent rate of success within the Context of a task so then can say you are Intelligent at solving Sudoku puzzles

          (quote by Luke Schoen i.e. Our judgement of the “truth” of something, similar to our judgement of the “intelligence” of something, should not be contingent on how we “perceive” that something. Instead our judgement should be based on assessing “properties” that emerge from the “source” of that something itself, and should be defined based on that somethings’ “success rate” at being able to be “truthful” within a defined “context” such as a task)

          • Humans are Intelligent at doing Multiple tasks (many things)
      • Agent Design (i.e. AI-agent) (i.e. aka Intelligent Systems)
        • AI-agent Dfn:
          • Cartesion View:
            • Software only is the AI-agent, whereas the Hardware (i.e. sensors) is considered
          • High-level View:
            • Entire robot is the AI-agent, all components directly reliably accessible/controllable part of the Environment
        • Example:
          • Automatic Vacuum - i.e. specification

            • Environment Understanding (Domain of Task / Problem)
              • room operating in, floor, walls, objects to avoid bumping into, except objects not of concern
            • State stored is only items necessary to complete task
              • current position
              • orientation while making actions
              • places already been before
              • obstacle location map (but they may move over time)
            • Goal State stored - result the AI-agent trying to achieve (i.e. final location(s) acceptable)
      • Agent Interaction with Environment
        • Input
          • Perception - Sensing the Properties of Env
          • Effectors
        • Internal
          • Cognition - Based on Perceived Inputs, the process of:
            • Reasoning
            • Decision Making
            • Action
          • Note:
            • Reactive / Behaviour-based Agents
              • Simple pre-programmed behaviours and directly associate Actions with Perception (bypassing process) (i.e. automated rubbish bin that senses something infront of it)
          • Classification of Agents
            • Classification Measures: - based on
              • Processing they do, the kind of
              • Environment Properties Environment States components they must capture
                • Fully Observable (i.e. Tic Tac Toe where whole board viewable)
                • Partially Observable (i.e. Battleship game where opponent positions hidden)
                • Deterministic (i.e. where Certain of Action outcomes)
                • Stochastic (i.e. where Uncertainty in Action outcomes)
                • Discrete (i.e. finite qty of States the env may be in)
                • Continuous (i.e. infinite qty of States such as due to Properties that need to be stored as real numbers)

                • Benign (i.e. where AI-agent is only one taking Actions that intentionally affect its Goal, i.e other Random events may be occurring)
                • Adversarial (i.e. where one or multiple AI-agents that can take Action to defeat its Goal, i.e. competitive games)
              • e.g.
                • Playing Poker - Partially Obs, Stoch., Adversarial
                • Recognition of Hand-written Text - Stochastic, Continuous (since source is physical object with arbitrary and unpredictable motions)
                • Driving on Road - All except Adversarial
                • Playing Chess - Advers.
            • Layering simple Reactive control in a Hierarchy may achieve complex Behaviour
            • Deliberate / Non-Trivial Processing by agents such as:
              • Knowledge-based Agents
              • Game Playing Agents
              • Path Planning Agents
              • Learning Agents
        • Output
          • Actions - Changing the State of Env
    • Lesson 5: Intro to Game Programming
      • Course Outline
        • Search and Logic and Planning - Peter Norvig
        • Probability and Bayesian Networks (BN) - Sebastian Thrun
        • Other - Thad Starner
      • Goal of Lesson
        • Design a Game AI-agent program smarter than ourselves when playing a the limited context of a single real game
      • Main Topics
        • Adversarial Search - find optimal moves
        • Minimax Algorithm - for Deterministic Games determines best move for any turn in a game
        • Expectimax Algorithm - for Non-Deterministic Games (Chance games) that considers all possible outcomes and chooses the one with maximum expected return (assuming opponent is making best moves available)
          • Expectimax Game Tree
            • Maximising Nodes - indicated by upward facing triangles
            • Minimising Nodes - indicated by downward facing triangles
            • Probabilistic Nodes - indicated by circles with the Chance of each outcome labelled on branches below it
            • Evaluation of Tree Approach - evaluate tree from left to right
            • Note: Game Board Nodes - values may range from -10 to +10
        • Alpha-Beta Pruning - helps optimise the Minimax / Expectimax Algorithm to make AI-agent play faster * Performance Improvements: * Apply to expected max game trees that have finite limits on values the evaluation function returns * Usage: Use the Alpha-Beta Algorithm to mark trees branches that do not need to be visited, to calculate the value of the optimum choice
        • Evaluation Function - creation of them for games
        • Isolation Game Player - apply knowledge to the Isolation Game (a complex game with simple rules)
          • Objective: To not become isolated an unable to move on your turn (i.e. be the last player to move)
          • Setup: 5x5 board, 2x players with X and O pieces
          • Movement Rules: Move permitted to any square up/down or diagonal from current position, but not through:
            • opponents position
            • previously occupied position (i.e. landed upon)
            • outside game board position
          • https://www.youtube.com/watch?v=n_ExdXeLNTk
          • Note: Hard to estimate chances of winning at start of game, but become more certain as game progresses
          • Minimax Algorithm Usage: * Automatic first choice consults Openning Book - Table of Best First Moves (used for End Games) for this board config * Discover best moves, i.e. * Move to positions that are always at least above, or in the position immediately above the opposition position * Move to positions with available escape options (unless can block and win on that specific move); and closest to other player position; and avoid some first moves entirely (i.e. top middle or bottom middle); higher odds of winning when make first move on 3x2 (w/h) board with bottom right position blocked off * Mark (with Downward Arrow above) Scores at Terminal (Leaf) Nodes * +1 for Win * -1 for Lose * Assumption 0: Maximization of score is being considered from the Perspective of the [Computer] (otherwise we’d reverse how we apply Minmax). The [Computer] can only affect the game on its turn. Top of Minimax Algorithm Tree is always a Maximize (Upward Arrow Below) * Assumption 1: * Player X [Computer] always tries to maximise its score * Assumption 2: * Player X [Computer] is smart and latest positions are due to them always plays to win trying to minimise Player O’s score, so they make bad moves and lose soon) * Mark Assumption 1 (with Upward Arrow Below, i.e. [Computer]’s Turns) on Tree, Turns when Player X [Computer] is trying to Maximise the score * Mark Assumption 2 (with Downward Arrow Below, i.e. [Human]’s Turns) on Tree, Turns when Player X [Computer] is trying to Minimise the score (i.e. Player O Turns where Player X [Computer] reliant on their Bad Move)

            • Move to positions that partition the board (fill positions on one side of board to reduce positions to move to to force early end to the game)
            • Move to positions that try to Trap the opposition
          • Example:
            • Build a Game Tree (showing all possible movies) based on 2x3 game board
            • Game Tree used to select moves with best change of winning
            • Rules:
              • Unavailable Moves: Any filled spot on game board
            • Human is X, Computer is O.
            • Mark Leaf (Terminal Nodes) branches where WIN with +1, and LOSE with -1.
            • Level 0 (Beginning Game Level) - bottom left is unavailable immediately.
              • Draw Level 0 of Game Tree as top of tree
            • Level 1 [Computer] - 5x possible moves.
              • Draw Level 1 of Game Tree as 5x branches possibilities below Level 0
            • Level 2 [Human] - 4x possible moves (reduces)
              • Draw Level 2 of Game Tree as 4x branches below EACH Level 1 branch
            • Level 3 [Computer] - 3x OR 2x OR 1x possible moves (exponential growth) (some moves BLOCKED by X, or BLINDSPOT)
            • Level 4 [Human] - 2x OR 1x possible moves (becomes denses)
            • Level 5 [Computer] - 1x OR 0x possible moves (multiple LOSE possibility)
            • Note Warn [Computer] not to make Level 3 moves that allow Level 4 branch to be one that gives [Human]’s a possible that results in on 0x possible moves for [Computer] in Level 5
            • Level 6 [Human] - 0x possible moves (i.e. Computer Win if get to this level)

            • Minimax Algorithm Dfn - Back Propagate possible future wins/losses up game tree to [Computer] so best possible first move is made
              • Start Bottom Left Leaf Terminal Node
              • Go Upward and if get:
                • Downward Arrow choose the Min value (from -1 or +1) appearing in Child Nodes (Turn where Player X [Computer] is trying to Minimise their opponents score)
                  • Note that since [Computer] plays to win, it never chooses some branches allowing [Human] to win, we want to Mark these branches as -1 , as we want to Avoid future opportunities that MAY result in our loss, and;
                • Upward Arrow nodes on way up, pick the Max Value among its Child Nodes
              • Note that if [Human] player makes a wiser choice at a higher tree level it can avoid situation of [Computer] being able to play to win
        • Multi-Player, Probabilistic Games
      • Problem Solving and Game Playing
        • Strategy Games - i.e. checkers, isolation, othello, chess
        • Algorithms
          • Reusable code, except modifying the code that generates all the next possible moves and the UI, to create Computer Player for simple Deterministic games (where outcome of each move is known) such as Tic Tac Toe, or as complex as Chess
          • Chance games (i.e. backgammon)
      • Max Qty of Nodes Visited
        • Given
          • Isolation Game board: 5x5
          • Time to process all Upper Limit of Nodes:
              # Move 1 [Computer] - 25 possible places
              # Move 2 [Human]    - 24 possible places
              #                     23 ...
            
              # i.e. 25 x 24 x 23 x 22 ... x 1
                     = 25!
                     = 10 ^ 25
            
              # assume multi-core gigahertz processor able to do
              # 10 ^ 9 operations/s it will take 10 ^ 16 seconds
              # to search the entire game,
              # where 10 ^ 16 sec
                given 3600 sec/hr
                given 24 hr/day
                given 365 day/yr
                = 317,097,920 years
            
          • Time to process only up to Second Move:
              # 25 x 24 = 600
            
          • Subsequent Moves from Third Move:
            • Each subsequent move moves like a queen and less moves
            • Center Position on board with Max Options to move on Third Move
          • Branching Factor
            • Center has 16 possible positions
            • Other Positions have <= 13 possible positions
            • Max moves (depth of tree) in a game is 25
            • Max possibilities from First and Second moves: 25 x 24 = 600
            • 23 remaining moves after First and Second moves:
              • <= 13 possible moves available (except center position)
            • End Nodes Worst Case in game tree:
              • 25 * 24 * 12 ^ 23 = >10 ^ 27 estimate …
            • Branching Factor of 12! factorial is far too much to assume as being necessary for the last moves where not many possibilities remain
              • Fourth Last Move possibilities = <=3
              • Third Last Move possibilities = <=2
              • Second Last Move possibilities = 1
              • Last Move possibilities = 0
            • Instead assume this approximate is sufficient
                12 ^ 12 ~= 10 ^ 13
                12 * 11 * ... * 3 * 2 * 1 = 5 * 10 ^ 8
              
                25 * 24 * (12 ^ 13) * (5 * 10 ^ 8) ~= 3 * 10 ^ 23
                (much better/faster than 10 ^ 27 since most branches will have less
                 than the max branching factor)
              
            • Qty of Nodes in Game Tree that Minimax Algorithm must visit to find Optimal solution is:
              • b ^ d
                • Given grid: 5x5
                • b (avg branch factor)
                • d (depth of game tree)
            • Calculate Average Branching Factor
              • Option 1: Brute Force
                • Try simple test first and add intelligence later
                • Create game player than moves randomly and calculate the avg branching factor (use statics on each move: branches remaining, total games, total plays, total branches, avg branching) and we find the
                  • Average Branching Factor: 8
                  • Estimated Max Nodes visited: 8 ^ 25 ~= 10 ^ 22 (i.e. 1.2 million years for answer)
              • Option 2: Depth Limited Search (Clever alternative to long end game search)
                • Assume:
                  • Search Speed: 10^9 nodes * 2 sec = 2 * 10^9 nodes
                  • Solve: 8^x < 2 * 10^9 log base 8 8^x < log base 8 2*10^9

                      Math Formula:
                      log base a  x  =  log base b  x  /  log base b  a
                    
                  • So…: x < log base 10 2 * 10^9 / log base 10 8 x < 10.3

                    • Assuming Branching Factor of 8 is good, to be on the safe side we should only go to Max Depth of 9 (only go as deep as think need to meet deadline)

                    • At Depth of 9 we want an Evaluation Function of a Node to evaluate goodness of node at depth of 9 based on how much we expect it to lead to Win for [Computer] player

                      • Note: Only way for [Computer] player to Win is for it to have moves left at end of game, so maybe [Computer] shouldn’t maximize the number of moves it has early on!

                      • Evaluation Function (aka “Number of #my_moves for convenience for Player 0”)

                        • Note On each game tree branch
                          • Min Branch (third last) - Mark min value of all below Max Branches
                          • Max Branch (second last) - Mark max value of leaves below it
                          • Leafs - Mark the maximum about of moves available to [Computer]
                        • Arguments: Each game board generated level 9 of Minimax game tree
                        • Returns: Number to compare that level 9 node with all other nodes at level 9. A higher number is returned is representative of how good the board is for the [Computer] player
                        • Task: Counting number of moves the [Computer] has available at a given node, allows [Computer] to select branches from Minimax tree that lead to spaces where [Computer] has the most options
                        • Also, use the Evaluation Function to test our assumption that if [Computer] makes a move to any middle position they will always lose. Try ONLY applying Evaluation Function down to Level 2 of game tree, to see that completely different nodes get higher scores, (the ones that would guarantee a loss!! Maybe they vary widely because critical decision being made at that level.
                          • Demonstrates better to search to at least level 3
                          • Check which branch the Minimax Algorithm recommends at each level of the search
                            • Use #my_moves Evaluation Function on Isolation Game with Max score 5, Min score 0. If find branch where [Computer] loses, give it score -1, or if find branch where [Computer] wins, give it score 100..

                            • Quiesence Search is the state we reach if we continue to Level 3, where the recommended branches no longer change much so know we have gone far enough… and if continue to Level 5, we just learn what we already know and values become -1’s and +1’s

                              • Use this approach of determining deepest level to go to if get consistent results
                              • Use this approach at Start and End of game when gives better results
                          • Depth-First Iterative Deepening (DFID)
                            • References:
                              • AI Textbook: 3.4.5 of Russel, Norvig
                              • UBC Slides: https://www.cs.ubc.ca/~hutter/teaching/cpsc322/2-Search6-final.pdf
                              • Video showing how DFID vs DFS http://movingai.com/dfid.html
                            • Note: End Game is best critical time to apply DFID to see what will happen (i.e. in Isolation game)
                            • Note: Using DFID means that our [Computer] player will always have an ANSWER READY incase it runs out of time (whe limited time PER MOVE), and using it we can search as far as possible within its time constraints. In other games where limited time PER GAME (i.e. speed chess) so we would want [Computer] to search to different depths depending on point in the game
                            • Strategy developed on how deep to search at different points in the game
                              • Strategy 1:
                                • Beginning of game: Standard Initial Moves
                                • Middle of game: Deeper Search (more time allocated) for Moves
                                • End of game: Less Time searching (as far as possible in remaining time) for Moves (relying on reduction and branching factor)
                              • Strategy 2:
                                • Conservative amount of time dedicated per Move
                                • Use DFID and Quiescent Search to determine few moves want to spend extra time on
                              • RISKS: Horizon Effect * Highlighting to [Computer] player critical situations when game will be won/lost on next move (but where [Computer] unable to search far enough into future to discover problem exists)

                                  * i.e. If a player is blocked on a side of
                                  a Partition where they have less moves than if they
                                  chose to move to the other side of the Partition
                                  (in the board where Partition is created by previously
                                  occupied positions)
                                
                                  * **Potential Solution** Add a process in your existing #my_move
                                  **Simple Evaluation Function** to make it a
                                  **Complex Evaluation Function** by additionally checking if a
                                  **Partition** is being formed by the next move, and if so, count
                                  the number of moves left to each player
                                  (BUT this causes our Evaluation Function to take exponentially longer)
                                
                                  * **Solution**
                                      * **Strategy** Depending on game time available, carefully think
                                      how efficiently Evaluation Function may be implemented depending on
                                      strategy chosen, and whether it captures desired behaviour or not,
                                      either use. But BEFORE the **Evaluation Function**
                                      use **Alpha-Beta Pruning** (to improve efficiency of game tree search):
                                
                                          * **Alpha-Beta Pruning Algorithm**
                                              * Dfn: Pruning technique to allow ignore whole sections
                                              of game tree but with same answer as with **Minimax Algorithm**
                                              so is MORE EFFICIENT, saving a lot of time.
                                              Where
                                              * **Minimax Algorithm** runs in a timeframe of `b ^ d`
                                              * **Minimax with Alpha-Beta Pruning (first)** runs in a timeframe of `b ^ d/2`
                                              (if nodes ordered optimally with best moves first), and down to timeframe of
                                              `b ^ d*3/4` (with random move ordering)
                                                  * Firstly, When performing the **Minimax Algorithm** from left to right,
                                                  after propogating the highest value up to the far left most bottom
                                                  Max Node (Level -2) from that branch's two nodes (Level -1, aka leaf),
                                                  (i.e. a value of 2) we know that the upward Min Node's (Level -3) subtree
                                                  value will be <= 2... (since Opponent X will choose branch that minimises
                                                  the value), so for each remaining Level -2 nodes, as soon as we find
                                                   they have a leave Level -1 node of value 2, we can ignore the rest of
                                                   its leaves and move on to next Level -2 node (i.e. if second
                                                    Level -2 Max Node has 3x leaf nodes, and we find first one is 2,
                                                    we can ignore the other two as they will never be selected), saving time!!
                                                    And based on this we know that at the highest Top node Level -4,
                                                    we will get a value of >= 2
                                                  * Then, assuming Level -4 is the Top of the game tree.. when we
                                                   check the second from the left Level -3 subtree's ([Computer] turn)
                                                   Level -1's leaf nodes,
                                                   as soon as we find a value of 2 we know that is a Winning leaf, with its
                                                   upward Level -3 subtrees Min value being <= 2, so we can ignore the rest of
                                                   the branches below that subtree
                                                  * Then, for third from the left Level -3 subtree's Level -1's leaf nodes,
                                                   as soon as we find a leaf with value of 2, we can ignore the remaining Level -2's
                                                   and associated leafs in that subtree, since we have already found a Max (Winning)
                                                   move in that subtree
                                                  * Note: The above assumes our goal is to play the game, and that
                                                  we choose from left to right using Minimax Algorithm, but not keep a list of all
                                                  the equally good moves on a given level
                                
                                                  * **Reordering leaf nodes by only swapping siblings of a given parent**
                                                  to increase amount pruned
                                                      * i.e. Since evaluate left to right, in right
                                                      subtree swap the Level -2 Max nodes around
                                                      so smallest is on left instead of on right, as Level -3 is Min, and may
                                                      only need to evaluate the left if we find Level -3 Min is less than
                                                      first subtree
                                
                                
                                          * **"Simple" Evaluation Function** and DEEP SEARCH, OR
                                          * **"Complex" Evaluation Function** and SHALLOW SEARCH
                                          to catch dangerous situations like the **Horizon Effect**, such as
                                          extensions of the **#my_move** **Simple Evaluation Function**
                                          that are correlated with our chances of winning including
                                          (try lots of Variants of Complex Evaluation Functions to see
                                          which ones are the best):
                                              * (**#my_moves** - **#opponent_moves**)
                                                  * Note: Useful for Simple Isolation games
                                                  * Favours moves where [Computer] has most options
                                                  * Penalises moves where opponent has most options
                                                  * Note: Possible to **weight the formula components** for
                                                  more or less aggressive game play.
                                                      * i.e. to cause [Computer] to chase after Opponent, and
                                                      to make it so the **Winning Move has the highest Evaluation
                                                      Function result**
                                                          * (**#my_moves** - 2 * **#opponent_moves**)
                                
                            • Init Goal: Want [Computer] to respond within < 2 sec per move
                            • Previously used Max Depth Calc: Calculated Max Depth we thought could search to within that timeframe
                            • Simpler Iterative Deepening Approach:
                              • Search to Depth Level 1, get answer for best move, Store answer to use incase run out of time (L1 has 1 tree node, L2 has 3 tree nodes, L1 has 1 iterative deepening nodes total, L2 has 4 iterative deepending nodes total)
                              • Repeat process but search to Depth Level 2
                              • Repeat process until run out of time (returning best move from level we get up to)
                            • Note: number of Tree Nodes for each level is the total number of nodes visited when exploring the tree till that depth (not just the nodes on that level)
                            • Note: number of Iterative Deepening Nodes are simply the cumulative sum at each level (always less than double the number of tree nodes)
                            • Note:
                                    b = 2                    (Branching Factor of 2 given)
                                    n = 2^(d+1) - 1          (Tree Nodes total)
                              
                                    b = 3                    (Branching Factor of 3 given)
                                    n = (3^(d+1) - 1) / 2    (Tree Nodes total)
                              
                                    b = k                       (Branching Factor of k given)
                                    n = (k^(d+1) - 1) / (k - 1) (Tree Nodes total)
                              
                            • Note: Iterative Deepening of researching the tree does not waste too much time as amount of time taken is dominated by last level searched and is a small multiplier compared to the DFS approach of checking every node from bottom leaf left to right (exponential)
                            • Note: With Branching Factor of 2, DFID expands less than 2x the qty of nodes a depth limited search would have done at the same level
                            • Note: With Branching Factor > 2 (i.e. most games) then the redundancy caused by iterative deepening is even less
      • AI Definitions
        • AI NP (Hard) Problem Types
          • Non-Exponential
          • Exponential in Time OR
          • Exponential in Space OR
          • Exponential in Both Time AND Space
        • i.e. “AI is the studying of finding clever hacks to exponential problems”
          • When problem finally solved or computer finally fast enough to solve it, the world no longer thinks of the problem as belonging to AI
        • i.e. “AI consists of all the NP hard problems that have not yet been solved”
        • i.e. “AI involves working on everyday problems that improve people’s lives”
        • AI Systems
          • System that helps consumers choose plane flights
          • System that determines when to deploy airbag in a car
      • Lesson 6.16:
        • Solving 5x5 Isolation Game
          • Searching to endgame in reasonable timeframe:
            • Minimax Algorithm with Alpha-Beta Pruning (reduces search space size of possible game states from b^d to b^(d/2) i.e. from approx 8^25 to 8^12)
            • Symmetry Analysis of board state that is defined as series of ordered moves * http://stackoverflow.com/questions/9082829/creating-2d-coordinates-map-in-python http://stackoverflow.com/questions/32334322/find-adjacent-elements-in-a-2d-numpy-grid http://stackoverflow.com/questions/6313308/get-all-the-diagonals-in-a-matrix-list-of-lists-in-python http://stackoverflow.com/questions/2373306/pythonic-and-efficient-way-of-finding-adjacent-cells-in-grid

              • Best at Beginning of Game and Up to Level 3 of the Search Tree (beyond Level 3, symmetry is rare, and effort to check for it wasn’t worthwhile) (when Branching Factor is High, i.e. Player 1 has 25 possible moves, but only 6 Unique moves, since can use Symmetry about horiz, vert, diagonal axis, and the center move)
                • Heuristic of Equivalent Moves by Rotating Board State 90 degrees (realising some moves equivalent)
                  • Reuse Game Tree known for Move (0,0) as is same as (4,0), (4,4), (0,4)
                  • Do not have to search the game tree further for these non-unique moves
            • Partitions (know the game outcome as soon as get partition, as separates both players completely, so player with longest path left Wins, so avoid having to search to end of game tree)
              • Best at Beginning of Game to create Partition so your player is on side with most remaining moves
            • Avoid being Player 2
              • Note: Assuming both players play Optimally, Player 2 always wins on 5x5 Isolation game
            • If Player 1, always first Move to Center square, and then if possible use Reflection on all subsequent moves to win
              • Reflection Phenomenon is always copying the opponents move 180 degrees on other side of board, on every move they make i.e. Player 1 Move 1: Center Player 2 Move 1: Diagonal Player 1 Move 2: Opposite Diagonal, etc
            • If Player 2, detect if Player 1 is using Reflection Phenomonon and if so, move to a position that Player 1 cannot reflect (there are 8 such moves of the 24 available to Player 2)
            • If Player 2, create Book of Opening Moves and Hints, such as:
              • If Player 1’s first move is to Center Square, then Player 2’s first move afterward should be an Unreflectable position
              • If Player 1’s first move is NOT Center Square, then Player 2’s first move afterward should be the Center square
            • **After first moves,:
              • Load and Search our order Order Book of Opening Moves and Hints efficiently, comprising:
                • Equivalent Moves
                • Hash Tables
              • Implement in order:
                • Minimax
                • Iterative Deepening
                • Alpha-Beta Pruning
                • Evaluation Function
      • Lesson 6.17 - 6.18
        • Multiplayer Games + MAXN (for any number of players)
          • 3-Player Isolation Game
            • Last player able still able to move wins
            • Pairs of players may form alliances
            • Do Not use Minimax Algorithm in multi-player games
            • Evaluate game board Leafs from perspective of each player and propagate values (as triplets) up the tree, i.e.
              • Level 3 Max triplet from Player 3’s perspective (i.e. propagate up triplet leaf with largest Player 3 value
              • Level 2 Max triplet from Player 2’s “
              • Level 1 Max triplet from Player 1’s “
              • Level 0 (Top) receives propagation from Level 1
      • Lesson 6.19
        • Multiplayer Games + Alpha-Beta Pruning
          • Lesson 6.21
            • Reading:
              • References: Korf, 1991, Multi-player alpha-beta pruning
                • Why might alphabeta pruning only work well in the two player case? *
                • How does the size of the search tree change with more than two players?
              • Questions Asked in Forum https://discussions.udacity.com/t/lesson-6-part-21-multi-player-alpha-beta-pruning-paper-reading-by-korf-figure-7-and-section-3-5/226209
          • A-B Pruning works if:
            • Some players’ Evaluation Functions have an Upper Bound
            • All players’ Evaluation Functions have a Lower Bound
          • Evaluation Function for number of #my_moves has:
            • Lower Bound of 0
            • Upper Bound - calculate using this Evaluation Function:
              • Evaluate Function should sum of total number of moves remaining for each player, which should not exceed remaining number of empty squares on isolation board (if 3x players, divide number of empty squares by 3 to get moves remaining for each player). Note: The Upper Bound would change with each game tree Level.
              • Simplify calculation that allows both Immediate Pruning and Shallow Pruning, BUT NOT Deep Pruning like before, and use 3-Player MAX-MAX-MAX Pruning (see Lesson 6.20):
                • SUM of players scores (within a triplet group) must be Max of 10, i.e.
                  • Level 2 Max triplet from Player 2’s perspective (i.e. if a Leaf has Player 2’s [middle] value of 10 value where Player 1 and Player 3 value is 0 since Max SUM of all three is 10, then propagate up triplet Leaf with largest Player 2 value, and prune remaining leaves on that branch)
                  • Level 0 (Top) receives ranges based on triplets propagated up, which gets refined on outcome of propagating up subsequent subtrees, i.e. (>=3, <=7, <=7)
                  • When doing subsequent subtrees, apply the range to Level 1 (i.e. (>=3, <=7, <=7) ), and if the subsequent branches have player values already within those ranges but less than the player values we have propagated up in previous subtrees to Level 1, then Prune them
      • Lesson 6.22 Probabilistic Games
        • Stochastic Games (i.e. Backgammon)
          • Note: i.e. Moves limited on each turn, based on outcome of rolling two dice (do not know value of each dice ahead of time, we assume we cannot do a game tree for it, but we can)
          • Game Tree
            • Expectimax Algorithm to make Best Choice decisions on Moves
              • Only allowed to Prune when have Known Bounds on the values that will be returned by the Expectimax Function
              • If player is going to lose anyway, unless the Opposition makes a bad choice or is unlucky, then still use Expectimax Algorithm to increase likelihood of winning incase Opposition makes bad choice or is unlucky, BUT we would need to search all the way to endgame
              • When evaluating leaves, process highest leaf value first with highest probability (change order) for the left-most branch
          • Sloppy Isolation (Variant of Isolation Game that is a Probabilistic Game)
            • Add proability nodes between Levels to illustrate probability of making the move vs a different move with probability of between 0 and 1
            • Use #my_moves Evaluation Function at leaf nodes, if make move at leaf node, calculate how many moves will still be available for the player to move to after the oppositions next move, and for each leaf multiple this value by the probability of that move occurring compared to others in the same branch, then sum those values for each leaf node
            • Elimate evaluating nodes all together if return is higher when looking for Min Level
    • Project 2:
      • Goal: Program player that beats opponent consistently
      • Design different Heuristics to perform the Adversarial Search
      • Analyse and compare the performance of each Heuristic

      • Reference Projects by others: * https://github.com/morganel/AIND-Isolation-1 https://github.com/bdiesel/AIND-Isolation https://github.com/ziyanfeng/AIND-Isolation https://github.com/davidventuri/udacity-aind/tree/master/isolation https://github.com/mohankarthik/AIND-P2-Isolation https://github.com/mlgill/AIND-Isolation https://github.com/morganel/AIND-Isolation-1

      • Reference Game Simulation https://deepmind.com/research/alphago/alphago-games-english/
      • Summary of AlphaGo, excellent https://www.tastehit.com/blog/google-deepmind-alphago-how-it-works/
    • Lesson 7 - Search:
      • AI dev dfn: Figure out what to do when don’t know what to do
        • Plan sequence of steps to figure out what to do even when do not know what first step should be until we solve the search problem
        • Note: Experiment and keep measuring so you can improve. Pay attention to people by keeping them happy and productive
        • A* Algorithm
      • non-AI dev dfn: Write instructions to tell computer what to do when you do know what to do
      • Challenge 1:
        • Tri-City Search Problem
          • Problem is visit 3x places on bicycle, minimising travel distance, and willing to go down wrong way down one-way streets, with minimal runtime, and minimal overhead
          • Step 0
            • Consider Space and Time advantages/disadvantages of each algorithm
          • Step 1
            • Option 1: Simultaneously (trick) start Search from A, B, and C using A* Algorithm (aka Tri-Directional A* Search), by growing them out until first two connect, then continuing until third location connects somewhere with first two (avoids repeated node visits)
              • Tri-Directional A* Search uses an Admissible Heuristic function that accounts for two potential goals at the same time otherwise would just be an uninformed Tri-Directional Search
            • Option 2: DFS - not viable as may cross country multiple times just for first branch
            • Option 3: BFS (UCS) - works, but 3x searches required,
              • Sequentially Search from: A to B, B to C, and C to A, then determine Shortest two of the three paths, and set the polling station between those two paths as the next to visit and start at either of the other two cities
              • Issue: Would explore more streets intersections / nodes repeatedly in the 3x searches more than necessary (duplicates), and search range is further than necessary
            • Option 4: Bi-Directional Search
              • Instead of each of 3x searches requiring b^d nodes it would be done in (b^d)/2 nodes, but would still revisit nodes in triangle b/w the cities
      • Challenge 2:
        • Rubik’s Cube
          • Problem Goal of all same colours on same side, cube levels twistable on vertical or horizontal. Design search algorithm that guarantees least qty moves to finish from any starting state, where each 1/4 (90 degree) turn is a move
          • Solution Iterative Deepening with A* Search Algorithm (IDA*), which looks for increasingly longer solutions in series of iterations, and using a special admissable Heuristic (search technique) described here Korf, 1997. Finding Optimal Solutions to Rubik’s Cube Using Pattern Databases., that uses a lower-bound heuristic to prune branches when their estimated length exceeds the current iteration bound, and a heuristic function based on large memory-based lookup tables or “pattern databases” (storing exact qty of moves req’d to solve sub-goals such as individual movable cubies) and where the effectiveness of the admissible heuristic function is characterised by is expected value (see paper for details), searching to median depth of 18 for random configs attempted. Proved that max quarter turns required to solve cube is 26 God’s Number is 26 in the Quarter-Turn Metric using 29 CPU years of idle computer time of super computer
            • Note: Overcomes issue since search space is large (i.e. nodes at depth of 18 in search tree is Quintillion (i.e. xx,xxx,xxx,xxx,xxx,xxx,xxx) so is Intractable
      • “Problem Solving” AI-Agents
        • i.e. theory and tech of building AI-agents to plan ahead to solve problems, where problem complexity is due to many problem states
        • Conditions that must be True in order to successfully search for a plan that solves a problem and be guaranteed of working
          • Domain must be Full Observable (i.e. must be able to see initial state we start with)
          • Domain must be Known (i.e. must know the actions that are available to us)
          • Domain must be Discrete (i.e. must be finite qty of actions to choose from)
          • Domain must be Deterministic (i.e. must know the result of taking an action)
          • Domain must be Static (i.e. must be nothing else in world that can change the world excpet our own actions
        • State Spaces of many Types may be dealt with using Problem-Solving through Search, such as:
          • 2D
            • Route finding problems
            • Sliding blocks puzzle called “15 Puzzle”
              • Note: requires a Heuristic function to be applied in order for the search algorithm to work
              • Starting State is mixed up positions
              • Approach - Slide tiles around until reach goal
              • Goal - Tiles in certain configuration in order left-to-right and top-to-bottom
                • i.e.
                     1  2  3  4
                     5  6  7  8
                     9 10 11 12
                    13 14 15
                  
              • Heuristic Usage Solution Approach
                • Heuristic h1 = qty of misplaced blocks
                  • Admissible (never overestimates, since each time in wrong position must be moved at least once to get to right position)
                • Heuristic h2 = sum of distances that each block would have to move to get to right position
                  • Admissible (since each tile in wrong position can be moved closer to correct position no faster than one space per move)
                • Heuristic h3 = always exact cost at every node (so naturally expands the least number of nodes)
                • Heuristic h4 = always 0 (so naturally expands the most number of nodes of all possible heuristics)
                • Note: h2 >= h1
                  • So with the exception of breaking ties, an A* Search using h2 always expands fewer paths than one using h1
                • Note: Heuristic that is greater than or equal to another one gets use closer to the perfect heuristic and expands at least the same qty of nodes
              • Deriving Candidate Heuristics Automatically (aka Generating a Relaxed Problem) (by manipulating a formal description of the problem where many constraints, such as harder to move tiles around, and made easier by relaxing one or more of the constraints)
                • Note: By relaxing the problem, it is as though new operators are being added that traverse the state in new ways, making the problem easier to solve (never overestimates, and is admissible)
                • Program that is:
                  • Given Description of a Problem (i.e. sliding blocks puzzle)
                      Line 1 - A block can move from any tile A to any tile B
                      Line 2 - if (A is adjacent to B)
                      Line 3 - and if (B is blank tile
                    
                • Loosen the restriction by:
                  • Crossing out Line 3, results in Heuristic h2 (since a block can move anywhere to an adjacent state)
                      Line 1 - A block can move from any tile A to any tile B
                      Line 2 - if (A is adjacent to B)
                    
                • Loosen the restriction again further by:
                  • Crossing out Line 2, results in Heuristic h1 (where a block can move anywhere, regardless of any condition)
                      Line 1 - A block can move from any tile A to any tile B
                    
              • Generating Another Heuristic
                • Guaranteed Admissible (as long as h1 and h2 Admissible) and never overestimates, and is guaranteed better as gets closer to true value
                    h = max (h1, h2)
                  
                • Note: Problem with combining heuristics is that there is cost to compute
          • 3D
            • i.e. Abstract properties other than x-y position on a plane
            • Robot vacuum cleaner world state space
              • In either of only 2x possible positions
                • Each position may or may not have dirt in it (additional property to deal with)
              • States Space required: 2x2x2 = 8 states
                • 2x physical states (State A or B)
                  • 2x dirty states for State A (dirty A-dirty or not A-not-dirty)
                  • 2x dirty states for State B (dirty B-dirty or not B-not-dirty)
              • Actions connect between each State
              • Example 1 (Simple):
                • Action of Move Right - Changes to State with position to the right that is dirty
                • Action of Sucking - Change to State with same position but no longer dirty
              • Example 2 (More Complex):
                • Given possible Robot positions 10 OFF
                  • Given Robot possible positions
                    • 10 possible positions
                  • Given Power Switch in one of 3 possible positions
                    • On
                    • Off
                    • Sleep
                  • Given Robot with Dirt Sensing Camera either (2 possible positions):
                    • On
                    • Off
                  • Given Robot with vacuum Brush Heights 5 OFF (5 possible positions):
                    • 1-5
                  • Given Dirty or not (2 possible states)
                    • Dirty or Not Dirty
                • State Space has qty of states of:
                  • Cross product of all variables (since each independent and any combination may occur)
                  • = 3 * 2 * 5 * 2^10 * 10 = 307,200 states
        • Paths in the State Space
          • Linked List of Nodes implement the Paths within a computer algorithm, where a Node is a Data Structure that has four Fields.
            • State Field - indicates the state at end of the Path (i.e. B)
            • Action - action take to get there (i.e. AB)
            • Cost - total cost and parent is pointer to another node (i.e. 234)
            • Parent - pointer to another node state (i.e. A), or null pointer if no parent
          • Note:
            • Path - abstract idea
            • Node - same thing as Path, but terminology for representation in computer memory
        • Data Structures that deal with Nodes/Paths and their implementation
          • Frontier
            • Operations
              • Remove best item
              • Add new items
              • Membership test (i.e. is item in frontier?)
            • Implementation
              • Priority Queue (knows how to keep track of best items in proper order)
            • Representation
              • Set (i.e. built from Hash Table or Tree)
            • Build
              • Hash Table
              • Tree
          • Explored List
            • Operations
              • Add new members
              • Check for membership
            • Implementation *
            • Representation
              • Single Set
            • Build
              • Hash Table
              • Tree
        • Example 1: Navigation WITHOUT Fog Problem
          • Initially many choices
          • Complexity in Sequence of Actions (i.e. of picking right choice of turn at each intersection)
        • Example 2: Navigating WITH Fog Problem
          • Complexity in Partial Observability (i.e. if unable to see through fog where possible paths are, actions themselves are unknown, and can’t see result of actions)
        • Example A: Route-Finding 2D Problem (Partial Observability)
          • Given
            • Start location (i.e. Arad)
            • Destination location (i.e. Bucharest). Note: Corner of map
            • Partial map not showing the Destination
          • Problem:
            • Find route from Start to Destination location
            • Actions executable by AI-agent is driving from city-to-city along roads shown on a map
          • Question:
            • Is there a solution the AI-agent can return?
          • Answer:
            • No, not if Start or Destination are not on the map
        • Example B: Route-Finding Problem (Full Observability)
          • Given
            • Same as above but Full map showing Destination
          • Problem
            • Same as above
          • Question
            • Same as above
          • Answer
            • Yes, many possible Sequences of Actions (routes) chained together that each achieve the goal
        • Defining a Problem
          • Break down into components
            • Initial State (of AI-agent) - s0 (i.e. being in Arad)
            • Actions Function takes state input argument and returns set of possible actions executable when agent in the state (i.e. sometimes same actions available in multiple states) (i.e. sometimes actions are dependent on the state) (i.e. possible to go to neighboring cities but not any city in one move)
              • myActionFunc(s) -> {a1, a2, a3}
            • Result function takes state and action as input argument and returns new state as output
              • myResultFunc(s, a) -> s’
              • i.e.
                • State - Initial AI-agent state - s0 (in Arad)
                • Action - Go via route E671 to Timisora
                • Result - New State where AI-agent in Timisora (of applying Action in given State)
            • GoalTest function takes state and returns Boolean whether state is goal or not (i.e. only True in final destination)
            • Path Cost function takes a path, sequence of transitions (from performing Actions to move between States), and returns Cost of the path. Note: Usually just additive (i.e. Path Cost just sum of cost of individual Action steps) so implement the Path Cost function in terms of a Step Cost function (i.e Path cost is just the sum of the Step Costs along a path)
            • Step Cost function takes a State, Action, and the Resulting State from that Action, and returns the Action Cost as a number (n)
              • i.e. in route finding example this cost may by qty of kms travelled or minutes taken to get to destination)
        • Apply to Route Finding example the Definition of a Problem
          • Given Initial State of Arad
          • Given Goal State of Bucharest
          • Given possible Actions of roads to follow when in specific city
          • Sequences of Actions are built from paths we follow
          • Note: Path Length of 0 when just in Initial State
          • Note: Path Length with “X” Sequences of Actions from Initial State has Path Length of “X”
          • Note: State Space is the set of all states
          • Note: Actions are used to navigate the State Space
          • Note: Step Cost of going between two adjacent States are shown
          • Note: Path Cost for a path is sum of Step Costs along that path
          • Note: At each Point we separate State into 3x different components:
            • Frontier consists of the States at the end of the Paths that have been explored
            • Explored region States before the end of the Paths that have been explored
            • Unexplored region States after the end of the Paths that have been explored
        • Define a Function for Solving Problems called Tree.Search
          • Note: Tree.Search represents a family of algorithms/functions, not a single algorithm, which superimposes a search tree over the state space
          • Note: It initialises frontier to be path consisting of only Initial State, then a loop checks if still anything left in Frontier, otherwise fails with no solution. If something left in Frontier we make a choice and use the Tree.Search’s family of functions to choose one of the paths in the Frontier, and then remove that path from the Frontier. If we find the State at the end of the Path that is a Goal State we return success, Else we perform Expanding that Path by iteratively inspecting all possible Actions expanding from that State and add to the Path the Actions and the result of that State to get a New Path (which has the Old Path, the Action, and result of the Action). All New Paths are added to the Frontier
        • Tree.Search Family of Algorithms - Commonalities and Differences
          • Commonalities:
            • All algorithms look at Frontier, popping items to check against Goal Test
          • Differences between the algorithms:
            • The “choice” of which path to look at first (i.e. remove.choice of how you expand the next item of the Frontier)
        • Search Algorithms
          • Breadth-First Search (BFS) Algorithm (aka Shortest steps first)
            • Dfn: i.e. Searches by expanding first the shortest possible path steps, since always choose from the Frontier one of the paths that hasn’t been considered yet, which is shortest possible
            • Optimal and Complete (if goal at finite depth)
          • Cheapest-First (Uniform Cost) Search (UCS) Algorithm (aka Cheapest cost first) (aka Dijkstra’s algorithm)
            • Dfn: i.e. Searches by expanding first for path with cheapest possible total cost first
            • Optimal and Complete (if goal has finite cost)
          • Depth-First Search (Opposite of Breadth-First Search)
            • Dfn: i.e. Searches by expanding first the longest path, with the most links in it
            • NOT Optimal, and NOT Complete (since if infinite path, DFS would keep following it and never get to path consisting of the goal)
            • DFS only uses Storage Space of n nodes (instead of 2^n nodes in BFS and CFS) so saves a huge amount of space. But if also track the “Expored Set” we do not save as much.
          • Greedy Best-First (GBF) Search Algorithm (UCS focussing on Estimate)
            • Dfn: i.e. Searches by expanding first the path that’s closest to the goal according to the Estimate (search leads us directly toward the goal but searching in ellipses toward the goal, instead of circles searching in all directions even away from the goal.
            • Note: If obstacles/barrier in the way, may lead us to the goal in a longer path than necessary if we continue toward the goal without back-tracking back to states farther from the goal and exploring in the opposite direction around the barrier)
            • Similar to Uniform Cost Search by expanding in terms of Contours, like on a topological map, first expanding to a certain step distance, then to a farther distance, etc, from the initial state until at some point we reach the goal state, but note that the search was not directed toward the goal, but instead was expanding out everywhere in the space and we should expect to explore Half the state space on average to find the goal, so when the state space is large, we won’t get to goal fast enough). To find the goal faster than with UCS, we add more Knowledge to the search by using an Estimate of the Distance from the Start State to the Goal State. In Route finding problem we can move in any direction (up, down, left, right), but for the Estimate we use the straight line distance between a State and Goal, and use that Estimate to find way to goal fastest.
          • A* Algorithm (aka Best estimated total path cost first) (combines UCS and GFS)
            • Dfn: i.e. Searches by always expanding path with minimum value of the Heuristic function f (defined as sum of g and h components) so result is a search strategy that is the best possible (i.e. it finds shortest length path whilst expanding minimum qty of paths possible), where g of a path equals path cost, where h of a path equals h value of the state (final state of the path), which equals the estimated distance to the goal
              • Heuristic Function (generated by relaxing the problem definition)
                  f = g + h
                  g(path) = path cost (sum of path costs from initial state to current state)
                  h(path) = h(s) = estimated straight line distance to goal (from current state)
                
            • Note: h is a Heuristic function finds the lowest cost path since when A* Search ends it returns path P, with estimated cost C, where C is also the actual cost, since the h component is 0 at the goal state (so path cost is total cost, as estimated by the h function).
              • Path P must have a cost less than true cost of other paths on Frontier.
              • Paths beyond the Frontier must have cost greater than Path P’s cost, since step cost is always 0 or more
              • Note: All paths on Frontier have estimated cost > C, since Frontier is explored in cheapest-first order
            • Note: Minimising g helps keep path short
            • Note: Minimising h helps keep focused toward finding goal
            • Conditions under which A* Algorithm will find the lowest cost path to goal: A* will always find the lowest cost path to the goal dependent on if the heuristic estimate function h for a state is less than the true cost of the path to the goal through that state:
              • h function want never to overestimates the distance to the goal
              • h is Optimistic
              • h is Admissible (since admissible to use it to find lowest cost path)
            • Goal test works by taking paths off the Frontier, not putting paths on the Frontier
            • Additional Heuristic
              • Use additional heuristic of the Straight Line distance between a State and a Goal, for each State
            • Combines Greedy Best-First Search of exploring a smaller number of nodes directed toward the goal state, and Uniform Cost Search that is guaranteed to find the shortest path to the goal state
            • Example
              • Given a found path through state space to State X, and trying to give value to the path. The measure f is sum of g (path cost so far) and h (estimated distance remaining for path to get to goal)
          • Note: Resolve ties in left to right order
          • Optimality Property of Algorithms (i.e. guaranteed to find the shortest path, cheapest total cost path, and longest path respectively)
            • Yes, Optimal - Breadth-First Search
            • Yes, Optimal (assuming individual step costs are non-negative) - Cheapest-First Search
            • Not Optimal (if more than one goal state) - Depth-First Search
          • Storage Requirements Property of Algorithms
            • Given a state space of “very large” Binary Tree (each level down the nodes double)
              • BFS “Frontier” requires at Level n a Storage Space of 2^n paths
              • CFS “Frontier” determines a sort of contour of cost but with a similar total number of nodes as BFS as Storage Space
              • DFS goes all the way down levels of branches to leaf and then back up working left to right, but at any point the “Frontier” has only n nodes (instead of 2^n nodes). If also tracking the “Explored Set” then do not get as much savings over BFS and CFS.
          • Completeness Property of Algorithms
            • i.e. if there is a Goal State somewhere, will the algorithm find it?
            • Given a state space of infinite trees where a Goal State is hidden deep down in the tree then the algorithms guaranteed to find a path to the goal include:
              • BFS, CFS, but NOT DFS
        • Tree.Search Family of Algorithms - List of Algorithms
          • General Tree Search (Non-Graph Search) using Breadth-First Search (BFS) Algorithm
              Tree.Search (problem p) returns path
                  frontier = { path(p.initial) }
                  loop:
                      if frontier is empty: return FAIL
                      path = remove.choice(frontier)
                      s = path.end
                      if Goal Test (s): return path
                      for a in p.Actions (s):
                          add [path + a > Result(s,a)]
                          to frontier
            
            • Steps
              • Sequence of Paths 1:
                • Pick Initial State with Path Length 0 (only path in Frontier, so is shortest)
                • Expand Initial State, adding in all paths that result from applying possible Actions
              • Sequence of Paths 2:
                • Remove Sequence of Paths 1 from Frontier
                • Find Shortest possible path resulting from next expansion
                  • Get all new paths resulting from applying possible Actions from Initial State
                  • If multiple new paths, then pick the shortest one
                    • If all have same length, then break the tie at random, or use a technique
              • Sequence of Paths 3:
                • Remove other Sequence of Paths 2 from Frontier (i.e. keep only the shortest path)
                • Add new paths to the Frontier, those that extend one level from new State (and including the back-tracking path back to the previous state that is also now part of the Frontier)
              • See Graph Search approach from here
            • Note: Build a tree, always keeping track of the Frontier (ends of paths we haven’t explored yet) as the search space is explored, whilst behind the Frontier is the the set of Explored states, and ahead is the Unexplored States
            • Note: Explored States are tracked for when we want to expand and we find a duplicate, so we can point back to a previously seen state (without having to create a new state in the tree) and make a regressive step into the previously Explored state
              • See slide “7.9. Tree Search Continued”
          • Graph Search Algorithm using Breadth-First Search (BFS) Algorithm
              Graph.Search (problem p) returns path
                  frontier = { path(p.initial) };
                  explored = { };
                  loop:
                      if frontier is empty: return FAIL
                      path = remove.choice(frontier)
                      s = path.end; add s to explored
                      if Goal Test (s): return path
                      for a in p.Actions (s):
                          add [path + a > Result(s,a)]
                          to frontier
                          unless Result(s,a) in frontier or explored
            
            • Avoids the repeated paths in Graph.Search function, by modifying the Tree.Search function
            • Steps

              • Sequence of Paths 3:
                • With Graph Search
                  • After we initialise “Explored Set” of paths already explored
                  • And after exploring new path, we add new state to set of “Explored Set”
                  • So when expanding the path and adding new states to end of it, we don’t add an end if we’ve already seen that new state before in either the “Frontier” or the “Explored Set”
              • Sequence of Paths 4
                • When using Breadth-First Search (BFS) Algorithm, the paths that are candidates to be explored next are the “shortest” ones (i.e. length 1) that haven’t already been explored yet
                • When we try to add new paths to the Frontier from this “shortest” state, if we find that those that extend one level from new State are either already in the “Frontier” or the “Explored Set”, then we won’t add them, and instead try the next “shortest” path, and try and add new paths from it to the Frontier
              • Sequence of Paths 5
                • Now with Frontier states that are all length of 2, we choose one of them,
                • We then add state paths to the Frontier that expand one level from this chosen state (but not the back-tracking path state since we’re doing Graph Search), and determine whether we terminate the algorithm at this point (if we’ve reached Goal State) or if continue
                • Note: The Goal Test isn’t applied when we add a goal node path to the Frontier, but instead when we remove a path from the Frontier. This is because it may not be the BEST path to the Goal.
                  • Note: Optimisation may be made if using Breadth-First Search by changing the algorithm and writing a specific BFS Routine so that it checks the state as soon as they are added to the Frontier and gives result as soon as the goal state is added to the Frontier and terminates early (instead of after removing it from the Frontier by waiting until they are expanded), when we know there isn’t the possibility of having a shorter path to the goal of say length 2.5 instead of 3
            • Note: Breadth-First Search
              • Will find shortest path in terms of least number of steps
              • Will NOT find shortest path in terms of shortest total cost (by adding up the step costs), perhaps try Uniform Cost Search instead
          • Graph Search Algorithm using Uniform Cost Search (UCS) Algorithm
            • Steps
              • Initial State “Unexplored” - Step 0
              • Change Initial State to “Explored”
              • Expand out looking at an adding paths out of that State (i.e. 3 OFF) to the “Frontier”
              • Pick path X to be expanded next (using UCS rules of “Cheapest total cost first”)
              • Remove that path X from the “Frontier” and add X to the “Explored”
              • Expand out looking at an adding paths out of that State from X (without back-tracking) to the “Frontier” and sum the total cost of those paths (across steps from initial state) and put on “Frontier” to compare
              • Pick path Y branching from initial state to be expanded next (using UCS rules of “Cheapest total cost first”)
              • Remove that path Y from the “Frontier” and add Y to the “Explored”
              • Add new paths to neighboring/successor states from Y (without back-tracking) to the “Frontier” and sum the total cost of those paths (across steps from initial state) and put on “Frontier” to compare
              • Repeat picking path, etc
              • Note, cannot terminate the algorithm until reach Goal State and have popped that cheapest path off the “Frontier” (i.e. continue searching to see if a better cheaper path that also reaches the Goal State)
    • Lab 7: Pac-Man
      • Help Pac-Man navigate his world in the most efficient way to find food and other objects by implementing BFS, DFS, and A* Search. Opportunity to create a visually good app for portfolio.
      • Reference http://inst.eecs.berkeley.edu/~cs188/pacman/project_overview.html
      • Udacity Forums https://discussions.udacity.com/c/nd889-search-optimization/nd889-lab-pac-man
    • Lesson 8 - Simulated Annealing (in family of Local Search Algorithms):
      • Reading: AIMA Chapter 4.1, + 4.2-4.5
      • References: Different viewpoints and extensions
      • Dfn: Used in AI when State Space is large and other techniques required to search it and find optimal solutions.
      • Example Problems: N-Queens problem where N Queens set up on a board in such a way that they cannot attack each other

      • Iterative Improvement Problems and Algorithms
        • Example #1: Travelling Salesman Problem
          • Problem:
            • Given a Salesman
            • Given 5x cities possible to visit
            • Rules:
              • Must visit all 5x cities before tour is finished
              • Must return to starting city at finish
            • Goal: Find most efficient order to visit the cities to minimise overall distance flown
            • Note:
              • This is an NP-hard Problem where all NP Problems are about as hard as each other (exponentially difficult)
                • N - Non-Deterministic
                • P - Polynomial Time
              • Use tricks to solve problem efficiently
          • Solution
            • Approaches
              • Randomly connect the cities
              • Identify where paths cross
                • Iterative Improvement Algorithm applied by adding a small amount of intelligence to (i.e. iteratively revising figure to uncross each crossed situation to reduce distance travelled) results in getting very close to optimum solution (i.e. do this with 1000’s of cities and will get result within 1% of optimum solution)
        • Example #2: Goodness graph
          • Problem:
            • Given an N-dimensional x-y axis graph that represents a Goodness Function that we are trying to Optimise for a problem
            • Find the Maximum point along it
          • Solution:
            • Approach #1:
              • Hill climbing - start somewhere (say right-most point) and improve answer by travelling up the gradient toward one of the maximum points, but the problem is we may get stuck at a local maximum when there may be multiple maximums. Unstick from Local Maximum to find a Global Maximum by using these techniques
              • Randomness Techniques to use to get over problem of getting stuck :
                • Random Restart
                  • Start many particles at random positions on graph, then Hill Climb, and take best results
                • Genetic Algorithms
                  • Select positions based on Fitness Function to breed children that mutate and eventually converge on solution
                • Simulated Annealing
                  • Analogy of gradual cooling temperature of particle from very high temperature to make it move in progressively less random ways while hill climbing until converge on Global Maximum
                • Stochastic Beam Search
                  • Combines the ideas of Random Restart, Genetic Algorithms and Simulated Annealing, such that multiple particles are released on graph and their children generate progressively less randomly and eventually converge on the maximum value
              • Note that Minimax is only relevant to Game playing, whilst UCS and BFS are difficult to implement in 2D graphs since infinite steps possible, and steps may need to be be infinitely small to avoid skipping solution
        • Example #3: N-Queens Problem
          • Problem
            • Given N queens placed randomly on N*N chessboard in positions so they cannot attack each other i.e. only any one queen can be on any one horizontal, vertical, or diagonal
            • Find queen positions to reduce attacks down to zero
          • Solution
            • N-Queens Heuristic Function
              • Try the stupid thing first, then add intelligence until solve the problem
              • Move queens to separate axis
              • Constraint Situation
                • i.e. only allow each queen to move up/down its column (i.e. Hill Climbing so only allow move left/right along x-axis of graph, so not too many pieces to move)
              • Try Move queen subject to most attacks first (move that improves situation the most)
              • Test First Move - See how many attacks reduced to by moving each of the queens to different positioins up/down and comparing to find the lowest resulting attacks
              • If more than one move reduces attacks by same amount then choose one randomly
              • Iterate until reach goal of 0 attacks
              • Local Minimum If encounter situation where No move decreases qty of attacks or increases them (no move improves situation)
        • Randomness Technique - Other Usages
          • Particle Filters Technique
          • Pattern Recognition with Hidden Markov Models
          • Monte Carlo Markov Chains
        • Random Restart (with use of Improvements)
          • Given the following:
            • Objective Function (y-axis) in State Space (x-axis) Graph
              • Shoulder - positive gradient interrupted part-way with no increase so can get stuck here like at Local Maximum
              • Global Maximum - largest maximum in graph
              • Local Maximum - smaller maximum along the graph
              • “Flat” local maximum
          • Goal:
            • Find Global Maximum and get Unstuck if encounter a Local Maximum by using Random Restart
          • Steps
            • Repeat the following:
              • Randomly chooses a particle on the graph
                • Taboo Search Trick of skipping to next if detect is place we’ve been before since we are tracking prior places
              • Heuristic #1 - Large Step size and reduce over time
                • Start with Large Step size and decrease it over time to ensure we reach the Global Maximum
              • Heuristic #2 - Simulated Annealing (Alternative to Heuristic #1)
              • Potential Issues
                • Step Size Too Small Issue Issue of landing on flat plateau or Shoulder where not enough info available to determine which way to go (to improve answer or check if reached maxium) and may result in wandering so need to only allow a Max amount of steps with no improvement in score, but with small enough step size we may thing we’re not improving and that algorithm should stop
                • Step Size Too Large Issue
                  • Step across to different Hill Issue is that may miss intended hill all together and go up a different hill instead with same incline
                  • Infinite Loop Risk since if step big enough to step over to other side of hill where alternate incline direction and step big enough over again and repeat (oscillating and not converge on answer)
                    • Solution Reduce Step Size if detect any oscillations
              • Follow up the positive gradient steps until reach Maximum
              • Check if its the Global Maximum
              • Stage Algorithm Variant Keep track of the current found Maximum and try to generate shape of the problem to predict where new Maximum may be given places not yet explored
              • Keep track place we have been before
            • Find the Maximum of all the iterations
        • “Simulated” Annealing
          • Dfn: Repeat Annealing process according to a formula and timing to refining toward desired properties
          • Dfn: “Simulated” since we are borrowing ideas from physicists such as energy minimisation where:
            • When external conditions may result in molecules becoming mobile, and where their mobility slowly reduces, they then arrange themselves into Minimum Energy Configuration, which result in regular Patterns
            • i.e. mud cracks where decrease in water in mud reduces molecule mobility over time and causes mud cracks to change into regular Patterns
            • i.e. iron molecules compact into Minimum Energy Configuration forming lattice with other molecules like carbon useful for sword making where a mixture of hardness and ductility is needed (Annealing process heating to temp where atoms can move about and form new structures with carbon and other molecules, and cooling iron molecules repeatedly to so they preserve alignment in desired lattice structures so swords are hard not brittle, recrystalisation of sword making)
            • i.e. honeycombs with Minimisation by Design where bees try optimise storage space the building materials used for the structure
            • i.e. lava cooling slowly and rocks form
            • i.e. columns of basalt rock where when it cools it shrinks and cracks into hexagonal lattice which is a Minimum Energy Configuration
          • Note: Use idea of Heating and Cooling to get Unstuck from Local Minima toward finding Global Maxima, where the following applies until converge on solution:
            • High Temperature equates to Higher Randomness
            • Gradual Cooling results in Decreasing Randomness
          • Algorithm:

              for t=1 to inf do
            
                  # schedule of T values that is guaranteed to converge to the
                  # Global Maxiumum if it starts with very high values
                  # of T (temperature), causing lots of random graph motion as
                  # we take all random positions offered to us, even bad positions
                  # and climb up and down hills. then we'll decrease T slowly enough
                  # until it's very small with no randomness where we just
                  # Hill Climb to nearest peak normally
                  #
                  # when temperature T is VERY HIGH approaching infinity
                  # then probability (e^(change_in_E/inf) = e^0 = 1) approaches 1
                  # (even if change_in_E is negative)
                  #
                  # when T is VERY SMALL small (opposite situation) normal Hill Climbing occurs (instead of random)
                  #
                  # given temperature T approaching 0
                  # T = 0.01 (i.e. near 0)
                  # note: do not want T to be precisely 0 as will give undefined answer
                  #
                  # if new random pos. improves score E then take it (up the gradient)
                  # if new random pos. does not improve E then change_in_E = 0
                  # if new random pos. worsens score then change_in_E < 0 (down the gradient)
                  #   e^change_in_E/T = e^(-1/0.01) = e^-100 (very small number)
                  #   (so almost no chance of taking this suggested new random pos.)
                  #
                  # if get stuck at a flat plateau (i.e. shoulder) where change_in_E == 0, and regardless of T's value,
                  # then e^0 = 1 and the algorithm will revert back to taking new random positions again
                  # until walk off the plateau to a position with positive gradient that can improve E toward maxium peak
                  #
                  # similarly in "real-world" annealing, when temperature
                  # is very high the particles jump around a lot at the
                  # start so that if the particle gets stuck at a
                  # Local Maxium instead, then given sufficiet time and
                  # randomness it has the ability to leave the
                  # peak and (ignore the fact that it is going the wrong way
                  # and ending up at different part of the graph).
                  # since temperature T is high, the next fixed random point
                  # may make it move both down and up the slope until it
                  # hits the Global Maximum
            
                  T <- schedule (t)
                  if T=0 then return current
            
                  # select points randomly in nearby region
            
                  next <- a randomly selected successor of current
            
                  # select points if we find with improved value
            
                  if change_in_E > 0 then current <- next
            
                  # select points also if poorer value but only with
                  # a probability
            
                  else current <- next only with probability e^(change_in_E/T)
            
        • Local Beam Search
          • Given a graph, uses k multiple particles (positions) instead of just one
          • At each time-frame inspect all randomly generated neighbors of each particle and compare them with each other and retain the k best particles for the next iteration.
          • Shares information between each particle and iteration, which differs from Normal Random Restart that doesn’t share any info between iterations
          • Terminate if any particle reaches a Goal State
        • Stochastic Beam Search
          • Combines the ideas of Random Restart, Genetic Algorithms and Simulated Annealing, and
          • Same as Local Beam Search but more useful, since successors are chosen not just based on fitness, but also based on Randomness so we don’t get stuck in a Local Maximum (similar to Simulated Annealing within class of Genetic Algorithms)
        • Genetic Algorithms (aka, second best solution to ANY problem)
          • Dfn: Genetic Algorithms is analogy to Natural Selection in Biology that uses Breeding and Mutation to find Optimal answer to problem (fancy version of Stochastic Beam Search that happens to make a nice analogy with biology)
          • Reducing Randomness over time (like in Simulated Annealing)
            • Fitness Function reduces the Randomness as each Generation gets closer to solution
            • Additional measures to Optimise and converge quickly such as:
              • Tuning Parameters
                • Crossover
                • Mutations qty
                • Parents qty
          • Convention to represent given board
            • Given one Queen in each column, encode a board with a number shown below each column that indicates where Queen is (i.e. grid distance of the Queen from the bottom) (i.e. 24748552, for 8x8 board)
            • Given 8x8 board, where there are 8 Queens, then after examining every possible pair of Queens that could attack each other we find there are 28 pairs.
                8 choose 2 = ( 8! / (8-2)!*2! ) = 28
              
            • Goal is to reduce qty of attacking pairs of Queens to 0, so use Fitness Function as follows, such that when it is equal to 28 we know we have Won:
                Fitness Function for board
                    = Max qty of attacking pairs of Queens - qty of attacking Queens for given board
                    = 28 - qty of attacking Queens for given board
              
          • N-Queens Problem using Genetic Algorithms
            • Gene Pool Choose 4x random boards (i.e. 24748552, …, …, …)
            • Combine Genes from Gene Pool in different Patterns to Breed better boards
            • Choosing Genes from Gene Pool
            • Evaluate each board state in initial population using Fitness Function to get a Fitness Score Value (aka qty of non-attacking pairs of Queens) for each is calculated by going through each column from left-to-right and looking only to the right for each Queen to see if it attacks another (to prevent duplicates)
            • Proportional Probabilities (%) based on the Fitness Scores indicative of how likely each board is to be chosen as a Parent and breed, when then list them in descending order of % (i.e. add the 4x scores and Normalise each into a percentage)
            • Randomly Select 4x Parents from the 4x Boards and order from best to worst (best on top) (i.e. given 4x boards with 31%, 29%, 26%, and 14% respectively, by rolling a 100 sided dice to select 1st Parent, and if it rolls between 1 and 31 we select the 1st board, 32 to 60 (i.e. 31% + 29% = 60%) we select 2nd board, 61 to 76 we select the 3rd board, and 77 to 100 we select the 4th board)
            • Create 2x Children from each of the 4x Parents using a Crossover Process (see Lesson 8.21) where we group parents into Pairs and pick a random vertice (Crossover Point b/w two columns) for each Pair. Create the 1st Child by combining the LHS of the vertice of the Parent 1 with the RHS of the vertice of Parent 2, and repeat by combining LHS of Parent 2 with RHS of Parent 1 (for 1st Pair), and repeat for remaining Pairs… This way at least one of the Children gets the Best attributes of the Parents
              • Change Parameters in Genetic Algorithms to try and Optimise how quickly we converge to a good result, such as the qty of Children that each Parent may create
            • Note: Survival of the fittest, as with increasing amount of Generations of Children eventually we evolve to an 8-Queens game board that solves the problem, the ones with lesser attacking Queens will have Higher Fitness Function and higher chance to create children, whereas the one with higher attacking Queens will have a Lower Fitness Function and less chance of creating children
            • Genetic Algorithms Mutation
              • Problem
                • Not selecting a Parent to breed where Fitness Function gives it Low Proportional Probability, but where that Parent has a critical piece, but never returns as a result of breeding from the other Parents
              • Solution
                • Genetic Mutations to add Randomness after Crossover with a Mutation step that avoids risk of never reaching goal state (similar to occasionally choosing a random direction in Simulated Annealing, and similar to randomness in Stochastic Beam Search), whereby for each qty (digit) of attacking queens for a given board, we have a small chance by significant chance that the digit will mutate into a different digit
    • Lab 8 - Simulated Annealing
      • Discussion Forums https://discussions.udacity.com/c/nd889-search-optimization/nd889-simulated-annealing
      • Use simulated annealing to implement the algorithm in a Jupyter notebook and using it to solve the Traveling Salesman Problem (TSP) between US state capitals.
      • Steps
        • Fork https://github.com/udacity/AIND-Simulated_Annealing
        • Open AIND-Simulated_Annealing.ipynb with: jupyter notebook AIND-Simulated_Annealing.ipynb
        • Follow instructions from AIMA to implement Simulated Annealing in the Jupyter Notebook
          • Simulated Annealing pseudo-code https://github.com/aimacode/aima-pseudocode/blob/master/md/Simulated-Annealing.md
          • Simulated Annealing code https://github.com/aimacode/aima-python/blob/master/search.py
              def simulated_annealing(problem, schedule=exp_schedule()):
                  """[Figure 4.5] CAUTION: This differs from the pseudocode as it
                  returns a state instead of a Node."""
                  current = Node(problem.initial)
                  for t in range(sys.maxsize):
                      T = schedule(t)
                      if T == 0:
                          return current.state
                      neighbors = current.expand(problem)
                      if not neighbors:
                          return current.state
                      next = random.choice(neighbors)
                      delta_e = problem.value(next.state) - problem.value(current.state)
                      if delta_e > 0 or probability(math.exp(delta_e / T)):
                          current = next
            
        • Install Jupyter Notebook with Miniconda
          • http://jupyter.readthedocs.io/en/latest/install.html
          • pip3 install --upgrade pip
          • pip3 install jupyter
    • Lesson 9 - Constraint Satisfaction:
      • Reading - AIMA Chapter 6
      • Techniques
        • Implement these algorithms optimise searching for solutions in constraint satisfaction problems in the following order, starting with the stupidest one first, then add intelligence as required
          • Backtracking Search (aka Simple Search) (i.e. stupidest algorithm)
            • i.e. States are defined by values assigned so far
            • Initial State is empty assignment
            • Check if current assignment is goal
            • Assign a value to unassigned variable that does not conflict with current assignment
            • Dead End is reached when no legal assignments (if so, backtrack to previous state and try different assignment)
            • Repeat recursively until find answer or have tried all possible assignments and reported failure
          • Forward Checking
            • Tracking remaining legal values for each unassigned Variable, so if ever make decision causing unassigned Variable to not be able to have a value, we stop search and backtrack
            • Benefit: If stuck (i.e. can’t assign a colour to a territory) can backtrack earlier than would have if continued assigning variables (i.e. early warning system that our search is going down the wrong branch)
            • Weakness: Does not provide early detection for all failures (i.e. detect when two unassigned Variables are adjacent and only the same Domain is available to populate them with, so we need to use other Constraint Propagation repeatedly to enforce all local constraints)
          • Arc Consistency (aka **Simple Constraint Propagation)
            • A Variable and a Constraint Propagation Problem are considered Arc Consistent wrt another Variable if value still available to Second Variable after assign value to First Variable. Network is Arc Consistent if all Variables in graph satisfy this condition.
            • Benefit: Potentially saves a lot of unncessary deep searching for more complicated problems
            • Example (see Map Colouring example below)
          • Heuristics
            • Minimum Remaining Values (MRV) (see Map Colouring example below for explanation)
            • Degree Heuristic (see Map Colouring example below)
            • Least Constraining Value
              • Assign the Domain to a Variable that least constraints future choices (i.e. reuse previously used Domains unless absolutely necessary to use a new one) to avoids Dead Ends and Backtracking (i.e. rules out the fewest values in the remaining Domains to improve Backtracking Search algorithm efficiency)
      • Example:
        • Airport departure scheduling:
          • Given:
            • 2500 arrivals / day
            • 2500 departures / day
            • Plane departs every 30 sec
          • Problem:
            • How schedule all the flights?
      • Example: Crypto-arithmetic Q1
        • Given
          • Each letter represents a digit
          • Combination of all letters in word represent a different digit
          • None start with 0
        • Setup problem by listing
        • Variables F T U W R O X1 X2 X3
        • Domains { 0,1,2,3,4,5,6,7,8,9 }
        • Constraints/Rules: alldiff { F, T, U, W, R, O }
          • Global Constraints
            • 1) All letters different. No letter can represent same digit
            • O + O == R + 10*X1 (i.e. except when carry to next column X1 in 10ths place)
              • where X1 == 0   1
            • X1 + W + W == U + 10 * X2
              • where X2 is 100ths place
            • X2 + T + T == O + 10 * X3
              • where X3 is 1000ths place
            • X3 == F
            • F != 0 (since problem does not allow leading 0’s)
          • Preference Constraints
            • Unary Constraints
              • Show constraints as square boxes on Constraint Hypergraph
                • Top square represents Global Constraint 1)
        • Use procedure of Backtracking Search improved by using WITH Minimum Remaining Values (MRV) Heuristic, Degree Heuristic, and Least Constraining Value Heuristic to minimise backtracking)
          • Given Variables, Domains, and Constraints
          • Select unassigned Variable (i.e. X3), with Domain of 0 or 1
          • Assign Variable X3 with a Domain value (i.e. 1), since can’t choose 0 as would fail constraint and not pass Forward Checking
          • Apply MRV Heuristic and select unassigned Variable (i.e. F) since only has 1 remaining value and assign it with Domain value 1
          • Now, MRV Heuristic find that X2 and X1 are tied with min remaining values at 2 (i.e. 0 or 1), so choose 0 for both (either value passes Forward Checking)
          • Now, O must be 4, since;
              `O + O == R + 10*X1`
              `2*O == R
              `O == R/2`
              so O is even since R max is 8, such that O == 4
            
              `X2 + T + T == O + 10 * X3`
              `0 + 2*T == O + 30
              `2*T - O == 30`
              `T == 15 + (O/2)`
              so if O == 4, then T == (1)7 == 7 (passing 1 to F)
            
              `X1 + W + W == U + 10 * X2`
              `U == 2*W`
              so, U must be even number less than 9, so
              choose U value of 6 (passes Forward Checking)
            
        • Outcome (see lecture notes for details https://www.youtube.com/watch?time_continue=97&v=iqJbOGWEd6Y)
          • X3: {0,1} => X3 = 1
          • X2: 0
          • X1: 0
        • Answer i.e. F=1, O=4, R=8, T=7, U=6, W=3 ``` TWO
        • TWO ===== FOUR ```
      • Example: Map Colouring
        • Constraint Optimisation Problem since has Preference Constraints, and is solved with Linear Programming, Factory Job Scheduling
          • e.g. Map Colouring, Sudoku, Car Assembly, Class Scheduling, Spreadsheets, Transport Scheduling, Floor Planning, N-Queens (1000-Queens)
        • Given
          • Add colours to each territory in Australia
          • Ensure no neighboring territory use same colour
        • Setup problem by listing:
          • Variables WA, NT, Q, NSW, V, SA, T
          • Domains Di = { green, blue, orange }
          • Constraints/Rules: adjacent regions must have different colours (i.e. list all possible pairwise constraints WA != NSW) OR (i.e. list all allowable assignments for each pair of territories
            • Preference Constraints
              • Unary Constraints (restrict value of given variable) (i.e. WA not orange)
              • Binary Constraints (relate two variables at most)
                • Constraint Graph uses for representation visually
                  • Nodes - (i.e. variables WA, NT, .... Note TAS is independent from rest of problem
                  • Lines - show constraints between variables)
                  • Apply General Constraint Satisfaction Problem Algorithms to the Constraint Graph Structure to search quickly for answer
              • Multiple Constraints (i.e. 3 or more variables)
              • Soft Constraints (i.e. prefer to teach on Tues and Thu)
        • Solution:
          • An assignment of colours to each of the territories that satisfies all constraints
        • Concrete Example Solution (i.e. using Backtracking Search ONLY)
          • Given Variables, Domains, and Constraints
          • Select unassigned Variable (i.e. WA)
          • Assign Variable a Domain orange from possible (i.e. green, blue, orange)
          • Select unassigned Variable (i.e. NT)
          • Assign Variable a Domain blue from possible (i.e. green, blue)
          • Select unassigned Variable (i.e. QLD)
          • Assign Variable a Domain blue from possible (i.e. blue, orange)
          • Dead End when try unassigned Variable (i.e. SA)
          • Backtrack up to previous branch where only orange and green had been used
          • Continue until find solution
            • Improve efficiency of algorithm
        • Concrete Example Solution (i.e. Backtracking Search improved by using WITH Minimum Remaining Values (MRV) Heuristic, Degree Heuristic, and Least Constraining Value Heuristic to minimise backtracking)
          • Given Variables, Domains, and Constraints
          • Structured CSPs Algorithm (Optimisation)
            • Reference: Graph Algorithm: Topological Sorting https://courses.cs.washington.edu/courses/cse326/03wi/lectures/RaoLect20.pdf
            • Benefit: Save time when deep searching
            • Approach: Investigate “Structure” of problem and decompose into independent sub-problems (i.e. TAS is separate), which removes a level from the search tree
            • Example 1:
              • Given problem with:
                • n = 80 possible Variables
                • d = 2 possible values each (Domain makes Variables binary in nature)
                • c = 20 possible Variables in each sub-problem (after dividing into 4x problems of 20 each)
                • d^n = 2^80 (original Search Space)
                  • n/c * d^c = 4 * 2^20 (afterward Search Space)
            • Example 2: Tree-Structured CSPs Algorithm (fast method, but must convert Constraint Graph into a tree first
              • Given problem with No Loops, can solve problem in O(n*d^2) Time instead of O(d^n) Time (where n is qty Variables, and d is size of Domain) by:
                • If Constraint Graph is not a “tree”, then try to modify the problem by assigning a Domain to some Variables first in order to change the Constraint Graph into a “tree” (i.e. if use MRV first and assign Domain to unassigned Variable so that Variable is removed from the possibilities of its neighbors then the problem becomes a “tree” and can be solved using this fast method)
                • Choose a Variable as root
                • Choose an ordering of Variables from the root to the leaves such that each Variable node appears after its parent node in the tree
                • Start at end Variables (leaves) and make each parent Arc Consistent going up the tree until we get to the root, and if not possible, report failure of the problem
                • Start at root of tree and choose any Domain assignment available until we get to the leaves (since when the tree is Arc Consistent any of the available assignments will solve the problem)
          • Forward Checking Status:
            • All territories could have any of the 3x possible colours
          • Select unassigned Variable (i.e. WA)
          • Assign Variable a Domain orange from possible (i.e. green, blue, orange)
          • Forward Checking Status Update:
            • All territories could have any of the 3x possible colours, EXCEPT:
              • Orange removed from colours available to NT and SA
          • Select unassigned Variable (i.e. NT)
          • Assign Variable a Domain green from possible (i.e. green, blue)
          • Arc Consistency (aka Simple Constraint Propagation check)
            • When assigning green to NT, ensure Arc Consistency by potentially causing a chain of changes to nested neighbors. First checking NT’s unassigned Variable primary-neighbors (i.e. QLD, SA) and check if assignment of “green” reduces qty of Domains available to them, and if so, remove “green” from the possible Domains available to each of them, then check the unassigned Variable secondary-neighbors and so forth to propagate remaining Domains. Repeat until no more unassigned Variables, or if a neighbor is found to not have any remaining Domains available in which case we force the search to try another option for NT. Note: Entire region of combined territories is Arc Consistent if there are no regions without a possible colour
          • Forward Checking Status Update:
            • etc
          • Optimise and improve Backtracking efficiency by choosing next Variable using a Heuristic as follows:
            • Minimum Remaining Values (MRV) Heuristic to assign Domain to a Variable that has fewest legal moves (i.e. remaining unused Domains)
              • i.e. Select unassigned Variable SA
            • Degree Heuristic used if there is a Tie when performing MRV, and involves choosing Variable with most Constraints on remaining Variables (i.e. has the most borders with other territories so is most complex) and we will run into problems sooner rather than delaying them
          • Optimise and improve Backtracking efficiency by choosing next Domain using a Heuristic as follows:
            • Least Constraining Value Heuristic used to assign the Domain to a Variable that least constrains future choices (i.e. reuse previously used Domains unless absolutely necessary to use a new one) to avoids Dead Ends and Backtracking
              • i.e. Assign Variable a Domain orange again as it least constraints future choices (leaving blue unused)
      • Example: N-Queens (Constraints Satisfaction Problem)
        • Iterative Improvement Algorithms
          • Examples: Genetic Algorithms or Hill Climbing Algorithms
          • Most efficient when either:
            • Many solutions OR
            • Few solutions
          • Very Hard (requiring significant CPU time) to solve at a Critical Ratio point between qty of Variables for a given qty of Constraints
          • Solved by randomly assigning values to Variables and iteratively improve the assignments until reach solution. Minimising Conflict Method of minimizing qty of attacks with each iterative improvement can find solution for 4-Queens problem which has 4^4 = 256 possible states. It can also solve 1,000,000-Queens in average of 50 moves (where large qty of solutions to problem)
    • Lab 9 - Constraint Satisfaction
      • Discussion Forums https://discussions.udacity.com/c/nd889-search-optimization/nd889-constraint-satisfaction
      • 8-Queens Puzzle Solution
        • Constraint Satisfaction Problem of solving N-queens problem using symbolic constraints and backtracking search in a Jupyter notebook.
      • Steps
        • Clone https://github.com/ltfschoen/AIND-Constraint_Satisfaction
        • Open AIND-Simulated_Annealing.ipynb with: jupyter notebook AIND-Constraint_Satisfaction.ipynb
        • Follow the instructions to complete the constraints for the N-Queens CSP and implement Backtracking-Search from AIMA in the notebook.
          • CSP pseudo-code https://github.com/aimacode/aima-pseudocode/blob/master/md/Tree-CSP-Solver.md
          • CSP code https://github.com/aimacode/aima-python/blob/master/csp.py
        • Install Jupyter Notebook with Miniconda
          • http://jupyter.readthedocs.io/en/latest/install.html
          • pip3 install --upgrade pip
          • pip3 install jupyter
    • Lesson 10 - Logic and Reasoning:
      • Dfn:
        • Logic - A logic is a way of describing the world. If logical statements about world selectively capture what want to know about world correctly, then we may reason within the logic to come to conclusions (without having to reference the world again)
      • AI Background
        • 50’s & 70’s - Requirement for better languages and systems for logical statements
          • Mistakes:
            • Trying to make statements only in Boolean Logic (where only True or False) even though world is uncertain and Probability Logic is better fit with world
        • 80’s - Better algorithms for Probability Logic
          • Issues
            • Too laborious writing by hand
        • 90’s - More data available online
          • Opportunities
            • Learning about world from data instead of writing rules by hand
      • Uncertainty Domains
        • Goal: Eliminate uncertainty of real-world and solve significant problems using logic
        • Future of Planning Algorithms:
          • Learning from examples
          • Transferring learning across domains (i.e. learn in one domain and execute in another)
          • Interactive Planning where human-machine team solves problems together and being able to explain things in terms that a human can understand, allowing humans to add things that were overlooked in the original problem statement (i.e. need correct UI interface as well as correct answer)
        • Examples:
          • Logistics
            • Planning Algorithms for when actions or world is uncertain such as for all deliveries of packages using a large fleet of vehicles
              • Describe as logic problem where want find solution minimising time and expense
      • Valid Sentence
        • Dfn: True in every possible Model for every combination of values of Propositional Symbols
      • Satisfiable Sentence
        • Dfn: True in some Models, but not necessarily all Models
      • Satisfiable Sentence
        • Dfn: False for all Models
      • Logic Types
        • Propositional Logic
          • Truth Tables for all Logical Connectives
            • Lists all four possibilities for combinations of Propositional Symbols
            • Note: “v means either OR both”
            • Note: “> means P implies Q” (aka if P then Q)
            • Note: Meaning in Logic is defined explicitly by Truth Table
                P  Q  !P  P ^ Q  P v Q  P > Q  P <> Q
                F  F  T   F      F      T      T
                F  T  T   F      T      T      F
                T  F  F   F      T      F      F
                T  T  F   T      T      T      T
              
          • Example:
            • Given the following propositions
              • [O] means Five is odd number
              • [P] means Paris is capital of France
              • [E] means Five is even number
              • [M] means Moscow is capital of France
            • Connectives:
              • O > P (i.e. IF Five is odd number THEN implies Paris is capital of France) IF True THEN implies True
              • E > M (i.e. IF Five is even number THEN implies Moscow is capital of France) IF False THEN implies False
                • Note: In terms of real-world formal logic both statements are True since according to the Truth Table definition, P > Q is True when both P and Q are False
          • Example: Alarm Problem
            • Given propositional Symbols B, E, A, and M, where our belief in Propositional Logic is that each is either True, False, or Unknown (similar to in Probabilistic Models, but dissimilar in that our degree of belief in Propositional Logic is not a number). They correspond to proposed Events:
              • [B]urglary occurring
              • [E]arthquake occurring
              • [A]larm going off
              • [M]ary calling
              • [J]ohn calling
            • Propositional Logical Sentences are either True or False with respect to a Model of the world (a set of True/False values for all the Propositional Symbols), (i.e. Model may be the set {B: True, E: False}). Truth Tables used to define Truth of a Sentence in terms of the Truth of the Symbols with respect to the Models. They may be created by combining these Symbols with Logical Constants of True and False, and Logical Operators, where Connectives include: “^ means AND”, “v means OR”, “> means Implies”, “<> means Equivalent” (biconditional)” (when one true the other is true, and when one is false the other is false), and “! means Negation”
              • (E v B) > A (i.e. imply Alarm is True whenever Earthquake and Burglary is True) i.e. E OR B == A
              • A > (J ^ M) (i.e. imply both John and Mary call when Alarm is True) i.e. A == J AND M
              • J <> Mary (i.e. John calls if and only if (iff) Mary calls, so John is equivalent to Mary) i.e. J == Mary
              • J <> !M(i.e. when John calls Mary does not call, and vice versa, so John is equivalent to not Mary)) i.e. J == !M
        • Propositional Logic Limitations
          • Inference Mechanisms that are efficient at determining Validity and Satisfiability
          • Limitations
            • Only able to handle True and False values (in Truth Table), with no capability to handle Uncertainty like covered in Probability Theory
            • Only talks about True or False events in the world
              • Does not talk about objects with properties such as size, weight, etc.
              • Does not talk about relations between objects
              • Does not include any shortcuts to succinctly talk about multiple things happening (i.e. vacuum world with multiple 1000 locations, to say every location is free of dirt, we need a conjunction of 1000 propositions, but can’t just write a single sentence saying all locations are clean)
        • First-Order Logic (FOL)
          • Dfn: First-Order Logic means that relations are on Objects but not on Relations
            • Note: Every model must have at least one object, and multiple variables may refer to the same object
            • Note: P(x) means there may exist an x that is a member of P, where P may be an empty relation
            • Note: Don’t just have Sentences with an assertion, always include a condition (If and Only If), so can prove the negative of the assertion without just an implication in one direction, when doing the definition of an adjacency or a membership problem
          • Dfn: Higher-Order Logic (HOL) means that relations are on Relations
            • i.e. define notion of a transitive relation about relations itself
            • Valid statement in Higher-Order Logic (but invalid in FOL:
              For All R, transitive of R is equivalent to, ...
              For All a,b,c, R(a,b) and R(b,c) implies R(a,c)
            
              ∀R Transitive(R) <=> (∀a,b,c R(a,b) ^ R(b,c) => R(a,c))
            
          • About: Overcomes Limitations encountered in Propositional Logic

              Logic varies in what you can say about the world, and what you can believe
              about what's been said about the world
            
                                      World                                   Belief
                                      (Ontological Commitment)                (Epistemological Commitments)
              ======================================================================================
              First Order Logic       Rel'ships, Obj's, Functions on Obj's    True, False, Unknown (?)
              (extension on Prop logic)
              Propositional Logic     Facts (Symbols, Variables)              True, False, Unknown (?)
              Probability Theory      Facts (")                               [0..1] (real numbers)
            
            
              Atomic           Factored                       Structured
              ===========================================================
            
              Search &
              Problem Solving  Variables                      Relations b/w Objects
                               (i.e. Propositional Logic &    (i.e. Programming languages, SQL)
                                Probability Theory)
            
            • Atomic Representation
              • Dfn: where representation of state is just individual state when no piece is inside it (no internal structure to states)
              • i.e. given transition from State A to State B can say if identical or not, if goal state?
              • Examples: Search, Problem Solving
            • Factored Representation
              • Dfn: break up world into set of True/False facts
              • Dfn: representation that individual state of world is factored into several Variables (i.e. B, E, A, M, J), i.e. boolean or other type of representation, but Not in Propositional Logic
              • Examples: Propositional Logic, Probability Theory
            • Structured Representation
              • Dfn: Most complex representation. Individual state is not just a set of values for variables, but also includes rel’ships b/w objects, a branching structure, complex representations, relations b/w different objects
              • Examples: Programming languages, structured databases (with SQL over them)
          • Process
            • Definitions
              • Function defined as mapping from objects to objects
            • Model in FOL differs from Propositional Logic Model, where value for each Propositional Symbol in Model corresponding to what is going on in a possible world i.e. { P: True, Q: False }
            • Model in FOL instead starts with Set of Objects and Set of Constants that refer to those Objects but don’t need one-to-one correspondence b/w Constants and Objects (i.e. multiple Constants may refer to same Object, for instance CEE could refer to Object C3)

              • Set of Objects include: 4x tiles A1 C3 B3 D2 Note: with numbers 1,2,3 that are also Objects in Model

              • Set of Constants: {A, B, C, D, 1, 2, 3, CEE}

              • Set of Functions:

                “Number” Function (i.e maps from tile to number on the tile) Defined as mapping:

                                  { A -> 1, B -> 3, C -> 3, D -> 2 }
                
              • Set of Relations:

                “Above” Relation (i.e. where in this model of the world, this relation is a set of two pulls where A is above B, and C is above D) Note: A binary relation holding b/w two objects, say where one block is above another block

                                  { [A, B], [C, D] }
                

                “Vowel” (Unary Relation) if Vowel is True only for the Object A, then that’s a set of Tuplies of length 1 containing just A

                                  { [A] }
                

                “Rainy” (Relation over no objects, just refers to current situation) if not rainy, represent as empty set of no tuples:

                                  { }
                
                          if is raining, represent with Singleton Set,
                          and since Arity of rainy is zero, there'd be
                          zero elements in each of the tuples
                                  { } { [] }
                
            • Syntax in FOL
              • Atomic Sentences - describe facts as either True or False and are predicates corresponding to Relations i.e.
                  Vowel(A)
                  Above(A, B)
                  2 = 2          (i.e. Equality Relation in every model to distinguish)
                
                • Operators Sentences may be combined with all operators from Propositional Logic (i.e. ^, v, !, >, <>, () )
                • Quantifiers Complex Sentence Types (unique to FOL)
                  • ∀x - For All, followed by Variable it introduces such as x
                  • Ǝy - There Exists, followed by Variable it introduces such as y
                  • Note: If no Quantifier provided, its a shortcut meaning For All
                  • Note: Typically don’t want to say something about every object in the domain since objects can be so different, instead we usually want to say something about a particular type of object (i.e. Vowels)
                  • Note: Typically when have a “There Exists an x” or any other Variable, it typically doesn’t go with a Conditional since only describing one object
                • Example Sentences Typical in FOL: ∀x Vowel(x) => Number of (x) = 1 (i.e. For All x, if x is Vowel, then Number of x equals 1) Ǝy Number of (x) = 2 (i.e. There Exists an x such that the Number of x equals 2) (means there is some Object in the Domain to which the Number of Function applies and has a value of 2, but without saying what the Object is)
              • Terms - describe Objects, may be Constants, Variables, or Functions i.e.
                  A,B,2         (i.e. Constants)
                  x, y          (i.e. Variables)
                  Number of (A) (i.e. Functions)
                      i.e. just another expression referring to same object as 1
                  1             (least in model shown previously)
                
          • Example: Vacuum world represented using FOL
            • Add:
                Locations
                =========
                A - Left Location
                B - Right Location
                V - Vacuum
                D1, D2 - Dirt
              
                Relations
                =========
                Loc - True of any location
                Vacuum - True of the vacuum
                Dirt - True of dirt
                At(o,l) - True of object and location
              
            • i.e. Say “vacuum is at location A”: At(V,A)
            • i.e. Say “no dirt at any location”;
              • Note: This Sentence still holds for 1000’s of locations, which is the power of FOL ``` For All dirt, and For All locations, if D is Dirt, and if l is a location, then d is not at l.

              ∀d ∀l Dirt(d) ^ Loc(l) => ¬At(d,l) ```

            • i.e. Say “vacuum is in a location with dirt” (without specifying the location)
                There Exists an l, and There Exists a d, such that d is the Dirt,
                and l is the Location, and the vacuum is at the location,
                and the dirt is at that same location
              
                Ǝl Ǝd Dirt(d) ^ Loc(l) ^ At(V,l) ^ At(d,l)
              
      • Expert Systems
        • Dfn:
          • Export Systems are common throughout the world
          • Algorithms to help understand how Export Systems work
        • Examples:
          • Decisions
            • Superhuman decision made by military expert system such as deciding where to place a supply base to save money (higher return than their investment in AI)
      • Resolution Algorithm - Inferring new knowledge from a knowledge base
      • GraphPlan - Used to make planning practical for new range of problems
      • Value Iteration Concept - Key concept for Markov Decision Processing
    • Lesson 11 - Planning:
      • About:
        • Problem Solving AI-agent
          • Given a State Space, Start Space, and Goal
          • Planning ahead, thinking only to figure out path to find goal, without interacting/sensing world
          • Executes path
        • Real-Life differs from Problem Solving AI-agent and requires:
          • Avoiding blindfolded execution and getting lost by:
            • Interleaving planning and executing
            • Feedback from environment required, cannot just plan ahead and derive whole plan
      • Planning and Executing
        • Interleaving required due to Properties of the environment that make it difficult:
        • Change Perspective to overcome Difficulties by planning in Space of Belief States instead of Space of World States
        • Difficulties
          • Stochastic Environments - don’t know outcome of performing an action, so need to deal with Contingencies, i.e. try turn right and wheel slipped, skidding, feet not move forward entirely straight, or traffic lights red (prevents go forward)
          • Multiagent Environments - i.e. other cars or people obstacles, plan about what they will do and react to unexpected (only know at Execution time, not at Planning time)
          • Partial Observability i.e. given plan to go from A to S to F to B, which appears it will work [A, S, F, B] but suppose we know at S the road to F sometimes closed (and sign there tells us), but when start off we cannot read that sign i.e. when start, we may not know what state we are in (i.e. may know we are in state A, but may not know if road is open or closed until we get to S in order to know if can continue down that path or not)
          • Unknown i.e. due to lack of knowledge on our part, where some model of world is unknown (such as map or GPS that is inaccurate or incomplete), then may not be able to execute
          • Hierarchical i.e. may need to deal with plans that are hierarchical at a high level steps that cannot be executed (i.e. action of go from A to S) without detailed low level plan steps (i.e. turn right, left, accelerate more, etc) being scheduled during execution
      • Example 1: Vacuum Cleaner (Deterministic environment)
        • Add:
            Locations
            =========
            A - Left Location
            B - Right Location
            V - Vacuum
            D1, D2 - Dirt (Yes or No)
          
            Possible States == 2 * 2 * 2 == 8
          
            State Space diagram - shows 3x possible Actions transitioning between states
                - Moving Right
                - Moving Left
                - Dirt Sucking
          
          • Given “Fully Deterministic” and “Fully Observable” world it is easy to plan i.e. Start State to Goal State where both sides clean

            0) Initally in Left Location 1) Execute Action to Dirt Suck 2) Move to Right Location 3) Execute Action to Dirt Suck 4) Transition to Goal State of no dirt

          • Given “Unobservable” (Sensorless) and Deterministic world where Vacuum’s sensors stop working, so can’t determine current State (i.e. what Location it is in or whether any Dirt).

            • How does the AI-agent represent the State of the world?

            • Solution: Search in the State Space of Belief States (instead of the State Space of Actual Spaces) i.e. we believe we are in one of the 8 states, and when we execute a subsequent Action we go to another Belief State and gain knowledge about the world even without sensing, which is either a known, or unknown state

              • i.e. Move Right, according to State Space of Belief States diagram we know we’re in Right-Hand Location (either Moved from Start to Right Wall or we were already at the Right Wall and stayed there)
                • Note: Reduces possibilities from 8 down to 4 so know more about world even though haven’t observed anything
          • Note: In real-world, operations of Move Left and Move Right are inverses of each other
          • Conformant Plans Plans that reach the goal without ever observing the world i.e. to achieve Goal of clean location, simply Dirt Suck, where move from Initial State where 8 possibilities to one of 4 possible states where one or both Locations are clean (do not know which of the 4 states we are in, but we know we achieved the Goal)

          • Given “Partially Observable” Vacuum Cleaner in Deterministic world

            • Given Local Sensing - where Vacuum knows its current Location, and sub-states going on i.e. if dirty or clean, but does not know about sub-state of other Locations (if contain dirt or not)

            • Draw Partial Diagram of part of Belief State from that world, where Belief State unfolds as the following occurs:

              • 1) Initial State
              • 2) Action taken to Move Right
              • 3) Observation (in Act-Percept Cycle) allows us to split the Belief State of our world to say:

                • if we observe we are in Location B and it’s Dirty then we know where in a certain state according to the diagram, and if observe we’re in Location B and its Clean we know we’re in another state according to the diagram
      • Act-Percept Cycle
        • In a Deterministic world, each individual world state (common to multiple states) in a Belief State maps exactly to another one (reason why we say “deterministic”, so the size of the Belief State either stays the same or may decrease if two Actions accidentally take you to the same Location
      • Observation
        • Works opposite of Act-Percept Cycle, in that when we Observe the world, we’re taking the current Belief State and partitioning it into pieces of where we know they are. Observations cannot introduce a new world state into the Belief State. Cannot learn less about the world after an Observation.
      • Stochastic (non-Deterministic)
        • Multiple outcomes for an Action
      • Example 2: Robot with slippery wheels (Stochastic and Partially Observable Environment)
        • Given Robot:
          • Which has slippery wheels
          • Where Move Left or Move Right causes wheels to slip, so most of time No Location Change, but sometimes Location Changes
          • Assume Suck always works perfectly
        • Draw Belief State Diagram where result of Actions often result in a Belief State increase (larger than before), since Action increases Uncertainty as do not know what result of Action will be
          • For each of individual world states belonging to a Belief State we have multiple outcomes for the Action (meaning of Stochastic), so end up with larger Belief State, since Actions tend to increase Uncertainty (in Stochastic and Partially Observable environment)
          • Observation-wise, same holds as in Deterministic world, where observation partitions Belief State into smaller Belief States and Observations tend to reduce the Uncertainty (in Stochastic and Partially Observable environment)
        • Planning approach i.e. want to go from initial Belief State to one where all squares are clean, need to know if will Always work or Maybe sometimes work for possible plans?

          All are Maybe, since with a plan that is a linear sequence, we are never guaranteed to achieve goal with Finite Sequence of Actions, since a successful plan must include at least one move action, but if try move action a finite number of times, each of those times the wheels may slip and won’t move.

          Note: We can introduce a new notion of plans that include Infinite Sequences

      • Infinite Sequences
        • Instead of writing plans as Linear Sequence, i.e.
            (suck, move right, suck)
          
            [SRS]
          

          instead write as Tree Structure

        • Tree Structure (where Finite Representation represents an Infinite Sequence of plans)
          • New State - Starting Belief State
          • Action - Suck
          • New State - A
          • Action - Move Right
          • Observation - either [A, Clean] or [B, Clean] or [B, Dirty] of current State
            • If identifies we are still in previous state, then go back in plan to A
            • Else continue
          • Action - Suck
        • Write infinite sequence of plans in more linear notation:
            S, while we observe A do R, and then do S
          
            [S, while A: R,S]
          
        • Note: If Stochasticity is independent (i.e. sometimes works and sometimes it doesn’t work), then with Probability of 1 as Limit, then this Plan will achieve the Goal. However, can’t state any bound qty of steps with which can achieve the goal, can only say it’s infinite
      • Finding a Successful Plan

        • Approach with Problem Solving (using Trees) (Previously)
          • Start in Single State (not a Belief State)
          • Search possible states in a Tree until find one pair that takes us to Goal State
          • Pick single path from Tree (sequence of steps with no branches in it where one leaf node is goal)
        • Approach with Problem Solving (using Maths Notation)
          • Given plan with straight line sequence
              [A,S,F]
            
          • Decide if is plan that satisfies the goal by checking if end state is goal state

          • Mathematical formulation for plan to be a goal
              1 Started in start state S1
              2 Transitioned to state S2 that is result of
                applying to that start state S1 the action
                of going from A to S
              3 Apply the result of starting
                in intermediate state S2 and applying action
                of going from S to F
              4 Check if resulting tate is an element of
                the set of goals to see if plan is valid
                and gives us a solution
            
                Result (Result (A, A -> S), S -> F) <- Goals
            
            
          • Stochastic and Partially Observable
            • Note: In Stochastic and Partially Observable worlds the equations are more complicated

            • Instead of just dealing with Individual States such as getting result of applying action to initial state (i.e. s1 = Result(s, a)), we’re actually dealing with Belief States where we making Prediction where we start in a belief state “b” and we look at the action “a” and each possible result of the action (since they’re stochastic to each possible member of the Belief State, which gives us a larger Belief State), then we Update the Belief State by taking the Observation “o” into account (which gives the same or smaller sized Belief State), and the we get a new state “b1”

            • Note: Use this Predict-Update Cycle to Track where we are

              • tells everything we need to know
              • Weakness: Belief States start to get large since they just list all the possible world states
                • Alternative: Instead represent the possible states using Variables (i.e. using Classical Planning) and an overarching Formula over the Variables that describes the states
              b1 = Update(Predict(b, a), o)
            
        • Approach with Belief States and Branching Plan Structures
          • Same process as for “Problem Solving”, but more complex tree with different possibilities
          • Example (of searching through tree)
            • Init State
              • Possible Actions to for Next State
                • Move Right
                • Suck
              • Branch of the Plan (not part of Search Space)
            • Try each Action and expanding nodes until find a Portion of the tree that is a Successful Plan according to criteria of reaching a Goal
        • Unbounded Solutions
          • If every leaf node in a plan is a goal since we can’t guide it in one direction or another (no matter what path with execute based on observations, noting that observations come to us, we do not pick them)
        • Bounded Solution
          • No loops (do not know how many steps it will take), but ok to have branches
      • Notation of Classical Planning (i.e. Action Schema)
        • Dfn:
          • Representation language for dealing with States, Actions, and Plans
          • Used as approach to deal with complexity by factoring the world into Variables
          • State Space
            • All possible assignments to k-Boolean Variables
          • State Space Size - 2^k
        • Example - Vacuum world with 2 Locations
          • States
            • Possible States: 222 = 8 (succinctly represented through the 3 Variables below)
            • Variables: 3 (Boolean)
              • DirtA - Dirt in Location A
              • DirtB - in Location B
              • VacA - Vacuum in Location A
            • World State - Complete Assignment of True or False to each of the 3 Variables
            • Belief States (depends on type of env we want to deal with)
              • Complete Assignment (in Core Classical Planning, useful in Fully Deterministic Fully Observable domains)
              • Partial Assignment (by extending Classical Planning) i.e. Belief State where VacA is True, whereas other 2 Variables are unknown, such that formula represents 4 possible world states
              • Arbitrary Formula (in boolean logic)
          • Actions
            • Represented in Classical Planning by Action Schema since represents a set of actions (i.e. for all possible planes, for all x and y)
            • Example 1: Fly Cargo
              • Given
                • Goal - need to send cargo around world
                • Assets - plans at airports, cargo
                • Action Schema (of plane flying from one location to another) -
                    Action - since Action Schema
                    Operator and Args - i.e. Fly from x to y
                    Preconditions - what need to know to be true
                      to execute actions
                        - P must be a plane
                        - ^ - AND from Boolean Propositional Logic
                        - At - p better be at x
                    Effects (what will happen as a result of action
                    in terms of transition between state spaces) -
                      i.e. when fly from x to y, the plane is no longer at x
                  
                    Action (Fly(p, x, y)
                        PRECOND: Plane (p) ^ Airport (x) ^ Airport (y)
                                           ^ At (p, x)
                        EFFECT: ¬ At (p, x) ^ At (p, y)
                  
              • Note: Can apply the Action Schema to specific world states where p and x would have specific values (we concatenate the names together to come up with one variable where the name has a complex form with commas and parenthesis to make easier to write one Action Schema that covers all individual flying actions)
            • Example 2: Fly Cargo
              • More complete representation of Problem Solving domain in language of classical planning
              • Note: Below representation and algorithms is good enough to handle 100,000+ cargo planes, and representing millions of ground actions at dozens of airports. If 100 planes, would be 10,000 different fly actions If had 1000’s of cargo pieces we’d have more Load and Unload actions (all representable by the below succinct Schema) * ``` ################### # Initial State (2x Cargo pieces, 2x Planes, 2x Airports) and wher things are ################### Init( At(C1, SFO) ^ At(C2, JFK) ^ At(P1, SFO) ^ At(P2, JFK) ^ Cargo(C1) ^ Cargo(C2) ^ Plane(P1) ^ Plane(P2) ^ Airport(JFK) ^ Airport(SFO) ) ################### # Goal State (i.e. C1 cargo must be delivered to JFK airport and…) ################### Goal( At(C1, JFK) ^ At(C2, SFO) ) ################### # Operator “Load” (loads cargo “c” onto Plane “p” at Airport “a”) ################### Action( Load(c,p,a), PRECOND: At(c, a) ^ At(p, a) ^ Cargo(c) ^ Plane(p) ^ Airport(a) EFFECT: ¬ At(c, a) ^ AIn(c, p) )

              ################### # Operator “Unload” (unloads cargo “c” from Plane “p” at Airport “a”) ################### Action( Unload(c, p, a), PRECOND: In(c, p) ^ At(p, a) ^ Cargo(c) ^ Plane(p) ^ Airport(a) EFFECT: At(c,a) ^ ¬ In(c,p) ) ################### # Fly Action Schema ################### Action( Fly(p, from, to, PRECOND: At(p, from) ^ Plane(p) ^ Airport(from) ^ Airport(to) EFFECT: ¬ At(p, from) ^ At(p, to) )

              ```

      • Progression (Forward) State Space Search - Planning using Notation of Classical Planning (i.e. Action Schema)
        • Searching through Concrete world states
        • Planning using Action Schema is done same as we did planning for Problem Solving
        • Searching through space of exact states, where each state is an Individual World State and where Actions are Deterministic then its the same approach as in Problem Solving
        • Since we are using this representation using Action Schema, there are other possibilities available

        • Example: Fly Cargo
          • Initial State At(P1, SFO), At(C1, SFO)
          • Branching to possible Actions i.e. one possible Action is to Load Load(C1, P1, SFO)
            • Continue Branching until satisfy Goal predicate
      • Regression (Backwards) State Space Search
        • Searching through Abstract states (where some Variables unspecified)
        • Where we start at the Goal representation (of many world states)
        • Use the same representation for arrows that represent a set of possible Actions

        • Example: Fly Cargo
          • Take description of Goal State At(C1, JFK), At(C2, SFO) (all we know about the state is that these two Propositions are True)
          • Backward Search by asking what Actions would lead to that Goal State, which represents a whol family of States with different Variables and values
          • Individually inspect one at a time the “Definition” of each possible Actions that would result in this each of the part of the Goal
            • i.e. inspect Actions in Action Schema that would result in At(C1, JFK), where we find the only Action that results in an At is the Unload Action Schema, such that Unload(c,p,a) results in At(c,a), so we know to achieve goal we’d have to perform an Action Unload(C1, anything, JFK), (i.e. branch leading into Goal state is Unload(c,p,a), which represents all possible Actions for any Plane “p” for Unloading cargo at the destination)
          • Regress the State over the Operator to get another representation before the Action that also will be Unknown (i.e. do not know what “p” is involved)
          • Continue searching backwards until enough of Variables are filled in and where we match against Initial State so we have our Solution that we found going Backwards
          • Apply the found Solution going Forwards
        • Example: Buying a Book
            # Single Action (buying a book)
            # Precondition is identifying what book and Effect is that own the book
            # Goal is to own the book satisfying precondition ISBN
            Action(buy(b)
                PRECOND: ISBN(b)
                EFFECT: OWN(b)
            Goal (Own(0136042597))
          
          • Progression Forward Search Solution:
            • Initial State - own nothing
            • Actions - which possible to apply i.e. if 10,000,000 possible books then branching factor is 10,000,000 coming out of initial state node and have to try all in order until hit goal Very Inefficient
          • Regression Backward Search Solution:
            • Goal State - own ISBN No. 0136042597
            • Actions - Inspect possible ones out of the 10,000,000, where there is Action Schema, which can only match Goal in one way when “b” == 0136042597 so we know Action is to buy 0136042597, so able to connect Goal State to Initial State in backwards direction in one step

      Plan Space Search * Dfn: Searching through space of plans (rather than searching through space of states) * Note: * Plan Space Search - Old approach * Forward Search - Popular as easier to come up with good Heuristics for search, as deals with Concrete Plan States

                * Example: Sliding Puzzle (using **Heuristics**)
                    * **Action Schema** for puzzle is formal representation
                    of **encoding Actions in logical form, so a program
                    could modify and generate Heuristics from it
                    (rather than requiring a human to manually do it)**
      
                        ```
                        # Action - slide tile to from location a to b
                        Action(Slide (t,a,b)
      
                                # Precondition - tile must be on location a,
                                  and must be a tile, and location b
                                  must be blank, and location a and b
                                  must be adjacent
      
                                PRECOND: On(t,a) ^ Tile(t) ^ Blank(b) ^ Adj(a,b)
      
                                # Effect - tile is now on location b, and blank
                                  is now on location a, and tile is no longer on
                                  location a, and blank is no longer on location b
      
                                EFFECT: On(t,b) ^ Blank(a) ^
                                        ¬ On(t,a) ^ ¬ Blank(b)
                              )
      
                        ```
                    * Use **Action Schema** to automatically derive
                    good **Heuristics** by modifying original Problem Heuristic
                        * Heuristic #1
                            * Remove a Precondition to make the problem easier
                            and generate a different Heuristic.
                            i.e. remove `Blank(b)` to end up with
                            the **"Manhattan" (aka City Block) Heuristic**
                            (i.e. since not Blank, and not contain Tile, it
                            contains something else)
                        * Heuristic #2
                            * i.e. remove `Adj(a,b)` to get the
                            **Qty Misplaced Tiles Heuristic**
                            (i.e. allows sliding a tile from any location a
                            to any location b, no matter how far apart they)
                        * Heuristic #3 (Ignore negative effects)
                            i.e. remove `¬ Blank(b)` from EFFECT,
                            to make the problem easier (more relaxed problem)
      
            * Backward - Benefit as may be more Efficient than Forward Search
      
        * Example: Getting Dressed with clothes in right order
            * Initial Empty Plan (flawed as does not lead from Start to Goal)
                ```
                Start State
                Goal State
                ```
            * Update Plan
                * Add "Right Shoe" Operator
                    ```
                    Start State
                    Right Shoe
                    Goal State
                    ```
            * Check if updated plan solves problem, otherwise further refine plan
            * Update Plan
                * Add parallel sequence of Operators in branching structure
                    * "Left Shoe"
                    * "Right Shoe"
                * Note: Captures that can achieve in any order
                ```
                    Start State
                     /       \
                Left Shoe    Right Shoe
                     \       /
                    Goal State
                ```
            * Repeat until get plan that works
      

      Planning Representation

        * **Situation Calculus (using FOL for Planning)**
            * Dfn: Regular FOL with a set of conventions for how to
            represent states and actions
            * **Possibility Axioms** and **Successor-State Axioms** are most of what
            comprises Situational Calculus and are used to describe entire
            **Domains** (i.e. airport cargo domain).
            We describe our specific Problem within that Domain by
            describing the Initial State/Situation, where we make different
            types of assertions using different predicates
            * Note: Deriving Solution from a Problem described in
            that supports full power of FOL using Situation Calculus
            where we can write anything we want
            (much more flexible than in Problem Solving or Classical Planning)
            and doesn't need
            any special program since there are already Theorem Provers for FOL,
            so simply state as a Problem, apply the Normal Theorem Prover
            that we already had for other uses so
            it comes up with a Situation as an answer which
            corresponds to a Path that satisfies the Goal, given the Initial
            State and Action descriptions.
      
            * **Conventions**
                * **Actions** represented as Objects in FOL (normally by functions)
                    ```
      
      
                    # Function Fly - represents an Object that is the Action
                    # (i.e. fly plane from airport x to y)
      
                    Fly(p,x,y)
                    ```
                * **Situations** described by Objects in the logic and
                correspond to the **Past History of Actions** that
                are in state space search
      
                    i.e. arriving at same world state from two different sets of
                    actions, they would be considered two different situations
      
                    ```
                    # Initial Situation
                    S0
      
                    # Function of Situations (aka "Result") where
                    result of situation object "s" and action object
                    "a" equals another situation S'
      
                    S' = Result(s, a)
      
                    # Possible
      
                        # Note: Situation Calculus does not describe Actions
                        applicable in a Situation with predicate Actions of
                        "s" as `Actions(s)`. Instead Situation Calculus talks
                        about Actions that are possible in a state by using
                        a predicate `Poss(a,s)`, which is an Action "a" that is
                        possible in a state
      
                    Poss(a, s)
      
                    # Precondition
      
                        # Note: Form for describing predicates (i.e. Poss(a,s)),
                        which has the form of a Precondition of state "s"
                        that implies it's possible to do Action "a" in
                        state "s"
      
                    SomePrecond(s) -> Poss
      
                    # Possibility Axiom for the Action "Fly"
      
                        # if some "p" exists which is a Plane in state "s" and
                        there's some "x" which is an Airport in state "x" and
                        there's also some "y" which is also an Airport in state "s" and
                        "p" is at Location "x" in state "s", then that implies
                        it's possible to fly "p" from "x" to "y" in state "s"
      
                    Plane(p,s) ^ Airport(x,s) ^ Airport(y,s) ^ At(p,x,s) -> Poss(Fly(p,x,y))
      
                    ```
      
                * **Predicates**
      
                    * **Fluents** (fluidity or change over time)
                    ```
                    i.e.
      
                    # Plane "p" at Airport "x" in Situation "s"
      
                    At(p,x,s)
                    ```
      
                * **Possibility Axiom** `Poss(...)`
      
                * **Successor-State Axiom** `In(...)`
      
                * **Describing what Changed and what Does Not Change
                in Classical Planning vs Situational Calculus**
                    * **Classical Planning** had "Action Schemas" that
                    described one Action at a time and what changed
                    * **Situation Calculus** however does it the other way around
                    by instead of writing one Action, Schema, or Axiom for
                    each Action, we instead do one for each Fluent (Predicate)
                    that can change using a **convention** called
                    **Successor-State Axioms (SSA)** that describe what happens in the
                    state that is a Successor of executing an Action.
                    SSAs have the form of saying:
                    ```
                    i.e. For All Actions and States, if it's possible to execute
                    Action "a" in State "s", then the Fluent is True
                    if and only if (IFF) Action "a" made it True OR
                    if Action "a" didn't undo it
      
                    (i.e. either it wasn't true before and "a" made it True,
                    or it was true before and "a" didn't stop it from being True)
      
                    ∀a,s Poss(a,s) -> (fluent true <-> "a" made it true V "a" didn't undo it)
                    ```
                    * Example: **Successor State Axiom** for the `In` Predicate
      
                    ```
                    i.e. For All Actions and States, for which it's Possible to execute
                    Action "a" in Situation "s", and if that's true then the In
                    Predicate holds between some cargo "c" in some plane "p" which is
                    in the state that is the "result" of executing Action "a" in state "s",
                    so the In Predicate will hold, IFF "a" was a Load Action
                    (i.e. if we load the cargo into the plane, then the result of
                    executing that Action "a" is that the cargo is in the Plane "p")
                    OR it might be already true that the cargo was in the Plane "p" in
                    Situation "s", AND Action "a" is not an Unload Action
      
                    ∀(a,s) Poss(a,s) -> In(c,p,result(s,a)) <-> (a=Load(c,p,x) V (In(c,p,s) ^ "a" != Unload(c,p,x)))
      
                    # Predicate states valid sentences in FOL like the following
                    to asset the initial state
                        - Plane P1 is at Airport JFK in Situation S0
                        - For All C, if C is Cargo then that C is at JFK
                        in Situation S0
      
                    Initial state: S0
                        At (P1,JFK,S0)
                        ∀c Cargo(c) -> At(c,JFK,S0)
      
                    # There Exists a Goal State "s" such that For All "c"
                    if "c" is Cargo, we want all that cargo
                    to be as SFO in state "s'
      
                    Goal:
                        Ǝs ∀c Cargo(c) -> At(c,SFO,s)
      
                    ```
      
            * Example: Fly Cargo
                * Goal: Move all cargo from Airport A to Airpot B,
                        regardless of qty of cargo
                    * Note: In FOL, you can express the notion of "All",
                    but NOT in Propositional Languages like
                    Classical Planning Notation
      
      • Predicates i.e. At(...)

      • Complete Assignment Dfn: when all Variables have values
      • Partial Assignment Dfn: when some Variables have values and others do not

      • Theorem Provers - My Research

        I did some web research this morning to try and answer my question and prepared the below summary of my findings. Unfortunately I couldn’t find any helpful nested infographics that present the big picture concisely, perhaps this still hasn’t been done before (if you happen to know of any, please share). So to understand how everything fits together better, I guess I’d just have to find the time to read, summarise, and consolidate a lot of different research papers, books, lecture notes from various sources.

        Problem Types =========

        • CSF - Constraint Satisfaction Problem (CSP) problems
        • SAT - Boolean Satisfiability (SAT) problems
        • IP - Integer Programming (IP) problems
        • LP - Linear Programming (LP) problems

        Algorithms =========

        • DPLL - Davis–Putnam–Logemann–Loveland (DPLL) algorithm - complete backtracking-based search algorithm

        Approaches =========

        • ASP - Answer Set Programming (ASP) problem solving approach

        Strategies / Techniques =========

        • Resolution - Theorem Proving Strategy for Propositional Logic and First-Order Logic, which is the way most modern automated Theorem Provers are implemented
        • Unification - Tool of matching two terms so they look the same by determining what variables/functions/constants combine, as an enabler of performing Resolution in FOL

        Formats =========

        • CNF - Conjunctive Normal Form (CNF) format may be used to represent SAT engines/solvers
        • STRIPS - formal language of the inputs to the STRIPS planner

        Solvers =========

        • Randomised Systematic Solvers - ?
        • SATzilla - SAT-based and DPLL-based
        • PrecoSAT (previously known as RelSAT, Siege, and ZChaff) - SAT-based and DPLL-based

        Planners / Provers =========

        • SATPlan - general propositional theorem prover that uses propositional logic instead of FOL. Input is a set of Axiom Schemas
        • GraphPLAN - Input is a set of STRIPS-style operators
        • Blackbox - planning system that works by converting problems specified in STRIPS notation into SAT problems
        • STRIPS (Stanford Research Institute Problem Solver) system - automated planner. STRIPS problems may be translated into Boolean Satisfiability (SAT) problems
        • Tweak - non-linear planning

        Misc =========

        • Ground - Variable-free (i.e. FOL is a lifted version of Ground Propositional Logic where Variables are used)

        References: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-825-techniques-in-artificial-intelligence-sma-5504-fall-2002/lecture-notes/Lecture7FinalPart1.pdf

    • Project 3:
      • Discussion: https://discussions.udacity.com/c/nd889-logic-reasoning-planning/nd889-implement-planning-search
    • Lesson 12 - Probability:
      • Reading: Chapter 13 AIMA
      • Conditional Probability and Bayes’ Rule (used in Machine Learning Algorithms)
        • Bayes’ Rule (from Statistics)
            P(X|Y)     = P(Y|X)     * P(X)  / P(Y)
          
            posterior  = likelihood * prior / marginal likelihood
          
            where "prior"     = prior probability (aka unconditional probability)
            where "posterial" = posterior probability (aka conditional probability)
            where "|"         = given
          
            where P(Y|X) = 1 - P(~Y|X)
          
            where P(Y) equals the "Theorem of Total Probability"
          
            P(Y) = ∑i P(Y|X=i) P(X=i)
          
            (prob. of Y is sum over all probabilities of Y
            conditional on X equals i, times probability of X equals i
          
        • Example
            LHS of Bayes' Rule
          
            Given Y (i.e. test result),
            but we want to know X (probability of have or not have cancer)
            where P(X|Y) is diagnostic reasoning (from evidence to causes)
          
            RHS of Bayes' Rule (turns LHS on its head and correct for the
            inversion by multiplying by the "prior" of the cause to be be
            the case in the first place, i.e. whether have cancer or not,
            and dividing by the probability of the evidence Y, where the divisor
            is often expanded using the Theorem of Total Probability )
          
            where P(Y|X) is causal reasoning (i.e. hypothetically given
            we know the cause X, what is probability of evidence Y we just
            observed)
          
        • Applied to Cancer Question 4 to understand (further below)
            P(C|+) = P(+|C) * P(C) / P(+)
                   = 0.9 * 0.001 / Theorem of Total Probability
                   = 0.9 * 0.001 / (prob. of pos. test result given have cancer +
                                    prob. of pos. test result given do not have cancer)
                   = 0.9 * 0.001 / (0.9 * 0.01 + 0.2 * 0.99)
                   = 0.009 / 0.009 + 0.198
                   = 0.0434
          
            care about prob. of cancer (cause) conditional on
            result of hidden cause, which is the positive test
            result in this case (evidence)
            =
            probability of seeing pos. test result given have cancer
            *
            prior probability of having cancer
            /
            prob. of seeing pos. test result
          
          
        • Question:
            1 - Probability of a given population positively having
            disease X being 3%.
            2 - Drug company has test Y to show if
            person has disease X.
            3 - Given probability of negative detection
            using test Y given a person with
            disease X being 1%
            4 - Given probability of positive detecting
            using test Y but person not actually having
            disease X is 10% (false positive)
            5 -
            Find probability person positively has disease X
            given they took test Y with positive result?
          
        • Solution:
          • Translate problem into Probability Notation:
              1 - P(X)    = 0.03 (prob. with X)
              so  P(~X)   = 0.97 (prob. not with X)
              3 - P(~Y/X) = 0.01 (prob. Y neg. detect with X)
              4 - P(Y/~X) = 0.1  (prob. Y pos. detect but not with X)
            
                - P(Y/X)  = 0.99 (prob. Y given pos. detect with X)
                - P(X/Y)  = ?    (prob. with X given pos. detect)
            
          • Find P(Y/X) first to use with Bayes’ Rule
              P(Y/X) = 1 - P(~Y/X) = 0.99 (prob. pos. detect)
            
          • Find P(Y) second using Ven Diagrams where P(Y) is overlap intersection of X and Y plus area of Y without the overlap intersection
              P(Y) = P(Y/X) P(X) + P(Y/~X) P(~X)
                   = 0.99 * 0.03 + 0.1 * 0.97
                   = 0.1267
            
          • Substitute into Bayes’ Rule
              P(X/Y) = P(Y/X) P(X) / P(Y)
                     = 0.99 * 0.3 / 0.1267
                     = 0.234
            
              so, 23.4%,
            
              mainly since disease X is rare with 3% in numerator
              so doctors would rely on multiple tests before
              giving definitive diagnosis
            
      • Graph Structure of Probability using Bayes’ Networks
        • Bayes’ Network
          • Dfn: Compact representation of a distribution over a very large Joint Probability Distribution of all Variables (i.e. Node)
            • Assume that events are discrete (i.e. Binary Events)
            • Specify the Bayes’ Network
            • Observe values of Variables (Nodes) (i.e. car not start, oil light, etc)
            • Compute Probabilities of hypotheses (i.e. probability fanbelt broken)
          • Dfn: Complex structure composed of nodes that correspond to events (known or unknown, random variables). Nodes are linked by Arcs that suggest a Child of an Arc is Influenced by its Parent (perhaps in Probabilistic way but not a Deterministic way) (i.e. battery age older has higher chance of causing battery dead but not clear that every old battery is dead). Describes a way to understand how a car (complex system) doesn’t start, where many variables not immediately measurable, but with sensors that allow understand a bit about state of the car Assists in reasoning from observable variables (i.e. car not start, dipstick value) to hidden causes (i.e. is fanbelt broken?, is battery dead?)

          • Usage: Used extensively in most fields of smart computer systems:
            • Diagnostics
            • Prediction
            • Machine Learning
            • Finance
            • Google
            • Robotics
            • Computer Vision
            • Bayes’ Network Inference
          • Advanced AI Techniques that extend from Bayes’s Networks
            • Particle Filters
            • Hidden Markov Models
            • MDP’s
            • POMDP’s
            • Kalman Filters
        • Conditional Independence
          • Dfn: ??
        • D-Separation Concept
          • Dfn: ??
        • Parameter Counts
          • Dfn: ??
        • Inference of Bayes’ Networks
          • Dfn: ??
        • Question: Car not start
          • Problem: Given car not start in morning.
          • Solution:
            • Specify the Bayes’ Network by drawing it (Influence Diagram) graph of possible causes and show methods of diagnosis (increase/decrease belief that battery may cause car failure) and Effects
              • Diagnostic Tool
              • Check battery age
              • Problem:
                • Car not start
                  • Possible Causes:
                    • Battery ok, but not charging OR Battery dead
                      • Possible Causes:
                        • Alternator broken
                        • Fanbelt broken
                        • Starter broken
                      • Diagnositics
                        • Battery Meter (Diagnostic Tool)
                        • Battery age
                    • Lights
                      • Possible Causes:
                        • Lightings draining battery
                      • Diagnositics
                        • Check if lights on
                    • Oil
                      • Possible Causes:
                        • No oil
                      • Diagnostics
                        • Check oil light
                    • Fuel
                      • Possible Causes:
                        • No fuel
                        • Blocked fuel line
                      • Diagnostics
                        • Check fuel guage
                  • Possible Effects:
                    • Battery Flat
                      • Affects:
                        • Lights (no power)
                        • Oil light
                        • Fuel guage as no power
                      • Not Affect:
                        • Oil level dipstick
                    • No Oil
                      • Affects:
                        • Oil level dipstick
                        • Oil light
                      • Not Affect: *
                    • No Gas
                      • Affects:
                        • Fuel guage
                        • Car not start
                      • Not Affect: *
            • Count Total Variable Centers
              • Note: Graph Structure and associated Probabilities specify a large Probability Distribution in space of all Variables (i.e. one Variable for each Node)
              • Assuming each Variable (Node) is Binary then given 16 Variables, we assume there can be 2^16 different Values
          • Probabilities
            • Total Probability
                P(Y) = ∑i P(Y|X=i) P(X=i)
              
                prob. any random variable Y can be written:
                    sum of all possible outcomes i for the random variable X of the:
                    prob. of Y given another random variable X of assumed value i
                    * prob. of X equals i
              
            • Negation of Probabilities
                P(¬X|Y) = 1 - P(X|Y)
              
            • Complementary Probability
                P(A) = p
                    =>  P(¬A) = 1 - p
              
            • Independence
              • Probability of the Join of any two Variables is the Product of the Marginals
                  X ⊥ Y : P(X)(Y)   = P(X,Y)
                          marginals = joint probability
                
              • Usage: When asked for probability of any combination of a number of coins, look at probability of each coin individually and multiple them
              • Important note: ``` If X and Y are not independent, you can’t use P(X and Y) = P(X | Y).

              But if X and Y are independent. You can replace the formula:

              P(X and Y) = P(Y/X) P(X) X and Y independent so, P(X) P(Y) = P(Y/X) P(X) so, if I eliminate P(X) in both sides (P(X) not 0) P(Y) = P(Y/X) ```

            • Product Rule (Joint - Multiplication Rule)
                P(X,Y) = P(X and Y)
                       = P(X) * P(Y|X)
              
              
              • http://www.mathgoodies.com/lessons/vol6/conditional.html
            • Dependence
              • Given two coins flipped
                  P(X1 = H) = 0.5 (prob. 1st flip heads)
                      |
                      branch to pick another coin
                      based on 1st flip outcome
                      |
                      `--> H : P(X2 = H | X1 = H) = 0.9
                      |
                      `--> T : P(X2 = T | X1 = T) = 0.8
                
                      i.e. prob. 2nd flip X2 is H given that
                      1st flip X1 was H is 0.9
                      (by Conditional Probability)
                
                  where X1 is outcome of first flip
                
                  What is probability of 2nd flip being H ?
                
                  P(X2 = H) = ???
                
              • Solution:
                  compute probability catering for the complement
                
                  using Theorem of Total Probability
                
                  P(X2 = H)
                    =
                      P(X2 = H | X1 = H) * P(X1 = H)
                      + P(X2 = H | X1 = T) * P(X1 = T)
                    =
                      0.9 * 0.5
                      + 0.2 * 0.5
                    =
                      0.45 + 0.1
                    =
                      0.55
                
                
          • Question: Heads or Tails 1
              1 - Given coin, with possible values heads or tails
              P(H) = 0.5
              P(T) = ?
            
            • Assume only two possible outcomes
          • Solution
            • Recall Bayes’ Rule
                P(X/Y) = P(Y/X) P(X) / P(Y)
                where P(Y/X) = 1 - P(~Y/X)
              
            • Translate problem into Probability Notation:
                1-  P(H)    = 0.5 (prob. with H)
              
                where P(T) = 1 - P(H) = P(~H)
                           = 1 - 0.5
                           = 0.5
              
          • Question: Heads or Tails 2
              1 - Given coin, with possible values heads or tails
              P(H) = 0.25
              P(T) = ?
            
            • Assume only two possible outcomes
          • Solution
            • Recall Bayes’ Rule
                P(X/Y) = P(Y/X) P(X) / P(Y)
                where P(Y/X) = 1 - P(~Y/X)
              
            • Translate problem into Probability Notation:
                1)  P(H)   = 0.25 (prob. with H)
              
                where P(T) = 1 - 0.25 = P(~H)
                           = 1 - 0.25
                           = 0.75
              
          • Question: Heads or Tails 3
              1 - Given coin, with possible values heads or tails
            
              Find probability of Heads 3x times in a row
            
              P(H)     = 0.5 (per flip)
              P(H,H,H) = ?
            
            • Assume only two possible outcomes
            • Assume coin flips independent
          • Solution
            • Translate problem into Probability Notation:
                1)  P(H)   = 0.5 (prob. with H)
              
                * Multiple probabilities since each is
                independent event
              
                P(H,H,H)   = 0.5 * 0.5 * 0.5 = 0.125
              
                12.5%
              
          • Question: Heads or Tails 4
              1 - Given coin, with possible values heads or tails
            
              Find probability of same side 4x times in a row
            
              P(H)     = 0.5 (per flip)
              P(~H)    = 0.5
              P(X1 = X2 = X3 = X4) = same result
            
              Pi(H) = 0.5 ∀i
            
              where Xi = result of i'th flip
                    Xi = { H, T }
            
            • Assume only two possible outcomes
            • Assume coin flips independent
            • Assume each flip has identity and equal probability of heads result
          • Solution
            • Translate problem into Probability Notation: ```
              • Multiple probabilities since each is independent event
              • Already know that probability 4x Heads is:
                • 0.5 * 0.5 * 0.5 * 0.5 = 0.0625
              • Already know that probability 4x Tails is 0.0625

              P(X1 = X2 = X3 = X4) = 0.0625 + 0.0625 = 0.125

              12.5% ```

          • Question: Heads or Tails 5
              1 - Given coin, with possible values heads or tails
            
              Find probability of at least 3x heads
              out of 4x coin flips
            
              P(H)     = 0.5 (per flip)
              P( {X1 X2 X3 X4} contains >= 3H )
            
            • Assume only two possible outcomes
            • Assume coin flips independent
            • Assume each flip has identity and equal probability of heads result
          • Solution
            • Identify different Sequences where head occurs at least 3x
                HHHH
                HHHT
                HHTH
                HTHH
                THHH
              
            • Translate problem into Probability Notation: ```
              • Already know that probability 4x Heads is: 0.5 * 0.5 * 0.5 * 0.5 = 0.0625

              P( {X1 X2 X3 X4} contains >= 3H ) = 5 * 0.0625 = 0.3125

              31.25% ```

          • Question: Weather 1
              1 - Given weather being sunny or rainy
            
              P(D1 = sunny)              = 0.9   (prob. sunny day)
              P(D2 = sunny | D1 = sunny) = 0.8   (prob. sunny day follows a sunny day)
              P(D2 = rainy | D1 = sunny) = ???   (prob. rainy day follows a sunny day)
            
              What is prob. of rainy day given prev. day was sunny day?
            
            • Assume sunny or rainy are only two possible outcomes
          • Solution
            
              Using Negation of Probabilities
                  P(¬X|Y) = 1 - P(X|Y)
            
              P(D2 = rainy | D1 = sunny)
                = 1 - P(¬ D2 = rainy | D1 = sunny)
                = 1 - P(D2 = sunny | D1 = sunny)
                = 1 - 0.8
                = 0.2
            
              20%
            
          • Question: Weather 2
              1 - Given weather being sunny or rainy
            
              P(D1 = sunny)              = 0.9   (prob. sunny day)
              P(D2 = sunny | D1 = sunny) = 0.8   (prob. sunny day follows a sunny day)
              P(D2 = rainy | D1 = sunny) = 0.2   (prob. rainy day follows a sunny day)
              P(D2 = sunny | D1 = rainy) = 0.6   (prob. sunny day follows a rainy day)
              P(D2 = rainy | D1 = rainy) = ???   (prob. rainy day follows a rainy day)
            
              What is prob. of rainy day given prev. day was rainy day?
            
            • Assume sunny or rainy are only two possible outcomes
          • Solution
            
              Using Negation of Probabilities
                  P(¬X|Y) = 1 - P(X|Y)
            
              P(D2 = rainy | D1 = rainy)
                = 1 - P(¬ D2 = rainy | D1 = rainy)   (replace with complement probability)
                = 1 - P(D2 = sunny | D1 = rainy)     (substitute with negation that we already know)
                = 1 - 0.6
                = 0.4
            
              40%
            
          • Question: Weather 3
              1 - Given weather being sunny or rainy
            
              P(D1 = sunny)              = 0.9   (prob. sunny day)
              P(D2 = sunny | D1 = sunny) = 0.8   (prob. sunny day follows a sunny day)
              P(D2 = rainy | D1 = sunny) = 0.2   (prob. rainy day follows a sunny day)
              P(D2 = sunny | D1 = rainy) = 0.6   (prob. sunny day follows a rainy day)
              P(D2 = rainy | D1 = rainy) = 0.4   (prob. rainy day follows a rainy day)
            
              What is prob. that day 2 is sunny?
              What is prob. that day 3 is sunny?
            
            • Assume sunny or rainy are only two possible outcomes
            • Assume that same Conditional Probability dynamics of D2 followed by D1 apply also to D3 followed by D2
          • Solution
              Use "Dependence"
            
              i.e. Given sunny or rainy
              P(D1 = sunny) = 0.9 (prob. 1st day sunny)
                 |
                 branch to pick another day
                 based on 1st flip outcome
                 |
                 `--> sunny : P(D2 = sunny | D1 = sunny) = 0.8
                 |
                 `--> rainy : P(D2 = rainy | D1 = rainy) = 0.4
            
                  i.e. prob. 2nd day D2 is sunny given that
                 1st day D1 was sunny is 0.8
                 (by Conditional Probability)
            
              P(D2 = sunny) = ???
            
              compute probability catering for the complement
            
              P(D2 = sunny)
                =
                  P(D2 = sunny | D1 = sunny) * P(D1 = sunny)
                  + P(D2 = sunny | D1 = rainy) * P(D1 = rainy)
                =
                  0.8 * 0.9
                  + 0.6 * 0.1
                =
                  0.72 + 0.06
                =
                  0.78
            
              now, what is prob. that day 3 is sunny?
            
              P(D3 = sunny) = ???
            
              P(D3 = sunny)
                =
                  P(D3 = sunny | D2 = sunny) * P(D2 = sunny)
                  + P(D3 = sunny | D2 = rainy) * P(D2 = rainy)
            
                now, we know from last answer that
                P(D2 = sunny) = 0.78, and its complement
                P(D2 = rainy) = 0.22, so substitute:
            
                and given the assumptions we know the following are
                the same (only relevant ones from previous
                to get the answer here are shown):
            
                P(D2 = sunny | D1 = sunny)
                  = P(D3 = sunny | D2 = sunny)
                  = 0.8
                P(D2 = sunny | D1 = rainy)
                  = P(D3 = sunny | D2 = rainy)
                  = 0.6
            
                so substitute these too:
            
                =
                  P(D3 = sunny | D2 = sunny) * 0.78
                  + P(D3 = sunny | D2 = rainy) * 0.22
            
                = 0.8 * 0.78 + 0.6 * 0.22
                = 0.624 + 0.132
                = 0.756
            
              75.6%
            
          • Question: Cancer 1
              1 - Probability of given population having cancer C is 1%
            
              P(C)   = 0.01        (prob. of given population having cancer C)
            
              What is prob. of not having the cancer?
            
            • Assume having or not having are only two possible outcomes
          • Solution
              P(¬ C) = 1 - 0.01
                     = 0.99        (prob. of given population not having cancer C)
            
          • Question: Cancer 2
              1 - Probability of given population having cancer C is 1%
              2 - Drug company has test Y to show if person has cancer C
              (with a probabilistic answer)
            
              where P(+|C)   - prob. of test Y Positive detection given we have a cancer C
              where P(-|C)   - prob. of test Y Negative detection given we have a cancer C
            
              P(C)   = 0.01        (prob. of given population having cancer C)
              P(¬ C) = 0.99        (prob. of given population not having cancer C)
            
              P(+|C) = 0.9         (prob. of test Y Positive detection given we have a cancer C)
              P(-|C) = ???
            
            
              What is prob. of test Y being Negative given we have a cancer C?
            
            • Assume having or not having are only two possible outcomes
          • Solution
              recall, Bayes' Rule:
            
                  P(X/Y) = P(Y/X) P(X) / P(Y)
            
                  where P(Y/X) = 1 - P(~Y/X)
            
            
            
            • Translate problem into Probability Notation:
                1- P(C)   = 0.01            (prob. of given population having cancer C)
                2- P(¬ C) = 0.99            (prob. of given population not having cancer C)
                3- P(Y/C) = P(+|C) = 0.9 ?? (prob. Y pos. detect with C present)
              
                P(¬ Y/C) = P(-|C) = ??      (prob. Y neg. detect with C present)
              
            • Find P(¬ Y/C) = P(-|C) = ?? first to use with Bayes’ Rule
                P(¬ Y/C) = 1 - P(Y/C)
                         = 1 - P(+|C)
                         = 1 - 0.9
                         = 0.1              (prob. Y neg. detect with C)
              
                         probability of the test coming out negative given that the
                         cancer is present
              
          • Question: Cancer 3
              In addition to that given in Question Cancer 2,
              assume:
            
              P(Y/ ¬ C)   = P(+|¬ C) = 0.2      (prob. Y pos. detect with no C present)
              P(¬ Y/ ¬ C) = P(-|¬ C) = 0.8      (prob. Y neg. detect with no C present)
            
              Find the "Joint Probabilities" (i.e. not Conditional):
            
              P(Y,     C) = P(+, C)   =         (prob. Y pos. detect and having C)
              P(¬ Y,   C) = P(-, C)   =
              P(Y,   ¬ C) = P(+, ¬ C) =
              P(¬ Y, ¬ C) = P(-, ¬ C) =
            
          • Solution: * ``` Substitute values into the Product Rule: P(X,Y) = P(X and Y) = P(X) * P(Y|X)

              P(+, C) = P(+|C) * P(C)
                      = 0.9 * 0.01
                      = 0.009
            
              P(-, C) = P(-|C) * P(C)
                      = 0.1 * 0.01
                      = 0.001
            
              P(+, ¬ C) = P(+| ¬ C) * P(¬ C)
                      = 0.2 * 0.99
                      = 0.198
            
              P(-, ¬ C) = P(-| ¬ C) * P(¬ C)
                      = 0.8 * 0.99
                      = 0.792
              ```
            
          • Question: Cancer 4
              In addition to that given in Question Cancer 2 and 3, find:
            
              P(C|Y) = P(C|+) = ???
            
              (prob. of having cancer C, given received a positive test Y diagnosis)
            
            
              Review prior cases...
              we know we have a Positive test (i.e. +)
              where we found
                  P(+, C)   = 0.009
                  P(+, ¬ C) = 0.198
            
              Noting and using Conditional Probability (Bayes' Rule)
            
                   P(X|Y) = P(Y|X) P(X) / P(Y)
              P(X|Y) P(Y) = P(Y|X) P(X)
            
            
              Using the Product Rule
                  P(X,Y) = P(X and Y)
                         = P(X) * P(Y|X)
            
              Substituting Product Rule into Conditional Probability:
                  P(X|Y) = P(X,Y) / P(Y)
            
            
              Solving using rearranged form of Product Rule
              P(C|+) = P(C|Y)
                     = P(C,Y) / P(Y)
            
                     where P(Y,C) = P(C,Y) = P(Y and C) = P(C and Y)
                     = P(C and Y) / P(Y)
                     substitute equivalent values from Q3
                       where P(+, C) = (prob. Y pos. detect and having C)
                       where P(Y) = Theorem of Total Probability
                                  = (prob. Y pos. detect and either having or not having C)
                                  = P(+, C) + P(+, ¬ C)
                     = P(+, C) / Theorem of Total Probability
                     = P(+, C) / P(+, C) + P(+, ¬ C)
                     = 0.009 / 0.009 + 0.198
                     = 0.043
            
                4.3%
            
        • Ch 13 Questions
          • What are “Query Propositions”? (bottom of actual pg 490)
    • Lesson 13 - Bayes Nets
      • Dfn: Bayes’ (Baysian) Networks (theory of the world) Constructed and used to make Inferences about new Situations to takes uncertainty in probability and marry it with Efficient Structures, so can see how each Uncertain Variable influences other Uncertain Variables
        • Usage:
          • Construct (a Bayes’ Nets)
          • Indifference (using Bayes’ Nets)
        • Solve using either approach of:
          • Summing the Results of All Relevant Situations
          • Inference by Sampling (to handle bigger Bayesian Networks)
        • Algorithms:
          • Monte Carlo Markov Chain
      • Bayesian Network - Graphical Representation of Distribution of Two Variables
          (A)   (NOT OBSERVABLE)
           |
           |
          (B)   (OBSERVABLE)
        
          where A and B are "Internal Variables"
          where A means: have or not have cancer
          where B means: test result pos. or neg. diagnosis
        
          where P(A) is "Prior Probability" (i.e. best guess given no other evidence)
          where P(B | A) and
                P(B | ¬ A) is prob. of B given different values of A
        
          where P(A | B) and
                P(A | ¬ B) are "Diagnostic Reasonings" (Inverse of Causal Reasoning)
        
        • Question: What is qty of numerical parameters (of underlying probabilities) required to specify the entire join prob. between A and B
          • Answer:
              Recall that "Joint Probability" using the "Independence" formula
              between two variables A and B is:
            
              (i.e. When asked for probability of any
              combination of a number of A and B,
              look at probability of each individually and multiple them)
            
               A ⊥ B : P(A)(B)   = P(A,B)
                       marginals  = joint probability
            
              Also recall the Product Rule
            
              P(A, B) = P(A and B)
                      = P(A) * P(B | A)
            
            
              Parameter 1:
                  A to specify P(A) from which can
                  "derive" P(¬ A)
            
              Parameters 2, and 3:
                  A and B to specify P(B | A) and P(¬ A) from which we can
                  "derive" P(¬ B | A) and P(¬ B | ¬ A)
            
              Answer: 3x parameters for this Bayesian Network
            
      • Example: Student Job https://www.youtube.com/watch?v=8dvd80dgsEE
          Given
          ========
          Student about to graduate
          Applied for jobs at 2x companies
        
          - G represents Graduated Student
          - O1 represents Student Received Job Offer from Company 1
          - O2 represents Student Received Job Offer from Company 2
        
          G ------> O1
              |          P(O1 | G)   = 0.5
              |          P(O1 | ¬ G) = 0.05
              |
              |___> O2
                         P(O2 | G)   = 0.75
                         P(O2 | ¬ G) = 0.25
        
          Question:
          =========
          Calc. prob. that student received offer from
          Company 2, given we know she received an offer
          from Company 1.
        
      • Solution
          Find ratio of times that O2 is True when
          given that O1 is True, to all situations when
          just O1 is True.
        
          Calculate answer using:
              "Approach of Summing the Results of All Relevant Situations"
        
          Note: Alternative approach is to use:
              "Inference by Sampling to handle bigger Bayesian Networks"
        
          P(O2 | O1)
        
          =
        
          Sum all possible Grad. situations where O1 and O2 are True /
          Sum all possibe Grad situations * Sum of all possible O2 situations, where where O1 and O2 are True
        
          =
        
          Sum all possible Grad. situations where O1 and O2 are True /
          Normalising Number + Situation when O2 is Negative
        
          =
        
          (Sum G) P(O2, O1, G) /
          (Sum G) (Sum O2) P(O2, O1, G)
        
          =  P(O1, O2, G) + P(O1, O2, ¬ G) /
             (
               [P(O1, O2, G) + P(O1, O2, ¬ G)] +
               [P(O1, ¬ O2, G) + P(O1, ¬ O2, ¬ G)]
             )
        
          = ???
        
          For Numerator and Denominator, we know:
        
          P(O1, O2, G) = P(O1 | G) * P(O2 | G) * P(G)
        
              Substitute values we know from Bayes' Network:
        
                       = 0.5       * 0.75      * 0.9
                       = 0.3375
        
        
          P(O1, O2, ¬ G) = P(O1 | ¬ G) * P(O2 | ¬ G) * P(¬ G)
        
              Substitute values we know from Bayes' Network:
        
                       = 0.05       * 0.25      * 0.10
                       = 0.00125
        
          P(O1, ¬ O2, G) = P(O1 | G) * P(¬ O2 | ¬ G) * P(G)
        
              Substitute values we know from Bayes' Network:
        
                       = 0.05       * 0.25      * 0.9
                       = 0.1125
        
          P(O1, ¬ O2, ¬ G) = P(O1 | ¬ G) * P(¬ O2 | ¬ G) * P(¬ G)
        
              Substitute values we know from Bayes' Network:
        
                       = 0.05       * 0.75      * 0.1
                       = 0.00375
        
          P(O2 | O1) = 0.3375 + 0.00125 / [0.3375 + 0.00125] + [0.1125 + 0.00375]
                     = 0.7445
        
      • Complex Bayesian Rule (and Complex Bayes’ Networks)

        
          Bayes Rule
        
              P(A | B) = P(B | A) * P(A)         / P(B)
                       = easily computable term  / hard to compute term
                                                   (but not depend on assumption for Var A)
              Now the "complementary" range is:
        
              P(¬ A | B) = P(B | ¬ A) * P(¬ A)   / P(B)
        
              Note: "Normaliser" is both P(B) whether use A on LHS or ¬ A on LHS
        
              Since two complementary events, also know that:
        
              P(A | B) + P(¬ A | B) = 1
        
        
              Compute the "posterior" probability that is
              not "Normalised" (i.e. whilst ignoring/omitting the "Normaliser" P(B))
        
              P'(A | B) = P(B | A) * P(A)
        
              P'(¬ A | B) = P(B | ¬ A) * P(¬ A)
        
              where P' is "prime" since not a real probability
              where P'(A | B) equals the "Normaliser" (Denominator of prev. equation)
              where P'(¬ A | B) "     "     "    ...
        
              Now, recover original probabilities by "Normalising" based
              on values P'(A | B) and P'(¬ A | B)
        
              P(A | B) = η * P'(A | B)          (actual probability)
        
              P(¬ A | B) = η * P'(¬ A | B)
        
              where η is "Normaliser" eta (times the non-"Normalised" form)
        
              and where:
        
              η = 1 / ( P'(A | B) + P'(¬ A | B) )
        
              Using these computed non-"Normalised" pseudo-probabilities we
              defer the calculation of "Normaliser" P(B), and make
              the calculation much easier
        
        
        
      • Absolute Independence (aka Independence) Bayes Network
          B
        
          C
        
        • Dfn: A ⊥ B where two nodes not connected at all
        • Note: Absolute Independence DOES NOT imply Conditional Independence
      • Conditional Independence Bayes Network
          A ------> B
              |
              |___> C
        
        • Dfn: A ⊥ B | C where two nodes A and B are “conditionally independent” given a hidden cause A caused two independent different measurements (we know information from a common branch C)
        • Note: Conditional Independence DOES NOT imply Absolute Independence (i.e. since if A is test result returning pos., this will affect the probability of the subsequent B test result likelihood of being pos.)
      • Confounding Cause Bayes Network
          A -----,
                  |--- C
          B _____/
        
        • Dfn: (two independent hidden causes confounded within a single observational variable)

        • C adds a dependence on A and B

      • Example: Two Tests for Cancer (Complex) and Conditional Independence

        • Conditional Independence Dfn: Given C, T1, and T2 in below diagram,

          • T1 ⊥ T2 C (i.e. T1 is “conditionally independent” of T2 given we know C)
          • Question:
            • If we did not know C (i.e. knowing whether we had Cancer) and if T1 is “conditionally independent” of T2, does that imply that T1 and T2 are “independent” of each other?
              • No, since getting a pos. test result with T1 of a specific cancer C, gives us info about whether we have cancer or not (i.e. raises probability of having cancer). With that increased probability outcome, we will predict that a subsequent test T2 will have a higher probability of giving a pos. test result (than if we didn’t do T1 first), so they are NOT “independent”
          Given,
        
              C  (Not Observable)
             /  \
          T1 ___ |
                 T2
        
          where C is Cancer
          where T1 is Test 1
          where T2 is Test 2
        
          P(C)       = 0.01  (prior probability of Cancer)
          P(+ | C)   = 0.9   (probability of receiving positive test result for either test)
          P(- | ¬ C) = 0.8   (prob. of neg. test result for either test, i.e. cancer free)
        
          Compute all other probabilities using the above:
        
          P(¬ C)     = 0.99
          P(- | C)   = 0.1
          P(+ | ¬ C) = 0.2
        
          Assuming both T1 and T2 results a pos., i.e.
        
              P(C | T1 = +, T2 = +)  =  P(C | + +) = ??
        
          What is probability of cancer now?
        
          Answer:
        
              Use the trick to compute:
        
                      prior   +       +       P'      P(C|++)
              C       0.01    0.9     0.9     0.0081  0.1698
              ¬ C     0.99    0.2     0.2     0.0396  0.8301
                                              ------
                                              0.0477 "Normaliser"
        
                             (use non-"Normalised" Bayes Rule
                              and multiple 0.01 * 0.9 * 0.9 to get
                              non-"Normalised" probability P' of
                              having cancer C given a TWO pos. test results
                              from T1 and T2
                              of 0.0081)
        
                              P' values are not probabilities.
                              Add the P' values of 0.0081 and 0.0396
                              we get 0.0477.
        
                              Lastly we divide (i.e. "Normalise")
                              the non-"Normalised" probabilities
                              0.0081 and 0.0396 by the Factor 0.0477
                              to get the correct "posterior probability"
                              answers
        
              P(C|++) = 0.1698
        
        
          Other notes:
        
          Hidden Variable C, causes the still Stochastic test results T1 and T1
          We assumed T1 and T2 are identically distributed (i.e. use same 0.9 for each)
          We assumed T1 and T2 were "Conditionally Independent"
              i.e. if knew for certain that had cancer C,
                   then knowing T1 would not help us make a statement about T2
          We assumed that probability of T2 given C and T1,
          which is same as probability of T2 given C
        
              i.e. P(T2 | C, T1) = P(T2 | C)
        
              (this is known as "Conditional Independence",
               where if we knew the value of cancer variable C
               for sure, then T2 would be "independent" of T1,
               but it's "Conditionally Dependent" since independence
               only holds True if we actually know C,
               since C separately causes T1 and T2 as shown in the diagram)
        
        
          What is the prob. of second T2 test result being pos. if we
          know that first T1 test result outcome was pos. ?
        
          P(T2 = + | T1 = +)  =  ???
        
          Solving using Bayes Rule
        
          P(+2 | +1) = P(+1 | +2) P(+2)
        
          Solve using "Theorem of Total Probability"
          (note that everything is conditional on +1)
        
          (i.e. Denominator fn RHS in Bayes Rule)
        
          P(T2 = + | T1 = +)
              = P(+2 | +1, C) * P(C | +1)  +  P(+2 | +1, ¬ C) * P(¬ C | +1)
        
              Note that P(+2 | +1, C) is an example of "conditional independence"
              since 1st test T1 gives no more information to 2nd test T2, as it
              only gives more info if C was unknown (which is not the case here),
              so we can re-write it as P(+2 | C), since we know now that
              P(+2 | +1, C) = P(+2 | C)
        
              So, simplifying the equation, by removing +1 from P(+2 | +1, C), since it
              doesn't give any more info
        
              = P(+2 | C) * P(C | +1) + P(+2 | ¬ C) * P(¬ C | +1)
        
              Substitute with values we already know:
        
              = 0.9       * P(C | +1) + 0.2         * P(¬ C | +1)
        
        
        
              Use the same trick used previously to
              compute P(C | +1) and P(¬ C | +1), but with only one
              test T2, instead of both of them:
        
                      prior   +       P'      P(C|+)
              C       0.01    0.9     0.009  0.009/0.207 = 0.043
              ¬ C     0.99    0.2     0.198  0.198/0.207 = 0.957
                                      ------
                                      0.207 "Normaliser"
        
                             (use non-"Normalised" Bayes Rule
                              and multiple 0.01 * 0.9 to get
                              non-"Normalised" probability P' of
                              having cancer C given one pos. test result
                              T1 of 0.009)
        
                              P' values are not probabilities.
                              Add the P' values of 0.009 and 0.198
                              we get 0.207.
        
                              Lastly we divide (i.e. "Normalise")
                              the non-"Normalised" probabilities
                              0.009 and 0.198 by the Factor 0.207
                              to get the correct "posterior probability"
                              answers
        
              P(C | +1)   = 0.043
        
              P(¬ C | +1) = 0.957
        
        
              = 0.9       * 0.043     + 0.2             * 0.957
        
              = 0.2301
        
              = 0.230
        
              (i.e. if 1st test returns pos. then expect 2nd test to return
              pos. with probability of 23% after seeing the pos. test result of 1st test,
              which is an increase over the
              default probability P(T2 = +) of 2nd test returning pos.
              without a 1st year being performed of 20%, which was calculated
              using "Normaliser" of Bayes Rule, which was 0.207 )
        
        
        
      • Example: Happiness (using Confounding Cause)
          R -----,
                  |--- H
          S _____/
        
          where happiness H (or unhappy) driven by two causes,
          whether it's sunny S or job raise R
        
          where Specification of a Probability Distribution is:
        
          P(S) = 0.7
          P(R) = 0.01
        
          P(H | S, R) = 1
          P(H | ¬ S, R) = 0.9
          P(H | S, ¬ R) = 0.7
          P(H | ¬ S, ¬ R) = 0.1
        
          What is:
        
          P(R | S) = ???
        
          Note that there is no mechanism by which
          R and S could co-occur, so P(R | S) = P(R)
          (i.e. prob. of raise given sunny is same as
                prob. of raise given any weather)
        
          P(R | S) = P(R) = 0.01
        
      • Explaining Away (special instance of Bayes Network reasoning)
          R -----,
                  |--- H
          S _____/
        
          If we know we are happy H,
          then sunny S weather can "explain away" the cause of happiness H
        
          If also know that it's sunny S weather,
          then the probability that we received a raise R then DECREASES
          (since the "observer" could already "explain away" that the reason
          for being happy H is likely because it's sunny S if they first
          observe that it's sunny outside
        
          Such that "seeing" one of the causes R or S first, can
          "explain away" any other potential cause of the effect H
        
          1) What is:
        
          P(R | H, S) = ???     (prob. raise given that i'm happy and its sunny)
        
          expanding th term using Bayes Rule and using Total Probability:
        
          P(R | H, S) = P(H | R, S) * P(R | S) / P(H | S)
        
          now simplifying P(R | S) into just P(R) since cannot co-occur:
        
                      = P(H | R, S) * P(R)     / P(H | R, S)
        
          now expand the Denominator by folding in R and ¬ R
        
                      = P(H | R, S) * P(R)     / P(H | R, S) * P(R) + P(H | ¬ R, S) * P(¬ R)
        
          substitute values from tables:
        
                      = 1 * 0.01                / 0.01 + 0.7 * 0.99
        
                      = 0.0142
        
          2) What is:
        
          P(R | H) = ???      (prob. of raise given we're observed as being happy,
                               but we do not know anything about the weather)
        
          compute auxiliary variable P(H) as combination of different variables that
          can make lead to happiness
        
          P(H) =   P(H | S, R)     * P(S, R) +
                   P(H | ¬ S, R)   * P(¬ S, R) +
                   P(H | S, ¬ R)   * P(S, ¬ R) +
                   P(H | ¬ S, ¬ R) * P(¬ S, ¬ R)
        
               using "independence" to replace Join Probabilities with Marginals
        
               =   P(H | S, R)     * P(S)P(R) +
                   P(H | ¬ S, R)   * P(¬ S)P(R) +
                   P(H | S, ¬ R)   * P(S)P(¬ R) +
                   P(H | ¬ S, ¬ R) * P(¬ S)P(¬ R)
        
              now substituting in values we know
        
               =   1               * (0.7 * 0.01) +
                   0.9             * (0.3 * 0.01) +
                   0.7             * (0.7 * 0.99) +
                   0.1             * (0.3 * 0.99)
        
               = 0.007 + 0.0027 + 0.4851 + 0.0297
        
               = 0.5245
        
          now using Bayes Rule to substitute into the initial equation
          with values we already know
        
          P(R | H) = P(H | R) * P(R) / P(H)
        
                   = P(H | R) * 0.01 / 0.5245
        
          now find out what P(H | R) is using "Total Probability", and substitute numbers:
        
          P(H | R) = P(H | R, S) * P(S) + P(H | R, ¬ S) * P(¬ S)
        
                   = 1           * 0.7  + 0.9           * 0.3
                   = 0.7 + 0.27
                   = 0.97
        
          now substitute this result back into equation to find P(R | H):
        
        
          P(R | H) = 0.97 * 0.01 / 0.5245
                   = 0.0185
        
        
          What is (extreme case for making raise R likely), solve similar to before:
        
          P(R | H, ¬ S) = P(H | R, ¬ S) * P(R | ¬ S) / P(H | ¬ S)
        
                        = 0.9 * 0.01 / P(H | R, ¬ S) * P(R) + P(H | ¬ R, ¬ S) * P(¬ R)
        
                        = 0.009      / (0.9 * 0.01 + 0.1 * 0.99)
        
                        = 0.0833
        
          so now it is clear that if don't know about the weather,
          that there is a higher chance that the raise R caused happiness H:
        
          P(R | H) > P(R | H, S)   = 0.0185 > 0.0142
        
          and if we observe it to not be sunny, then it is much
           more likely that cause of happiness is related to a job raise
        
          P(R | S) < P(R | H, ¬ S) = 0.01   < 0.0833
        
        
        
        
      • Bayes Networks (General Definition)
        • Dfn: Define Probability Distributions over Graphs of Random Variables (i.e. A, B, C, D, E)
          A         B
           \       /
            \     /
             *   *
               C
             /   \
            /     \
           *       *
          D         E
        
        • Bayes Network is defined by prob. distrib. inherent to each individual node
            i.e. nodes A and B are defined as the following
                 since A and B have no incoming arcs
          
                    P(A)
                    P(B)
          
            i.e. node C is defined as "condition distribution"
                 (conditioned on A and B)
          
                    P(C | A, B)
          
            i.e. node D and E are conditioned on C
          
                    P(D | C)
                    P(E | C)
          
            The "Joint Probability" represented by a Bayes Network
            is the product of various Bayes Network probabilities
            defined over individual nodes,
            where each node's probability is only conditioned on
            the incoming arcs
          
                    P(A, B, C, D, E) = P(A) * P(B) * P(C | A, B) * P(D | C) * P(E | C)
          
            Required Variables (of Bayes Network vs Combinatorial Approach)
            ==================
          
            Advantage of using Bayes Networks here is that only requires 10 values
            (so compactness of Bayes Network leads to representation
             that scales much better on LARGE Networks, than using the
             Combinatorial Approach that goes through all combinations of variable values)
          
              required parameters       1      1       4               2           2
                                                     (A and B binary)  (C binary)  (C binary)
          
            Using a Combinatorial Approach (unstructured representation)
            the "Joint Distribution" over any 5x variables
            would require all combinations of variable values
          
              required parameters      (2 ^ 5) - 1 = 31   (probability values)
          
            Noting that any variable that has K inputs, requires 2 ^ K search variables
          
        • Question
            How many probability values are required to specify this Bayes Network?
          
                   A
                 / | \
                /  |  \
               *   *   *
              B    C    D
               \    \   /
                \    \ /
                 *    *
                 E    F
          
            Answer:
          
                P(A, B, C, D, E, F) = P(A) * P(B | A) * P(E | B) * P(C | A) * P(D | A) * P(F | C, D)
                                       1        2           2          2          2          4
          
                Summing up the qty of required parameters (search variables) is
          
                13
          
                (i.e. 13 numeric probabilities are required to specify this Bayes Network)
          
          
        • D-Separation

          • Rule: Any two variables are “independent” if they are not linked by just unknown variables i.e. If know B, then any variable downstream of B becomes “independent” of anything upstream of B

              Given the following Bayes Network
            
                          A
                        /   \
                       /     \
                      *       *
                     B         D
                    /           \
                   /             \
                  *               *
                 C                 E
            
              Are the following "independent" of each other:
            
              C ⊥ A       NO, since A influence C by virtue of B,
                          but if we know B, then A becomes independent of C.
                          So the only determinant of C is B, if you know B for certain
                          since in this case A won't tell C anything B hasn't already told it.
              C ⊥ A | B   YES
              C ⊥ D       NO, same as above
              C ⊥ D | A   YES, since if C already knows A, then D won't tell C anything it doesn't
                          already know
              E ⊥ C | D   NO
            
            
            
              E is not "independent of C | D
              Knowledge of B does NOT render A and E independent
            
            
            
          • Example 2

            
              Given the following Bayes Network
            
            
              A         B
               \       /
                \     /
                 *   *
                   C
                 /   \
                /     \
               *       *
              D         E
            
              Are the following "independent" of each other?
            
              A ⊥ E       NO, since through C we can see influence of A to E
              A ⊥ E | B   NO, since A still influences C despite knowing B
              A ⊥ E | C   YES, since if know C the influence is cut off so
                          A cannot influence E if we know C
              A ⊥ B       YES, different entry variables with no arcs
              A ⊥ B | C   NO, since given C, A and B become "dependent"
                          due to the "Explain Away" effect
            
                          (Knowledge of variable C, renders previously
                          "independent" variables as "dependent")
            
                          (i.e. since if know C to be True, then knowledge of A
                          will substantially affect what we believe about B)
            
                          (i.e. if there are two joint causes for C, i.e. A and B,
                          and if we know A is True, then we will discredit
                          cause B, and conversely, if we know B is False, it will
                          increase our belief that B is True)
            
            
    • Lesson 14 - Bayes Networks using Probabilistic Inference (i.e. how to answer questions we’re interested in, given certain inputs) * This Lesson 14: * Probabilistic Inference * Dfn: How to answer Probability questions using Bayes Networks

        * Previous Lesson 13:
            * Probability Theory of how a Bayes network can concisely represent
             a "joint probability distribution", including representation of
             "independence" between variables
        * Example: Earthquake
            ```
            Given Bayes Network graph
      
            (B)urglary       (E)arthquake
                  \         /
                   \       /
                    *     *
                    (A)larm
                     /  \
                    /    \
                   /      \
                  *        *
            (J)ohn calls   (M)ary calls
      
            "Inference" question:
                What are outputs given some inputs ?
      
            "Probabilistic Inference" Terminology:
      
                "Evidence" variables = Inputs = B and E  (i.e. variables we know value of)
      
                "Query" variables = Outputs = J and M  (i.e. variabes we want to find value of)
      
                    Important Note: "Probabilistic Inference" outputs are not single numbers, instead
                    outputs are a "Probability Distribution", so the answer will be a
                    Complete "Joint Probability Distribution" over the "Query" variables,
                    which is known as the:
      
                    "Posterior Distribution" (given the "Evidence")
                    ===============================================
      
                    P(Q1, Q2 ... | E1 = e1, E2 = e2)
      
                    (i.e. "Posterior Distribution" of one or more "Query" variables
                    given the exact values of the zero or more "Evidence" variables)
      
                "Hidden" variables = Internal = (i.e. variables that are neither "Evidence" nor "Query" that need to compute with internally)
      
            Maximisable "Query" values (Q), given the "Evidence" values:
            ===========================================================
      
                argmax q  P(Q1 = q1, Q1 = q2, ... | E1 = e1, ...)
      
      
                Note: The most likely explanation out of all the possible values
                for all the "Query" variables, which combination of values
                has the highest probability
      
            Note: In ordinary programming, each function goes only one way:
                * Input variables
                * Computation
                * Output result variable(s)
            However, in Bayes Networks, we can go MULTIPLE DIRECTIONS
                i.e.
                    CAUSAL DIRECTION FLOW
                        - giving as "Evidence" the root nodes of the tree
                        - asking as "Query" values the nodes at the bottom of the tree
      
                    REVERSE CAUSAL DIRECTION FLOW
                        - giving as "Evidence" the J and M variables
                        - asking as "Query" variables the B and E variables
      
                    or any other combination...
      
                    OTHER
                        - giving as "Evidence" the M variable
                        - asking as "Query" variables the B and J variables
      
      
            ```
                * Question
                    ```
                    Situation where Mary (M) has called to report
                    Alarm (A) is going off, and we want to know
                    whether or not a Burglary (B) occurred.
      
                    Show what nodes are "Evidence", "Hidden" or "Query" nodes
      
                    B - "Query"
                    E - "Hidden"
                    A - "Hidden"
                    J - "Hidden"
                    M - "Evidence" (known)
                    ```
        * **Inference on Bayes Networks** https://youtu.be/nJqloQLLYaI?t=3m25s
            * **Enumeration** Method
                * Dfn: Goes through all possibilities, adds them up, and derives answer
            ```
            Question:
      
                What is probability that the Burglar Alarm occurred
                given that John called and Mary called?
      
            Answer:
      
                Use the definition of "Conditional Probability"
                and perform "Inference" by "Enumeration" on the Belief network
      
                P(Q | E) = P(Q, E) / P(E)
      
                P(+b | +j, +m)    ("conditional probability")
      
                   = joint probability distribution of all three variables / conditionalised variables
      
                   = P(+b, +j, +m) / P(+j, +m)   (using shorter notation)
      
                   Note: here we "Normalise" by the Denominator P(+j, +m)
      
                   (re-written without any "conditional probability")
      
                   where P(B = True)
                       = P(+e)
                       = 1 - P(¬ e)
                       = P(e)
      
                   now enumerate all "atomic" probabilities and
                   calculate the sum of products
      
                   the Numerator can be determined by Enumerating an expression with
                   all possible values of the Hidden variables
                   (i.e. in this case E and A)
      
                   P(+b, +j, +m)
                     = ∑ "e" ∑ "a" P(+b, +j, +m, e, a)     (note how only b, j, and m are fixed as +)
      
                     (sum over all values for "e" and "a" where only two values for each since Boolean,
                     we ask what is the probability of this unconditional term)
      
                     to get the values of the "atomic" events, we re-write the equation in a
                     form that corresponds to the "conditional probability" tables that we have
                     associated with the Bayes Network
      
                     re-write the part after the ∑ "e" ∑ "a", as being an expression
                      in terms of the Parents of each of the nodes in the network, giving a product
                      of 5x terms, which we then sum over all values of Hidden variable values for
                      "e" and "a"
      
                     = ∑ "e" ∑ "a" P(+b) * P(e) * P(a | +b,e) * P(+j | a) * P(+m | a)
      
                     = ∑ "e" ∑ "a" f(e, a)
      
                     = f(+e, +a) + f(+e, ¬a) + f(¬e, +a) + f(¬e, ¬a)
      
                     now expand and substitute +/¬e and +/¬a for each respectively:
      
                     =   P(+b) * P(+e) * P(a | +b,+e) * P(+j | +a) * P(+m | +a)
      
                         + P(+b) * P(+e) * P(¬a | +b,+e) * P(+j | ¬a) * P(+m | ¬a)
      
                         + P(+b) * P(¬e) * P(+a | +b,¬e) * P(+j | +a) * P(+m | +a)
      
                         + P(+b) * P(¬e) * P(¬a | +b,¬e) * P(+j | ¬a) * P(+m | ¬a)
      
                     (sum of f, for all values of "e" and "a")
                     (i.e. the sum of 4x terms, where each of the terms is the product of 5x numbers
      
                     now, we get the numbers to substitute into this equation
                     from the Conditional Probability Tables from our Model
      
                    Question:
      
                    Given the case where "e" is pos. and "a" is pos., to
                    lookup in the Conditional Probability Tables
                    https://youtu.be/nJqloQLLYaI?t=3m25s
                    and substitute
                    the values for each of the 5x terms, then multiple them together,
                    and to determine their product
      
      
      
      
            ```
      
            * **Efficient Inference Technique #1 -
                Faster technique for doing Inference on Bayes Networks
                that involves "Pulling out terms" from the Enumeration**
                * Reasoning:
                    * e.g. Network 1: If had 5x Hidden variables, there'd only be 32 rows in Conditional Probability Table to sum up
                    * e.g. Network 2: If had 27x Hidden variables (if all Boolean), gives 100 million rows to sum out, but since some aren't Boolean,
                    so representing the whole enumeration, we'd have to sum over a quadrillion rows,
                    so we need methods that are faster than enumerating everything
      
                * e.g.
      
                ```
                given:
      
                ∑ "e" ∑ "a" P(+b) * P(e) * P(a | +b,e) * P(+j | a) * P(+m | a)
      
                here the prob. of "b" is the same for all values of "e" and "a"
                so we can take that out of the equation and have less work to do
                (i.e. multiply by P(+b) once, instead of having it in each row
                of the table)
      
                also, here the prob. of "e" does not depend on the summation of "a"
                so we can move it left outside the summation of "a", resulting in
                less work so the inner loop of the summation has less terms
      
                P(+b) ∑ "e" P(e) ∑ "a" P(a | +b,e) * P(+j | a) * P(+m | a)
      
                so we've reduced the cost of doing each row of the table,
                but still have same qty of rows in the table,
                so we'll need to try a different technique #2
      
      
      
                ```
      
            * **Efficient Inference Technique #2 - "Maximise Independence"** of
                variables, in order to reduce the number
                of rows in the Conditional Probability Table
                * Note: The structure of a Bayes Network determines the
                efficiency for performing "inference" on it
      
                    ```
                    i.e.
      
                      1) Linear Bayesian Network:
      
                         - with string of variables X1 through to Xn    O(n)
      
                         here "inference" can be done in a timeframe
                         that's proportional to the number "n"
      
                      2) "Complete" Network:
      
                         - where every node points to every other node  O(2^n)
                           could take up to 2^n timeframe, if all "n"
                           variables are Boolean variables
      
                      3) Example Alarm Network:
      
                         - where had all "independence" relations
                         represented in the structure of the network
      
                         - BUT if we put the nodes in different order,
                         we'd end up with different structure
      
                         i.e. change to order with J node first then
                         M node second (i.e. John calls then Mary calls).
      
                         now, given just these two nodes J and M,
                         is M node "dependent" or "independent" of the
                         the J node
      
                         ANSWER: "dependent" (since here we don't know that
                         the alarm had occurred)
      
                              J -------* M
                              \         /
                               \       /
                                \     /
                                 *   *
                                   A
                                 /   \
                                /     \
                               /       \
                              *         *
                             B  ------*  E
      
                         If we add A to the network, it is "dependent" on both
                         M and J
      
                         If we add B to the network, it's just "dependent" on A
                         (i.e. B is "independent" of J and M, given A)
      
                         If we add E to the network, it's "dependent" on
                         both A and B
      
                         (since if E did occur, then its more likely that
                         A would go off,
                         and since if B did occur that would explain why A
                         is going off, and would mean E is less likely)
      
      
                    ```
                * **Causal Direction**
                    * Bayes Networks easier to do "inference" on when
                    they are written in the "Causal Direction"
                    (i.e. when networks flow from "causes" to "effects"
      
            * **Efficient Technique #3 - "Variable Elimination"**
                * Reasoning:
                    ```
                    previously to evaluate this equation, we first joined
                    up the whole Joint Distribution, before Summing over the
                    Hidden Variables "e" and "a", which is SLOW, since we
                    end up repeating most of the work
      
                    ∑ "e" ∑ "a"   (whole joint distribution)
      
                    = ∑ "e" ∑ "a" P(+b) * P(e) * P(a | +b,e) * P(+j | a) * P(+m | a)
                    ```
                * Dfn:
                    * Faster that doing "inference" by Enumeration in most cases, but
                    is still a difficult computation (its NP-hard to do "inference" on
                    Bayes Networks in general)
                    * Requires an Algebra for Manipulating Factors
                    (names for multi-dimensional arrays that come out of these
                    probabilistic terms)
                * Example
                    ```
                    Given a Bayes Network with 3x Boolean variables:
      
                    where T "dependent" on R
                    where L "dependent" on T
      
                    R  --------*    T   --------*   L
                    (raining)    (traffic)   (late for appointment)
      
                    Conditional Probability Tables - for each variable
                    ==============================
      
                    P(R)        P(T | R)        P(L | T)
                    +r  0.1     +r  +t  0.8     +t  +l  0.3
                    -r  0.9     +r  -t  0.2     +t  -l  0.7
                                -r  +t  0.1     -t  +l  0.1
                                -r  -t  0.9     -t  -l  0.9
      
                    now, use "inference" to determine answer to questions:
      
                    QUESTION: Am I going to be Late ?
      
                    ANSWER (using "Enumeration" ok for this simple example):
      
                        by going through all possible values for "r" and "t",
                        and then Summing the product of the 3x nodes
      
                        P(+l) = ∑ r ∑ t P(r) P(t | r) P(+l | t)
      
                    ANSWER (using "Variable Elimination" alternative for complex examples):
      
                        Variable Elimination Steps
                        ====================
      
                        NOTE:
      
                        (continued process of joining together "factors"
                         to form a larger factor, and then eliminating
                         variables by summing out. importantly, if we apply
                         a good order by which we apply the operations in
                         this process, then Variable Elimination can be
                         much more efficient that just doing the whole
                         "Enumeration")
      
                        STEP 1 - starts with a big network and perform "Joining Factors"
      
                            where each "factor" is one of the
                            Conditional Probability Tables (multi-dimensional matrix)
      
                            choose two or more of the "factors" (i.e. R and T)
      
                            combine them together to form a NEW "factor" called RT
                            that represents the "Joint Probability" of all the
                            variables (i.e. "r" and "t") in that "factor"
      
                            P(R | T)
                            +r  +t  0.1 * 0.8 = 0.08
                            +r  -t  0.1 * 0.2 = 0.02
                            -r  +t  0.9 * 0.1 = 0.09
                            -r  -t  0.9 * 0.9 = 0.81
      
                            so we converted from before to after
      
                            BEFORE    R  --------*    T   --------*   L
      
                            AFTER           R T  --------*    L
      
                        STEP 2 - eliminate some variables (aka Summing Out, aka Marginalisation)
      
                            the objective of this step is to take the new Table
                            called RT, and REDUCE it
      
                            P(R | T)            P(L | T)
                            +r  +t  0.08        +t  +l  0.3
                            +r  -t  0.02        +t  -l  0.7
                            -r  +t  0.09        -t  +l  0.1
                            -r  -t  0.81        -t  -l  0.9
      
                        STEP 3 - compute by summing out (aka marginalising)
                            over the variable "r" to give us a
                            P(T) Table that just operators on T,
                            which has two entries, +t and -t.
                            where +t is formed by "summing out" all the entry values in
                            the P(R | T) Table for all values of "r" for which we have +t,
                            and form the -t entry the same way
      
                        ("combine" parts of the big network
                         into "smaller parts", so we have a
                         smaller network to deal with)
      
                            P(T)
                            +t  0.08 + 0.09 = 0.17
                            -t  0.02 + 0.81 = 0.83
      
                            so we converted from before to after
      
                            BEFORE    R T --------*   L
      
                            AFTER      T  --------*   L
      
                            so now we have Probability Tables:
      
                            P(T)            P(L | T)
                            +t  0.17        +t  +l  0.3
                            -t  0.83        +t  -l  0.7
                                            -t  +l  0.1
                                            -t  -l  0.9
      
                        STEP 5 - continue "combining" by performing a "Join" over T and L
                          variables using point-wise multiplication
                          to give a new Table with "Joint Probability" P(T, L) and
                          then "Enumerate" over it
      
                          now we are down to a Single Node with this "Joint Probability" table:
      
                          P(T, L)
                          +t  +l  0.17 * 0.3 = 0.051
                          +t  -l  0.17 * 0.7 = 0.119
                          -t  +l  0.83 * 0.1 = 0.083
                          -t  -l  0.83 * 0.9 = 0.747
      
                        STEP 6 - finally, "Sum Out" to give us just the L node
      
                          P(L)
                          +l  0.051 + 0.083 = 0.134
                          -l  0.119 + 0.747 = 0.866
      
      
                    ```
        * **"Sampling" approach to achieve Approximate Inference**
            * Advantage 1 - of "Sampling" over "Inference" is that
              we know a procedure for coming up with at least an
              approximate value for the "Joint Prob. Dist.",
              as opposed to EXACT "Inference" where the computation
              may be very complex
            * Advantage 2 - If do not know the numeric values of the
            "Cond. Prob. Tables" are, but we can Simulate the process,
            then we can still proceed by using "Sampling", whereas we
            could not proceed in this situation using EXACT "Inference"
            * Example:
                ```
                Given a "Joint Probability Distribution"
                i.e. distribution of Heads or Tails over two coins 1c and 5c
      
                STEP 1 - Build "Probability Table"
      
                1c 5c
                =========
                h  h  I
                h  t  IIII
                t  h  II
                t  t  I
      
                STEP 2 - Start counting by "Sampling"
      
                i.e. flip the coins, and add the result
                     to the relevant row in the Table
      
                STEP 3 - Repeat "Sampling" process until enough counts
                         until we are close to the True distribution so
                         that we can "Estimate" the "Joint Prob. Dist"
                         by inspecting the count tally
      
                         If only a few "Samples" then results
                         may not be accurate
                         (random variation that prevents them
                          from converging to true values)
      
                ```
      
            * Example: Using "Sampling" to do "Inference"
            and **Rejection Sampling**
                ```
                Given 4x boolean variables
      
                         Cloudy
                         /     \
                        /       \
                       /         \
                      *           *
                Sprinkler         Rain
                     *             *
                      \           /
                       \         /
                        \       /
                         Wet Grass
      
      
                And given "Condit. Prob. Tables";
      
                P(S | C)
                ===============
                +c  +s  0.1
                 "  -s  0.9
                -c  +s  0.5
                 "  -s  0.5
      
                P(W | S, R)
                ===============
                +s  +r  +w  0.99
                 "   "  -w  0.01
                 "  -r  +w  0.9
                 "   "  -w  0.1
                -s  +r  +w  0.9
                 "   "  -w  0.1
                 "  -r  +w  0.01
                 "   "  -w  0.99
      
                 P(R | C)
                 ==============
                 +c  +r  0.8
                  "  -r  0.2
                 -c  +r  0.2
                  "  -r  0.8
      
                 "inference" over this network using "Sampling"
      
                 STEP 1: Choose Variable where all Parents are defined
      
                    i.e. Cloudy is the only such variable
      
                    Conditional Prob. Table
      
                    P(C)
                    +c  0.5
                    -c  0.5
      
                 STEP 2: "Sample" from P(C) for variable C
      
                    generate a random sample number, and let's say it comes up
                    as a +c value, which is defined in P(C) Table
      
                    now, choose another variable from a Parent, say S, and then
                    look at rows of P(S | C) Table for which Cloudy (the Parent)
                    is pos. +c, where we see +c +s is 0.1 and +c -s is 0.9
      
                 STEP 3: "Sample" from P(S | C) for variable S
      
                    generate a random sample number by "sampling" for S,
                    and let's say it comes up -s
      
                    now for the Rain variable R, go to P(R | C) Table,
                    find row where the Parent C is pos +c
                    (since we got a +c sample before), i.e.
                    +c  +r  0.8 and  +c  -r  0.2,
      
                 STEP 4: "Sample" from P(R | C) for variable R
      
                    generate a random sample number, and let's say it comes up
                    as a +r value
      
                    now, choose final Parent W, and look at rows
                    of P(W | S, R) Table for which
      
                    Which rows of P(W | S, R) should we be considering ?
      
                        since previously we generated "Samples" of +c, -s, and +r we
                        choose rows with:
      
                        -s +r +w
                        -s +r -w
      
                    Is it more likely to have a +w or a -w ?
      
                        +w  (since it has higher value % of 0.9 than -w does)
      
                STEP 5: Write down our Complete "Sample" based on outcomes of choosing random
                        values and working our way through the network to see if + or - for each variable
      
                    Samples: +c, -s, +r, +w
      
      
                Note:
      
                    prob. of choosing a particular variable value (i.e. +w vs -w),
                    depends on the Parent variable values, which are chosen
                    according to the Conditional Probability Tables
                    (i.e. P(C), P(S | C), P(R | C), and P(W | S, R) ).
                    So, in the count of each Sampled variable will approach the
                    true probability (i.e. with infinite qty of Samples then
                    this procedure computes the True "Joint Probability Distribution",
                    such that the Sampling method is "Consistent".
      
                    Use this kind of Sampling to compute the
                    Complete "Joint Probability Distribution"
                    OR use this kind of Sampling
                    just to compute the value of an Individual Variable
      
                ```
            * **Rejection Sampling** Procedure
                *Dfn:
                    ```
                    "Rejection Sampling" Procedure
                    ====================
                    If want to compute a Conditional Probability
                    like P(w | -c), then the Sample we got previously
                    (i.e. +c, -s, +r, +w) wouldn't be helpful at all,
                    since it had to do with it being Cloudy +c, but not
                    NOT Cloudy -c, so we would "Reject the Sample"
                    by crossing it off the list of Samples
                    (repeat rejecting for any other Sample that doesn't match
                    conditional probabilities we're interested in,
                    but keeping those Samples that do match).
                    ```
                * Issues:
                    * If the evidence is unlikely, then most of the Samples
                    are Rejected. Overcome this issue by using the
                    **Likelihood Weighting** Technique
                        ```
                        e.g. given the Burglary and the Alarm
      
                          B  ----------*  A
      
                        want to compute probability of a B given that
                        the alarm A goes off
      
                        P(B | +a)
      
                        Issue: B's are infrequent
      
                        Procedure for generating Samples:
                        =============
      
                        1 Generate a B,
                        2 Results are added to Samples list
                        3 Go back and check if results match what
                        we want (i.e. does -b, -a match P(B | +a)? )
                        4 Reject the Sample if no match
                        5 Repeat from 1 to 5
      
                        i.e.
                            -b, -a  REJECT
                            -b, -a  REJECT
                            +b, -a  KEEP
                            -b, -a  REJECT
                            ...
      
                        ```
      
            * **Likelihood Weighting** Technique for Sampling
                * Dfn: Generates Samples so we can keep them all
                and do not have to Reject any, by
                **fixing** the value of the "Evidence"(input) variables
                to be what we want to compute (i.e. if we want to compute
                P(B | +a), then we'd fix the variable value of A to
                always be +a when generating Samples. So we'd get Samples
                    i.e.
                        -b, +a
                        -b, +a
                        +b, +a
                        ...
      
                * Issue: **Fixing** values in Samples causes
                    results that are "Inconsistent", but we can
                    overcome this by applying **Likelihood Weighting**
                    * Solve: Fix this issue by assigning a
                    **"Probabilistic Weight"** to each Sample
                * Example: 1
                    ```
                    Say we want to compute P(R | +s, +w)
      
                    (prob. of Rain given that the Sprinklers are on and
                     the Wet Grass is wet)
      
                        1 Pick random value for Cloudy C of +c
                        2 Pick random value for Sprinkler S, but we are
                          "Constrained" to always choosing +s
                          (since want it to match what we want to compute)
                          so select from P(S | C) Table the row
                          +c +s, which as prob. of only 0.1,
                          so we insert that prob. into the "Weight"
                        3 Pick random value for Rain R, (no constraints
                          of matching with what we want to compute)
                          by selecting row from P(R | C) Table
                          with +c (prev. chosen) and selecting say +r,
                          so row has +c +r with prob. 0.8
                        4 Pick random value for Wet Grass W (but
                          we are "Constrained" to choose +w, since
                          "w" appears in what we want to compute P(R | +s, +w) ),
                          and we know that the Parents are also both positive
                          (i.e. previously chose +s +r) so
                          choose row +s +r +w with prob. of 0.99
                          since it's a "Constrained" choice, we
                          add in this prob. into the "Weight" Table
                          and multiply it with the previously recorded
                          weight value
      
                        "Weight"              "Sample"
                        ========              ========
                        0.10 * 0.99 = 0.099   +c, +s, +r, +w
      
                     When we include the "Weights", then counting
                     a "Sample" that was forced to have +s, +w, with a
                     "Weight of 0.099 instead of counting it as a full
                     1.0 sample, makes it such that "Likelihood Weighting"
                     is "Consistent"
      
                    ```
                * Example: 2 - Doesn't solve all our problems
                    ```
                    want to compute P(C | +s, +r)
      
                    (i.e. "Constraint" S and R to always be positive +s +r)
                    now, since we use the "Evidence" when we
                    generate a node that has that "Evidence" as Parents,
                    then the Wet Grass W node will always get good values
                    for each Sample based on the "Evidence"
                    but the Cloudy C node won't, so it will be generating
                    values are random without looking at these values,
                    and most of the time or some of the time it'll generate
                    values that don't go well with the evidence.
                    We won't have to reject them (like in Rejection Sampling),
                    but they will have a low probability associated with them.
      
                    ```
            * **Gibbs Sampling** Technique
                * Dfn: Takes all the "Evidence" into account, not just the
                Upstream "Evidence" (according to the Causal Directions) by
                using the **Markov Chain Monte Carlo (MCMC)**, which is "Consistent"
                and which involves
                re-Sampling just one variable at a time, Conditioned on all
                the other variables
                    ```
                    i.e. given a Set of variables
      
                    C    S    R    W
      
                    first we initialise them to random variable values
                    (keeping the "Evidence" values Fixed) to create
                    a Sample like below:
      
                    +c   +s   -r   -w
      
                    now, at each iteration through the loop, just select one
                    non-"Evidence" variable, and re-Sample it based on
                    all the other variables, which will result in outputting
                    another Sample, i.e.
      
                    +c   -s   -r   -w
      
                    repeat (walking around in the space of assignments of
                            variables randomly)
      
                    +c   +s   +r   -w
      
      
                    Note: In "Rejection Sampling" and in "Likelihood Weighting Sampling"
                          each sample was "independent" of the other Samples.
                          However, in MCMC, this isn't true, instead in
                          MCMC, the Samples are "dependent" on each other,
                          and Adjacent Samples are very similar
                          (i.e. only differ in one place). MCMC is still "Consistent"
                    ```
      
            * **Monty Hall Problem**
                ```
                Given 3x doors D1 D1 D3
                Behind each door is a prize
                One door's prize is sports car
                Other two doors contain a goat (undesirable)
      
                Step 1: Choose D1, but don't open it
                Step 2: Monty opens one of the other two doors,
                        but deliberately chooses D3, the one with the Goat
                Step 3: Now we need to decide whether to Stay or Switch
      
                What is prob. of winning if you stick to D1??
      
      
                Note: We learned info about D2, since it wasn't the door
                      with the Goat
                      flipped over by the host, so that additional info
                      has updated the probability, but we haven't learnt
                      any new info about D1 since it was never an option
                      for the host to switch D1
                ```
      
    • Lesson 15 - Hidden Markov Models (HMMs) (Probabilistic Models)
      • References:
        • Probabilistic Theory: http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf
      • Other videos: https://www.youtube.com/playlist?list=PLjLGYJqya5UiWa1bNPAAX8oOQ1NCUbvB8
      • Dfn: Machine Learning Algorithms (tools) to process sequences of time-series data such as Pattern Recognition through time (i.e. speech, hand-writing, of signals with a language-like structure such as sign-language recognition, and other human activities like playing basketball, driving a car, vacuuming the floor), which all have familiar units (i.e. movement), combinations of those units, and a source statistical grammar (most promising for future research in Thad’s opinion)
      • Dfn: HMMs have:
        • States that represent Observed phenomena
        • Transitions that describe how to move between states
      • Dfn: “Hidden” in HMM refers to the fact we don’t know exactly what State matches what Physical Event. Instead each State may yield certain Outputs, that we Observe over time, and Encode a Sequence of States over time based on how likely they were to produce that Output
      • Decoding HMMs Find out which Model “best fits” a Sequence of Data
      • HMM Performance Boosters tricks
        • State Tiling
        • Context
        • Stochastic Grammars
        • Boosting
      • First-Order HMMs
        • Dfn: Only depend on State immediately preceding them, but not an entire history of multiple States
      • HMM Representation
        • Example: (from Norvig textbook) https://www.youtube.com/watch?v=34Noy-7bPAo ``` Given Markov chain with Output node for each State (common in Machine Learning community)

        X0 —* X1 —* … —* XK —* … —* XT | | | | | | * * * E1 EK ET

        where X’s each represent a Data Frame where XO is initial state (useful for book keeping) where X1 represents first timeframe T = 1 where E1 represents Output at T = 1

        HMM States are implicit in the representation

        BUT HMMs are better thought of in terms of their States

        Given a graph of signal (y-axis) through time (x-axis) (0:59) Create an x-State HMM, where x is the qty of different Gradients in the signal to design a Model that may have generated the Data shown in the graph

        Assume we can create a HMM by “Inspection” to represent at given signal. In reality we want many examples of signal we want to Model, and need a Model that can accommodate all different examples (balancing generalisation and overfitting)

        Use a “Left-To-Right HMM” (i.e. never transition back to a previous state after leaving it)

        Loops at 1:31 represent “Self-Transitions”, indicating that the Model may remain in same State for several timeframes

        Sometimes Entry Point values before entering a State are shown.

        “Output Probabilities” (Emission Probabilities) - values that are allowable whilst in a given State (but Output Distributions are “densities”, so the Output Probabilities aren’t really probabilities)

        Step 1: “Output Probabilities” for first State may be shown as a “Box Car” Distribution (2:07) since from -2 to -1 and all values equally represented. Repeat for other States.

        Step 2: Determine values for “Transition Probabilities”. Note that at start of graph we spend 10 timeframes before “transitioning” to second part of the graph, so want to assign a probability at the end of the first part where we escape State No. 1 (i.e. 0.1, so on average we expect to stay in that state for 10 frames)

        Note: All probabilities out of a state must add up to 1, and that we only have the “Self-Transition” remaining, we know it has probability of 0.9

        Note: Qty of timeframes we expect to stay in a given State is

          Qty_Timeframes_in_State = 1 / (1 - "Self-Transition Probability")
        

        Step 3: Repeat similar to Step 2, which has 5 timeframes, so make “Escape Probability” of 1/5 = 0.2, and “Self-Transition Prob.” of 1 - 0.2 = 0.8

        Repeat…

        … ```

      • Dynamic Time Warp
        • See in example
      • Note: In this section, focus on situations where language-like structure to a problem (i.e. a fundamental unit like a “letter” in handwriting) that’s combined with other units to make larger constructs such as “words”, which are then combined into larger constructs like “sentences”

      • Example: Characters and Words https://www.youtube.com/watch?v=0DJanDy8-eA
          Given 2x HMMs (HMM A, HMM B), each with 3x states
          Graph for each state shows "topology" and "transition probability"
          Graph for each state shows only "output probability" (1x gaussian, including a "mean" and "standard deviation" of 1)
          Goal is to use HMM A and HMM B for "Optical Character Recognition"
        
          Given graphs with "y" (y-axis) and "t" (x-axis)
          that depict a Character, and Words.
        
          Given graph Inputs, a sequence of "v" values indexed over time
          from left to right, where "v" computed by counting qty of "black" squares
          in each column of time
        
          i.e. L character has sequence of "v" values of 5, 1, 1, 1, 1
        
          Given Figures 1 to 4 showing a sequence of characters over time,
          indicate which of HMM A or HMM B most likely produced the data?
        
          Note: All states from both HMM A and HMM B have same "transition probability"
          The "output probabilities" of HMM A and HMM B are what makes them different
          Each HMM state's "output probability" gaussian shows high or low "mean" values at different times,
          so match these with qty of black squares over columns of time in each of Figures 1 to 4
        
      • Example: Dolphin Communication (Euclidean Distance, Dynamic Time Warping, Sakoe Chiba Bounds)
          Given Spectogram of dolphin whistle, where:
              - x-axis is Time
              - y-axis is Frequency
              - Pixel Brightness indicates Power in a Frequency Band
              - Data in Spectrum shows same Dolphin Whistle repeated twice
              where each Dolphin Whistle contains 2x parts.
        
                  Note:
                  - Noise in Ocean at Lower Frequencies when recording across broadband
                      - 5 kHz to 17 kHz (Atlantic Spotted Dolphin whistles)
                        and up to 110 kHz (Ultrasonic)
                          - <= 3 miles (detection audible range under water)
                          - Signature Whistles for Dolphin Identification
                          (if in new area, emits Signature whistle to indicate to friends
                           location where it wants their company)
                      - < 5 kHz (Boat Wave Noise)
                  - Dolphins have Sound Categories vocalisation:
                      - frequency modulated whistles (Signature whistle)
                      - burst pulse squawks (aggression, close proximity, social)
                      - echolocation clicks (sonar, navigation, feeding by dig under sand, close proximity 1km)
                      - close proximity sounds (aggressive contacts)
                      - buzzes (discipline, close proximity, tactile effect used during courtship)
                  - Human audible range is up to 15 kHz (i.e. Whistles)
                  - Correlate Sound Categories with certain Behaviour types
        
          Goal:
              * Handle different Signal Categories (Classes)
              where each example may be "warped" differently in time
          Issues:
              * Cannot just use the Whistle frequency over time
              as Dolphins Raise/Lower the basic frequency of their Whistle
          Important:
              * Identify Patterns of Rise/Falls of Whistle frequency over time
              * Plot "Delta Frequency" (the Change in frequency) instead
              (i.e. instead of plot 5, 14, 10,  7, 10, 14...
                    instead use:        9, -4, -3,  3, 4...
                that way Recogniser can handle Whistle frequencies starting at
                different values
              * Try plot "Time Warping" (i.e. saying quickly vs slowly)
                  * Solution:
                      * Match "features" that make Whistle distinct
                      whether produced quickly or slowly
              * Try plot "Euclidean Distance" (INSUFFICIENT, use DTW instead)     https://www.youtube.com/watch?v=IdIiVJG4M0c
                  * Weakness: Sensitivity to distortion in time axis
                  i.e. plot two graphs, first G1 with 21 samples (S0, S1...) of Delta Frequency,
                  and second G2 with 12 samples with remaining 9 padded with 0's (no change),
                  or fill in with average or last value
                  * Calculate "Euclidean Distance" by calculating
                  square root of the squares of the distances
                  (i.e. sqrt(
                          t0 G1's S0 of 2 and G2's S0 of 5 would be (2-5)^2 = 9 +
                          t1 G1's S1 of 3 and G2's S1 of 5 would be (3-5)^2 = 4 +
                          ...
                        )
                 )
              * Try plot "Dynamic Time Warping" (DTW)  https://www.youtube.com/watch?v=RvO7DhXKHn0
                  * Benefit: When we know the two signals we are comparing will have
                  some sections that are faster, whilst others are slower
                      * Allows two similar time series but locally out of phase to align in
                       non-linear manner
                  * Disadvantage: O(n^2) complexity, but applying the Lower Bounding Technique
                  on the DTW distance becomes O(n) (even with 4S approach).
                  * Lower Bound Technique: Bounding envelope below and above the
                  warping window. Used for sequences of same length.
                  If sequences have different lengths, one of them must be reinterpolated
                  and becomes the 4S approach
                  (see Part 2.1 of paper for code)
                  * Objective: Align 2x signal samples of Whistles to see/compare
                  how we are matching togther
                  * Important Note: A "Match" without any "time warping" on the
                  x and y axis graph, would be a straight line diagonal
                  from lower left to upper right corner (rare)
                  * Process
                      * Draw dotted diagonal line
                      * Start plotting 1st Sample of each signal
                      * Plot transitions to try and Match values as "close as possible
                      to the ideal matching diagonal dotted line" to minimise error,
                      and if values are tied, try and take preference in choice
                      to keep to the dotted diagonal
                      * After all dots plotted, calculate for each dot
                       the sqrt of the x-axis value minus the y-axis value. then add all these
                        values up and take the sqrt of the total
                        (which is half the amount of using just plain Euclidean Distance
                        padding the remaining with 0's)
        
              * "Sakoe-Chiba Bounds"
                  * If allow too much "warping" then distance may be smaller than desired
                  * Use "Sakoe Chiba Bounds" to force more "reasonable" matching by bounding
                  amount can deviate from diagonal
                  "Sakoe Chiba Bounds" does not allow matches outside limits of two diagonal
                  bounding dotted lines, causes matching to be worse and to lead to
                  a bigger distance, which is what we want
                  * Calculate "bounds" often empirically. Set different "bounds" and use
                  cross-validation to ensure they're reasonable
                  * Set different "bounds" for different sections of the signal
                  each with different amounts of allowed warping
                  (i.e. where lots of potential variance of short/long, but
                  where hump shapes may have small variation)
                  * Avoid different "Sakoe Chiba Bounds" for different sections of signal
                  but would be complicated to train, instead use Hidden Markov Models (HMMs)
                  * Note: DTW with lower bound requires two sequences being compared
                  both of same length where amount of warping is constrained
                  * Note: The 10% constraint on warping inherited from speech processing
                  community is actually too large for real world data mining
        
              * Readings on Dynamic Time Warping
                Ratanamahatana and Keogh’s “Everything you know about Dynamic Time Warping is Wrong”
        
                http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf
                  * Section 2.1
                      * shows how given two sequences, the optimal path
                      is a path that minimises the
                      warping cost (minimises the total cumulative distance between them)
                      * shows how to align two sequences by constructing a warping matrix
                      to search for the optimum path, and using constraints to
                      restrict number of paths needed to be considered
                      (using Sakoe-Chiba Bounds)
        
        
        
        
      • Readings:
        • Dolphins https://www.youtube.com/watch?v=blmTrZMTcUs&feature=youtu.be after 26 mins it talks about interaction with other dolphin species using underwater camera
        • Prairie Dogs http://www.cogs.indiana.edu/spackled/2009readings/Slobodchikoff%202009.PDF
          Notes from Dolphins YouTube video:
                  - Together or Apart from different Species
                      - Mutual dialects (when with other species)
                      - Individual dialect (when separate from other species of Dolphin)
                  - Signals
                      - Referential (i.e. refer to some label like a name or a word)
                          i.e. Monkey Alarm Calls for different predators so take appropriate evasive action
                          i.e. Prairie Dog Alarm Calls
                      - Graded (i.e. intensity/frequency of sound as motivation or emotional state of animal)
                  - Multiple Sounds
                      - Dolphins can make two Sound Categories at the same time
                  - Sound Direction
                      - Dolphins are Directional senders/receivers of ultrasonic
                      sounds and can interally vary angle by 20 degrees
                  - Key Challenges:
                      - Recording power to sample and decoding High Frequency sounds
                      - Localising to deciphering which Dolphin made the sound
                          - i.e. Hypdrophone array
                      - Whistles are best studied sound, since easier to measure
                      (but not most frequently used sound, not do they have much broadband
                      characteristics with important information)
                      - Goals with Thad's Georgia Tech CS:
                          1 Categorise difficult Sounds (make sense of info, match with Behaviour types)
                          2 Match with Behaviour
                          3 Look for Structure and Order (mine data, look for longer Sequences,
                          Structure, and Order, as not yet applied to animals)
        
                  - Tech Overview
                      - Neural Networks to train computer if have enough Samples of
                      each individual to see quantitatively measurement of
                      differences between Whistles
                      - Analysis of Dolphin sounds using "Multi Model Signal Integration"
                      with Noldus "Observer Behaviour Software" that
                      shows interaction of Sound and Behaviour over timeline:
                          - Body postures timeline from video breakdown on Ethergram over time are labelled (of behaviour)
                          - Sounds from Spectogram over time and labelling the Sound Categories
                  - Tools and Techniques (Thad)
                      - Goal
                          - Find Fundamental Units (FU) used by Dolphins
                          - Check if they combine FU to form new vocalisations
                      - Implementation
                          - Algorithm takes large DB of data and search for FUs and Categorise different parts
                          (previously done for handwriting recognition to pull out individual letters
                          and words used repeatedly in a sentence)
                          - Goal is Universal Algorithm that takes any structure from given data,
                          uncovers FUs (those of interest excluding noise), put the FUs in a
                          Visualisation tool and automatically tune it to recognise
                          repeated Patterns in data set specifically
                          (grammars, rules to the behaviour), by labelling the data set with
                          colour codes for each discovered Pattern.
                          Find "Rules, using Regular Expressions (Alignment Based Learning)
                          where there is a structure in the data set (i.e. when a pattern
                          is allowed or not to be used in the data set). The "Rules"
                          indicated much more structure than first thought, and were
                          better predictors of visual behaviour.
        
                          Scenarios:
                              - Add Dolphin "D1" and "D2" to scene
                              - Find patterns in Behaviour (i.e. head to head, head to tail)
                              - Try to "Predict" what Labels would be
                                  - DISCOVERED that "Rules" (combination of Patterns) were
                                  very good at "Predicting" which Dolphin was around at certain time
                              - Try to use Acoustic Data to "Predict" Categories of distinct
                              biological data. Found that all are distinct
                                  - aggression
                                  - play
                                  - foraging
                                  - reunion
                              - Statistical Testing - use System and Rules to use Acoustic Data to
                              "Predict" different "labels" of Visual Behaviours
                                  Tables of Correlations:   https://youtu.be/blmTrZMTcUs?t=22m55s
                                  (where "blue" means very similar
                                         "green" means very different
                                         "yellow" means some correlation
                                      - Discovered "Rules" (Patterns generated) of aggression don't correspond to "Rules" of play
                                      - Discovered "Rules of play don't correspond with "Rules" of foraging
                              - Tool that takes 30 years of Data, processes it allowing
                              hypotheses to be made and tested quickly
                                  (similar to Tool for gene banks in genetics that looked
                                   at repeated patterns in different Human characteristics
                                   such as people with Parkinsons', or blue eyes, etc.
                                   it uses discrete 4-base-pairs and sequence to predict when something
                                   interesting will occur, but for Dolphin Acoustics
                                   we are looking at Continous time instead of Discrete time
                  - Tech Interface for Mutual Shared Communication dialect b/w Humans and Dolphins
                      - Missing element is Tool allowing Dolphins to communicate back
                          - Created portable Keyboard activated by Action touch
                          point by Dolphin. Labeled Objects with Visual and Acoutistic cues:
                              (shapes such as Scarf, Rope, Bow Ride)
                              - Goal of Dolphins mimic human-Whistles so humans could
                              respond to their requests
                          - Problem: Too Slow response time
        
                      - SOLUTION:
                              Built underwater wearable computer called
                              CHAT (Cetacean Hearing And Telemetry)
                              (PROJECT 1 FUTURE FROM 2015 - ideally want Google Glass but waterproof
                              so can see all FU as a script moving across i.e. ABBC
                              that can manipulate and repeat back to Dolphins)
                                  - Bond Conducting Headphones - speaker sends
                                  vibrations through skull to cochlea (for underwater)
                                  - Hyrophones - receives sounds
                                  - Speaker - plays sounds
                                  - Keypad - allows user generate sounds
                                      - Instead of being on Arm, they put it on the box
                                      (cast buttons and display in resin)
                                          - Paper was released on how they designed this
                                  - Real-Time Sound Recognition
                                  (so if Dolphin mimics a sound we know in real-time
                                  what sound it mimiced, and respond accordingly)
                              Issue: Salt water intrusion (took 4 years)
                                  - Delays ordering IP67 connectors to aluminium box
                                  and $400, but now did with 3D Printer much cheaper
                                  - Clear cover on case too, so Wifi signals go through
                                  (can't go through aluminium) to download video quickly, and
                                  send patch over wifi in real-time
                                  - Hydrophones not entirely waterproof through movement
                                  - Not want low power high throughput computer to not
                                  overheat in small box
                                  of aluminium in Bahama sun
                                  - 192 kHz auio system (2 channel)
                                  - Audio amplifier (to make non-compressible media
                                  sound go through water
                                  lots of power required, so had to design their own
                                  speaker systems, for certain frequencies, where
                                  amplifier matches speaker and powered off batteries)
                                  - LED screen for status
                                  - Risk of battery on fire near chest
                                  - Risk of electrical current underwater attracts sharks
                                  - Only tuned to narrowband width, but
                                  Not powerful enough to recognise full 96 kHz band
                                  used by Dolphins when they tried to mimic, so they could
                                  Sample at, so needed Faster computer,
                                  so used an Nvideo Jetson board and replaced
                                  big fan with a large heatsink,
                                  so could get Spectogram running Samples at 192 kHz
                                  and 32 bits per channel using 256 GB SSD
        
                              ACOUSTICALLY CONTROLLED SUB (DOLPHIN SUB)
                                  - undergraduate project
                                  - must reward animals immediately
                                  within 50 ms otherwise cannot be trained
                                      - Issue Identified:
                                          - Maybe time to pass scarf underwater
                                          taking too long for Dolphin to associate
                                              - ideally a toy that reacted to
                                              Dolphin vocalisations,
                                              i.e. ultimately put CHAT inside a
                                              DOLPHIN SUB, that makes sounds with a
                                              chat box and flashes lights and performs
                                              an corresponding
                                              action like move left or right
                                              (to spark interest from Dolphins)
        
                              RECOGNISING SPOTS SYSTEM
                                  - 95% of time recognises a Dolphin by its spots
                                  (even works from tail-end view)
                                  - applied same to Humpbacks
        
                          GET INVOLVED
        
                              CHAT Society Technical & Scientific Student Support
        
                              www.WildDolphinProject.org
        
        
          Goal:
              Recognise the Types of Dolphin Whistles so marine mammal researchers
              can automatically annotate their database of recordings
        
          Issue:
              The second time the Dolphin Whistle is repeated, its lasts longer
        
      • Example: Sign Language Recognition (real problem using HMM Representation)
        • Step 1: Create HMMs for Word “I” (finger point to chest) Gesture using Delta Y as our first feature (a bad feature), to purposely show how powerful HMMs can be (can still tell difference b/w two words by difference in timing, since it will show a higher probability in middle state for I than WE, and the Transition Probabilities lower in I than WE since it spends less time there moving hand across chest, since longer a gesture is, the easier it is to recognise)
          • Note: Delta Y is similar for both Word “I” and Word “WE” Delta X is easier to differentiate the two words “I” and “WE”
          • Visualise the Derivative of a Signal
            • Derivative Plots https://www.youtube.com/watch?v=LiDdrJjjjfc
        • Step 2: Plot HMM “I” with 3x States for each of the 3x separate Motions of the Gesture
          • State 1 - arm comes up toward chest
          • State 2 - arm pauses for a moment
          • State 3 - arm comes down away from chest
        • Note: Higher probability on State 1 and 3 “Self-Transition” since spend more time on them. “Transition Probabilities” for each State based on the timing
        • Step 3: Observe Plot of “Delta Y over time” for the Model to check it’s reasonable
        • Step 4: Observe Plot of “Output Probability Distribution” (i.e. P(Delta Y) over Delta Y), where State 2 has higher probability of Delta Y of 0, where used Gaussians to model upper probabilities. https://youtu.be/CBW2C3lwbz4?t=47s

        • Step 5: Create HMMs for Word “WE” (finger from side to right chest then left chest then down to side) https://youtu.be/1mtzJGhRYHg
          • Use 3x States
          • State 2 Delta Y varies more so wider/shorter for “WE” than “I” (see “Output Prob. Distrib” plot)
          • State 2 more time spent for “WE” than “I” (see “Delta Y over time” plot)
          • State 2 more time spent for “WE” so higher prob. on Self-Transition (loop) (see HMM Model)
        • Note: Distinguish between two gestures by observing the following sequence properties of Delta Y:
          • Duration of time spent in a state
          • Distribution of Delta Y observed
      • Recognition of HMM Models
        • Given set of Observations “O” (Samples in example to recognise)
        • Create a Viterbi Trellis to check likelihood that the Model generated Samples in “O”.
          • Correct Match will be the Model with highest probability
          • Note: “Hidden” is the exact Sequence of States that created the Output, so need to go through all possibilities
        • Lambda λ I is the combination of the HMM Model and the “Output Probability Distribution” Plot
        • List the Data, a Sequence of Delta Δ Y values over time that we want to recognise as either “I” or “WE”
        • Goal is to solve P(O | λI) for “I” and “WE” and compare them (i.e. probability of Observation Sequence given HMM Model of “I”
        • Create a “Trellis” with separate row for each State S1, S2, S3, in the HMM Model (want find State we’re in at each Time). Fill out the Trellis
          • Eliminate impossible States and paths using HMM. Fill in path of state transitions possible (note that Transition Probabilities limit what State you can be in at any given time) i.e. obviously S1 leads to S1 or S2 at the Start. and the last two States must be S2 and/or S3 https://www.youtube.com/watch?v=aOPeQAxDS4E
          • In Middle of Trellis, many more options for State Transitions exist.
          • Add “Transition Probabilities” directly from HMM Model λI
          • Go through Observation Sequence over time, determine probability of getting the Delta Y value at each time, by looking at the “Output Probability Distribution” to see what the probability the Gaussian would generate for a Delta Y value at each time https://youtu.be/tqYJnaQfsn0?t=1m7s
            • Note: Lower likelihood for tighter Gaussian width
        • Find the Viterbi Path (most likely path)
          • At each timeframe, we multiply the “Transition Probability” by the “Output Probability” to find the Maximum Path
          • Note: The “Viterbi Path” may not be the “Greedy Path” (i.e. the highest expected value with each transition may not lead to highest valued overall path)
          • i.e.
            • Expected Value of Staying in a state S1 instead moving to S2 at Time 2 is 0.8 * 0.5 versus 0.2 * 10^-7
            • Greedy Algorithm would expect to Stay in S1 since the value from the multiplication is bigger there
        • Note: To account for the overall path, we need to track the probability of each possible sequence of the Trellis. Keep track of Maximum value path to each State in the sequence at each time through the Trellis https://youtu.be/Hq2P6zWA9F8?t=2m14s and at the end choose the Most likely path At Time = 2, there are only 2x nodes each sequence that could be chosen: (ideally use Log space to calculate the probabilities to avoid running out of precision in the way the OS represents the numbers)
            1 * 0.5 * 0.8 * 0.5 = 0.2
          
            OR
          
            1 * 0.5 * 0.2 * 10 ^ -7 = 10 ^ -8
          
      • HMM Training for “I”
        • Dfn: We usually Train using the Data itself, instead of just creating the Models by inspection. When using HMMs for Gesture Recognition, have minimum 5x examples, and ideally we have 12x examples for each Gesture we’re trying to recognise. Each example set of Data may have different amount of Data Points (timeframes)

        • Example:

            Given 3x examples of Data, each with different amount of Data Points (timeframes)
            Use a 3x State Left-To-Right Topology for HMM.
          
            What examples of Data match with which State?
          
          
            STAGE 1: Use Verterbi Alignment to Initialise the values for each HMM State
                     (for each timeframe in each example of Data we assigned a State
                     for that timeframe and calculated the resulting averages of all
                     values assigned to each State, along with average amount of
                     time we expect to stay in each State)
            ==========================================================================
          
                Calculate "Transition Probabilities"
                ==================================
                - Assume each State has approx equal qty of timeframes.
                - Divide each example's timeframes into 3rds
                - Average length of each Data is 12 timeframes long (i.e. 4x Data Points per State per example Data)
                So "Transition Probability" out of each State is approx 0.25
                And so "Self-Transitions" must be 0.75 (since total probability for each State is 1)
          
                Calculate "Output Probabilities"
                ==============================
                - Assign the first 4x Data Points to the 1st State S1 for each example of Data
                Average of all the values is Mean: 5.4; Standard Deviation: 3.1, and then plot on single Gaussian
                - Repeat for State S2 and S3
          
                Iterate in Process similar to "Expectation Maximisation"
                ==============================
                - Try to get a Gaussian with less variance by performing calculation
                and moving the Mean and Std Deviations between the States S1, S2, S3.
                (See if moving boundaries in these examples of Data between
                States S1, S2 and S3 will lead to improved explanation of the Data)
                - Recalculate, by re-counting qty of Data Points in S1 for each example and
                 recalculate the "Transition Probability" and "Self-Transition" in the HMM.
                 Repeat for each State
                - Update the "Output Probabilities" on the Gaussians
                - Repeat until "Convergence" when none of the boundaries need to move,
                nor are the "Output Probabilities"
                going to change
          
          
            STAGE 2: Finish HMM Training
            ============================
          
                "Baum Welch (aka Expectation-Maximization (EM) algorithm) Re-Estimation Process"
                using the "Forward-Backward Algorithm" to keep track of calculations
                (final Model often gives better results than initial estimate).
          
                (similar to "Expectation Maximisation", but instead,
                every sample of Data contributes to every State
                proportionally to the probability of that frame
                of Data being in that State)
          
                (i.e. even though a Data Point is most likely in say S1,
                its value still affects S2 and S3 a little bit, since its
                likelihood of belonging to S2 and S3 is low)
          
                Note: Some Data Points have higher likelihood of being in
                two out of the three States i.e. S1 or S2, and so has a
                larger Influence on the final "Output Probabilities" for
                S1 and S2 states.
                ================================
          
                References:
          
                    - Rabiner Tutorial on Baum Welch
                        http://www.cogsci.ucsd.edu/~ajyu/Teaching/Tutorials/hmm.pdf
                    - Thad's Master's Thesis (continues with Sign Language example using Baum Welch)
                    - Review code of HMM Toolkit like HTK
                    (watch Means and Variances change as re-estimation process
                    iterates to build intuition)
          
          
          
      • HMM Improvement using Delta X Feature (instead of Delta Y done previously)
        • Note: Choice of Delta X or Delta Y as being the better Feature depends on the Sign being recognised. Since Sign Language is two-handed, we need these features both for Left and Right Hands, so would have 8x features being tracked for time frames that we need to integrate into our HMM Models by adding more dimensions (i.e. calculate multi-dimensional distances instead of just one dimension) for our Multidimensional Output Probabilities, and all the other HMM Training and recognition are same as before. It’s hard to show on Graphs, but easy to code https://youtu.be/xDN1V4Y1Ung?t=56s

        • Examples of Good Features:

          • “Size” of the hand is a good feature of info about whether hand coming toward or away from camera
          • “Angle” of hand to the horizontal
      • “Mixture of Gaussians” Technique
        • Dfn: Use two Gaussians to Model whenever “Output Probabilities” are Bimodal instead of Gaussian. The more Gaussians we use, the closer we will model it (try to limit to 2-3 Gaussians, otherwise tends to over fit)
        • Note: Central Limit Theorem states we should get Gaussians if enough factors affect the data
      • HMM Topologies
        • Strategy of try simple Topology first and only add States when necessary. Many different Topologies may be used with cross-validation with randomly selected training-independent test sets to find the best one
        • Note: Increase size of vocabulary from just “I” and “WE” by creating a Topology for each new Sign that we can use to start making phrases.
        • “WANT” has onset lifting arms up arms stretched out infront, and motion for sign itself of pull them towards body, then drop hands. Requires 3x States.
        • “NEED” has onset of lift one hand up with index and thumb like a claw, then lower hand about 10cm, and then drop hand. Requires 3x States.
        • “TABLE” has onset of both arms crossed horizontally with right on top about 5cm apart, then bounce up/down connecting both arms twice or three times (means same thing), then drop arms.
          • HMM Alternative Topology (instead of Left-To-Right Topology)
            • Loop in HMM from 3rd to 2nd State so motions may repeat a number of times.
        • “CAT” can be either with one hand lifted to mouth, then pull away horizontally away from body OR with both hands (means same thing)
          • Find best Topology by using Strategy of simplest first, starting with normal 3-State HMM, then use “Mixture of Gaussians” Technique to help handle the “Output Probabilities” we get
          • Alternatively (results in highest accuracy given sufficient examples to train them, whilst adding more signs to vocabulary) recognise as two different signs. Train a Model for first variation of the sign called “CAT 1”, and train a separate Model for second variation of the sign called “CAT 2”, and after recognise “CAT 1” or “CAT 2” we’ll just have recogniser output CAT
      • Phrase Level Sign Language Recognition

        • Example
            Give X Signs and Y Phrases to recognise.
          
            Note: If have 2x Variants of CAT ("CAT 1" and "CAT 2" to recognise)
            we need more Phrases.
          
            Note:
                - Grammar used to simplify with order [Pronoun, Verb, Noun]
                    i.e. PRONOUN = {I     | WE}, etc
                         VERB    = {NEED  | WANT}
                         NOUN    = {TABLE | CAT}
                         CAT     = {CAT1  | CAT2}
                - We will only recognise one phrase at a time
          
            Given Data from someone using Sign Language that we want to recognise
            (i.e. the Phase "I NEED CAT")
          
            Inspecting the Delta Y "feature" of the Data over time gives us
            Observations (i.e. 20 Samples through time):
            3 7 3 0 1 3 1 0 -1 -3 -2 0 2 1 0 1 0 -3 -7 -3
          
            Use recognition as before but using Big "Trellis"
          
            Create horizontal row with 20 Samples
          
            Create vertical row with 22 States, break-down as follows:
                3x States for "I", "WE", "NEED", "WANT", "CAT 1", "CAT 2"
                4x States for "TABLE"
          
            "Breadth-First Search"-like expansion of each possible path in each
            time step
          
            At T = 1, Grammar says must start with with "I" or "WE" (Pronouns)
            so must start in State 1 (S1) these
          
            At T = 2, could be in S1 or S2 of "I" or "WE"
          
            At T = 3, could be in S1, S2, or S3 of "I" or "WE"
          
            At T = 4, could be in S1, S2, or S3 of "I" or "WE" (Pronouns)
                   OR transitioned to S1 of "NEED" or "WANT" (Verbs)
                   https://youtu.be/0M-WJCgqtEc?t=1m45s
          
                   ... complexity begins (even though simple example with 3x letters)
          
                   ... see 2:45 to 2:53 that shows elimination of impossible final States that
                    aren't Nouns
          
          
          
          
      • Stochastic Beam Search (SBS)
        • Note: HMM uses a lot of memory (especially if had vocabulary with all 6000 signs in American Sign Language ASL and variations that give its expressiveness and speed. Recognisers will find it hard to keep full “Trellis” in memory, and keeping all the paths updated, so we use SBS to overcome the issue instead of Breadth-First Search-like expansion of each possible path in each time step. SBS involves pruning some paths
          • Remove some Low Probability Paths
            • i.e. some paths get low probability quickly (i.e. staying in S1 until T = 1 seems impossible)
            • Do not remove all low probability paths (since possibility of Bad match in beginning of the Phrase that becomes a Good match later)
              • i.e. signer might hesitate or accidently start with wrong sign before changing it (like stuttering or false start in spoken language)
            • Keep paths Randomly (important in AI) in proportion to their probability (similar to the Fitness Function used previously in Genetic Algorithms and to the randomness of Simulated Annealing)
          • Draw Low Probability Paths through the Trellis
          • Draw High Probability Paths through the Trellis
      • Context Training Trick (reduce error rate by factor of 2 when applied to language-like structure)
        • Difference in movement between Combination of signs in a Phrase vs just Isolated signs (i.e. in “I” vs “I NEED CAT”, in the Phrase the hand transitions skipping the last “I” position to a mid “NEED” position, and so on, so hands don’t have to move as much for each word)
          • The signs before and after a given sign affect how it appears when being signed in a Phrase
        • So, for Verbs that connect with other signs at the beginning and end, we’ll concatenate a combined 6x State model of “I NEED” and “NEED CAT” (in practice lots of examples of Phrases of a sign to Train on, rather than individual signs). Assume we have many examples Data of three sign phrase
        • Note: In Thad’s Paper on recognising Phrases of ASL, he had 40 signs in vocabulary, but the only Training Data used was 500 Phrases that each contained 5x signs.
        • Use same steps used in Stage 1 and Stage 2 of HMM Training for “I” shown previously for Phrase “I NEED CAT”, doing same thing for first step, but assuming the Data is evenly divided between each sign, and then divide the Data of each State within each sign, then as before, iterate and adjust boundaries for each State and each sign until convergence. After converged everything for each sign, go back and find every place where “I NEED” occurs (whilst we’ll assume there are enough examples, there will be less examples of “I NEED” than of “NEED”), and cut out all Data that we think belongs to “I NEED”, and use the modelling approach Context Training Trick to Train the Combined 6x State HMM Model on it, since the “Output Probabilities” at the boundary b/w “I” and “NEED”, as well as the “Transition Probabilities” in that region will then be tuned to better represent “I NEED”, rather than the general case which would also include Phrases like “WE NEED”, since in speech, the effect of one phoneme on an adjacent phoneme is called Coarticulation. Repeat for “NEED CAT 1” and every other two-sign Combination. Iterate using Baum Welch using these larger two-sign Contexts for embedded Training until converge again. If have sufficient Data, ideally use larger three-sign Contexts (or more) when Phrases are complex since huge benefits

        • Benefits: Context Training reduces error rate by 2 in recognition tasks where there’s a language structure
      • Statistical Grammar Trick (reduce error rate by factor of 4 when applied to language-like structure)
        • Dfn: In real life, language isn’t well segmented. If record large qty of language and record proportion of times that “I” comes before “NEED” versus “I” coming before “WANT”: i.e. “I NEED” 6%, “I WANT” 10%, etc Use these probabilities to help bia our recognition based on expected distribution of co-occurrence of the signs.

        Instead of Simple Grammar (i.e. order [PRONOUN, VERB, NOUN]) that places strong limit of start and end in Viterbi Trellis

        • Benefits: Statistical Grammar reduces error rate by a factor of 4
      • State Tying Trick
        • Dfn: Combining Training for States, where the States within the HMM Model have close Means and Variances during Training, and then determine if performing “State Tying” looks logical given the motion we expect
        • Example:
            Given HMM Models and "Output Probability Distribution"
            for "I" and "WE"
          
            Note that when recognising Isolated Signs, the initial movement
            of right hand to chest is similar in both HMM Models.
            So instead of having a separate "I" State 1 (S1) and
            "WE" State 1 (S1), we'll define a "COMMON" Initial State
            and included it in the definition of both the "I" and "WE"
            HMM Models, so when we Train the HMM, we have twice as much data
            to Train the Initial State.
          
            Repeat for the Final State S3, since inspecting the
            "Output Probability Distribution" of the hand movement back down
            to rest looks the same for both "I" and "WE"
          
          
            Care must be taken since "State Tying" gets complicated when
            "Context Training" is used
          
      • Segmentally Boosted HMMs (SBH) Trick
        • About: Combines some advantages of discriminative models with generative models. Not yet in HTK Toolkit or Georgia Tech Gesture Toolkit https://wiki.cc.gatech.edu/ccg/projects/gt2k/gt2k
        • Example: Hand Appearance Models
          • Similarity Metric of how closely current hand looked like different visual models of the hand as “features” for the HMM. Issue encountered by using many dimensions was that when models that didn’t match well were producing relatively random results and caused noise such that many dimensions became meaningless
            • Boosting (to Weight the “feature” vector) used as mitigation measure and is called Segmentally Boosted HMMs
        • SBH Process:
          • Align and Train HMMs as normal
          • Use the Training to Align the Data that belongs to each State as best as possible
          • Iteratively examine each State of each Model
          • Boost by finding what “features” help us most in differentiating the Data for our chosen State versus the rest of the States
          • Weight (Up/Down) the Dimensions appropriately in that HMM
        • Benefit: 20%-70% better results than with normal HMMs (depending on the Data set used)
      • Using HMMs to Generate Data
        • Initially hoped that same HMMs used to recognise speech could also be used to generate speech, but was not a good idea, since when plot sequence of data points in time, the “Output Distributions” do not understand the concept of “continuity” (could solve but using many States to force a better ordering of the Output, but would lead to over-fitting)
          • Solution:
            • Model the State Transitions with a HMM, but using a different process whilst in each State, a process that is more aware of the Context required to generate the final Data with “continuity”, using techniques such as the following for better results:
              • Combining HMMs with Deep Belief Networks at Google
      • TODO Readings on HMMs
        • AIMA: Chapter 15.1-15.3
        • Rabiner’s famous Tutorial on hidden Markov models and selected applications in speech recognition
          • Slides http://www.cogsci.ucsd.edu/~ajyu/Teaching/Tutorials/hmm.pdf
          • PDF http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf
          • errata http://alumni.media.mit.edu/~rahimi/rabiner/rabiner-errata/
        • Thad Starner’s MS thesis: Visual Recognition of American Sign Language Using Hidden Markov Models
          • link: https://dspace.mit.edu/handle/1721.1/29089
          • PDF https://dspace.mit.edu/bitstream/handle/1721.1/29089/32601581-MIT.pdf
        • The Hidden Markov Model Toolkit (HTK)
          • Download http://htk.eng.cam.ac.uk/
          • About http://svr-www.eng.cam.ac.uk/research/projects/HTK_HMM_Toolkit/
        • Chapter 1 The Fundamentals of HTK (pages 3-13) in The HTK Book (version 3.4)
          • PDF http://speech.ee.ntu.edu.tw/homework/DSP_HW2-1/htkbook.pdf
          • HTML http://www.ee.columbia.edu/ln/LabROSA/doc/HTKBook21/HTKBook.html
      • TODO HMM Resources
        • AIMA: Chapter 15.4-15.6 (provides another viewpoint on HMMs with a natural extension to Kalman filters, particle filtering, and Dynamic Bayes Nets), Chapter 20.3 (hidden variables, EM algorithm)

        • Yechiam Yemini’s “slides on HMMs used in genetics (gene sequencing, decoding).” http://www.cs.columbia.edu/4761/notes07/chapter4.1-HMM.pdf

        • Sebastian Thrun and Peter Norvig’s AI course:
          • HMMs and Kalman filters https://classroom.udacity.com/courses/cs271/lessons/48734405/concepts/last-viewed
          • Natural Language Processing https://classroom.udacity.com/courses/cs271/lessons/48641663/concepts/last-viewed
          • Natural Language Processing II https://classroom.udacity.com/courses/cs271/lessons/48734403/concepts/last-viewed
        • Further study Huang, Ariki, and Jack’s book “Hidden Markov Models for Speech Recognition.” http://www.amazon.com/Hidden-Recognition-Edinburgh-Information-Technology/dp/0748601627
      • TODO Resources for Segmentally Boosted HMMs
        • Gesture and Activity Recognition Toolkit (GART), which utilises HTK
          • https://wiki.cc.gatech.edu/ccg/projects/gt2k/gt2k
        • SBHMM project at Georgia Tech http://www.cc.gatech.edu/cpl/projects/sbhmm/

        • Further study
          • Pei Yin’s dissertation: Segmental discriminative analysis for American Sign Language recognition and verification https://smartech.gatech.edu/handle/1853/33939
      • TODO HMMs for Speech Synthesis
        • Junichi Yamagishi’s An Introduction to HMM-Based Speech Synthesis https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/TrajectoryModelling/HTS-Introduction.pdf

        • Heiga Zen’s Deep Learning in Speech Synthesis http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41539.pdf

      • TODO Project 4
        • file:///Users/Ls/Downloads/0dreuw07-interspeech.pdf
        • https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
          • http://www.speech.sri.com/projects/srilm/
        • https://discussions.udacity.com/t/slm-data-for-this-asl-dataset/230822/16
        • http://www.cs.cmu.edu/~bhiksha/courses/11-756.asr/spring2010/class.19apr/ngrams.pdf
      • TODO
        • Optional AI lectures http://artificialbrain.xyz/artificial-intelligence-complete-lectures-01-23/
    • Project 4
      • Discussions https://discussions.udacity.com/c/nd889-probabilistic-models/nd889-project-translate-sign-language

      • Part 4 Optional Notes:

        • Definitions
          • SLM - Statistical Language Models
          • Visual
        • References
          • Refer to folder: docs_part4
            • DONE READ Slides - Introduction to N-grams (Stanford Jurafsky) https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
            • Detailed Slides - Language Model Topics http://wwwhomes.uni-bielefeld.de/~gibbon/Handbooks/gibbon_handbook_1997/node199.html
            • Paper - Speech Recognition Techniques for Sign Language https://www-i6.informatik.rwth-aachen.de/publications/download/154/Dreuw–2007.pdf
            • DONE SKIM READ IT - http://www.cs.cmu.edu/~bhiksha/courses/11-756.asr/spring2010/class.19apr/ngrams.pdf
          • Other links
            • DONE SKIM READ IT - https://en.wikipedia.org/wiki/Language_model
            • http://masatohagiwara.net/training-an-n-gram-language-model-and-estimating-sentence-probability.html
          • ARPA Format links
            • http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html
            • https://msdn.microsoft.com/en-us/library/office/hh378460(v=office.14).aspx
          • Discussion Forums
            • https://discussions.udacity.com/t/slm-data-for-this-asl-dataset/230822/9
        • Tutorials
          • Data Analysis and
            • Data Analysis - https://classroom.udacity.com/courses/ud170
          • Pandas
            • Pandas Cheat Sheet - https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf
            • Panda DataFrames - https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python#gs.AzoIXlk
            • Pandas quick guide - http://pandas.pydata.org/pandas-docs/stable/10min.html
        • Goal
          • Improve the WER of the “0-gram” (equivalent) recogniser by using SLM data in the downloaded data set using “1-gram”, “2-gram”, and/or “3-gram” statistics. Use probabilities data already calculated and convert into a pandas DataFrame if desired.
        • Setup
          • Download SLM data for this ASL dataset:
            • ftp://wasserstoff.informatik.rwth-aachen.de/pub/rwth-boston-104/lm/
          • Extract all gzip *.gz
        • Natural Language Processing (NLP) Summary of: https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf

          • Language Model (LM) (aka grammar)
            • Dfn: Model that computes probability of upcoming word OR probability of a certain sequence of words (sentence)
                P(W) = P(w1, w2, w3, ..., wn)
                P(W) = P(wn | w1, w2, w3, ..., wn-1)
              
            • N-gram Model Issues:
              • language has long-distance dependencies
              • Always in Log space (avoids underflow, and adding is faster than multiplying)
                  log(p1 * p2 * p3) = log p1 + log p2 + log p3
                
              • Overfitting, where N-Grams only work well for word prediction if the Test corpus looks like the Training corpus (but in real life it often doesn’t)
                • Overcome by Training the Models to generalise and account for where a word or sequence occurs in the Test Set but does not exist in the Training Set
                  i.e.
                
                  Training set        Test set
                  I am                I did
                  Dog am
                
                  P(did|I) = 0
                
                
                
                • Zero Probability Bigrams
                  • i.e. cannot divide by 0 to compute Perplexity
            • Smoothing
              • i.e. ``` given broad statistics

                P(w|I) (i.w. probability of a word w given previous word is “I”) 3 am 2 eat 1 have ——- 6 total

                Use a “Probability Mass” to “Generalize” better

                P(w|I) 2.5 am 1.5 eat 0.5 have 1.5 other ——- 6 total

            • Laplace Smoothing (Add-One Estimation)
              • Usage: Not used for N-Grams (better options out there), only used to “Smooth” other NLP methods (i.e. when not many zeros)
              • Dfn: Pretend saw each word one more time than we actually did (i.e. add 1 to all counts) to overcome Zero Probability issue. Modify the MLE Equation into the Add-One Equation estimate (where V is Vocabulary??)
                  P(wi|wi-1) = count(wi-1,wi) + 1 / count(wi-1) + V
                
            • Example: (Perplexity)
                Compute "Joint Probability" of:
              
                    P(W) = P(its, water, is, so)
              
                Use "Chain Rule of Probability"
              
                    P(A,B,C,D)     = P(A)  * P(B|A)   * P(C|A,B) * P(D|A,B,C)
                    P(x1,x2,..,xn) = P(x1) * P(x2|x1) * P(x3|x1,x2) ... P(xn|x1,...,xn-1)
              
                P(W) = P(its) * P(water|its) * P(is|its,water) * P(so|its,water,is)
              
                Estimate Probabilities with N-gram Models:
                =================================
              
                    * "Unigram" Model (1-gram)
              
                        P(w1,w1,...,wn) ~= ∑i P(wi)
              
                    * "Bigram" Model (2-gram) (conditional on previous word)
              
                        P(wi|w1,w2,...,wi-1) ~= P(wi|wi-1)
              
              
                Maximum Likelihood Estimate (MLE)
                =================================
                    * "Bigram"
              
                        P(wi|wi-1) = count(wi-1,wi) / count(wi-1)
              
                        Dfn: MLE of some Parameter of a Model M from
                        Training Set T maximisess the likelihood of
                        T given M
              
                        * NOTE: Modify the equation to use "Laplace Smoothing"
                        to overcome the "Zero Probability" issue
              
                        e.g. given word "bagel" occurs 10 times in 100 words
                        the probability that a random word from another
                        text will be "bagel" using the MLE estimate is:
                        10/100 = 0.1
              
                        e.g. given these sentences:
              
                            "now I am Sam ok"
                            "now Sam am I ok"
                            "now I am dog ok"
              
                        P(am|I) = 2/3 = 0.67
                        P(ok|Sam) = 0.5
                        P(Sam|now) = 1/3 = 0.33
                        P(am|I) = 2/3 = 0.67
              
              
                Raw "Bigram" Counts
                =====================
              
                    * "Bigram" counts given various sentences (3 OFF), draw table of count
                    of first (rows) and second (cols) words
              
                          now   I   am   Sam   dog  ok
                    now    0    0    0   1     0     0
                    I      0    0    2   0     0     0
                    am     0    1    0   1     1     0
                    Sam    0    0    1   0     0     1
                    ok     0    0    0   0     0     0
              
                    * "Normalise" by "Unigrams" counts (i.e. individual word counts)
              
                    now  I   am   Sam  dog  ok
                    3    3   3    2    1    3
              
                    * Result (i.e. divide each in value in "Bigram" by respective
                    count for the first word in combo in the "Normalised" table:
              
                          now   I   am    Sam  dog   ok
                    now    0    0    0    1/3  0      0
                    I      0    0    2/3  0    0      0
                    am     0    1/3  0    1/3  1/3    0
                    Sam    0    0    1/2  0    0      1/2
                    ok     0    0    0    0    0      0
              
                    * "Bigram" Estimate of Sentence Order probability
              
                    P(am Sam dog) = P(Sam|am) * P(dog|Sam)
                                  = 1/3       * 0
                                  = 0
              
                Evaluate Model goodness
              
                    * Check assigns higher probability to sentences
                        * real or frequently observed   VS   ungrammatical or rarely observed
                    * Train Model Params on a Training Set
                    * Test Model Performance on Test Set (unseen dataset)
                    and use Evaluation Metric to show performance
                        * Note: Test Set is unused, differs from Training Set
              
              
                Extrinsic Evaluation of N-Gram Models
              
                    i.e. comparing accuracy of Models A vs Model B
                    by comparing WER
              
              
                Intrinsic Evaluation of N-Gram Models:
              
                    i.e. Perplexity
              
                    Benefit:  shorter duration than Extrinsic
              
                Perplexity
              
                    Dfn: Best language Modle is one that best predicts
                    (best performance) during Evaluate Model goodness phase the
                    an Test Set (unseen test set) and gives
                     highest probability of a sentence
              
                      Perplexity is Weighted Equivalent Branching Factor
              
                      i.e. Perplexity of recognising digits "0,1,2"
                      is equal to 3
              
                      Minimising Perplexity == Maximising Probability
              
                      Lower Perplexity == Better Model
                      (i.e. Perplexity decreases the higher the N in N-Gram)
              
                    PP(W) = P(w1,w2,...,wN) ^ 1/N
              
              
                    Dfn: Perplexity is inverse probability of test set,
                      normalised by qty of words.
              
                    so, represented for "Bigram" using Chain Rule, gives:
              
                    PP(W) =  Nsqrt ( of Sum(j=1 to N) of 1 / P(wi|wi-1) )
              
              
                    Example: Perplexity
              
                      given sentence with random digits
                      where Model assigns probability of 1/10 to each digit:
              
                          PP(W) = P(w1,w2,...,wN) ^ -1/N
                                = ( (1/10)^N ) ^ -1/N
                                = (1/10) ^-1
                                = 10
              
                      given sentence with 2x random words
                      where Model assigns probabilities:
                        dog (probability 1 in 4)
                        cat (probability 1 in 3)
              
                          PP(W) = P(w1,w2,...,wN) ^ -1/N
                                = ( )
              
                Shannon Visualisation Model
                    * Dfn: Showing 2-gram/3-gram of words in step-like appearance,
                    for 2-gram choose random N-gram for each next stage that has first
                    element in 2-gram the same as the second element in previous,
                    until add obvious last element
              
                Reconstituted Counts (pg 53/88) ???
                    i.e. compared with raw bi-gram counts
              
              
                Backoff
                    * Dfn:
                        * uses 3-gram only when good evidence
                        * use 2-gram / 1-gram when poor evidence
              
                Interpolation
                    * Dfn:
                        * Mixes 1-gram, 2-gram, 3-gram
              
                        * i.e. Extended Interpolated Kneser-Ney
              
                    * Types:
              
                        * Linear Interpolation
              
                            * Simple Interpolation
              
                                P(wn|wn-1,wn-2) = λ1 * P(wn|wn-1,wn-2)
                                                  + λ2 * P(wn|wn-1)
                                                  + λ3 * P(wn)
                              where ∑i λi = 1
              
                            * Lambdas Conditional on Context
              
                                P(wn|wn-2,wn-1) = λ1(wn-1,n-2) * P(wn|wn-2,wn-1)
                                                  + λ2(wn-1,n-2) * P(wn|wn-1)
                                                  + λ3(wn-1,n-2) * P(wn)
              
                    * Set Lambda:
                        * Use "held-out" corpus:
                            * Training data
                            * Held-out data
                            * Test data
              
                        * Choose λs to maximise probabilities of "held-out" data
                        by fixing N-gram probabilities on training data
                        and then searching for λs giving largest prob. to "held-out" data
              
                        logP(w1,...,wn|M(λ1...λk)) = ∑i log P _M(λ1..λk) (wi|wi-1)
              
                        where _ means subscript
              
                Open vs Closed Vocabulary Tasks (All Unknown words vs Known words)
              
                    * Closed Voca. Tasks
                        * Vocab. V is fixed
              
                    * Open Vocab. Tasks
                        * Out of Vocabulary = OOV words
              
                        * Create Unknown word token <UNK> then Train the
                        <UNK> word probabilities
                        (create fixed lexicon L of size V, and at text Normalisation
                        phase change to <UNK> any Training word not in L,
                        then Train probabilities of <UNK> like a normal word, and use
                        <UNK> probabilities when decoding any word not in Training
              
                Large-scale N-Grams (i.e. Google N-gram)
                    * see Skipped pg 62-63 (too advanced) including
                    "Stupid Back-offs", "Pruning", etc
              
                Smoothing Algorithms
                  (use count of things seen to help estimate count
                  of things we've never seen)
              
                    * God-Turing
                    * Kneser-Ney
                    * Witten-Bell
              
                ... skipped the rest
              
              
            • Google N-Gram (Google Machine Translation Team at Google Research)
              • https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
              • Shared large dataset of 1 trillion words and counts of all 1 billion five-word sequences that appear at least 40 times (where total of ~13 million unique words)
                i.e.
              
              
                3-gram Data in Corpus
              
                sentence      count
                ===================
                I am dog      4
              
                4-gram Data in Corpus
              
                sentence      count in db
                ===================
                I am a dog    92
              
            • Google Books N-Gram Viewer
              • https://books.google.com/ngrams/graph?content&year_start=1800&year_end=2000&corpus=0&smoothing=3&direct_url=
              • Raw data set http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
          • Automatic Sign Language Recognition (ASLR) Summary of: https://www-i6.informatik.rwth-aachen.de/publications/download/154/Dreuw–2007.pdf
            • ASLR difficulties are with Computer Vision, whereas Auto Speech Recog. (ASR) issues already solved
              • Try adopting “speech” recognition system’s large Vocabulary to recognise sentences of continuous sign language.
                • Obtain “features” from video cameras
                • Focus on:
                  • Feature and Model combo techniques used in ASR
                  • Language Models (LM) and pronounciation in sign language
              • Vision issues:
                • capturing, tracking, segmentation
            • Sign language
              • Basic components
                • Manual components
                  • Hand configuration
                  • Place of articulation
                  • Hand movement
                  • Hand orientation
                • Non-Manual Components
                  • Facial expression
                  • Body posture
            • Continous Sign Language Recognition must deals with
              • Coarticulation effects (probability of sign depends on preceding and succeeding signs)
              • Inter- and Intra-personal variability
            • Goal:
              • Sign Language Recognition System, that is person independent, and recognises sentences of continous sign language using vision-based approach (i.e. no data acquisition devices like gloves or motion capture) using a webcam
            • Sytem Overview
              • Based on Bayes Decision Rule
              • Search for the “best” fit word sequence to recognise for the current observation against the Acoustic Model (trained word model inventory from ASR) and against the LM
              • Visual Modelling Phonological Model Dfn: * Signs divided into “chiremes” units * Sign language words split into sub-word “phonemes” units * e.g ``` Split Words

                              each word model split into 1-3 pseudo-"phonemes"
                              modelling avg word length from training)
                
                              lexicon ends up with 247 pseudo-"phonemes"
                              after dividing up 104 words
                
                          Model the Split Words
                
                              model each pseudo-"phoneme" with 3-state
                              left-to-right HMM with 3x Gaussian mixtures
                              and a pooled Covariance Matrix
                          ```   * Issues and Solutions
                  * Different dialects (many words with same meaning)
                  * HMMs fix small difference between appearance and length of utterances
                  * Different Models required to fix different pronounciations of signs
                  (with different qty of States and GMMs)   * RWTH-Boston-104 Corpus Data (of sign language sentences) used (see Section 3)
                  * Perplexity PP of the language model is low since its
                  sentences have simple structure
                  * Out-of-vocabulary (OOV) word (only 1 OFF) is in the Test Corpus
                  (that cannot be recognised correctly)
                
              • Language Models
                • Bayes Decision Rule
                  • Acoustic Model (AM) and Language Model (LM) have same impact on decision
                  • Increase the Weight/scale of LM (alpha scale) above AM (beta scale) where Language Model Factor = alpha/beta (see equation in section 2.2)
Written on January 5, 2016