Artificial Intelligence Nanodegree Term 1
Table of Contents
Chapter 1 - Artificial Intelligence Nanodegree Term 1
Term 2 About
- https://medium.com/udacity/ai-nanodegree-program-syllabus-term-2-deep-learning-in-depth-d935197b66ec
- Orientation http://go.udacity.com/aind-term2-orientation
- Slack - #ai-deeplearning
- Stanford Deep Learning https://www.youtube.com/playlist?list=PL3FW7Lu3i5Jsnh1rnUwq_TcylNr7EkRe6
Background and Definition
- AI comprises design and building intelligent entities/agents and requires
- Intelligence
- Artifact (i.e. computer)
- AI methods concern:
- Human Thinking Automation (i.e. decision-making, problem solving, learning)
- Cognitive Science Modelling Approach
- Build precise and testable theory to express as
computer program algorithm AI models
with matching input/output behaviour including
trace of reasoning steps when solving problems
comparable to human performance (similar to General Problem Solver of
1961), using:
- Introspection (i.e. of human thoughts)
- Psychological Experiments (i.e. observe human action)
- Brain Imaging (i.e. observe brain action, incorporating neurophysiological evidence into computer models)
- Build precise and testable theory to express as
computer program algorithm AI models
with matching input/output behaviour including
trace of reasoning steps when solving problems
comparable to human performance (similar to General Problem Solver of
1961), using:
- Cognitive Science Modelling Approach
- Mental Computation Models (i.e. perception, reasoning, action)
- Behaviour
- Human Thinking Automation (i.e. decision-making, problem solving, learning)
- AI measures success against:
- Human Intelligence Performance Comparison (i.e. Turing Test
to see if can decipher whether response by human or computer
using Computer Vision and Robotics for object perception/manipulation)
- Computer Capabilities
- Natural Language Processing (i.e. communicates in English)
- Knowledge Representation (i.e. store current knowledge, store received/heard info)
- Automated Reasoning (i.e. reason and respond to requests using stored data)
- Machine Learning (i.e. adapt to new situations and detect/deal with patterns)
- Computer Capabilities
- Rationality (i.e. ideal performance with given data and assumptions)
- THOUGHT - Logic Notation (discovered by Aristotle) by reasoning with statements about any object and relationships between with guidance on reasoning steps to try first to minimise computational resources to solve problems in “principle” using Intelligent Computational Reasoning Systems (differs from solving in “practice”)
- ACTION (by rational agent) -
- Autonomous operation
- Perception of environment
- Adapt to change
- Uptime maintained
- Goal pursuit
- Deliver best outcome/decision
- Rationality using inference given uncertainy (using Knowledge Representation)
- Human Intelligence Performance Comparison (i.e. Turing Test
to see if can decipher whether response by human or computer
using Computer Vision and Robotics for object perception/manipulation)
- AI approach to:
- Simple Probems
- Hypothesis that perfect rationality required for analysis
- Complex Problems
- Realise that unable to achieve perfect rationality requiring high computational demands
- Simple Probems
- Definitions (of Mind and Matter)
- Rationalism - Reasoning to understand the world
- Dualism - Human mind has dual qualities governed by both soul/spiritual laws and physical laws
- Materialism - Human mind is constituted from brain operation according to laws of physics
- Empirism
- Flow of Knowledge
- Source of Knowledge
- Physical mind uses Senses to manipulate Knowledge with reasoning to achieve Understanding
- Flow of Knowledge
- Logical Positivism (Doctrine) - Combines Rationalism and Empiricism whereby Knowledge arises from interconnected logical theories connected to Observation Sentences corresponding to Sensory Inputs
- Free will - Perception of choices to a choosing entity
- Induction Principle - Senses are reliant on rules (Confirmation Theory) when acquiring Knowledge from their Experience (exposure to repeated associations between elements)
- Actions (occur after Reasoning) - Actions justified through logical connection between Goals, and Knowledge of an Action’s outcome (assuming the possible end outcome based on Experience)
- Rational Decision Theory
- Issues
- Decision-making Rules when several Actions achieve same Goal
- Decision-making Rules when no Action achieves Goal
- Decision-making Rules when uncertainty
- AI Science
- Maths (areas contributing to AI)
- Logic
- Algorithms using maths reasoning for logical deduction. Procedure to prove any true statement in boolean first-order logic of objects and relations, but first-order logic does not capture principle of maths induction required to characterise natural numbers. Incompleteness Theorem proved that in arithmetic (elementary natural numbers) some true statements are undecidable (no proof within the theory), whereby some functions on integers cannot be represented by an algoritm.
- Turing Machine was developed and showed some functions could not be Computed
- Theory of References related logic objects with real-world objects
- Computation
- Tractability - (easy) problems where time required to solve instance of
problem grows with Polynomial growth.
Tractable sub-problems should be approached when generating
intelligent behaviour
- Polynomial growth
- Intractable (hard) problems - time required to solve instance of problem grows Exponentially with size of instances
- Exponential growth in complexity means even moderately large problem instances cannot be solved in reasonable timeframe
- NP-completeness Theory showed there are many NP-complete problems (combination of search and reasoning), such that if a given problem class may be reduced to one of these NP-complete problems, it is likely Intractable
- Tractability - (easy) problems where time required to solve instance of
problem grows with Polynomial growth.
Tractable sub-problems should be approached when generating
intelligent behaviour
- Probability
- Quantitative Science
- Bayes’ Rule used for uncertain measurements/reasoning of AI systems. Statistical methods used to overcome incomplete theories
- Quantitative Science
- Logic
- Science
- Economics
- Definitions:
- Economies comprise individual agents making choices that lead to preferred outcomes that maximise economic well-being
- Sellers may assert they prefer money whilst they hope Buyers prefer their goods
- Large Economies
- Decision Theory
(framework for decisions under uncertain environment)
- Probability Theory
- Utility Theory
- Usage: Suitable for large economies where individual agents should not be concerned about Actions undertaken jointly by multiple other agents
- Decision Theory
(framework for decisions under uncertain environment)
- Small Economies
- Game Theory
- Usage: Suitable for small economies where individual agents may be significantly affected by Actions of other individual agents
- Note: Discovered that rational agents should appear to adopt random policies (ambiguous prescription for Actions)
- Game Theory
- Other
- Satisficing
- Definition: “Good enough” decisions rather than optimally calculated decisions more accurately describe human behaviour
- Optimisation of Sequence Operations
- Definition: Rational agent decisions when multiple Actions in sequence result in payoff
- Satisficing
- Definitions:
- Neuroscience
- Brain
- Thought, Action, Consciousness enables Minds -
based on collection of neurons
- Mysticism Theory - Alternative theory whereby mind operates in mystical realmn beyond physical science
- Cognitive Functions Segregated
- Speech production (left hemisphere)
- Neuronal Structure
- Electrochemical signals sent via neuron junctions to control brain activity and learning
- Information processing in outer layer of brain using tunnels of tissue 0.5mm dia / 4.0mm depth containing 20,000 neurons
- Unknown is how damaged functions adopted by others
- Unknown is how individual memory stored
- Note: Single-cell neuron activity artificially stimulated (electrically, chemically, optically) for recording of neuron I/O mapping relating to cognitive processes
- Sensory Signals
- Publish/Subscribe mapping between brain areas and bodily parts (control output or sensory input received)
- Note: Multiple maps apparent in some creatures that change every few weeks
- Note: Brain activity images using EEG and MRI
- Thought, Action, Consciousness enables Minds -
based on collection of neurons
- Research Applications
- Math Models applied to nerve cells (neurons) to study nervous system
- Brain
- Psychology
- Behaviourism
- Achievements: Discoveries in some animals but less with humans
- Human Behaviour Experiments
- Subjective methodology
- Introspective data is subjective (unlikely experimenter disagree with own theory)
- Note: Objective methodology applied to humans
- Subjective methodology
- Animal Behaviour Experiments
- Objective measures of stimulus given and resulting actions/response
- Lack of introspective data
- Objective measures of stimulus given and resulting actions/response
- Cognitive Psychology
- Definition: Brain viewed as a CPU
- Knowledge-based Agent Info Processing Steps
- Translate stimulus to internal representation
- Manipulate internal representation using Cognitive Processes to achieve new internal representation
- Translate back into Action
- Computer Models should be like Cognitive Theory and
address psychology of the following by describing information
processing mechanism for cognitive function implementation:
- Memory
- Language
- Logical Thought
- Behaviourism
- Computer Engineering
- Future power increases from mass parallelism
- AI work has pioneered ideas used in computer science (i.e. object-oriented programming, etc)
- Control Theory and Cybernetics
- Control Theory Tools
- Calculus
- Matrix Algebra
- Control Theory Systems
- Systems described by “fixed” sets of continous variables
- Artifact (computer) Systems
- Self-regulating feedback control system machines (modify behaviour in response to env changes). i.e. submarine, thermostat
- AI
- Escapes limitations (NOT “fixed”)
- Logical Inference and Computation Tools solve problems in:
- Language
- Vision
- Planning
- Purposive Behaviour
- Regulatory mechanism that minimises error toward goal state
- Intelligence AI Source
- Homeostatic devices with feedback loops creating stable adaptive behaviour
- Stochastic Optimal Control
- Goal of system designs that behave optimally (maximising objective function over time)
- Control Theory Tools
- Linguistics
- Behaviourist Approach - to Language Learning
- Ignores Creativity in language (i.e. child makes sentences)
- Syntactic Structures & Chomsky’s Theory
- Natural Language Processing (aka Computational Linguistics)
- Understand sentence structures, subject matter, and context to understand a language and use Knowledge Representation to convert into form a computer may reason with
- Behaviourist Approach - to Language Learning
- Economics
- Maths (areas contributing to AI)
- Issues
History of AI
up to page 18
Python Links
- Python
zip, map, lambda
https://bradmontgomery.net/blog/pythons-zip-map-and-lambda/ - Iterate Dictionary/Hash:
for k, v in values.items():
- Iterate Array:
for val,i in enumerate(range(len(a))):
http://www.diveintopython.net/file_handling/for_loops.htmlvalues[val]
- Iterate over String:
[i for i in k]
- Append to Array:
elim_box.append((k, v))
- Replace values in string:
removed = original.replace("M", "")
http://www.petercollingridge.co.uk/book/export/html/362
Extra Curricular
Code links
- AIMA textbook example algorithms https://github.com/aimacode/aima-python
- lecture 6 (right ahead of project 2) and from AIMA only had to cover 5.1-5.4 specifically references DFS and iterative deepening from earlier chapters
- Github projects
https://github.com/udacity?utf8=%E2%9C%93&q=aind https://github.com/udacity/artificial-intelligence https://github.com/danainschool/aima-python https://github.com/udacity/AIND-Isolation https://github.com/udacity/AIND-Planning https://github.com/ltfschoen/AIND-Sudoku https://github.com/ltfschoen/hmmlearn
Python requirements
- itertools, list comprehensions, tuples, function with multiple return values, default values as function params, functions as first class citizen (pass functions like variables), creating classes
- Anaconda
- Python 3
- Courses preparation https://www.udacity.com/course/design-of-computer-programs–cs212
OpenAI
https://github.com/openai https://openai.com/blog/
Research Papers in AI
https://github.com/songrotek/Deep-Learning-Papers-Reading-Roadmap https://github.com/terryum/awesome-deep-learning-papers?files=1 http://www.arxiv-sanity.com/ https://research.google.com/pubs/papers.html
- Summaries already done https://docs.google.com/spreadsheets/d/1xej5Nca2xUUtrZ1GCyPjFMqI9ZgNq_OhgnTxOOMQ2js/edit#gid=404493967
- DL related (Jack Clark at OpenAI to crowdsource via dhruvp): https://docs.google.com/spreadsheets/d/1MT8LRuEn3xdJ7j6_XQjRZwV_fRmrpjaWwZKGk2Dqm4k/edit#gid=0
Competitions in AI
- General AI Challenge
https://www.general-ai-challenge.org/active-rounds https://discourse.general-ai-challenge.org/
Convert Markdown to PDF
- https://gitprint.com/
Reading - Python
- Book “Data structures and algorithms in Python”
- Recursion chapter, Tree/graph algorithms
- Official Python Docs - itertools and other Python packages
- AI Overview
- DARPA https://youtu.be/-O01G3tSYpU
- AIMA book
- Chapters http://aima.cs.berkeley.edu/2nd-ed/newchap11.pdf
- Code examples for each of the algorithms https://github.com/aimacode
-
Python e-books: http://www.diveintopython3.net https://learnpythonthehardway.org/book/ http://www.deeplearningbook.org/
- MIT AI course (similar topics to AIND) https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-034-artificial-intelligence-fall-2010/lecture-videos/
Week 0
- Kick-off event https://youtu.be/z7V7Ri6P_dA
- About https://www.udacity.com/course/artificial-intelligence-nanodegree–nd889
Week 1
- Guides
- Sebastian Thrun - Udacity co-founder
- Key takeaways:
- Reinforces that AI surpassing human efficiency for repetitive tasks
- Highlights self-driving car applications, benefits (i.e. safety), and modern suppliers
- Identifies need for multi-tasking AI platform
- Suggests students to build innovative experiments and forward to Sebastian (i.e. using Arduino)
- Q&A (Starring Luke Schoen)
- https://youtu.be/AijSw9Q3oas
- Initiative of re-writing or summarising research papers so more easily
understood by layman by using visuals and animations like Udacity uses
- TODO - Central repository that unifies papers into a single site
- Interactive tool where can play with parameters with different outputs
- i.e. create site with all animations associated with Artificial Intelligence (Udacity would like to work with people involved)
- TODO - Central repository that unifies papers into a single site
- Cyc http://www.cyc.com/
- Key takeaways:
- Peter Norvig - Google - AI book author
- Thad Starner - Georgia Tech - Contextual Computing & Google X researcher
- Arpan Chakraborty - Georgia Tech - Computer Vision - AI research
- David Joyner - Georgia Tech - AI lecturer - applies AI to tools
- Dhruv Parthasarathy - AI Program director
- Lisbeth Ortega - Community Manager
- Kathleen Mullaney - Career Services
- Rachel Higgens - Career Services
- Trinh Nguyen - Career Services - Program Coordinator
- Kirsten Bailer - Career Services - Employer Relations
- Luis Serrano - Instructor/Course Developer
- Dean Webb - Mentor Online
- Shelly - provides Quizzes during course
- Help: ai-support@udacity.com OR Chris LaPollo (outside enrolment)
- Sebastian Thrun - Udacity co-founder
-
Udacity Code Style Guide - https://udacity.github.io/git-styleguide/
- Project 1 - Sukoku AI-agent
- Build and code AI-agent solves any Sudoku problem
- Learn 2x AI techniques
- Constraint Propagation
- Search
- Advanced Sukoku strategy to improve agent speed
- Learn 2x AI techniques
- Based on blogpost http://norvig.com/sudoku.html
- Build and code AI-agent solves any Sudoku problem
- Project 2 - Game playing AI-agent
- AI-agent plays games and defeats opponents in isolation
- Learn 2x AI Strategies (used by Google’s Go agent AlphaGo)
- Minimax
- Alpha-beta pruning
- Learn 2x AI Strategies (used by Google’s Go agent AlphaGo)
- AI-agent guides pacman through maze faster and eats food efficiently
- Learn 2x AI strategies
- Breadth-first search
- Depth-first search
- A-Start search
- Learn 2x AI strategies
- AI-agent plays games and defeats opponents in isolation
- Project 3 - Planning AI-agent
- AI-agent plans optimal route transport cargo from A to B (used in logistics)
- Project 4 - Natural Language Processing ??? AI-agent
- AI-agent automatically reads sign language chars and convert to English
- Learn 1x AI strategy
- Hidden Markov Models
- Learn 1x AI strategy
- AI-agent automatically reads sign language chars and convert to English
- Deadlines
- Deadline means submitted and Reviewed
- Code Reviews - Unlimited
- Mentor - Leadership, Guidance, SME
- Weekly checkin so track.
- Help set learning goals.
- Guide to supplementary resources when get stuck
- Respond to any questions about the program
- Office Hours https://calendar.google.com/calendar/embed?src=knowlabs.com_h7k8bibekume01f054v7ns8v1o%40group.calendar.google.com
- instructors, content creators and other Udacity staff to answer questions about the content and projects
- Forums - Tech Ques https://discussions.udacity.com/
- AIND Term 1 Categories - https://discussions.udacity.com/categories
- Projects category – post anything specific to projects in respective sub-categories there
- Announcements specific to cohort including reminders and additional resources
- Cafe interact with your peers
- AIND Term 1 Categories - https://discussions.udacity.com/categories
- Community - Online/Offline Study Groups arranged by Udacity
- Career Services - http://career-resource-center.udacity.com
- Udacity partners
- Google, IBM Watson, Accenture, AT&T, Priceline.com, Nvidia, Mercedes-Benz, Amazon
- Career Path - Location, etc
- Career Development in Classroom
- Resume updates
- Udacity Reviews for free
- Udacity partners
- Other Students -
- Sample Search https://www.google.com.au/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=code+solution+naked+twins+sudoku+aind+github
- Solution to Sudoku Naked Twins - https://github.com/vxy10/AIND-Sudoku/blob/master/solution.py
- https://github.com/Kiyo921/AI-sudoku
- Project 2
- https://github.com/philferriere/aind-projects
- https://github.com/bdiesel?tab=repositories
- https://github.com/ziyanfeng/AIND-Isolation
- https://github.com/davidventuri/udacity-aind
- https://github.com/mohankarthik?tab=repositories
- https://github.com/mohankarthik/AIND-P2-Isolation
- https://github.com/mohankarthik/AIND-L1-Pacman
- https://github.com/mlgill?tab=repositories
- https://github.com/mlgill/AIND-Isolation
- https://github.com/mlgill/AIND-Planning
- https://github.com/mlgill/AIND-Simulated_Annealing
- https://github.com/morganel?tab=repositories
- https://github.com/morganel/AIND-Isolation-1
- https://github.com/morganel/AIND-Planning
- https://github.com/madhavajay/nd889
- https://github.com/frogermcs
- arunkv
- https://github.com/arunkv/AIND-Planning/commits/master
- sumitbinnani
- Recogniser project
- DONE + Part 4 https://github.com/frogermcs
- DONE + Part 4 https://github.com/climberwb/AIND-SignLangRecognizer
- DONE https://github.com/IlanStefanon/AIND-Recognizer
- DONE https://github.com/binojohnthomas/AIND-Recognizer
- DONE https://github.com/iamsteveng/AIND-Recognizer
- DONE + Part 4 https://github.com/ArthurLu/AIND-Recognizer
- HALF-DONE https://github.com/curiousily/AIND-Recognizer
- HALF-DONE https://github.com/tienthanh8490/AIND-Recognizer
- ??? https://github.com/RenSilvaAU/AIND-Recognizer
- ??? https://github.com/Blindshooter/AIND-Recognizer
- No progress https://github.com/zouyu9631/AIND-Recognizer
- No progress https://github.com/edhzsz/AIND-Recognizer
- No progress https://github.com/AgremE/AIND-Recognizer
- ??? https://github.com/WestonLu/AIND-Recognizer
- Chat - Slack https://artificial-intelligence.udacity.com/
- ai-nd.slack.com
- LinkedIn Links https://docs.google.com/spreadsheets/d/1kdiF8weRpmaF91xqmeZ27tqf7IPgWqhpYxeynDb–SA/edit#gid=0
- Intro @channel Hello everyone. My name is Mike Salem and I am the Nanodegree Service Lead for AI. I want to welcome each and every one of you to Artificial Intelligence!
For introductions:
- Your name (first name is OK)
- Where are you from / located
- What’s your background / have you taken any AI courses in the past?
- Why did you sign up for the AIND? What would you like to learn?
- (Optional) Do you have any cool AI related things you would like to show off? be creative. It could be a github account, a youtube videos, etc.
We want your experience at Udacity to be the best possible. So please chat with other people in the channel, make friends, and most importantly learn and have fun!
- Bug/Issue/Problem Reporting https://waffle.io/udacity/aind-issue-reports
- https://github.com/udacity/aind-issue-reports/issues
- IntelliJ Python Debugger Setup
Find Miniconda Python: $ which python /Users/Ls/miniconda3/bin/python
File > Project Structure > Platform Settings > SDKs > + > Python SDK > /Users/Ls/miniconda3/bin/python
File > Project Settings > Project > Project SDK > Python 3.6.0 (~/miniconda3/bin/python…)
Run > Debug > + > Python
- Name: Sudoku
- Script: /Users/Ls/code/aind/term01/lesson01/function.py
-
Use Specified Interpreter: Python 3.6.0 (~/miniconda3/bin/python…)
- Lesson 1 - Anaconda
- About - Version management (specifically built for data science)
- About Conda
- Dfn (Conda) - Package manager (install library/software) and Virtual Environment manager (similar to
virtualenv
andpyenv
)- Note:
pip
is the default Python package manager (general packages) - Note:
conda
is Python package manager (focus on data science packages), not specific to just Python packages
- Note:
- Dfn (Anaconda) - Large software distribution that comes with Conda, python, and 150 scientific packages (data science) and dependencies
- i.e. comes with: Numpy, Scipy and Scikit-learn compiled with the MKL library, speeding up various math operations
- Dfn (Miniconda) - Small software distributin comes with Conda, python (install required packages manually)
- Dfn (Conda) - Package manager (install library/software) and Virtual Environment manager (similar to
- Usage of Conda
- Conda Documentation https://conda.io/docs/using/index.html
- Cheat Sheet https://conda.io/docs/_downloads/conda-cheatsheet.pdf
- Cheat Sheet Advanced http://know.continuum.io/rs/387-XNW-688/images/conda-cheatsheet.pdf?mkt_tok=eyJpIjoiWm1Sak56UmpZMlZsTW1aaSIsInQiOiJOc3Rud2NraTI4ODlzVmJBSnllZ0JJWk1ZSlhLbE5QOEhZUWRMMDY2NWUrNGZxalRnSXNcL2NOM3h2UVFJN0ZJV0YybGpGOFNrdEJWa0Rnd1hlRnFzXC9wdUJcL1c1bkxsWVZIT1JzXC9vRGwwSmNvREpGXC9UYk9LV01UcjV6b1Z3TlN2In0%3D
- Another https://leifengtechblog.files.wordpress.com/2016/01/conda-cheatsheet.pdf
- Steps
- Create new environement for each project (separate package versions so no version conflicts)
conda create -n tea_facts python 3
- Delete env
conda remove -n tea_facts --all
(orconda env remove -n tea_facts
) - Activiate environment
source activate tea_facts
- Deactivate environment
source deactivate tea_facts
- Show default packages installed
conda list
- Show non-Python packages
conda search --canonical | grep -v 'py\d\d'
- Update environment with new packages (i.e. work with t-data, visualisations, code development)
conda install numpy pandas matplotlib
conda install jupyter notebook
- Update a package
conda update ____
- Update all packages
conda update --all
- Find a package
conda search ___
- Export list of packages in environment to file for easy loading dependencies by others (similar to
pip freeze > requirements.txt
)conda
- Create new environement for each project (separate package versions so no version conflicts)
- Other
- Help
conda install --help
- Help
- Installation of Conda
- Anaconda
- Download https://www.continuum.io/downloads
- Cheat Sheet of Installation https://docs.continuum.io/_downloads/Anaconda_CheatSheet.pdf
- Miniconda https://conda.io/miniconda.html
- https://github.com/conda/conda
- Download v3.5
- Run with
bash Miniconda3-latest-MacOSX-x86_64.sh
-
Check installation location
which conda
=> /Users/Ls/miniconda3/bin/condaconda list
# packages in environment at /Users/Ls/miniconda3: # cffi 1.9.1 py36_0 conda 4.3.11 py36_0 conda-env 2.6.0 0 cryptography 1.7.1 py36_0 idna 2.2 py36_0 openssl 1.0.2k 0 pip 9.0.1 py36_1 pyasn1 0.1.9 py36_0 pycosat 0.6.1 py36_1 pycparser 2.17 py36_0 pyopenssl 16.2.0 py36_0 python 3.6.0 0 readline 6.2 2 requests 2.12.4 py36_0 ruamel_yaml 0.11.14 py36_1 setuptools 27.2.0 py36_0 six 1.10.0 py36_0 sqlite 3.13.0 0 tk 8.5.18 0 wheel 0.29.0 py36_0 xz 5.2.2 1 yaml 0.1.6 0 zlib 1.2.8 3
- List envs
conda info --envs
(orconda env list
)# conda environments: # root * /Users/Ls/miniconda3
- Clone the root env
conda create --name rootclone --clone root
-
Create env with specific Python version
conda create --name aind-3.6.0 python=3.6.0
-
Checkout the aind env
source activate aind
- Save env (incl packages and Python version for sharing).
Also include a pip requirements.txt file using
pip freeze
for people not using conda
conda env export > environment.yaml
pip freeze > requirements.txt
- Load env from YAML env file
conda env create -f environment.yaml
- Anaconda
- Installation of Spyder IDE
-
https://github.com/spyder-ide/spyder
- Create subenv with Spyder IDE, update to latest versions
source activate root
(root) $ python --version
==> 3.6.0(root) $ conda upgrade conda
(root) $ conda upgrade --all
(root) $ conda create -n spyder python=3.6.0
(root) $ source activate spyder
- Install Spyder IDE in subenv
(spyder) $ conda install qt pyqt
(spyder) $ conda install spyder
(spyder) $ conda upgrade --all
- Run Spyder IDE
(spyder) $ spyder
-
- Python 3 vs Python 3
- Usage of
print
from Python 2 instead of Python 3’s print function (surround withprint("a","b")
notprint "a" "b"
) - Use new print function in old Python 2 code by importing
from __future__ import print_function
- Usage of
- Env variables
- https://conda.io/docs/using/envs.html
- Pre-load with AI Nanodegree Python environment
- Download aind-environment-unix.yml
- Load
conda env create -f /Users/Ls/Downloads/aind-environment-unix.yml
- Switch to AIND env
source activate aind
- Install PyGame for visualisations of AI programs for portfolio sharing
brew install sdl sdl_image sdl_mixer sdl_ttf portmidi mercurial
pip install pygame
- Get PyGame library directory
>>> import site;print(site.getsitepackages()) ['~/miniconda3/envs/aind/lib/python3.6/site-packages']
- Within IntelliJ, go to Project Structure (CMD+;)> Platform Settings > SDKs, and Add
~/miniconda3/bin/python
as a Python 3.6.0 SDK home path, then add to this an additional Classpath of~/miniconda3/envs/aind/lib/python3.6/site-packages
- Within IntelliJ, go to Project Structure > Platform Settings > Modules
- Project 1 - Suduko
- About Sudoku http://www.conceptispuzzles.com/?uri=puzzle/sudoku/rules
- Given: 9x9 grid partially completed
- Objective: Fill grid with missing digits so each row, each col, and each of 9 principal 3x3 sub-squares contains all digits 1-9
- Extras:
- Strategies other than Naked Twins like: (Standard + Diagonal Sudoku)
- Goal
- Build an AI-agent that solves all Sudoku problems and provides intro to techniques: Constraint Propagation, and Search
- Note: AI
- Composed of combining many simple ideas to solve complex problem
- Dfn: Constraint Propagation
- e.g: Extract max info from applying simple Local Constraints (i.e. to each square) to iteratively narrow search space of possible solutions
- Risk: Never solve every puzzle (incl. hard ones)
- Example: Elimination of possible values for each peer within each square here http://norvig.com/sudoku.html states that even using Naked Twins strategy we never know if we can solve every puzzle (even hard ones)
- Dfn: Search
- e.g. Given two or more possibilities available, branch out and create whole tree of possibilities, then traverse tree and consider all until find solution
- Risk: May take forever to run
- Example:
- Option 1: Brute Force Approach: Single machine instruction http://norvig.com/sudoku.html (i.e. try all possibilities, takes a long time)
- Option 2: Constraint Propagation (Constraint satisfaction problem): Multiple machine instructions each processing
multiple possibilities (i.e. eliminate possibilities on each try, so not need try all possibilities)
- https://en.wikipedia.org/wiki/Constraint_satisfaction
- https://en.wikipedia.org/wiki/Search_algorithm
- Depth-First Search Algorithm:
- Dfn: Depth First Search (DFS) algorithm traverses a graph in a depthward motion and uses a stack to remember to get the next vertex to start a search, when a dead end occurs in any iteration
- Check not already found solution or contradiction
- Select unfilled square and try possible values until find successful value, and for each possible value check if with that possible value we can successfully search for a solution.
- Note: This is a recursive search since we recursively consider all possibile outcomes for a value before trying a different value
- Note: Create new copy of values for each recursive call to search so each branch of the search tree is independent and does not confuse other branches (hence may be more efficient to implement set of possible values for a Sudoku square as a string, rather than a Set or a List)
- Search Implementation
- Variable Ordering - i.e. choice of which square to try first
- Minimum Remaining Values Heuristic - i.e. choose a square with min. qty of possible values
- i.e. if choose square with 7 possibilities we expect to guess wrong with probability of 6/7, whereas if choose square with 2 possibilities we expect only 1/2 (50%) probability
- Minimum Remaining Values Heuristic - i.e. choose a square with min. qty of possible values
- Value Ordering - i.e. choice of which digit to try for first Sudoku square
- Randomize Ordering Heuristic - Randomise the value ordering may avoid any deadly combinations of value choices that may take 10,000 times longer to find a contradiction
- Least-Constraining Value Heuristic - Chooses the first value that imposes the least constraints on peers
- Variable Ordering - i.e. choice of which square to try first
- Problem:
. . 3 |. 2 . |6 . . 9 . . |3 . 5 |. . 1 . . 1 |8 . 6 |4 . . ------+------+------ . . 8 |1 . 2 |9 . . 7 . . |. . . |. . 8 . . 6 |7 . 8 |2 . . ------+------+------ . . 2 |6 . 9 |5 . . 8 . . |2 . 3 |. . 9 . . 5 |. 1 . |3 . .
- Solution: ‘483921657, 967345821, 251876493, 548132976, 729564138, 136798245, 372689514, 814253769, 695417382’
1 2 3 4 5 6 7 8 9 A 4 8 3 |9 2 1 |6 5 7 B 9 6 7 |3 4 5 |8 2 1 C 2 5 1 |8 7 6 |4 9 3 ------+------+------ D 5 4 8 |1 3 2 |9 7 6 E 7 2 9 |5 6 4 |1 3 8 F 1 3 6 |7 9 8 |2 4 5 ------+------+------ G 3 7 2 |6 8 9 |5 1 4 H 8 1 4 |2 5 3 |7 6 9 I 6 9 5 |4 1 7 |3 8 2
- Steps:
- If box has a value, then all boxes in same row, same column, or same 3x3 square cannot have same value
- If only one allowed value for given box in row, column, or 3x3 square, then the box is assigned that value
- Naming Conventions
- Rows and Columns
- Rows labelled A, B, C, D, E, F, G, H, I.
- Columns labelled 1, 2, 3, 4, 5, 6, 7, 8, 9
- 3x3 squares will not be labelled
- Boxes, Units and Peers (name important elements created by rows and columns) * Boxes - Individual squares at the intersection of rows and columns (i.e. ‘A1’, ‘A2’, …, ‘I9’) * Units - Complete rows, columns, and 3x3 squares. Each unit is a set of 9 boxes, there are 27 units total. * Peers - All other boxes associated with a particular box (such as ‘A1’), that belong to the same unit (i.e. belong to the same row, column, or 3x3 square) (i.e. Peers of ‘A1’: row: A2, A3, A4, A5, A6, A7, A8, A9 column: B1, C1, D1, E1, F1, G1, H1, I1 3x3 square: B2, B3, C2, C3 (since A1, A2, A3, B1, C1 are already counted). * Note: Each box has 20 peers
- Rows and Columns
- Elimination Strategy:
- Eliminate possible values for a box by analysing peers
- If a box has a value assigned, then none of its peers may also have its value
- Update grid with possible values for each box
- Use to compute
grid_values()
so each box value represents all the available values for that box. i.e. i.e. box ‘B5’ would have value ‘47’ (because 4 and 7 are its only possible values). Starting value returned for every empty box will be ‘123456789’ instead of ‘.’
- Only Choice Technique:
- If one of the possible values in any one of the unsolved boxes of a unit/square that is not also one of the remaining values in the other unsolved boxes, then it is the Only Choice for that box
- Search Strategy:
- Example Usage:
- Game Playing
- Route Planning (to find solutions efficiently)
- Alpha Go https://storage.googleapis.com/deepmind-media/alphago/AlphaGoNaturePaper.pdf
- Selection
- Expansion
- Evaluation
- Backup
- Depth First Search Algorithm Steps:
- Start from root, traverse the tree always in the most-left-first fashion, and when reach end leaf, move to next branch
- Step 1: Pick an unsolved box that has the fewest possible numbers to try out i.e. 89 (less work), and create a Tree of possibilities to try out
- Step 2:
- Step 2a: Try the first branch’s possible value of 8, using Constraint Propagation
- Step 2b: If not done after first branch’s first level, then choose from that unit/square another box with fewest possible numbers (i.e. 159), and create 3x 2nd level branches to Try solving with using 8 with each of its values (i.e. 1, 5, and 9) using Constraint Propagation
- Step 2c: Repeat with more levels as required
- Step 3: Try for remaining branches
- Step 2: If Step 1-3 results in no solution, repeat for another one of its possible values
- Note: 4x more work, but each step gets easier
- Example Usage:
- Constraint Propagation:
- Using local constraints in a space (i.e. Sudoku square constraints)
to reduce search space. Each constraint that is enforced introduces
new constraints for other parts of board that helps reduce remaining
possibilities
- Example Usage:
- Map Coloring Problem: Prevent two adjacent map segments from having same color
- Crypto-Arithmetic Puzzles: Where each unique letter represents a digit in a calculation (i.e. TWO + TWO = FOUR), and no number starts with a 0. Goal is find mapping from letters to digits to satisfy equations by using constraints imposed by the equation to create an AI-agent to solve the proble via Contraint Propagation
- Sudoku:
- Options:
- Constraints: Elimination
- Pick solved box, and apply the Eliminate function to remove the value of the box from the possibilities of its peers
- Constraints: Only Choice
- Pick a Unit, if there is one value in the unsolved boxes that is only allowed in one of the boxes, then assign it to the box
- Constraints: Search
- Use Depth First Search algorithm to solve harder Sudoku problems using recursion and calling function that uses Constraint Propagation (i.e. Elimination and Only Choice techniques)
- Note: Constraint Propagation Technique
(using Elim & Only Choice) Applied
- Step 1: Pick a solved box and apply Eliminate function
- Step 2: With the more complete grid, pick a Unit and apply Only Choice function
- Step 3: Repeat Step 1 for a different solved box
- Step 4: Repeat Step 2
- etc
- Solution ?? Or Stuck ?
- Important Note: May only handle EASY puzzle
- Constraints: Elimination
- Options:
- Example Usage:
- Using local constraints in a space (i.e. Sudoku square constraints)
to reduce search space. Each constraint that is enforced introduces
new constraints for other parts of board that helps reduce remaining
possibilities
- Naked Twins Technique (extension) http://www.sudokudragon.com/tutorialnakedtwins.htm
- Given a Sudoku puzzle partially completed with some boxes with possibilities for their
values. If there are two boxes in the same row or column that both contain the same two values
i.e. 23 and 23, then we know for sure that 2 and 3 can only occur in those boes in that
unit (column 3), so:
- Remove 2 and 3 from possible values in that column
- Given a Sudoku puzzle partially completed with some boxes with possibilities for their
values. If there are two boxes in the same row or column that both contain the same two values
i.e. 23 and 23, then we know for sure that 2 and 3 can only occur in those boes in that
unit (column 3), so:
- Diagonal Sudoku (by modifying the algorithm)
- Project 1:
- Added Python Debugging functionality from main.py file (see my Gists on GitHub)
- Added assertions
- Get PyGame library directory
>>> import site;print(site.getsitepackages()) ['/Users/Ls/miniconda3/envs/aind/lib/python3.6/site-packages']
- Code Review Comments:
- Logging with default level ERROR could be added to debug the code. Logs can also help to understand the algorithms. See https://docs.python.org/3/howto/logging.html
- Assert statements could be used https://wiki.python.org/moin/UsingAssertionsEffectively
- Utility methods in separate file
- MyPy - Static Type Checking
- Install MyPy
python3 -m pip install mypy
. - Install Typing
python3 -m pip install typing
. - Import Typing
import typing; from typing import *
, - Apply MyPy static type checking to an existing function where expect warnings
def solve(grid: str) -> None:
- Run MyPy Linter with
mypy solution.py
-
Apply MyPy with expected types
def solve(grid: str) -> typing.Dict:
- Installation http://mypy.readthedocs.io/en/latest/getting_started.html#getting-started
python3 -m pip install mypy
- Also installed
typing
, even though at this website it says it’s only necessary when running old Python 2 http://mypy.readthedocs.io/en/latest/faq.htmlpython3 -m pip install typing
- And then added to top of file:
import typing; from typing import *
- Implemented into function with:
def solve(grid: str) -> typing.Dict:
- And then added to top of file:
- Run MyPy Linter
mypy solution.py
- Also installed
- Basic Summary http://mypy.readthedocs.io/en/latest/basics.html
- Note:
- Example built-in types: http://mypy.readthedocs.io/en/latest/builtin_types.html
- Run MyPy Linter that performs type checking of program without running it and provides warnings
mypy program.py
PySudoku.py:1: error: Cannot find module named 'pygame' PySudoku.py:1: note: (Perhaps setting MYPYPATH or using the "--ignore-missing-imports" flag would help) PySudoku.py:3: error: Cannot find module named 'SudokuSquare' PySudoku.py:4: error: Cannot find module named 'GameResources' PySudoku.py:65: error: Name 'main' is not defined
- Extend MyPy Linter to perform type checking of Libraries that have Stubs that
are skeletons of public interface of library classes, variables, functions and their types
but with Dummy function bodies (i.e.
those incorporated in the Typeshed project from Python Built-ins and Standard Library)
- Type inference of uses Stubs for definitions
# Sample code in .py file x = chr(4) # Sample MyPy Library Stub in .pyi file (ellipsis used for function body) def chr(code: int) -> str: ...
- Create Custom Stubs https://github.com/python/mypy/wiki/Creating-Stubs-For-Python-Modules
- Write Library Stub in .pyi file within either same directory as the Library Module OR
in specific Stub directory (i.e. myproject/stubs) and set environment variable
pointing to
export MYPYPATH=~/work/myproject/stubs
- Use subdirectory with
__init__.pyi
for packages - In directories with both
.pyi
and.py
the.pyi
file takes precedence
- Write Library Stub in .pyi file within either same directory as the Library Module OR
in specific Stub directory (i.e. myproject/stubs) and set environment variable
pointing to
- Type inference of uses Stubs for definitions
- Contribute Custom Library Stubs created into the Typeshed Repo
- Static Type Checking (instead of Dynamic Type Checking)
- Type Annotations added to functions (type checker reports type errors within function)
- Apply return type as None by default so statically typed context results in type check error
def greeting(name: str, prefix: str = 'Mr.') -> None: return 'Hello, {} {}'.format(name, prefix)
- Use
typing
module to use type definitions in statically typed codefrom typing import Iterable def greet_all(names: Iterable[str]) -> None: for name in names: print('Hello, {}'.format(name))
- Previously Dynamic Type Checking
- No type checking annotation in function signatures
- Install MyPy
- Lesson 4: Intro to AI
- Optimisation of Navigation and Route Finding Domain
- Computationally hard problems
- Find shortest path on map from A to B
- Apply graph of city nodes
- Connect city nodes by road edges
- Option 1: Breadth-First Search
- Inefficencies
- Considers all possible paths out of city until reaches destination via one of them
- Only considers major roadways
- Issues
- Intractable space of possible paths to search if consider smaller streets/alleys
- Inefficencies
- Option 2: Heuristic to Manually Solve (i.e. AI for smart idea, a rule,
function, or constraint that informs an otherwise
Brute-force Algorithm to act more optimally)
- Consider major roads
- Defer considering major roads going too far
- Prioritise paths that take closer to destination
- Option 3: A* Search Algorithm
- Draw straight line path between source and destination
- Priorities how our algorithm considers and evaluates different possibilities by try testing (using Direct Distance Heuristic) all possible next steps from current position so new direct distance changes (prevent unnecessary searching and converges on optimal direction quickly). Choose Most Promising Choice with smallest direct distance as next step (but may not turn out to be Optimal). Store each other path for later consideration.
- Find shortest path on map from A to B
- Computationally hard problems
- Intelligent Agent
- Dfn: Actions taken to Maximise its expected Utility given desired Goal
- Comparison of Agents
- Rational Behaviour extent - requires Optimal Behaviour (regardless of what Cognitive Mechanism used), however, the Optimal Solution is hard to achieve (i.e. Sudoku problem, etc), but AI-agents are not always expected to perform Optimally
- Constraints faced by the AI-agent must also be considered
i.e.
- Partially Observable env
- Limited computational resources (speed, memory)
- Rules of task such as require respond by deadline
- Bounded Optimality
- Quantifying Intelligence feasibly and practically
- Level of Performance / Bound for AI-agent to meet (i.e. Chess AI-agent to win 60% of time against human chess masters, or Route Finding Algorithm that always find route no more than 2km longer than Optimal route)
- i.e. Anticipate and React to changes in environment (i.e. other drivers)
- Game AI-Agent (Tic Tac Toe)
- Goal - Design AI-agent to play Game (Tic Tac Toe) using Search Strategy (similar to route-finding problem)
- Step 1
- Determine Nodes and Edges in the Graph we search
- Option A -
- Nodes - Each cell
- Edges - Connect two if adjacent
- INSUFFICIENT - Does not capture enough info on what to do next
- Option B -
- Node & Edges - Current Player position (row/col Tuple) {0, 1, 2} Note: Two Nodes connectable if row/col differ by 1 (i.e. (0, 1) connectable to (1, 0) but not (2, 2)
- INSUFFICIENT
- Option C -
- Opponent Tracking - Track of opponents position too
- Node - Tuple (i.e. <(0, 1), (2, 0)>
- Edges - Connect if both Player positions differ by max 1
- INSUFFICIENT - Since need know which cells already filled, as cannot move into them
- CORRECT - Option D -
- Node - Whole Board is Node, with one Node for every possible arrangement of X’s and O’s
- Edges - Connect two Nodes if there is a valid move
that changes the board from one node to another
- The edges embody the Rules of the game
- i.e. only possible to go to finite qty of next states from current state (limits qty of moves need to consider)
- The edges embody the Rules of the game
- SUFFICIENT - Usable for Tic Tac Toe!!! Since will know position of Computer Player and Opponent Player so know what to do next (by Excludes any extraneous details we don’t need to solve it)
- Note:
- 3x3 Board -
- First Computer Move - 9 possibilities
- First Opponent Move - 8 possibilities
- etc
- 3x3 Board -
- Note: Search Indicators
- Pruning the Search Tree
- Ignorable / Non-Winnable Moves (some moves are wasteful)
- Evaluated Condition - i.e. When enemy has X on same row and column and proposed placement
- Ignorable / Non-Winnable Moves (some moves are wasteful)
- Adversarial Search
- Minimax Algorithm (maximise chance of win on our turn, whilst enemy trying to
minimise our chance of win on their turn)
Note: AI-agent Dfn - Anticipates and plans around expected changes in its environment,
including those introduced by other agents
- Anticipate Enemy Movies (intuition of recognising that enemy is Human also trying to Win)
- Evaluate Condition - i.e. When enemy can make sequence of moves that win after our move
- Rule Out certain moves as being BAD removes potential successors from its consideration
- Evaluate Condition - i.e. When enemy can make sequence of moves that win after our move
- Anticipate Enemy Movies (intuition of recognising that enemy is Human also trying to Win)
- Minimax Algorithm (maximise chance of win on our turn, whilst enemy trying to
minimise our chance of win on their turn)
Note: AI-agent Dfn - Anticipates and plans around expected changes in its environment,
including those introduced by other agents
- Pruning the Search Tree
- Option A -
- Determine Nodes and Edges in the Graph we search
- Monty Hall Problem
- Given 3x doors, randomly assigned behind each is either car or goat
- Strategy:
- Stay - Wins 33.33% of time
- Switch - Wins 66.67% of time
- Strategy: 1) C G G - better to STAY x 0
2) C G G - better to STAY x 0
3) C G G - better to SWITCH x 0
C G G - never, so counts as 3) probability 0 x
4) C G G - better to SWITCH 0 x
C G G - never, so counts as 4) probability 0 x
- Strategy:
- Probability Theory - captures uncertainty in
real-world to make sensible decisions when unsure
- Prior Probability - best guess given no other evidence
P(CarA) + P(CarB) + P(CarC) == 1 P(CarA) == 1/3, P(CarB) == 1/3, P(CarC) == 1/3 (aka **Prior Probability**)
- Equal to 1 since probability of at least one is car is 100% (sum of probabilities equals 1)
- So each door has 1/3 probability of car behind it
- Move 1 -
- Player 1 Select door A
- Monty Open door B
- OpenB (Bbservation)
- Posterior Probability - belief after making an Observation
P(CarB | OpenB) == 0
- Note: 2x doors still undisclosed, probability of car behind either
is NOT 1/2, since Monty (enemy) knowingly opened door with
Goat behind it (never reveal the car on this move)
NOT THIS - P(CarA) == 1/2, P(CarC) == 1/2
- We are actually interested in Posterior Probability of finding
car behind Door C, given that Monty opened Door B, and the implicit
Rule that Monty never reveals the car, so
express the Posterior Probability in terms of other quantities
that can compute using Bayes’ Theorem
P(CarC | OpenB) == == (P(OpenB | CarC) * P(CarC)) / P(OpenB) == (Likelihood * Prior) / Marginal Probability Likelihoood: where P(OpenB | CarC) == 1 since if CarC true, Monty's only option is OpenB, since OpenA already picked by us (not option to Monty) Prior: where P(CarC) == 1/3, which previously computed as 1/3 Marginal Probability: P(OpenB) == Possibility 1 + Possibility 2 + Possibility 3 where Possibility 1: Car behind A with 1/3 possibility, and in this case Monty either: OpenB or OpenC, both with 1/2 possibility, so Probability of OpenB **given** CarA is 1/3 (i.e. P(OpenB | CarA)) where Possibility 2: Car behind B has 1/3 possibility, but in this case Monty would not reveal it, so Probability of OpenB given CarB is 0 == P(CarA) * P(OpenB | CarA) + P(CarB) * P(OpenB | CarB) + P(CarC) * P(OpenB | CarC) == 1/3 * 1/2 + 1/3 * 0 + 1/3 * 1 == 1/6 + 1/3 == 1/2 so, updated Bayes' Theorem == (Likelihood * Prior) / Marginal Probability == (1 * 1/3 ) / 1/2 == 2/3 (so probability that the car is behind the other closed door has increased to 2/3)
- Note: Empirical Evaluation (by characterisation of given situation coupled with a formal method helps make decisions in face of Uncertainty) shows that consistent switch strategy wins 2/3 of time
- Prior Probability - best guess given no other evidence
- Given 3x doors, randomly assigned behind each is either car or goat
- Intelligence Dfn:
- Ability to produce Reasonable behaviour while dealing with different
sources of complexity
- Intelligent Objects: Thermos, Plant, Siri, Rock, Arpan
- E.g: If a person you know was not human but acted same, you probably wouldn’t tell the difference
- Note:
- Intelligence often used to label this we cannot explain,
- Artificial General Intelligence Dfn (sub-field of AI) deals with goal of:
- Ability to replicate complex nature of human thinking that accomplishes a variety of intellectual tasks like humans do
- Algorithms or Formulas used to label things we CAN explain
- Philosophically this implies that:
- Anything we design can never be intelligent, as we would know how it works
- Humans assume they are intelligent now, as we do not fully understand how brains work, but when we do, will we not consider ourselves intelligent anymore?
- Intelligence - must not be contingent on how we Perceive something
- Intelligence must be a Property that emerges from the system itself
- Intelligence - should be defined as the ability to do something with a decent rate of success within the Context of a task so then can say you are Intelligent at solving Sudoku puzzles
(quote by Luke Schoen i.e. Our judgement of the “truth” of something, similar to our judgement of the “intelligence” of something, should not be contingent on how we “perceive” that something. Instead our judgement should be based on assessing “properties” that emerge from the “source” of that something itself, and should be defined based on that somethings’ “success rate” at being able to be “truthful” within a defined “context” such as a task)
- Humans are Intelligent at doing Multiple tasks (many things)
- Ability to produce Reasonable behaviour while dealing with different
sources of complexity
- Agent Design (i.e. AI-agent) (i.e. aka Intelligent Systems)
- AI-agent Dfn:
- Cartesion View:
- Software only is the AI-agent, whereas the Hardware (i.e. sensors) is considered
- High-level View:
- Entire robot is the AI-agent, all components directly reliably accessible/controllable part of the Environment
- Cartesion View:
- Example:
-
Automatic Vacuum - i.e. specification
- Environment Understanding (Domain of Task / Problem)
- room operating in, floor, walls, objects to avoid bumping into, except objects not of concern
- State stored is only items necessary to complete task
- current position
- orientation while making actions
- places already been before
- obstacle location map (but they may move over time)
- Goal State stored - result the AI-agent trying to achieve (i.e. final location(s) acceptable)
- Environment Understanding (Domain of Task / Problem)
-
- AI-agent Dfn:
- Agent Interaction with Environment
- Input
- Perception - Sensing the Properties of Env
- Effectors
- Internal
- Cognition - Based on Perceived Inputs, the process of:
- Reasoning
- Decision Making
- Action
- Note:
- Reactive / Behaviour-based Agents
- Simple pre-programmed behaviours and directly associate Actions with Perception (bypassing process) (i.e. automated rubbish bin that senses something infront of it)
- Reactive / Behaviour-based Agents
- Classification of Agents
- Classification Measures: - based on
- Processing they do, the kind of
- Environment Properties
Environment States components they must capture
- Fully Observable (i.e. Tic Tac Toe where whole board viewable)
- Partially Observable (i.e. Battleship game where opponent positions hidden)
- Deterministic (i.e. where Certain of Action outcomes)
- Stochastic (i.e. where Uncertainty in Action outcomes)
- Discrete (i.e. finite qty of States the env may be in)
-
Continuous (i.e. infinite qty of States such as due to Properties that need to be stored as real numbers)
- Benign (i.e. where AI-agent is only one taking Actions that intentionally affect its Goal, i.e other Random events may be occurring)
- Adversarial (i.e. where one or multiple AI-agents that can take Action to defeat its Goal, i.e. competitive games)
- e.g.
- Playing Poker - Partially Obs, Stoch., Adversarial
- Recognition of Hand-written Text - Stochastic, Continuous (since source is physical object with arbitrary and unpredictable motions)
- Driving on Road - All except Adversarial
- Playing Chess - Advers.
- Layering simple Reactive control in a Hierarchy may achieve complex Behaviour
- Deliberate / Non-Trivial Processing by agents such as:
- Knowledge-based Agents
- Game Playing Agents
- Path Planning Agents
- Learning Agents
- Classification Measures: - based on
- Cognition - Based on Perceived Inputs, the process of:
- Output
- Actions - Changing the State of Env
- Input
- Optimisation of Navigation and Route Finding Domain
- Lesson 5: Intro to Game Programming
- Course Outline
- Search and Logic and Planning - Peter Norvig
- Probability and Bayesian Networks (BN) - Sebastian Thrun
- Other - Thad Starner
- Goal of Lesson
- Design a Game AI-agent program smarter than ourselves when playing a the limited context of a single real game
- Main Topics
- Adversarial Search - find optimal moves
- Minimax Algorithm - for Deterministic Games determines best move for any turn in a game
- Expectimax Algorithm - for Non-Deterministic Games (Chance games)
that considers all possible outcomes and chooses the
one with maximum expected return
(assuming opponent is making best moves available)
- Expectimax Game Tree
- Maximising Nodes - indicated by upward facing triangles
- Minimising Nodes - indicated by downward facing triangles
- Probabilistic Nodes - indicated by circles with the Chance of each outcome labelled on branches below it
- Evaluation of Tree Approach - evaluate tree from left to right
- Note: Game Board Nodes - values may range from -10 to +10
- Expectimax Game Tree
- Alpha-Beta Pruning - helps optimise the Minimax / Expectimax Algorithm to make AI-agent play faster * Performance Improvements: * Apply to expected max game trees that have finite limits on values the evaluation function returns * Usage: Use the Alpha-Beta Algorithm to mark trees branches that do not need to be visited, to calculate the value of the optimum choice
- Evaluation Function - creation of them for games
- Isolation Game Player - apply knowledge to the Isolation Game (a complex game
with simple rules)
- Objective: To not become isolated an unable to move on your turn (i.e. be the last player to move)
- Setup: 5x5 board, 2x players with X and O pieces
- Movement Rules: Move permitted to any square up/down or diagonal
from current position, but not through:
- opponents position
- previously occupied position (i.e. landed upon)
- outside game board position
- https://www.youtube.com/watch?v=n_ExdXeLNTk
- Note: Hard to estimate chances of winning at start of game, but become more certain as game progresses
-
Minimax Algorithm Usage: * Automatic first choice consults Openning Book - Table of Best First Moves (used for End Games) for this board config * Discover best moves, i.e. * Move to positions that are always at least above, or in the position immediately above the opposition position * Move to positions with available escape options (unless can block and win on that specific move); and closest to other player position; and avoid some first moves entirely (i.e. top middle or bottom middle); higher odds of winning when make first move on 3x2 (w/h) board with bottom right position blocked off * Mark (with Downward Arrow above) Scores at Terminal (Leaf) Nodes * +1 for Win * -1 for Lose * Assumption 0: Maximization of score is being considered from the Perspective of the [Computer] (otherwise we’d reverse how we apply Minmax). The [Computer] can only affect the game on its turn. Top of Minimax Algorithm Tree is always a Maximize (Upward Arrow Below) * Assumption 1: * Player X [Computer] always tries to maximise its score * Assumption 2: * Player X [Computer] is smart and latest positions are due to them always plays to win trying to minimise Player O’s score, so they make bad moves and lose soon) * Mark Assumption 1 (with Upward Arrow Below, i.e. [Computer]’s Turns) on Tree, Turns when Player X [Computer] is trying to Maximise the score * Mark Assumption 2 (with Downward Arrow Below, i.e. [Human]’s Turns) on Tree, Turns when Player X [Computer] is trying to Minimise the score (i.e. Player O Turns where Player X [Computer] reliant on their Bad Move)
- Move to positions that partition the board (fill positions on one side of board to reduce positions to move to to force early end to the game)
- Move to positions that try to Trap the opposition
- Example:
- Build a Game Tree (showing all possible movies) based on 2x3 game board
- Game Tree used to select moves with best change of winning
- Rules:
- Unavailable Moves: Any filled spot on game board
- Human is X, Computer is O.
- Mark Leaf (Terminal Nodes) branches where WIN with +1, and LOSE with -1.
- Level 0 (Beginning Game Level) - bottom left is unavailable immediately.
- Draw Level 0 of Game Tree as top of tree
- Level 1 [Computer] - 5x possible moves.
- Draw Level 1 of Game Tree as 5x branches possibilities below Level 0
- Level 2 [Human] - 4x possible moves (reduces)
- Draw Level 2 of Game Tree as 4x branches below EACH Level 1 branch
- Level 3 [Computer] - 3x OR 2x OR 1x possible moves (exponential growth) (some moves BLOCKED by X, or BLINDSPOT)
- Level 4 [Human] - 2x OR 1x possible moves (becomes denses)
- Level 5 [Computer] - 1x OR 0x possible moves (multiple LOSE possibility)
- Note Warn [Computer] not to make Level 3 moves that allow Level 4 branch to be one that gives [Human]’s a possible that results in on 0x possible moves for [Computer] in Level 5
-
Level 6 [Human] - 0x possible moves (i.e. Computer Win if get to this level)
- Minimax Algorithm Dfn -
Back Propagate possible future wins/losses up game tree to [Computer] so
best possible first move is made
- Start Bottom Left Leaf Terminal Node
- Go Upward and if get:
- Downward Arrow choose the Min value (from -1 or +1) appearing in Child Nodes
(Turn where Player X [Computer] is trying to Minimise their opponents score)
- Note that since [Computer] plays to win, it never chooses some branches allowing [Human] to win, we want to Mark these branches as -1 , as we want to Avoid future opportunities that MAY result in our loss, and;
- Upward Arrow nodes on way up, pick the Max Value among its Child Nodes
- Downward Arrow choose the Min value (from -1 or +1) appearing in Child Nodes
(Turn where Player X [Computer] is trying to Minimise their opponents score)
- Note that if [Human] player makes a wiser choice at a higher tree level it can avoid situation of [Computer] being able to play to win
- Multi-Player, Probabilistic Games
- Problem Solving and Game Playing
- Strategy Games - i.e. checkers, isolation, othello, chess
- Algorithms
- Reusable code, except modifying the code that generates all the next possible moves and the UI, to create Computer Player for simple Deterministic games (where outcome of each move is known) such as Tic Tac Toe, or as complex as Chess
- Chance games (i.e. backgammon)
- Max Qty of Nodes Visited
- Given
- Isolation Game board: 5x5
- Time to process all Upper Limit of Nodes:
# Move 1 [Computer] - 25 possible places # Move 2 [Human] - 24 possible places # 23 ... # i.e. 25 x 24 x 23 x 22 ... x 1 = 25! = 10 ^ 25 # assume multi-core gigahertz processor able to do # 10 ^ 9 operations/s it will take 10 ^ 16 seconds # to search the entire game, # where 10 ^ 16 sec given 3600 sec/hr given 24 hr/day given 365 day/yr = 317,097,920 years
- Time to process only up to Second Move:
# 25 x 24 = 600
- Subsequent Moves from Third Move:
- Each subsequent move moves like a queen and less moves
- Center Position on board with Max Options to move on Third Move
- Branching Factor
- Center has 16 possible positions
- Other Positions have <= 13 possible positions
- Max moves (depth of tree) in a game is 25
- Max possibilities from First and Second moves: 25 x 24 = 600
- 23 remaining moves after First and Second moves:
- <= 13 possible moves available (except center position)
- End Nodes Worst Case in game tree:
- 25 * 24 * 12 ^ 23 = >10 ^ 27 estimate …
- Branching Factor of 12! factorial is far too much to
assume as being necessary for the last moves where not many
possibilities remain
- Fourth Last Move possibilities = <=3
- Third Last Move possibilities = <=2
- Second Last Move possibilities = 1
- Last Move possibilities = 0
- Instead assume this approximate is sufficient
12 ^ 12 ~= 10 ^ 13 12 * 11 * ... * 3 * 2 * 1 = 5 * 10 ^ 8 25 * 24 * (12 ^ 13) * (5 * 10 ^ 8) ~= 3 * 10 ^ 23 (much better/faster than 10 ^ 27 since most branches will have less than the max branching factor)
- Qty of Nodes in Game Tree that Minimax Algorithm must visit
to find Optimal solution is:
- b ^ d
- Given grid: 5x5
b
(avg branch factor)d
(depth of game tree)
- b ^ d
- Calculate Average Branching Factor
- Option 1: Brute Force
- Try simple test first and add intelligence later
- Create game player than moves randomly and calculate
the avg branching factor
(use statics on each move: branches remaining, total games,
total plays, total branches, avg branching)
and we find the
- Average Branching Factor: 8
- Estimated Max Nodes visited: 8 ^ 25 ~= 10 ^ 22 (i.e. 1.2 million years for answer)
- Option 2: Depth Limited Search (Clever alternative to long end game search)
- Assume:
- Search Speed: 10^9 nodes * 2 sec = 2 * 10^9 nodes
-
Solve: 8^x < 2 * 10^9 log base 8 8^x < log base 8 2*10^9
Math Formula: log base a x = log base b x / log base b a
-
So…: x < log base 10 2 * 10^9 / log base 10 8 x < 10.3
-
Assuming Branching Factor of 8 is good, to be on the safe side we should only go to Max Depth of 9 (only go as deep as think need to meet deadline)
-
At Depth of 9 we want an Evaluation Function of a Node to evaluate goodness of node at depth of 9 based on how much we expect it to lead to Win for [Computer] player
-
Note: Only way for [Computer] player to Win is for it to have moves left at end of game, so maybe [Computer] shouldn’t maximize the number of moves it has early on!
-
Evaluation Function (aka “Number of #my_moves for convenience for Player 0”)
- Note On each game tree branch
- Min Branch (third last) - Mark min value of all below Max Branches
- Max Branch (second last) - Mark max value of leaves below it
- Leafs - Mark the maximum about of moves available to [Computer]
- Arguments: Each game board generated level 9 of Minimax game tree
- Returns: Number to compare that level 9 node with all other nodes at level 9. A higher number is returned is representative of how good the board is for the [Computer] player
- Task: Counting number of moves the [Computer] has available at a given node, allows [Computer] to select branches from Minimax tree that lead to spaces where [Computer] has the most options
- Also, use the Evaluation Function to test our assumption
that if [Computer] makes a move to any middle position they will always lose.
Try ONLY applying Evaluation Function down to Level 2 of game tree,
to see that completely different nodes get higher scores, (the ones that
would guarantee a loss!! Maybe they vary widely because critical decision being
made at that level.
- Demonstrates better to search to at least level 3
- Check which branch the Minimax Algorithm recommends at
each level of the search
-
Use #my_moves Evaluation Function on Isolation Game with Max score 5, Min score 0. If find branch where [Computer] loses, give it score -1, or if find branch where [Computer] wins, give it score 100..
-
Quiesence Search is the state we reach if we continue to Level 3, where the recommended branches no longer change much so know we have gone far enough… and if continue to Level 5, we just learn what we already know and values become -1’s and +1’s
- Use this approach of determining deepest level to go to if get consistent results
- Use this approach at Start and End of game when gives better results
-
- Depth-First Iterative Deepening (DFID)
- References:
- AI Textbook: 3.4.5 of Russel, Norvig
- UBC Slides: https://www.cs.ubc.ca/~hutter/teaching/cpsc322/2-Search6-final.pdf
- Video showing how DFID vs DFS http://movingai.com/dfid.html
- Note: End Game is best critical time to apply DFID to see what will happen (i.e. in Isolation game)
- Note: Using DFID means that our [Computer] player will always have an ANSWER READY incase it runs out of time (whe limited time PER MOVE), and using it we can search as far as possible within its time constraints. In other games where limited time PER GAME (i.e. speed chess) so we would want [Computer] to search to different depths depending on point in the game
- Strategy developed on how deep to search at different points in the game
- Strategy 1:
- Beginning of game: Standard Initial Moves
- Middle of game: Deeper Search (more time allocated) for Moves
- End of game: Less Time searching (as far as possible in remaining time) for Moves (relying on reduction and branching factor)
- Strategy 2:
- Conservative amount of time dedicated per Move
- Use DFID and Quiescent Search to determine few moves want to spend extra time on
-
RISKS: Horizon Effect * Highlighting to [Computer] player critical situations when game will be won/lost on next move (but where [Computer] unable to search far enough into future to discover problem exists)
* i.e. If a player is blocked on a side of a Partition where they have less moves than if they chose to move to the other side of the Partition (in the board where Partition is created by previously occupied positions) * **Potential Solution** Add a process in your existing #my_move **Simple Evaluation Function** to make it a **Complex Evaluation Function** by additionally checking if a **Partition** is being formed by the next move, and if so, count the number of moves left to each player (BUT this causes our Evaluation Function to take exponentially longer) * **Solution** * **Strategy** Depending on game time available, carefully think how efficiently Evaluation Function may be implemented depending on strategy chosen, and whether it captures desired behaviour or not, either use. But BEFORE the **Evaluation Function** use **Alpha-Beta Pruning** (to improve efficiency of game tree search): * **Alpha-Beta Pruning Algorithm** * Dfn: Pruning technique to allow ignore whole sections of game tree but with same answer as with **Minimax Algorithm** so is MORE EFFICIENT, saving a lot of time. Where * **Minimax Algorithm** runs in a timeframe of `b ^ d` * **Minimax with Alpha-Beta Pruning (first)** runs in a timeframe of `b ^ d/2` (if nodes ordered optimally with best moves first), and down to timeframe of `b ^ d*3/4` (with random move ordering) * Firstly, When performing the **Minimax Algorithm** from left to right, after propogating the highest value up to the far left most bottom Max Node (Level -2) from that branch's two nodes (Level -1, aka leaf), (i.e. a value of 2) we know that the upward Min Node's (Level -3) subtree value will be <= 2... (since Opponent X will choose branch that minimises the value), so for each remaining Level -2 nodes, as soon as we find they have a leave Level -1 node of value 2, we can ignore the rest of its leaves and move on to next Level -2 node (i.e. if second Level -2 Max Node has 3x leaf nodes, and we find first one is 2, we can ignore the other two as they will never be selected), saving time!! And based on this we know that at the highest Top node Level -4, we will get a value of >= 2 * Then, assuming Level -4 is the Top of the game tree.. when we check the second from the left Level -3 subtree's ([Computer] turn) Level -1's leaf nodes, as soon as we find a value of 2 we know that is a Winning leaf, with its upward Level -3 subtrees Min value being <= 2, so we can ignore the rest of the branches below that subtree * Then, for third from the left Level -3 subtree's Level -1's leaf nodes, as soon as we find a leaf with value of 2, we can ignore the remaining Level -2's and associated leafs in that subtree, since we have already found a Max (Winning) move in that subtree * Note: The above assumes our goal is to play the game, and that we choose from left to right using Minimax Algorithm, but not keep a list of all the equally good moves on a given level * **Reordering leaf nodes by only swapping siblings of a given parent** to increase amount pruned * i.e. Since evaluate left to right, in right subtree swap the Level -2 Max nodes around so smallest is on left instead of on right, as Level -3 is Min, and may only need to evaluate the left if we find Level -3 Min is less than first subtree * **"Simple" Evaluation Function** and DEEP SEARCH, OR * **"Complex" Evaluation Function** and SHALLOW SEARCH to catch dangerous situations like the **Horizon Effect**, such as extensions of the **#my_move** **Simple Evaluation Function** that are correlated with our chances of winning including (try lots of Variants of Complex Evaluation Functions to see which ones are the best): * (**#my_moves** - **#opponent_moves**) * Note: Useful for Simple Isolation games * Favours moves where [Computer] has most options * Penalises moves where opponent has most options * Note: Possible to **weight the formula components** for more or less aggressive game play. * i.e. to cause [Computer] to chase after Opponent, and to make it so the **Winning Move has the highest Evaluation Function result** * (**#my_moves** - 2 * **#opponent_moves**)
- Strategy 1:
- Init Goal: Want [Computer] to respond within < 2 sec per move
- Previously used Max Depth Calc: Calculated Max Depth we thought could search to within that timeframe
- Simpler Iterative Deepening Approach:
- Search to Depth Level 1, get answer for best move, Store answer to use incase run out of time (L1 has 1 tree node, L2 has 3 tree nodes, L1 has 1 iterative deepening nodes total, L2 has 4 iterative deepending nodes total)
- Repeat process but search to Depth Level 2
- Repeat process until run out of time (returning best move from level we get up to)
- Note: number of Tree Nodes for each level is the total number of nodes visited when exploring the tree till that depth (not just the nodes on that level)
- Note: number of Iterative Deepening Nodes are simply the cumulative sum at each level (always less than double the number of tree nodes)
- Note:
b = 2 (Branching Factor of 2 given) n = 2^(d+1) - 1 (Tree Nodes total) b = 3 (Branching Factor of 3 given) n = (3^(d+1) - 1) / 2 (Tree Nodes total) b = k (Branching Factor of k given) n = (k^(d+1) - 1) / (k - 1) (Tree Nodes total)
- Note: Iterative Deepening of researching the tree does not waste too much time as amount of time taken is dominated by last level searched and is a small multiplier compared to the DFS approach of checking every node from bottom leaf left to right (exponential)
- Note: With Branching Factor of 2, DFID expands less than 2x the qty of nodes a depth limited search would have done at the same level
- Note: With Branching Factor > 2 (i.e. most games) then the redundancy caused by iterative deepening is even less
- References:
- Note On each game tree branch
-
-
- Assume:
- Option 1: Brute Force
- Given
- AI Definitions
- AI NP (Hard) Problem Types
- Non-Exponential
- Exponential in Time OR
- Exponential in Space OR
- Exponential in Both Time AND Space
- i.e. “AI is the studying of finding clever hacks to exponential problems”
- When problem finally solved or computer finally fast enough to solve it, the world no longer thinks of the problem as belonging to AI
- i.e. “AI consists of all the NP hard problems that have not yet been solved”
- i.e. “AI involves working on everyday problems that improve people’s lives”
- AI Systems
- System that helps consumers choose plane flights
- System that determines when to deploy airbag in a car
- AI NP (Hard) Problem Types
- Lesson 6.16:
- Solving 5x5 Isolation Game
- Searching to endgame in reasonable timeframe:
- Minimax Algorithm with Alpha-Beta Pruning (reduces search space size of possible game states from b^d to b^(d/2) i.e. from approx 8^25 to 8^12)
-
Symmetry Analysis of board state that is defined as series of ordered moves * http://stackoverflow.com/questions/9082829/creating-2d-coordinates-map-in-python http://stackoverflow.com/questions/32334322/find-adjacent-elements-in-a-2d-numpy-grid http://stackoverflow.com/questions/6313308/get-all-the-diagonals-in-a-matrix-list-of-lists-in-python http://stackoverflow.com/questions/2373306/pythonic-and-efficient-way-of-finding-adjacent-cells-in-grid
- Best at Beginning of Game and Up to Level 3 of the Search Tree
(beyond Level 3, symmetry is rare, and effort to check for it wasn’t worthwhile)
(when Branching Factor is High,
i.e. Player 1 has 25 possible moves, but only 6 Unique moves,
since can use Symmetry about horiz, vert, diagonal axis, and
the center move)
- Heuristic of Equivalent Moves by Rotating Board State 90 degrees
(realising some moves equivalent)
- Reuse Game Tree known for Move (0,0) as is same as (4,0), (4,4), (0,4)
- Do not have to search the game tree further for these non-unique moves
- Heuristic of Equivalent Moves by Rotating Board State 90 degrees
(realising some moves equivalent)
- Best at Beginning of Game and Up to Level 3 of the Search Tree
(beyond Level 3, symmetry is rare, and effort to check for it wasn’t worthwhile)
(when Branching Factor is High,
i.e. Player 1 has 25 possible moves, but only 6 Unique moves,
since can use Symmetry about horiz, vert, diagonal axis, and
the center move)
- Partitions (know the game outcome as soon as get partition,
as separates both players completely, so player with longest path left Wins,
so avoid having to search to end of game tree)
- Best at Beginning of Game to create Partition so your player is on side with most remaining moves
- Avoid being Player 2
- Note: Assuming both players play Optimally, Player 2 always wins on 5x5 Isolation game
- If Player 1, always first Move to Center square,
and then if possible use Reflection on all subsequent moves to win
- Reflection Phenomenon is always copying the opponents move 180 degrees on other side of board, on every move they make i.e. Player 1 Move 1: Center Player 2 Move 1: Diagonal Player 1 Move 2: Opposite Diagonal, etc
- If Player 2, detect if Player 1 is using Reflection Phenomonon and if so, move to a position that Player 1 cannot reflect (there are 8 such moves of the 24 available to Player 2)
- If Player 2, create Book of Opening Moves and Hints,
such as:
- If Player 1’s first move is to Center Square, then Player 2’s first move afterward should be an Unreflectable position
- If Player 1’s first move is NOT Center Square, then Player 2’s first move afterward should be the Center square
- **After first moves,:
- Load and Search our order
Order Book of Opening Moves and Hints efficiently, comprising:
- Equivalent Moves
- Hash Tables
- Implement in order:
- Minimax
- Iterative Deepening
- Alpha-Beta Pruning
- Evaluation Function
- Load and Search our order
Order Book of Opening Moves and Hints efficiently, comprising:
- Searching to endgame in reasonable timeframe:
- Solving 5x5 Isolation Game
- Lesson 6.17 - 6.18
- Multiplayer Games + MAXN (for any number of players)
- 3-Player Isolation Game
- Last player able still able to move wins
- Pairs of players may form alliances
- Do Not use Minimax Algorithm in multi-player games
- Evaluate game board Leafs from perspective of
each player and propagate values (as triplets)
up the tree, i.e.
- Level 3 Max triplet from Player 3’s perspective (i.e. propagate up triplet leaf with largest Player 3 value
- Level 2 Max triplet from Player 2’s “
- Level 1 Max triplet from Player 1’s “
- Level 0 (Top) receives propagation from Level 1
- 3-Player Isolation Game
- Multiplayer Games + MAXN (for any number of players)
- Lesson 6.19
- Multiplayer Games + Alpha-Beta Pruning
- Lesson 6.21
- Reading:
- References: Korf, 1991, Multi-player alpha-beta pruning
- Why might alphabeta pruning only work well in the two player case? *
- How does the size of the search tree change with more than two players?
- Questions Asked in Forum https://discussions.udacity.com/t/lesson-6-part-21-multi-player-alpha-beta-pruning-paper-reading-by-korf-figure-7-and-section-3-5/226209
- References: Korf, 1991, Multi-player alpha-beta pruning
- Reading:
- A-B Pruning works if:
- Some players’ Evaluation Functions have an Upper Bound
- All players’ Evaluation Functions have a Lower Bound
- Evaluation Function for number of #my_moves has:
- Lower Bound of 0
- Upper Bound - calculate using this Evaluation Function:
- Evaluate Function should sum of total number of moves remaining for each player, which should not exceed remaining number of empty squares on isolation board (if 3x players, divide number of empty squares by 3 to get moves remaining for each player). Note: The Upper Bound would change with each game tree Level.
- Simplify calculation that allows both
Immediate Pruning and Shallow Pruning,
BUT NOT Deep Pruning like before, and use
3-Player MAX-MAX-MAX Pruning (see Lesson 6.20):
- SUM of players scores (within a triplet group) must be Max of 10, i.e.
- Level 2 Max triplet from Player 2’s perspective (i.e. if a Leaf has Player 2’s [middle] value of 10 value where Player 1 and Player 3 value is 0 since Max SUM of all three is 10, then propagate up triplet Leaf with largest Player 2 value, and prune remaining leaves on that branch)
- Level 0 (Top) receives ranges based on triplets propagated up, which gets refined on outcome of propagating up subsequent subtrees, i.e. (>=3, <=7, <=7)
- When doing subsequent subtrees, apply the range to Level 1 (i.e. (>=3, <=7, <=7) ), and if the subsequent branches have player values already within those ranges but less than the player values we have propagated up in previous subtrees to Level 1, then Prune them
- SUM of players scores (within a triplet group) must be Max of 10, i.e.
- Lesson 6.21
- Multiplayer Games + Alpha-Beta Pruning
- Lesson 6.22 Probabilistic Games
- Stochastic Games (i.e. Backgammon)
- Note: i.e. Moves limited on each turn, based on outcome of rolling two dice (do not know value of each dice ahead of time, we assume we cannot do a game tree for it, but we can)
- Game Tree
- Expectimax Algorithm to make Best Choice decisions on Moves
- Only allowed to Prune when have Known Bounds on the values that will be returned by the Expectimax Function
- If player is going to lose anyway, unless the Opposition makes a bad choice or is unlucky, then still use Expectimax Algorithm to increase likelihood of winning incase Opposition makes bad choice or is unlucky, BUT we would need to search all the way to endgame
- When evaluating leaves, process highest leaf value first with highest probability (change order) for the left-most branch
- Expectimax Algorithm to make Best Choice decisions on Moves
- Sloppy Isolation (Variant of Isolation Game that is a Probabilistic Game)
- Add proability nodes between Levels to illustrate probability of making the move vs a different move with probability of between 0 and 1
- Use #my_moves Evaluation Function at leaf nodes, if make move at leaf node, calculate how many moves will still be available for the player to move to after the oppositions next move, and for each leaf multiple this value by the probability of that move occurring compared to others in the same branch, then sum those values for each leaf node
- Elimate evaluating nodes all together if return is higher when looking for Min Level
- Stochastic Games (i.e. Backgammon)
- Course Outline
- Project 2:
- Goal: Program player that beats opponent consistently
- Design different Heuristics to perform the Adversarial Search
-
Analyse and compare the performance of each Heuristic
-
Reference Projects by others: *
https://github.com/morganel/AIND-Isolation-1 https://github.com/bdiesel/AIND-Isolation https://github.com/ziyanfeng/AIND-Isolation https://github.com/davidventuri/udacity-aind/tree/master/isolation https://github.com/mohankarthik/AIND-P2-Isolation https://github.com/mlgill/AIND-Isolation https://github.com/morganel/AIND-Isolation-1
- Reference Game Simulation https://deepmind.com/research/alphago/alphago-games-english/
- Summary of AlphaGo, excellent https://www.tastehit.com/blog/google-deepmind-alphago-how-it-works/
- Lesson 7 - Search:
- AI dev dfn: Figure out what to do when don’t know what to do
- Plan sequence of steps to figure out what to do even when do not know what first step should be until we solve the search problem
- Note: Experiment and keep measuring so you can improve. Pay attention to people by keeping them happy and productive
- A* Algorithm
- non-AI dev dfn: Write instructions to tell computer what to do when you do know what to do
- Challenge 1:
- Tri-City Search Problem
- Problem is visit 3x places on bicycle, minimising travel distance, and willing to go down wrong way down one-way streets, with minimal runtime, and minimal overhead
- Step 0
- Consider Space and Time advantages/disadvantages of each algorithm
- Step 1
- Option 1: Simultaneously (trick) start Search from A, B, and C
using A* Algorithm (aka Tri-Directional A* Search),
by growing them out until first two connect, then continuing
until third location connects somewhere with first two
(avoids repeated node visits)
- Tri-Directional A* Search uses an Admissible Heuristic function that accounts for two potential goals at the same time otherwise would just be an uninformed Tri-Directional Search
- Option 2: DFS - not viable as may cross country multiple times just for first branch
- Option 3: BFS (UCS) - works, but 3x searches required,
- Sequentially Search from: A to B, B to C, and C to A, then determine Shortest two of the three paths, and set the polling station between those two paths as the next to visit and start at either of the other two cities
- Issue: Would explore more streets intersections / nodes repeatedly in the 3x searches more than necessary (duplicates), and search range is further than necessary
- Option 4: Bi-Directional Search
- Instead of each of 3x searches requiring b^d nodes it would be done in (b^d)/2 nodes, but would still revisit nodes in triangle b/w the cities
- Option 1: Simultaneously (trick) start Search from A, B, and C
using A* Algorithm (aka Tri-Directional A* Search),
by growing them out until first two connect, then continuing
until third location connects somewhere with first two
(avoids repeated node visits)
- Tri-City Search Problem
- Challenge 2:
- Rubik’s Cube
- Problem Goal of all same colours on same side, cube levels twistable on vertical or horizontal. Design search algorithm that guarantees least qty moves to finish from any starting state, where each 1/4 (90 degree) turn is a move
- Solution Iterative Deepening with A* Search Algorithm (IDA*), which
looks for increasingly longer solutions in series of iterations, and using a special
admissable Heuristic (search technique) described here Korf, 1997. Finding Optimal Solutions to Rubik’s Cube Using Pattern Databases.,
that uses a lower-bound heuristic to prune branches when their estimated
length exceeds the current iteration bound, and a heuristic function
based on large memory-based lookup tables
or “pattern databases” (storing exact qty of moves req’d to solve sub-goals
such as individual movable cubies) and where the effectiveness of the
admissible heuristic function is characterised by is expected value (see paper
for details),
searching to median depth of 18 for random configs attempted.
Proved that max quarter turns required to solve cube is 26 God’s Number is 26 in the Quarter-Turn Metric
using 29 CPU years of idle computer time of super computer
- Note: Overcomes issue since search space is large (i.e. nodes at depth of 18 in search tree is Quintillion (i.e. xx,xxx,xxx,xxx,xxx,xxx,xxx) so is Intractable
- Rubik’s Cube
- “Problem Solving” AI-Agents
- i.e. theory and tech of building AI-agents to plan ahead to solve problems, where problem complexity is due to many problem states
- Conditions that must be True in order to successfully
search for a plan that solves a problem and be guaranteed of working
- Domain must be Full Observable (i.e. must be able to see initial state we start with)
- Domain must be Known (i.e. must know the actions that are available to us)
- Domain must be Discrete (i.e. must be finite qty of actions to choose from)
- Domain must be Deterministic (i.e. must know the result of taking an action)
- Domain must be Static (i.e. must be nothing else in world that can change the world excpet our own actions
- State Spaces of many Types may be dealt with using Problem-Solving through Search,
such as:
- 2D
- Route finding problems
- Sliding blocks puzzle called “15 Puzzle”
- Note: requires a Heuristic function to be applied in order for the search algorithm to work
- Starting State is mixed up positions
- Approach - Slide tiles around until reach goal
- Goal - Tiles in certain configuration in order left-to-right
and top-to-bottom
- i.e.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
- i.e.
- Heuristic Usage Solution Approach
- Heuristic h1 = qty of misplaced blocks
- Admissible (never overestimates, since each time in wrong position must be moved at least once to get to right position)
- Heuristic h2 = sum of distances that each block would
have to move to get to right position
- Admissible (since each tile in wrong position can be moved closer to correct position no faster than one space per move)
- Heuristic h3 = always exact cost at every node (so naturally expands the least number of nodes)
- Heuristic h4 = always 0 (so naturally expands the most number of nodes of all possible heuristics)
- Note: h2 >= h1
- So with the exception of breaking ties, an A* Search using h2 always expands fewer paths than one using h1
- Note: Heuristic that is greater than or equal to another one gets use closer to the perfect heuristic and expands at least the same qty of nodes
- Heuristic h1 = qty of misplaced blocks
- Deriving Candidate Heuristics Automatically
(aka Generating a Relaxed Problem) (by manipulating a formal
description of the problem where many constraints, such as harder to move
tiles around, and made easier by relaxing one or more of the constraints)
- Note: By relaxing the problem, it is as though new operators are being added that traverse the state in new ways, making the problem easier to solve (never overestimates, and is admissible)
- Program that is:
- Given Description of a Problem (i.e. sliding blocks puzzle)
Line 1 - A block can move from any tile A to any tile B Line 2 - if (A is adjacent to B) Line 3 - and if (B is blank tile
- Given Description of a Problem (i.e. sliding blocks puzzle)
- Loosen the restriction by:
- Crossing out Line 3, results in Heuristic h2
(since a block can move anywhere to an adjacent state)
Line 1 - A block can move from any tile A to any tile B Line 2 - if (A is adjacent to B)
- Crossing out Line 3, results in Heuristic h2
(since a block can move anywhere to an adjacent state)
- Loosen the restriction again further by:
- Crossing out Line 2, results in Heuristic h1
(where a block can move anywhere, regardless of any condition)
Line 1 - A block can move from any tile A to any tile B
- Crossing out Line 2, results in Heuristic h1
(where a block can move anywhere, regardless of any condition)
- Generating Another Heuristic
- Guaranteed Admissible (as long as h1 and h2 Admissible) and
never overestimates, and is guaranteed better as gets closer to true value
h = max (h1, h2)
- Note: Problem with combining heuristics is that there is cost to compute
- Guaranteed Admissible (as long as h1 and h2 Admissible) and
never overestimates, and is guaranteed better as gets closer to true value
- 3D
- i.e. Abstract properties other than x-y position on a plane
- Robot vacuum cleaner world state space
- In either of only 2x possible positions
- Each position may or may not have dirt in it (additional property to deal with)
- States Space required: 2x2x2 = 8 states
- 2x physical states (State A or B)
- 2x dirty states for State A (dirty A-dirty or not A-not-dirty)
- 2x dirty states for State B (dirty B-dirty or not B-not-dirty)
- 2x physical states (State A or B)
- Actions connect between each State
- Example 1 (Simple):
- Action of Move Right - Changes to State with position to the right that is dirty
- Action of Sucking - Change to State with same position but no longer dirty
- Example 2 (More Complex):
- Given possible Robot positions 10 OFF
- Given Robot possible positions
- 10 possible positions
- Given Power Switch in one of 3 possible positions
- On
- Off
- Sleep
- Given Robot with Dirt Sensing Camera either
(2 possible positions):
- On
- Off
- Given Robot with vacuum Brush Heights 5 OFF
(5 possible positions):
- 1-5
- Given Dirty or not
(2 possible states)
- Dirty or Not Dirty
- Given Robot possible positions
- State Space has qty of states of:
- Cross product of all variables (since each independent and any combination may occur)
- = 3 * 2 * 5 * 2^10 * 10 = 307,200 states
- Given possible Robot positions 10 OFF
- In either of only 2x possible positions
- 2D
- Paths in the State Space
- Linked List of Nodes implement the Paths within a computer algorithm,
where a Node is a Data Structure that has four Fields.
- State Field - indicates the state at end of the Path (i.e. B)
- Action - action take to get there (i.e. AB)
- Cost - total cost and parent is pointer to another node (i.e. 234)
- Parent - pointer to another node state (i.e. A), or null pointer if no parent
- Note:
- Path - abstract idea
- Node - same thing as Path, but terminology for representation in computer memory
- Linked List of Nodes implement the Paths within a computer algorithm,
where a Node is a Data Structure that has four Fields.
- Data Structures that deal with Nodes/Paths and their implementation
- Frontier
- Operations
- Remove best item
- Add new items
- Membership test (i.e. is item in frontier?)
- Implementation
- Priority Queue (knows how to keep track of best items in proper order)
- Representation
- Set (i.e. built from Hash Table or Tree)
- Build
- Hash Table
- Tree
- Operations
- Explored List
- Operations
- Add new members
- Check for membership
- Implementation *
- Representation
- Single Set
- Build
- Hash Table
- Tree
- Operations
- Frontier
- Example 1: Navigation WITHOUT Fog Problem
- Initially many choices
- Complexity in Sequence of Actions (i.e. of picking right choice of turn at each intersection)
- Example 2: Navigating WITH Fog Problem
- Complexity in Partial Observability (i.e. if unable to see through fog where possible paths are, actions themselves are unknown, and can’t see result of actions)
- Example A: Route-Finding 2D Problem (Partial Observability)
- Given
- Start location (i.e. Arad)
- Destination location (i.e. Bucharest). Note: Corner of map
- Partial map not showing the Destination
- Problem:
- Find route from Start to Destination location
- Actions executable by AI-agent is driving from city-to-city along roads shown on a map
- Question:
- Is there a solution the AI-agent can return?
- Answer:
- No, not if Start or Destination are not on the map
- Given
- Example B: Route-Finding Problem (Full Observability)
- Given
- Same as above but Full map showing Destination
- Problem
- Same as above
- Question
- Same as above
- Answer
- Yes, many possible Sequences of Actions (routes) chained together that each achieve the goal
- Given
- Defining a Problem
- Break down into components
- Initial State (of AI-agent) - s0 (i.e. being in Arad)
- Actions Function takes state input argument and returns
set of possible actions executable when agent in the state
(i.e. sometimes same actions available in multiple states)
(i.e. sometimes actions are dependent on the state) (i.e. possible to go to
neighboring cities but not any city in one move)
- myActionFunc(s) -> {a1, a2, a3}
- Result function takes state and action as input argument
and returns new state as output
- myResultFunc(s, a) -> s’
- i.e.
- State - Initial AI-agent state - s0 (in Arad)
- Action - Go via route E671 to Timisora
- Result - New State where AI-agent in Timisora (of applying Action in given State)
- GoalTest function takes state and returns Boolean whether state is goal or not (i.e. only True in final destination)
- Path Cost function takes a path, sequence of transitions (from performing Actions to move between States), and returns Cost of the path. Note: Usually just additive (i.e. Path Cost just sum of cost of individual Action steps) so implement the Path Cost function in terms of a Step Cost function (i.e Path cost is just the sum of the Step Costs along a path)
- Step Cost function takes a State, Action, and the
Resulting State from that Action, and returns the Action Cost as a
number (n)
- i.e. in route finding example this cost may by qty of kms travelled or minutes taken to get to destination)
- Break down into components
- Apply to Route Finding example the Definition of a Problem
- Given Initial State of Arad
- Given Goal State of Bucharest
- Given possible Actions of roads to follow when in specific city
- Sequences of Actions are built from paths we follow
- Note: Path Length of 0 when just in Initial State
- Note: Path Length with “X” Sequences of Actions from Initial State has Path Length of “X”
- Note: State Space is the set of all states
- Note: Actions are used to navigate the State Space
- Note: Step Cost of going between two adjacent States are shown
- Note: Path Cost for a path is sum of Step Costs along that path
- Note: At each Point we separate State into 3x different components:
- Frontier consists of the States at the end of the Paths that have been explored
- Explored region States before the end of the Paths that have been explored
- Unexplored region States after the end of the Paths that have been explored
- Define a Function for Solving Problems called Tree.Search
- Note: Tree.Search represents a family of algorithms/functions, not a single algorithm, which superimposes a search tree over the state space
- Note: It initialises frontier to be path consisting of only Initial State, then a loop checks if still anything left in Frontier, otherwise fails with no solution. If something left in Frontier we make a choice and use the Tree.Search’s family of functions to choose one of the paths in the Frontier, and then remove that path from the Frontier. If we find the State at the end of the Path that is a Goal State we return success, Else we perform Expanding that Path by iteratively inspecting all possible Actions expanding from that State and add to the Path the Actions and the result of that State to get a New Path (which has the Old Path, the Action, and result of the Action). All New Paths are added to the Frontier
- Tree.Search Family of Algorithms - Commonalities and Differences
- Commonalities:
- All algorithms look at Frontier, popping items to check against Goal Test
- Differences between the algorithms:
- The “choice” of which path to look at first (i.e.
remove.choice
of how you expand the next item of the Frontier)
- The “choice” of which path to look at first (i.e.
- Commonalities:
- Search Algorithms
- Breadth-First Search (BFS) Algorithm (aka Shortest steps first)
- Dfn: i.e. Searches by expanding first the shortest possible path steps, since always choose from the Frontier one of the paths that hasn’t been considered yet, which is shortest possible
- Optimal and Complete (if goal at finite depth)
- Cheapest-First (Uniform Cost) Search (UCS) Algorithm (aka Cheapest cost first) (aka Dijkstra’s algorithm)
- Dfn: i.e. Searches by expanding first for path with cheapest possible total cost first
- Optimal and Complete (if goal has finite cost)
- Depth-First Search (Opposite of Breadth-First Search)
- Dfn: i.e. Searches by expanding first the longest path, with the most links in it
- NOT Optimal, and NOT Complete (since if infinite path, DFS would keep following it and never get to path consisting of the goal)
- DFS only uses Storage Space of n nodes (instead of 2^n nodes in BFS and CFS) so saves a huge amount of space. But if also track the “Expored Set” we do not save as much.
- Greedy Best-First (GBF) Search Algorithm (UCS focussing on Estimate)
- Dfn: i.e. Searches by expanding first the path that’s closest to the goal according to the Estimate (search leads us directly toward the goal but searching in ellipses toward the goal, instead of circles searching in all directions even away from the goal.
- Note: If obstacles/barrier in the way, may lead us to the goal in a longer path than necessary if we continue toward the goal without back-tracking back to states farther from the goal and exploring in the opposite direction around the barrier)
- Similar to Uniform Cost Search by expanding in terms of Contours, like on a topological map, first expanding to a certain step distance, then to a farther distance, etc, from the initial state until at some point we reach the goal state, but note that the search was not directed toward the goal, but instead was expanding out everywhere in the space and we should expect to explore Half the state space on average to find the goal, so when the state space is large, we won’t get to goal fast enough). To find the goal faster than with UCS, we add more Knowledge to the search by using an Estimate of the Distance from the Start State to the Goal State. In Route finding problem we can move in any direction (up, down, left, right), but for the Estimate we use the straight line distance between a State and Goal, and use that Estimate to find way to goal fastest.
- A* Algorithm (aka Best estimated total path cost first)
(combines UCS and GFS)
- Dfn: i.e. Searches by always expanding path with minimum value of the
Heuristic function f (defined as sum of g and h components) so result is a search
strategy that is the best possible (i.e. it finds shortest length path
whilst expanding minimum qty of paths possible),
where g of a path equals path cost,
where h of a path equals h value of the state (final state of the path),
which equals the estimated distance to the goal
- Heuristic Function (generated by relaxing the problem definition)
f = g + h g(path) = path cost (sum of path costs from initial state to current state) h(path) = h(s) = estimated straight line distance to goal (from current state)
- Heuristic Function (generated by relaxing the problem definition)
- Note: h is a Heuristic function finds the lowest cost path since
when A* Search ends it returns path P, with estimated cost C,
where C is also the actual cost, since the h component is 0 at the goal state
(so path cost is total cost, as estimated by the h function).
- Path P must have a cost less than true cost of other paths on Frontier.
- Paths beyond the Frontier must have cost greater than Path P’s cost, since step cost is always 0 or more
- Note: All paths on Frontier have estimated cost > C, since Frontier is explored in cheapest-first order
- Note: Minimising g helps keep path short
- Note: Minimising h helps keep focused toward finding goal
- Conditions under which A* Algorithm will find the lowest cost
path to goal: A* will always find the lowest cost path to the goal dependent
on if the heuristic estimate function h for a state is less than the
true cost of the path to the goal through that state:
- h function want never to overestimates the distance to the goal
- h is Optimistic
- h is Admissible (since admissible to use it to find lowest cost path)
- Goal test works by taking paths off the Frontier, not putting paths on the Frontier
- Additional Heuristic
- Use additional heuristic of the Straight Line distance between a State and a Goal, for each State
- Combines Greedy Best-First Search of exploring a smaller number of nodes directed toward the goal state, and Uniform Cost Search that is guaranteed to find the shortest path to the goal state
- Example
- Given a found path through state space to State X, and trying to give value to the path. The measure f is sum of g (path cost so far) and h (estimated distance remaining for path to get to goal)
- Dfn: i.e. Searches by always expanding path with minimum value of the
Heuristic function f (defined as sum of g and h components) so result is a search
strategy that is the best possible (i.e. it finds shortest length path
whilst expanding minimum qty of paths possible),
where g of a path equals path cost,
where h of a path equals h value of the state (final state of the path),
which equals the estimated distance to the goal
- Note: Resolve ties in left to right order
- Optimality Property of Algorithms (i.e. guaranteed to find the shortest path, cheapest total cost
path, and longest path respectively)
- Yes, Optimal - Breadth-First Search
- Yes, Optimal (assuming individual step costs are non-negative) - Cheapest-First Search
- Not Optimal (if more than one goal state) - Depth-First Search
- Storage Requirements Property of Algorithms
- Given a state space of “very large” Binary Tree (each level down
the nodes double)
- BFS “Frontier” requires at Level n a Storage Space of 2^n paths
- CFS “Frontier” determines a sort of contour of cost but with a similar total number of nodes as BFS as Storage Space
- DFS goes all the way down levels of branches to leaf and then back up working left to right, but at any point the “Frontier” has only n nodes (instead of 2^n nodes). If also tracking the “Explored Set” then do not get as much savings over BFS and CFS.
- Given a state space of “very large” Binary Tree (each level down
the nodes double)
- Completeness Property of Algorithms
- i.e. if there is a Goal State somewhere, will the algorithm find it?
- Given a state space of infinite trees where a Goal State is hidden deep down
in the tree then the algorithms guaranteed to find a path to the goal include:
- BFS, CFS, but NOT DFS
- Breadth-First Search (BFS) Algorithm (aka Shortest steps first)
- Tree.Search Family of Algorithms - List of Algorithms
- General Tree Search (Non-Graph Search) using Breadth-First Search (BFS) Algorithm
Tree.Search (problem p) returns path frontier = { path(p.initial) } loop: if frontier is empty: return FAIL path = remove.choice(frontier) s = path.end if Goal Test (s): return path for a in p.Actions (s): add [path + a > Result(s,a)] to frontier
- Steps
- Sequence of Paths 1:
- Pick Initial State with Path Length 0 (only path in Frontier, so is shortest)
- Expand Initial State, adding in all paths that result from applying possible Actions
- Sequence of Paths 2:
- Remove Sequence of Paths 1 from Frontier
- Find Shortest possible path resulting from next expansion
- Get all new paths resulting from applying possible Actions from Initial State
- If multiple new paths, then pick the shortest one
- If all have same length, then break the tie at random, or use a technique
- Sequence of Paths 3:
- Remove other Sequence of Paths 2 from Frontier (i.e. keep only the shortest path)
- Add new paths to the Frontier, those that extend one level from new State (and including the back-tracking path back to the previous state that is also now part of the Frontier)
- See Graph Search approach from here
- Sequence of Paths 1:
- Note: Build a tree, always keeping track of the Frontier (ends of paths we haven’t explored yet) as the search space is explored, whilst behind the Frontier is the the set of Explored states, and ahead is the Unexplored States
- Note: Explored States are tracked for when we want to expand and we find a duplicate,
so we can point back to a previously seen state (without having to create a new state in the
tree) and make a regressive step into the previously Explored state
- See slide “7.9. Tree Search Continued”
- Steps
- Graph Search Algorithm using Breadth-First Search (BFS) Algorithm
Graph.Search (problem p) returns path frontier = { path(p.initial) }; explored = { }; loop: if frontier is empty: return FAIL path = remove.choice(frontier) s = path.end; add s to explored if Goal Test (s): return path for a in p.Actions (s): add [path + a > Result(s,a)] to frontier unless Result(s,a) in frontier or explored
- Avoids the repeated paths in Graph.Search function, by modifying the Tree.Search function
-
Steps
- Sequence of Paths 3:
- With Graph Search
- After we initialise “Explored Set” of paths already explored
- And after exploring new path, we add new state to set of “Explored Set”
- So when expanding the path and adding new states to end of it, we don’t add an end if we’ve already seen that new state before in either the “Frontier” or the “Explored Set”
- With Graph Search
- Sequence of Paths 4
- When using Breadth-First Search (BFS) Algorithm, the paths that are candidates to be explored next are the “shortest” ones (i.e. length 1) that haven’t already been explored yet
- When we try to add new paths to the Frontier from this “shortest” state, if we find that those that extend one level from new State are either already in the “Frontier” or the “Explored Set”, then we won’t add them, and instead try the next “shortest” path, and try and add new paths from it to the Frontier
- Sequence of Paths 5
- Now with Frontier states that are all length of 2, we choose one of them,
- We then add state paths to the Frontier that expand one level from this chosen state (but not the back-tracking path state since we’re doing Graph Search), and determine whether we terminate the algorithm at this point (if we’ve reached Goal State) or if continue
- Note: The Goal Test isn’t applied when we add a goal node path to the Frontier, but
instead when we remove a path from the Frontier. This is because it may not be
the BEST path to the Goal.
- Note: Optimisation may be made if using Breadth-First Search by changing the algorithm and writing a specific BFS Routine so that it checks the state as soon as they are added to the Frontier and gives result as soon as the goal state is added to the Frontier and terminates early (instead of after removing it from the Frontier by waiting until they are expanded), when we know there isn’t the possibility of having a shorter path to the goal of say length 2.5 instead of 3
- Sequence of Paths 3:
- Note: Breadth-First Search
- Will find shortest path in terms of least number of steps
- Will NOT find shortest path in terms of shortest total cost (by adding up the step costs), perhaps try Uniform Cost Search instead
- Graph Search Algorithm using Uniform Cost Search (UCS) Algorithm
- Steps
- Initial State “Unexplored” - Step 0
- Change Initial State to “Explored”
- Expand out looking at an adding paths out of that State (i.e. 3 OFF) to the “Frontier”
- Pick path X to be expanded next (using UCS rules of “Cheapest total cost first”)
- Remove that path X from the “Frontier” and add X to the “Explored”
- Expand out looking at an adding paths out of that State from X (without back-tracking) to the “Frontier” and sum the total cost of those paths (across steps from initial state) and put on “Frontier” to compare
- Pick path Y branching from initial state to be expanded next (using UCS rules of “Cheapest total cost first”)
- Remove that path Y from the “Frontier” and add Y to the “Explored”
- Add new paths to neighboring/successor states from Y (without back-tracking) to the “Frontier” and sum the total cost of those paths (across steps from initial state) and put on “Frontier” to compare
- Repeat picking path, etc
- Note, cannot terminate the algorithm until reach Goal State and have popped that cheapest path off the “Frontier” (i.e. continue searching to see if a better cheaper path that also reaches the Goal State)
- Steps
- General Tree Search (Non-Graph Search) using Breadth-First Search (BFS) Algorithm
- AI dev dfn: Figure out what to do when don’t know what to do
- Lab 7: Pac-Man
- Help Pac-Man navigate his world in the most efficient way to find food and other objects by implementing BFS, DFS, and A* Search. Opportunity to create a visually good app for portfolio.
- Reference http://inst.eecs.berkeley.edu/~cs188/pacman/project_overview.html
- Udacity Forums https://discussions.udacity.com/c/nd889-search-optimization/nd889-lab-pac-man
- Lesson 8 - Simulated Annealing (in family of Local Search Algorithms):
- Reading: AIMA Chapter 4.1, + 4.2-4.5
- References: Different viewpoints and extensions
- Dfn: Used in AI when State Space is large and other techniques required to search it and find optimal solutions.
-
Example Problems: N-Queens problem where N Queens set up on a board in such a way that they cannot attack each other
- Iterative Improvement Problems and Algorithms
- Example #1: Travelling Salesman Problem
- Problem:
- Given a Salesman
- Given 5x cities possible to visit
- Rules:
- Must visit all 5x cities before tour is finished
- Must return to starting city at finish
- Goal: Find most efficient order to visit the cities to minimise overall distance flown
- Note:
- This is an NP-hard Problem
where all NP Problems are about as hard as each
other (exponentially difficult)
- N - Non-Deterministic
- P - Polynomial Time
- Use tricks to solve problem efficiently
- This is an NP-hard Problem
where all NP Problems are about as hard as each
other (exponentially difficult)
- Solution
- Approaches
- Randomly connect the cities
- Identify where paths cross
- Iterative Improvement Algorithm applied by adding a small amount of intelligence to (i.e. iteratively revising figure to uncross each crossed situation to reduce distance travelled) results in getting very close to optimum solution (i.e. do this with 1000’s of cities and will get result within 1% of optimum solution)
- Approaches
- Problem:
- Example #2: Goodness graph
- Problem:
- Given an N-dimensional x-y axis graph that represents a Goodness Function that we are trying to Optimise for a problem
- Find the Maximum point along it
- Solution:
- Approach #1:
- Hill climbing - start somewhere (say right-most point) and improve answer by travelling up the gradient toward one of the maximum points, but the problem is we may get stuck at a local maximum when there may be multiple maximums. Unstick from Local Maximum to find a Global Maximum by using these techniques
- Randomness Techniques to use to get over problem
of getting stuck :
- Random Restart
- Start many particles at random positions on graph, then Hill Climb, and take best results
- Genetic Algorithms
- Select positions based on Fitness Function to breed children that mutate and eventually converge on solution
- Simulated Annealing
- Analogy of gradual cooling temperature of particle from very high temperature to make it move in progressively less random ways while hill climbing until converge on Global Maximum
- Stochastic Beam Search
- Combines the ideas of Random Restart, Genetic Algorithms and Simulated Annealing, such that multiple particles are released on graph and their children generate progressively less randomly and eventually converge on the maximum value
- Random Restart
- Note that Minimax is only relevant to Game playing, whilst UCS and BFS are difficult to implement in 2D graphs since infinite steps possible, and steps may need to be be infinitely small to avoid skipping solution
- Approach #1:
- Problem:
- Example #3: N-Queens Problem
- Problem
- Given N queens placed randomly on N*N chessboard in positions so they cannot attack each other i.e. only any one queen can be on any one horizontal, vertical, or diagonal
- Find queen positions to reduce attacks down to zero
- Solution
- N-Queens Heuristic Function
- Try the stupid thing first, then add intelligence until solve the problem
- Move queens to separate axis
- Constraint Situation
- i.e. only allow each queen to move up/down its column (i.e. Hill Climbing so only allow move left/right along x-axis of graph, so not too many pieces to move)
- Try Move queen subject to most attacks first (move that improves situation the most)
- Test First Move - See how many attacks reduced to by moving each of the queens to different positioins up/down and comparing to find the lowest resulting attacks
- If more than one move reduces attacks by same amount then choose one randomly
- Iterate until reach goal of 0 attacks
- Local Minimum If encounter situation where No move decreases qty of attacks or increases them (no move improves situation)
- N-Queens Heuristic Function
- Problem
- Randomness Technique - Other Usages
- Particle Filters Technique
- Pattern Recognition with Hidden Markov Models
- Monte Carlo Markov Chains
- Random Restart (with use of Improvements)
- Given the following:
- Objective Function (y-axis) in State Space (x-axis) Graph
- Shoulder - positive gradient interrupted part-way with no increase so can get stuck here like at Local Maximum
- Global Maximum - largest maximum in graph
- Local Maximum - smaller maximum along the graph
- “Flat” local maximum
- Objective Function (y-axis) in State Space (x-axis) Graph
- Goal:
- Find Global Maximum and get Unstuck if encounter a Local Maximum by using Random Restart
- Steps
- Repeat the following:
- Randomly chooses a particle on the graph
- Taboo Search Trick of skipping to next if detect is place we’ve been before since we are tracking prior places
- Heuristic #1 - Large Step size and reduce over time
- Start with Large Step size and decrease it over time to ensure we reach the Global Maximum
- Heuristic #2 - Simulated Annealing (Alternative to Heuristic #1)
- Potential Issues
- Step Size Too Small Issue Issue of landing on flat plateau or Shoulder where not enough info available to determine which way to go (to improve answer or check if reached maxium) and may result in wandering so need to only allow a Max amount of steps with no improvement in score, but with small enough step size we may thing we’re not improving and that algorithm should stop
- Step Size Too Large Issue
- Step across to different Hill Issue is that may miss intended hill all together and go up a different hill instead with same incline
- Infinite Loop Risk since if step big enough to step over to other
side of hill where alternate incline direction and step big enough
over again and repeat (oscillating and not converge on answer)
- Solution Reduce Step Size if detect any oscillations
- Follow up the positive gradient steps until reach Maximum
- Check if its the Global Maximum
- Stage Algorithm Variant Keep track of the current found Maximum and try to generate shape of the problem to predict where new Maximum may be given places not yet explored
- Keep track place we have been before
- Randomly chooses a particle on the graph
- Find the Maximum of all the iterations
- Repeat the following:
- Given the following:
- “Simulated” Annealing
- Dfn: Repeat Annealing process according to a formula and timing to refining toward desired properties
- Dfn: “Simulated” since we are borrowing ideas from physicists
such as energy minimisation where:
- When external conditions may result in molecules becoming mobile, and where their mobility slowly reduces, they then arrange themselves into Minimum Energy Configuration, which result in regular Patterns
- i.e. mud cracks where decrease in water in mud reduces molecule mobility over time and causes mud cracks to change into regular Patterns
- i.e. iron molecules compact into Minimum Energy Configuration forming lattice with other molecules like carbon useful for sword making where a mixture of hardness and ductility is needed (Annealing process heating to temp where atoms can move about and form new structures with carbon and other molecules, and cooling iron molecules repeatedly to so they preserve alignment in desired lattice structures so swords are hard not brittle, recrystalisation of sword making)
- i.e. honeycombs with Minimisation by Design where bees try optimise storage space the building materials used for the structure
- i.e. lava cooling slowly and rocks form
- i.e. columns of basalt rock where when it cools it shrinks and cracks into hexagonal lattice which is a Minimum Energy Configuration
- Note: Use idea of Heating and Cooling to get Unstuck from Local Minima
toward finding Global Maxima, where the following applies until
converge on solution:
- High Temperature equates to Higher Randomness
- Gradual Cooling results in Decreasing Randomness
-
Algorithm:
for t=1 to inf do # schedule of T values that is guaranteed to converge to the # Global Maxiumum if it starts with very high values # of T (temperature), causing lots of random graph motion as # we take all random positions offered to us, even bad positions # and climb up and down hills. then we'll decrease T slowly enough # until it's very small with no randomness where we just # Hill Climb to nearest peak normally # # when temperature T is VERY HIGH approaching infinity # then probability (e^(change_in_E/inf) = e^0 = 1) approaches 1 # (even if change_in_E is negative) # # when T is VERY SMALL small (opposite situation) normal Hill Climbing occurs (instead of random) # # given temperature T approaching 0 # T = 0.01 (i.e. near 0) # note: do not want T to be precisely 0 as will give undefined answer # # if new random pos. improves score E then take it (up the gradient) # if new random pos. does not improve E then change_in_E = 0 # if new random pos. worsens score then change_in_E < 0 (down the gradient) # e^change_in_E/T = e^(-1/0.01) = e^-100 (very small number) # (so almost no chance of taking this suggested new random pos.) # # if get stuck at a flat plateau (i.e. shoulder) where change_in_E == 0, and regardless of T's value, # then e^0 = 1 and the algorithm will revert back to taking new random positions again # until walk off the plateau to a position with positive gradient that can improve E toward maxium peak # # similarly in "real-world" annealing, when temperature # is very high the particles jump around a lot at the # start so that if the particle gets stuck at a # Local Maxium instead, then given sufficiet time and # randomness it has the ability to leave the # peak and (ignore the fact that it is going the wrong way # and ending up at different part of the graph). # since temperature T is high, the next fixed random point # may make it move both down and up the slope until it # hits the Global Maximum T <- schedule (t) if T=0 then return current # select points randomly in nearby region next <- a randomly selected successor of current # select points if we find with improved value if change_in_E > 0 then current <- next # select points also if poorer value but only with # a probability else current <- next only with probability e^(change_in_E/T)
- Local Beam Search
- Given a graph, uses k multiple particles (positions) instead of just one
- At each time-frame inspect all randomly generated neighbors of each particle and compare them with each other and retain the k best particles for the next iteration.
- Shares information between each particle and iteration, which differs from Normal Random Restart that doesn’t share any info between iterations
- Terminate if any particle reaches a Goal State
- Stochastic Beam Search
- Combines the ideas of Random Restart, Genetic Algorithms and Simulated Annealing, and
- Same as Local Beam Search but more useful, since successors are chosen not just based on fitness, but also based on Randomness so we don’t get stuck in a Local Maximum (similar to Simulated Annealing within class of Genetic Algorithms)
- Genetic Algorithms (aka, second best solution to ANY problem)
- Dfn: Genetic Algorithms is analogy to Natural Selection in Biology that uses Breeding and Mutation to find Optimal answer to problem (fancy version of Stochastic Beam Search that happens to make a nice analogy with biology)
- Reducing Randomness over time (like in Simulated Annealing)
- Fitness Function reduces the Randomness as each Generation gets closer to solution
- Additional measures to Optimise and converge quickly such as:
- Tuning Parameters
- Crossover
- Mutations qty
- Parents qty
- Tuning Parameters
- Convention to represent given board
- Given one Queen in each column, encode a board with a number shown below each column that indicates where Queen is (i.e. grid distance of the Queen from the bottom) (i.e. 24748552, for 8x8 board)
- Given 8x8 board, where there are 8 Queens, then after examining
every possible pair of Queens that could attack each other we find there are
28 pairs.
8 choose 2 = ( 8! / (8-2)!*2! ) = 28
- Goal is to reduce qty of attacking pairs of Queens to 0,
so use Fitness Function as follows, such that when it is
equal to 28 we know we have Won:
Fitness Function for board = Max qty of attacking pairs of Queens - qty of attacking Queens for given board = 28 - qty of attacking Queens for given board
- N-Queens Problem using Genetic Algorithms
- Gene Pool Choose 4x random boards (i.e. 24748552, …, …, …)
- Combine Genes from Gene Pool in different Patterns to Breed better boards
- Choosing Genes from Gene Pool
- Evaluate each board state in initial population using Fitness Function to get a Fitness Score Value (aka qty of non-attacking pairs of Queens) for each is calculated by going through each column from left-to-right and looking only to the right for each Queen to see if it attacks another (to prevent duplicates)
- Proportional Probabilities (%) based on the Fitness Scores indicative of how likely each board is to be chosen as a Parent and breed, when then list them in descending order of % (i.e. add the 4x scores and Normalise each into a percentage)
- Randomly Select 4x Parents from the 4x Boards and order from best to worst (best on top) (i.e. given 4x boards with 31%, 29%, 26%, and 14% respectively, by rolling a 100 sided dice to select 1st Parent, and if it rolls between 1 and 31 we select the 1st board, 32 to 60 (i.e. 31% + 29% = 60%) we select 2nd board, 61 to 76 we select the 3rd board, and 77 to 100 we select the 4th board)
- Create 2x Children from each of the 4x Parents using a
Crossover Process (see Lesson 8.21) where we group parents
into Pairs
and pick a random vertice (Crossover Point b/w two columns)
for each Pair. Create the 1st Child
by combining the LHS of the vertice of the Parent 1 with the
RHS of the vertice of Parent 2, and repeat by
combining LHS of Parent 2 with RHS of Parent 1 (for 1st Pair),
and repeat for remaining Pairs… This way at least one of the
Children gets the Best attributes of the Parents
- Change Parameters in Genetic Algorithms to try and Optimise how quickly we converge to a good result, such as the qty of Children that each Parent may create
- Note: Survival of the fittest, as with increasing amount of Generations of Children eventually we evolve to an 8-Queens game board that solves the problem, the ones with lesser attacking Queens will have Higher Fitness Function and higher chance to create children, whereas the one with higher attacking Queens will have a Lower Fitness Function and less chance of creating children
- Genetic Algorithms Mutation
- Problem
- Not selecting a Parent to breed where Fitness Function gives it Low Proportional Probability, but where that Parent has a critical piece, but never returns as a result of breeding from the other Parents
- Solution
- Genetic Mutations to add Randomness after Crossover with a Mutation step that avoids risk of never reaching goal state (similar to occasionally choosing a random direction in Simulated Annealing, and similar to randomness in Stochastic Beam Search), whereby for each qty (digit) of attacking queens for a given board, we have a small chance by significant chance that the digit will mutate into a different digit
- Problem
- Example #1: Travelling Salesman Problem
- Lab 8 - Simulated Annealing
- Discussion Forums https://discussions.udacity.com/c/nd889-search-optimization/nd889-simulated-annealing
- Use simulated annealing to implement the algorithm in a Jupyter notebook and using it to solve the Traveling Salesman Problem (TSP) between US state capitals.
- Steps
- Fork https://github.com/udacity/AIND-Simulated_Annealing
- Open AIND-Simulated_Annealing.ipynb with:
jupyter notebook AIND-Simulated_Annealing.ipynb
- Follow instructions from AIMA to implement Simulated Annealing
in the Jupyter Notebook
- Simulated Annealing pseudo-code https://github.com/aimacode/aima-pseudocode/blob/master/md/Simulated-Annealing.md
- Simulated Annealing code https://github.com/aimacode/aima-python/blob/master/search.py
def simulated_annealing(problem, schedule=exp_schedule()): """[Figure 4.5] CAUTION: This differs from the pseudocode as it returns a state instead of a Node.""" current = Node(problem.initial) for t in range(sys.maxsize): T = schedule(t) if T == 0: return current.state neighbors = current.expand(problem) if not neighbors: return current.state next = random.choice(neighbors) delta_e = problem.value(next.state) - problem.value(current.state) if delta_e > 0 or probability(math.exp(delta_e / T)): current = next
- Install Jupyter Notebook with Miniconda
- http://jupyter.readthedocs.io/en/latest/install.html
pip3 install --upgrade pip
pip3 install jupyter
- Lesson 9 - Constraint Satisfaction:
- Reading - AIMA Chapter 6
- Techniques
- Implement these algorithms optimise searching for solutions in
constraint satisfaction problems in the following order, starting with
the stupidest one first, then add intelligence as required
- Backtracking Search (aka Simple Search) (i.e. stupidest algorithm)
- i.e. States are defined by values assigned so far
- Initial State is empty assignment
- Check if current assignment is goal
- Assign a value to unassigned variable that does not conflict with current assignment
- Dead End is reached when no legal assignments (if so, backtrack to previous state and try different assignment)
- Repeat recursively until find answer or have tried all possible assignments and reported failure
- Forward Checking
- Tracking remaining legal values for each unassigned Variable, so if ever make decision causing unassigned Variable to not be able to have a value, we stop search and backtrack
- Benefit: If stuck (i.e. can’t assign a colour to a territory) can backtrack earlier than would have if continued assigning variables (i.e. early warning system that our search is going down the wrong branch)
- Weakness: Does not provide early detection for all failures (i.e. detect when two unassigned Variables are adjacent and only the same Domain is available to populate them with, so we need to use other Constraint Propagation repeatedly to enforce all local constraints)
- Arc Consistency (aka **Simple Constraint Propagation)
- A Variable and a Constraint Propagation Problem are considered Arc Consistent wrt another Variable if value still available to Second Variable after assign value to First Variable. Network is Arc Consistent if all Variables in graph satisfy this condition.
- Benefit: Potentially saves a lot of unncessary deep searching for more complicated problems
- Example (see Map Colouring example below)
- Heuristics
- Minimum Remaining Values (MRV) (see Map Colouring example below for explanation)
- Degree Heuristic (see Map Colouring example below)
- Least Constraining Value
- Assign the Domain to a Variable that least constraints future choices (i.e. reuse previously used Domains unless absolutely necessary to use a new one) to avoids Dead Ends and Backtracking (i.e. rules out the fewest values in the remaining Domains to improve Backtracking Search algorithm efficiency)
- Backtracking Search (aka Simple Search) (i.e. stupidest algorithm)
- Implement these algorithms optimise searching for solutions in
constraint satisfaction problems in the following order, starting with
the stupidest one first, then add intelligence as required
- Example:
- Airport departure scheduling:
- Given:
- 2500 arrivals / day
- 2500 departures / day
- Plane departs every 30 sec
- Problem:
- How schedule all the flights?
- Given:
- Airport departure scheduling:
- Example: Crypto-arithmetic Q1
- Given
- Each letter represents a digit
- Combination of all letters in word represent a different digit
- None start with 0
- Setup problem by listing
- Variables
F T U W R O X1 X2 X3
- Domains
{ 0,1,2,3,4,5,6,7,8,9 }
- Constraints/Rules:
alldiff { F, T, U, W, R, O }
- Global Constraints
- 1) All letters different. No letter can represent same digit
O + O == R + 10*X1
(i.e. except when carry to next column X1 in 10ths place)-
where X1 == 0 1
-
X1 + W + W == U + 10 * X2
- where X2 is 100ths place
X2 + T + T == O + 10 * X3
- where X3 is 1000ths place
X3 == F
F != 0
(since problem does not allow leading 0’s)
- Preference Constraints
- Unary Constraints
- Show constraints as square boxes on
Constraint Hypergraph
- Top square represents Global Constraint 1)
- Show constraints as square boxes on
Constraint Hypergraph
- Unary Constraints
- Global Constraints
- Use procedure of Backtracking Search improved by using
WITH Minimum Remaining Values (MRV) Heuristic, Degree Heuristic, and
Least Constraining Value Heuristic to minimise backtracking)
- Given Variables, Domains, and Constraints
- Select unassigned Variable (i.e. X3), with Domain of 0 or 1
- Assign Variable X3 with a Domain value (i.e. 1), since can’t choose 0 as would fail constraint and not pass Forward Checking
- Apply MRV Heuristic and select unassigned Variable (i.e. F) since only has 1 remaining value and assign it with Domain value 1
- Now, MRV Heuristic find that X2 and X1 are tied with min remaining values at 2 (i.e. 0 or 1), so choose 0 for both (either value passes Forward Checking)
- Now, O must be 4, since;
`O + O == R + 10*X1` `2*O == R `O == R/2` so O is even since R max is 8, such that O == 4
`X2 + T + T == O + 10 * X3` `0 + 2*T == O + 30 `2*T - O == 30` `T == 15 + (O/2)` so if O == 4, then T == (1)7 == 7 (passing 1 to F)
`X1 + W + W == U + 10 * X2` `U == 2*W` so, U must be even number less than 9, so choose U value of 6 (passes Forward Checking)
- Outcome (see lecture notes for details https://www.youtube.com/watch?time_continue=97&v=iqJbOGWEd6Y)
- X3: {0,1} => X3 = 1
- X2: 0
- X1: 0
- Answer i.e. F=1, O=4, R=8, T=7, U=6, W=3 ``` TWO
- TWO ===== FOUR ```
- Given
- Example: Map Colouring
- Constraint Optimisation Problem since has Preference Constraints, and is solved with
Linear Programming, Factory Job Scheduling
- e.g. Map Colouring, Sudoku, Car Assembly, Class Scheduling, Spreadsheets, Transport Scheduling, Floor Planning, N-Queens (1000-Queens)
- Given
- Add colours to each territory in Australia
- Ensure no neighboring territory use same colour
- Setup problem by listing:
- Variables
WA, NT, Q, NSW, V, SA, T
- Domains
Di = { green, blue, orange }
- Constraints/Rules: adjacent regions must have different colours
(i.e. list all possible pairwise constraints
WA != NSW
) OR (i.e. list all allowable assignments for each pair of territories- Preference Constraints
- Unary Constraints (restrict value of given variable) (i.e.
WA not orange
) - Binary Constraints (relate two variables at most)
- Constraint Graph uses for representation visually
- Nodes - (i.e. variables
WA, NT, ...
. NoteTAS
is independent from rest of problem - Lines - show constraints between variables)
- Apply General Constraint Satisfaction Problem Algorithms to the Constraint Graph Structure to search quickly for answer
- Nodes - (i.e. variables
- Constraint Graph uses for representation visually
- Multiple Constraints (i.e. 3 or more variables)
- Soft Constraints (i.e. prefer to teach on Tues and Thu)
- Unary Constraints (restrict value of given variable) (i.e.
- Preference Constraints
- Variables
- Solution:
- An assignment of colours to each of the territories that satisfies all constraints
- Concrete Example Solution (i.e. using Backtracking Search ONLY)
- Given Variables, Domains, and Constraints
- Select unassigned Variable (i.e.
WA
) - Assign Variable a Domain
orange
from possible (i.e.green, blue, orange
) - Select unassigned Variable (i.e.
NT
) - Assign Variable a Domain
blue
from possible (i.e.green, blue
) - Select unassigned Variable (i.e.
QLD
) - Assign Variable a Domain
blue
from possible (i.e.blue, orange
) - Dead End when try unassigned Variable (i.e.
SA
) - Backtrack up to previous branch where only
orange
andgreen
had been used - Continue until find solution
- Improve efficiency of algorithm
- Concrete Example Solution (i.e. Backtracking Search improved by using
WITH Minimum Remaining Values (MRV) Heuristic, Degree Heuristic, and
Least Constraining Value Heuristic to minimise backtracking)
- Given Variables, Domains, and Constraints
- Structured CSPs Algorithm (Optimisation)
- Reference: Graph Algorithm: Topological Sorting https://courses.cs.washington.edu/courses/cse326/03wi/lectures/RaoLect20.pdf
- Benefit: Save time when deep searching
- Approach: Investigate “Structure” of problem and decompose
into independent sub-problems (i.e.
TAS
is separate), which removes a level from the search tree - Example 1:
- Given problem with:
- n = 80 possible Variables
- d = 2 possible values each (Domain makes Variables binary in nature)
- c = 20 possible Variables in each sub-problem (after dividing into 4x problems of 20 each)
- d^n = 2^80 (original Search Space)
- n/c * d^c = 4 * 2^20 (afterward Search Space)
- Given problem with:
- Example 2: Tree-Structured CSPs Algorithm (fast method, but must convert
Constraint Graph into a tree first
- Given problem with No Loops, can solve problem
in O(n*d^2) Time instead of O(d^n) Time
(where n is qty Variables, and d is size of Domain) by:
- If Constraint Graph is not a “tree”, then try to modify the problem by assigning a Domain to some Variables first in order to change the Constraint Graph into a “tree” (i.e. if use MRV first and assign Domain to unassigned Variable so that Variable is removed from the possibilities of its neighbors then the problem becomes a “tree” and can be solved using this fast method)
- Choose a Variable as root
- Choose an ordering of Variables from the root to the leaves such that each Variable node appears after its parent node in the tree
- Start at end Variables (leaves) and make each parent Arc Consistent going up the tree until we get to the root, and if not possible, report failure of the problem
- Start at root of tree and choose any Domain assignment available until we get to the leaves (since when the tree is Arc Consistent any of the available assignments will solve the problem)
- Given problem with No Loops, can solve problem
in O(n*d^2) Time instead of O(d^n) Time
(where n is qty Variables, and d is size of Domain) by:
- Forward Checking Status:
- All territories could have any of the 3x possible colours
- Select unassigned Variable (i.e.
WA
) - Assign Variable a Domain
orange
from possible (i.e.green, blue, orange
) - Forward Checking Status Update:
- All territories could have any of the 3x possible colours,
EXCEPT:
- Orange removed from colours available to NT and SA
- All territories could have any of the 3x possible colours,
EXCEPT:
- Select unassigned Variable (i.e.
NT
) - Assign Variable a Domain
green
from possible (i.e.green, blue
) - Arc Consistency (aka Simple Constraint Propagation check)
- When assigning
green
to NT, ensure Arc Consistency by potentially causing a chain of changes to nested neighbors. First checking NT’s unassigned Variable primary-neighbors (i.e. QLD, SA) and check if assignment of “green” reduces qty of Domains available to them, and if so, remove “green” from the possible Domains available to each of them, then check the unassigned Variable secondary-neighbors and so forth to propagate remaining Domains. Repeat until no more unassigned Variables, or if a neighbor is found to not have any remaining Domains available in which case we force the search to try another option for NT. Note: Entire region of combined territories is Arc Consistent if there are no regions without a possible colour
- When assigning
- Forward Checking Status Update:
- etc
- Optimise and improve Backtracking efficiency by choosing next
Variable using a Heuristic as follows:
- Minimum Remaining Values (MRV) Heuristic to assign Domain to a
Variable that has fewest legal moves (i.e. remaining unused Domains)
- i.e. Select unassigned Variable
SA
- i.e. Select unassigned Variable
- Degree Heuristic used if there is a Tie when performing MRV, and involves choosing Variable with most Constraints on remaining Variables (i.e. has the most borders with other territories so is most complex) and we will run into problems sooner rather than delaying them
- Minimum Remaining Values (MRV) Heuristic to assign Domain to a
Variable that has fewest legal moves (i.e. remaining unused Domains)
- Optimise and improve Backtracking efficiency by choosing next
Domain using a Heuristic as follows:
- Least Constraining Value Heuristic used to assign the Domain to
a Variable that least constrains future choices (i.e. reuse previously
used Domains unless absolutely necessary to use a new one) to avoids
Dead Ends and Backtracking
- i.e. Assign Variable a Domain
orange
again as it least constraints future choices (leavingblue
unused)
- i.e. Assign Variable a Domain
- Least Constraining Value Heuristic used to assign the Domain to
a Variable that least constrains future choices (i.e. reuse previously
used Domains unless absolutely necessary to use a new one) to avoids
Dead Ends and Backtracking
- Constraint Optimisation Problem since has Preference Constraints, and is solved with
Linear Programming, Factory Job Scheduling
- Example: N-Queens (Constraints Satisfaction Problem)
- Iterative Improvement Algorithms
- Examples: Genetic Algorithms or Hill Climbing Algorithms
- Most efficient when either:
- Many solutions OR
- Few solutions
- Very Hard (requiring significant CPU time) to solve at a Critical Ratio point between qty of Variables for a given qty of Constraints
- Solved by randomly assigning values to Variables and iteratively improve the assignments until reach solution. Minimising Conflict Method of minimizing qty of attacks with each iterative improvement can find solution for 4-Queens problem which has 4^4 = 256 possible states. It can also solve 1,000,000-Queens in average of 50 moves (where large qty of solutions to problem)
- Iterative Improvement Algorithms
- Lab 9 - Constraint Satisfaction
- Discussion Forums https://discussions.udacity.com/c/nd889-search-optimization/nd889-constraint-satisfaction
- 8-Queens Puzzle Solution
- Constraint Satisfaction Problem of solving N-queens problem using symbolic constraints and backtracking search in a Jupyter notebook.
- Steps
- Clone https://github.com/ltfschoen/AIND-Constraint_Satisfaction
- Open AIND-Simulated_Annealing.ipynb with:
jupyter notebook AIND-Constraint_Satisfaction.ipynb
- Follow the instructions to complete the constraints for the N-Queens CSP
and implement Backtracking-Search from AIMA in the notebook.
- CSP pseudo-code https://github.com/aimacode/aima-pseudocode/blob/master/md/Tree-CSP-Solver.md
- CSP code https://github.com/aimacode/aima-python/blob/master/csp.py
- Install Jupyter Notebook with Miniconda
- http://jupyter.readthedocs.io/en/latest/install.html
pip3 install --upgrade pip
pip3 install jupyter
- Lesson 10 - Logic and Reasoning:
- Dfn:
- Logic - A logic is a way of describing the world. If logical statements about world selectively capture what want to know about world correctly, then we may reason within the logic to come to conclusions (without having to reference the world again)
- AI Background
- 50’s & 70’s - Requirement for better languages and systems for logical statements
- Mistakes:
- Trying to make statements only in Boolean Logic (where only True or False) even though world is uncertain and Probability Logic is better fit with world
- Mistakes:
- 80’s - Better algorithms for Probability Logic
- Issues
- Too laborious writing by hand
- Issues
- 90’s - More data available online
- Opportunities
- Learning about world from data instead of writing rules by hand
- Opportunities
- 50’s & 70’s - Requirement for better languages and systems for logical statements
- Uncertainty Domains
- Goal: Eliminate uncertainty of real-world and solve significant problems using logic
- Future of Planning Algorithms:
- Learning from examples
- Transferring learning across domains (i.e. learn in one domain and execute in another)
- Interactive Planning where human-machine team solves problems together and being able to explain things in terms that a human can understand, allowing humans to add things that were overlooked in the original problem statement (i.e. need correct UI interface as well as correct answer)
- Examples:
- Logistics
- Planning Algorithms for when actions or world is uncertain such as
for all deliveries of packages using a large fleet of vehicles
- Describe as logic problem where want find solution minimising time and expense
- Planning Algorithms for when actions or world is uncertain such as
for all deliveries of packages using a large fleet of vehicles
- Logistics
- Valid Sentence
- Dfn: True in every possible Model for every combination of values of Propositional Symbols
- Satisfiable Sentence
- Dfn: True in some Models, but not necessarily all Models
- Satisfiable Sentence
- Dfn: False for all Models
- Logic Types
- Propositional Logic
- Truth Tables for all Logical Connectives
- Lists all four possibilities for combinations of Propositional Symbols
- Note: “v means either OR both”
- Note: “> means P implies Q” (aka if P then Q)
- Note: Meaning in Logic is defined explicitly by Truth Table
P Q !P P ^ Q P v Q P > Q P <> Q F F T F F T T F T T F T T F T F F F T F F T T F T T T T
- Example:
- Given the following propositions
- [O] means Five is odd number
- [P] means Paris is capital of France
- [E] means Five is even number
- [M] means Moscow is capital of France
- Connectives:
- O > P (i.e. IF Five is odd number THEN implies Paris is capital of France) IF True THEN implies True
- E > M (i.e. IF Five is even number THEN implies Moscow is capital of France)
IF False THEN implies False
- Note: In terms of real-world formal logic both statements are True since according to the Truth Table definition, P > Q is True when both P and Q are False
- Given the following propositions
- Example: Alarm Problem
- Given propositional Symbols B, E, A, and M, where our belief in
Propositional Logic is that each is either True, False, or Unknown
(similar to in Probabilistic Models, but dissimilar in that
our degree of belief in Propositional Logic is not a number).
They correspond to proposed Events:
- [B]urglary occurring
- [E]arthquake occurring
- [A]larm going off
- [M]ary calling
- [J]ohn calling
- Propositional Logical Sentences are either True or False with respect
to a Model of the world (a set of True/False values for all the
Propositional Symbols), (i.e. Model may be the set {B: True, E: False}).
Truth Tables used to define Truth of a Sentence in terms of the
Truth of the Symbols with respect to the Models.
They may be created by combining these Symbols with
Logical Constants of True and False, and Logical Operators, where
Connectives include: “^ means AND”, “v means OR”, “> means Implies”,
“<> means Equivalent” (biconditional)” (when one true the other is true, and when
one is false the other is false), and “! means Negation”
- (E v B) > A (i.e. imply Alarm is True whenever Earthquake and Burglary is True) i.e. E OR B == A
- A > (J ^ M) (i.e. imply both John and Mary call when Alarm is True) i.e. A == J AND M
- J <> Mary (i.e. John calls if and only if (iff) Mary calls, so John is equivalent to Mary) i.e. J == Mary
- J <> !M(i.e. when John calls Mary does not call, and vice versa, so John is equivalent to not Mary)) i.e. J == !M
- Given propositional Symbols B, E, A, and M, where our belief in
Propositional Logic is that each is either True, False, or Unknown
(similar to in Probabilistic Models, but dissimilar in that
our degree of belief in Propositional Logic is not a number).
They correspond to proposed Events:
- Truth Tables for all Logical Connectives
- Propositional Logic Limitations
- Inference Mechanisms that are efficient at determining Validity and Satisfiability
- Limitations
- Only able to handle True and False values (in Truth Table), with no capability to handle Uncertainty like covered in Probability Theory
- Only talks about True or False events in the world
- Does not talk about objects with properties such as size, weight, etc.
- Does not talk about relations between objects
- Does not include any shortcuts to succinctly talk about multiple things happening (i.e. vacuum world with multiple 1000 locations, to say every location is free of dirt, we need a conjunction of 1000 propositions, but can’t just write a single sentence saying all locations are clean)
- First-Order Logic (FOL)
- Dfn: First-Order Logic means that relations are on Objects but
not on Relations
- Note: Every model must have at least one object, and multiple variables may refer to the same object
- Note: P(x) means there may exist an x that is a member of P, where P may be an empty relation
- Note: Don’t just have Sentences with an assertion, always include a condition (If and Only If), so can prove the negative of the assertion without just an implication in one direction, when doing the definition of an adjacency or a membership problem
- Dfn: Higher-Order Logic (HOL) means that relations are on Relations
- i.e. define notion of a transitive relation about relations itself
- Valid statement in Higher-Order Logic (but invalid in FOL:
For All R, transitive of R is equivalent to, ... For All a,b,c, R(a,b) and R(b,c) implies R(a,c) ∀R Transitive(R) <=> (∀a,b,c R(a,b) ^ R(b,c) => R(a,c))
-
About: Overcomes Limitations encountered in Propositional Logic
Logic varies in what you can say about the world, and what you can believe about what's been said about the world World Belief (Ontological Commitment) (Epistemological Commitments) ====================================================================================== First Order Logic Rel'ships, Obj's, Functions on Obj's True, False, Unknown (?) (extension on Prop logic) Propositional Logic Facts (Symbols, Variables) True, False, Unknown (?) Probability Theory Facts (") [0..1] (real numbers) Atomic Factored Structured =========================================================== Search & Problem Solving Variables Relations b/w Objects (i.e. Propositional Logic & (i.e. Programming languages, SQL) Probability Theory)
- Atomic Representation
- Dfn: where representation of state is just individual state when no piece is inside it (no internal structure to states)
- i.e. given transition from State A to State B can say if identical or not, if goal state?
- Examples: Search, Problem Solving
- Factored Representation
- Dfn: break up world into set of True/False facts
- Dfn: representation that individual state of world is factored into several Variables (i.e. B, E, A, M, J), i.e. boolean or other type of representation, but Not in Propositional Logic
- Examples: Propositional Logic, Probability Theory
- Structured Representation
- Dfn: Most complex representation. Individual state is not just a set of values for variables, but also includes rel’ships b/w objects, a branching structure, complex representations, relations b/w different objects
- Examples: Programming languages, structured databases (with SQL over them)
- Atomic Representation
- Process
- Definitions
- Function defined as mapping from objects to objects
- Model in FOL differs from Propositional Logic Model, where value for each Propositional Symbol in Model corresponding to what is going on in a possible world i.e. { P: True, Q: False }
-
Model in FOL instead starts with Set of Objects and Set of Constants that refer to those Objects but don’t need one-to-one correspondence b/w Constants and Objects (i.e. multiple Constants may refer to same Object, for instance CEE could refer to Object C3)
-
Set of Objects include: 4x tiles A1 C3 B3 D2 Note: with numbers 1,2,3 that are also Objects in Model
-
Set of Constants: {A, B, C, D, 1, 2, 3, CEE}
-
Set of Functions:
“Number” Function (i.e maps from tile to number on the tile) Defined as mapping:
{ A -> 1, B -> 3, C -> 3, D -> 2 }
-
Set of Relations:
“Above” Relation (i.e. where in this model of the world, this relation is a set of two pulls where A is above B, and C is above D) Note: A binary relation holding b/w two objects, say where one block is above another block
{ [A, B], [C, D] }
“Vowel” (Unary Relation) if Vowel is True only for the Object A, then that’s a set of Tuplies of length 1 containing just A
{ [A] }
“Rainy” (Relation over no objects, just refers to current situation) if not rainy, represent as empty set of no tuples:
{ } if is raining, represent with Singleton Set, and since Arity of rainy is zero, there'd be zero elements in each of the tuples { } { [] }
-
- Syntax in FOL
- Atomic Sentences - describe facts as either True or False
and are predicates corresponding to Relations
i.e.
Vowel(A) Above(A, B) 2 = 2 (i.e. Equality Relation in every model to distinguish)
- Operators Sentences may be combined with all operators from Propositional Logic (i.e. ^, v, !, >, <>, () )
- Quantifiers Complex Sentence Types (unique to FOL)
- ∀x - For All, followed by Variable it introduces such as x
- Ǝy - There Exists, followed by Variable it introduces such as y
- Note: If no Quantifier provided, its a shortcut meaning For All
- Note: Typically don’t want to say something about every object in the domain since objects can be so different, instead we usually want to say something about a particular type of object (i.e. Vowels)
- Note: Typically when have a “There Exists an x” or any other Variable, it typically doesn’t go with a Conditional since only describing one object
- Example Sentences Typical in FOL:
∀x Vowel(x) => Number of (x) = 1
(i.e. For All x, if x is Vowel, then Number of x equals 1)Ǝy Number of (x) = 2
(i.e. There Exists an x such that the Number of x equals 2) (means there is some Object in the Domain to which the Number of Function applies and has a value of 2, but without saying what the Object is)
- Terms - describe Objects, may be Constants, Variables, or Functions
i.e.
A,B,2 (i.e. Constants) x, y (i.e. Variables) Number of (A) (i.e. Functions) i.e. just another expression referring to same object as 1 1 (least in model shown previously)
- Atomic Sentences - describe facts as either True or False
and are predicates corresponding to Relations
i.e.
- Definitions
- Example: Vacuum world represented using FOL
- Add:
Locations ========= A - Left Location B - Right Location V - Vacuum D1, D2 - Dirt Relations ========= Loc - True of any location Vacuum - True of the vacuum Dirt - True of dirt At(o,l) - True of object and location
- i.e. Say “vacuum is at location A”:
At(V,A)
- i.e. Say “no dirt at any location”;
- Note: This Sentence still holds for 1000’s of locations, which is the power of FOL ``` For All dirt, and For All locations, if D is Dirt, and if l is a location, then d is not at l.
∀d ∀l Dirt(d) ^ Loc(l) => ¬At(d,l) ```
- i.e. Say “vacuum is in a location with dirt” (without specifying the location)
There Exists an l, and There Exists a d, such that d is the Dirt, and l is the Location, and the vacuum is at the location, and the dirt is at that same location Ǝl Ǝd Dirt(d) ^ Loc(l) ^ At(V,l) ^ At(d,l)
- Add:
- Dfn: First-Order Logic means that relations are on Objects but
not on Relations
- Propositional Logic
- Expert Systems
- Dfn:
- Export Systems are common throughout the world
- Algorithms to help understand how Export Systems work
- Examples:
- Decisions
- Superhuman decision made by military expert system such as deciding where to place a supply base to save money (higher return than their investment in AI)
- Decisions
- Dfn:
- Resolution Algorithm - Inferring new knowledge from a knowledge base
- GraphPlan - Used to make planning practical for new range of problems
- Value Iteration Concept - Key concept for Markov Decision Processing
- Dfn:
- Lesson 11 - Planning:
- About:
- Problem Solving AI-agent
- Given a State Space, Start Space, and Goal
- Planning ahead, thinking only to figure out path to find goal, without interacting/sensing world
- Executes path
- Real-Life differs from Problem Solving AI-agent and requires:
- Avoiding blindfolded execution and getting lost by:
- Interleaving planning and executing
- Feedback from environment required, cannot just plan ahead and derive whole plan
- Avoiding blindfolded execution and getting lost by:
- Problem Solving AI-agent
- Planning and Executing
- Interleaving required due to Properties of the environment that make it difficult:
- Change Perspective to overcome Difficulties by planning in Space of Belief States instead of Space of World States
- Difficulties
- Stochastic Environments - don’t know outcome of performing an action, so need to deal with Contingencies, i.e. try turn right and wheel slipped, skidding, feet not move forward entirely straight, or traffic lights red (prevents go forward)
- Multiagent Environments - i.e. other cars or people obstacles, plan about what they will do and react to unexpected (only know at Execution time, not at Planning time)
- Partial Observability i.e. given plan to go from A to S to F to B, which appears it will work [A, S, F, B] but suppose we know at S the road to F sometimes closed (and sign there tells us), but when start off we cannot read that sign i.e. when start, we may not know what state we are in (i.e. may know we are in state A, but may not know if road is open or closed until we get to S in order to know if can continue down that path or not)
- Unknown i.e. due to lack of knowledge on our part, where some model of world is unknown (such as map or GPS that is inaccurate or incomplete), then may not be able to execute
- Hierarchical i.e. may need to deal with plans that are hierarchical at a high level steps that cannot be executed (i.e. action of go from A to S) without detailed low level plan steps (i.e. turn right, left, accelerate more, etc) being scheduled during execution
- Example 1: Vacuum Cleaner (Deterministic environment)
- Add:
Locations ========= A - Left Location B - Right Location V - Vacuum D1, D2 - Dirt (Yes or No) Possible States == 2 * 2 * 2 == 8 State Space diagram - shows 3x possible Actions transitioning between states - Moving Right - Moving Left - Dirt Sucking
-
Given “Fully Deterministic” and “Fully Observable” world it is easy to plan i.e. Start State to Goal State where both sides clean
0) Initally in Left Location 1) Execute Action to Dirt Suck 2) Move to Right Location 3) Execute Action to Dirt Suck 4) Transition to Goal State of no dirt
-
Given “Unobservable” (Sensorless) and Deterministic world where Vacuum’s sensors stop working, so can’t determine current State (i.e. what Location it is in or whether any Dirt).
-
How does the AI-agent represent the State of the world?
-
Solution: Search in the State Space of Belief States (instead of the State Space of Actual Spaces) i.e. we believe we are in one of the 8 states, and when we execute a subsequent Action we go to another Belief State and gain knowledge about the world even without sensing, which is either a known, or unknown state
- i.e. Move Right, according to State Space of
Belief States diagram we know we’re in Right-Hand Location
(either Moved from Start to Right Wall or we were already
at the Right Wall and stayed there)
- Note: Reduces possibilities from 8 down to 4 so know more about world even though haven’t observed anything
- i.e. Move Right, according to State Space of
Belief States diagram we know we’re in Right-Hand Location
(either Moved from Start to Right Wall or we were already
at the Right Wall and stayed there)
-
- Note: In real-world, operations of Move Left and Move Right are inverses of each other
-
Conformant Plans Plans that reach the goal without ever observing the world i.e. to achieve Goal of clean location, simply Dirt Suck, where move from Initial State where 8 possibilities to one of 4 possible states where one or both Locations are clean (do not know which of the 4 states we are in, but we know we achieved the Goal)
-
Given “Partially Observable” Vacuum Cleaner in Deterministic world
-
Given Local Sensing - where Vacuum knows its current Location, and sub-states going on i.e. if dirty or clean, but does not know about sub-state of other Locations (if contain dirt or not)
-
Draw Partial Diagram of part of Belief State from that world, where Belief State unfolds as the following occurs:
- 1) Initial State
- 2) Action taken to Move Right
-
3) Observation (in Act-Percept Cycle) allows us to split the Belief State of our world to say:
- if we observe we are in Location B and it’s Dirty then we know where in a certain state according to the diagram, and if observe we’re in Location B and its Clean we know we’re in another state according to the diagram
-
-
- Add:
- Act-Percept Cycle
- In a Deterministic world, each individual world state (common to multiple states) in a Belief State maps exactly to another one (reason why we say “deterministic”, so the size of the Belief State either stays the same or may decrease if two Actions accidentally take you to the same Location
- Observation
- Works opposite of Act-Percept Cycle, in that when we Observe the world, we’re taking the current Belief State and partitioning it into pieces of where we know they are. Observations cannot introduce a new world state into the Belief State. Cannot learn less about the world after an Observation.
- Stochastic (non-Deterministic)
- Multiple outcomes for an Action
- Example 2: Robot with slippery wheels (Stochastic and Partially Observable Environment)
- Given Robot:
- Which has slippery wheels
- Where Move Left or Move Right causes wheels to slip, so most of time No Location Change, but sometimes Location Changes
- Assume Suck always works perfectly
- Draw Belief State Diagram where result of Actions often result
in a Belief State increase (larger than before),
since Action increases Uncertainty as do not know
what result of Action will be
- For each of individual world states belonging to a Belief State we have multiple outcomes for the Action (meaning of Stochastic), so end up with larger Belief State, since Actions tend to increase Uncertainty (in Stochastic and Partially Observable environment)
- Observation-wise, same holds as in Deterministic world, where observation partitions Belief State into smaller Belief States and Observations tend to reduce the Uncertainty (in Stochastic and Partially Observable environment)
-
Planning approach i.e. want to go from initial Belief State to one where all squares are clean, need to know if will Always work or Maybe sometimes work for possible plans?
All are Maybe, since with a plan that is a linear sequence, we are never guaranteed to achieve goal with Finite Sequence of Actions, since a successful plan must include at least one move action, but if try move action a finite number of times, each of those times the wheels may slip and won’t move.
Note: We can introduce a new notion of plans that include Infinite Sequences
- Given Robot:
- Infinite Sequences
- Instead of writing plans as Linear Sequence,
i.e.
(suck, move right, suck) [SRS]
instead write as Tree Structure
- Tree Structure (where Finite Representation represents an
Infinite Sequence of plans)
- New State - Starting Belief State
- Action - Suck
- New State - A
- Action - Move Right
- Observation - either [A, Clean] or [B, Clean] or [B, Dirty]
of current State
- If identifies we are still in previous state, then go back in plan to A
- Else continue
- Action - Suck
- Write infinite sequence of plans in more linear notation:
S, while we observe A do R, and then do S [S, while A: R,S]
- Note: If Stochasticity is independent (i.e. sometimes works and sometimes it doesn’t work), then with Probability of 1 as Limit, then this Plan will achieve the Goal. However, can’t state any bound qty of steps with which can achieve the goal, can only say it’s infinite
- Instead of writing plans as Linear Sequence,
i.e.
-
Finding a Successful Plan
- Approach with Problem Solving (using Trees) (Previously)
- Start in Single State (not a Belief State)
- Search possible states in a Tree until find one pair that takes us to Goal State
- Pick single path from Tree (sequence of steps with no branches in it where one leaf node is goal)
- Approach with Problem Solving (using Maths Notation)
- Given plan with straight line sequence
[A,S,F]
-
Decide if is plan that satisfies the goal by checking if end state is goal state
- Mathematical formulation for plan to be a goal
1 Started in start state S1 2 Transitioned to state S2 that is result of applying to that start state S1 the action of going from A to S 3 Apply the result of starting in intermediate state S2 and applying action of going from S to F 4 Check if resulting tate is an element of the set of goals to see if plan is valid and gives us a solution Result (Result (A, A -> S), S -> F) <- Goals
- Stochastic and Partially Observable
-
Note: In Stochastic and Partially Observable worlds the equations are more complicated
-
Instead of just dealing with Individual States such as getting result of applying action to initial state (i.e.
s1 = Result(s, a)
), we’re actually dealing with Belief States where we making Prediction where we start in a belief state “b” and we look at the action “a” and each possible result of the action (since they’re stochastic to each possible member of the Belief State, which gives us a larger Belief State), then we Update the Belief State by taking the Observation “o” into account (which gives the same or smaller sized Belief State), and the we get a new state “b1” -
Note: Use this Predict-Update Cycle to Track where we are
- tells everything we need to know
- Weakness: Belief States start to get large
since they just list all the possible world states
- Alternative: Instead represent the possible states using Variables (i.e. using Classical Planning) and an overarching Formula over the Variables that describes the states
b1 = Update(Predict(b, a), o)
-
- Given plan with straight line sequence
- Approach with Belief States and Branching Plan Structures
- Same process as for “Problem Solving”, but more complex tree with different possibilities
- Example (of searching through tree)
- Init State
- Possible Actions to for Next State
- Move Right
- Suck
- Branch of the Plan (not part of Search Space)
- Possible Actions to for Next State
- Try each Action and expanding nodes until find a Portion of the tree that is a Successful Plan according to criteria of reaching a Goal
- Init State
- Unbounded Solutions
- If every leaf node in a plan is a goal since we can’t guide it in one direction or another (no matter what path with execute based on observations, noting that observations come to us, we do not pick them)
- Bounded Solution
- No loops (do not know how many steps it will take), but ok to have branches
- Approach with Problem Solving (using Trees) (Previously)
- Notation of Classical Planning (i.e. Action Schema)
- Dfn:
- Representation language for dealing with States, Actions, and Plans
- Used as approach to deal with complexity by factoring the world into Variables
- State Space
- All possible assignments to k-Boolean Variables
- State Space Size - 2^k
- Example - Vacuum world with 2 Locations
- States
- Possible States: 222 = 8 (succinctly represented through the 3 Variables below)
- Variables: 3 (Boolean)
- DirtA - Dirt in Location A
- DirtB - in Location B
- VacA - Vacuum in Location A
- World State - Complete Assignment of True or False to each of the 3 Variables
- Belief States (depends on type of env we want to deal with)
- Complete Assignment (in Core Classical Planning, useful in Fully Deterministic Fully Observable domains)
- Partial Assignment (by extending Classical Planning) i.e. Belief State where VacA is True, whereas other 2 Variables are unknown, such that formula represents 4 possible world states
- Arbitrary Formula (in boolean logic)
- Actions
- Represented in Classical Planning by Action Schema since represents a set of actions (i.e. for all possible planes, for all x and y)
- Example 1: Fly Cargo
- Given
- Goal - need to send cargo around world
- Assets - plans at airports, cargo
- Action Schema (of plane flying from one
location to another) -
Action - since Action Schema Operator and Args - i.e. Fly from x to y Preconditions - what need to know to be true to execute actions - P must be a plane - ^ - AND from Boolean Propositional Logic - At - p better be at x Effects (what will happen as a result of action in terms of transition between state spaces) - i.e. when fly from x to y, the plane is no longer at x Action (Fly(p, x, y) PRECOND: Plane (p) ^ Airport (x) ^ Airport (y) ^ At (p, x) EFFECT: ¬ At (p, x) ^ At (p, y)
- Note: Can apply the Action Schema to specific world states where p and x would have specific values (we concatenate the names together to come up with one variable where the name has a complex form with commas and parenthesis to make easier to write one Action Schema that covers all individual flying actions)
- Given
- Example 2: Fly Cargo
- More complete representation of Problem Solving domain in language of classical planning
- Note: Below representation and algorithms is good enough to handle 100,000+ cargo planes, and representing millions of ground actions at dozens of airports. If 100 planes, would be 10,000 different fly actions If had 1000’s of cargo pieces we’d have more Load and Unload actions (all representable by the below succinct Schema) * ``` ################### # Initial State (2x Cargo pieces, 2x Planes, 2x Airports) and wher things are ################### Init( At(C1, SFO) ^ At(C2, JFK) ^ At(P1, SFO) ^ At(P2, JFK) ^ Cargo(C1) ^ Cargo(C2) ^ Plane(P1) ^ Plane(P2) ^ Airport(JFK) ^ Airport(SFO) ) ################### # Goal State (i.e. C1 cargo must be delivered to JFK airport and…) ################### Goal( At(C1, JFK) ^ At(C2, SFO) ) ################### # Operator “Load” (loads cargo “c” onto Plane “p” at Airport “a”) ################### Action( Load(c,p,a), PRECOND: At(c, a) ^ At(p, a) ^ Cargo(c) ^ Plane(p) ^ Airport(a) EFFECT: ¬ At(c, a) ^ AIn(c, p) )
################### # Operator “Unload” (unloads cargo “c” from Plane “p” at Airport “a”) ################### Action( Unload(c, p, a), PRECOND: In(c, p) ^ At(p, a) ^ Cargo(c) ^ Plane(p) ^ Airport(a) EFFECT: At(c,a) ^ ¬ In(c,p) ) ################### # Fly Action Schema ################### Action( Fly(p, from, to, PRECOND: At(p, from) ^ Plane(p) ^ Airport(from) ^ Airport(to) EFFECT: ¬ At(p, from) ^ At(p, to) )
```
- States
- Dfn:
- Progression (Forward) State Space Search - Planning using Notation of Classical Planning (i.e. Action Schema)
- Searching through Concrete world states
- Planning using Action Schema is done same as we did planning for Problem Solving
- Searching through space of exact states, where each state is an Individual World State and where Actions are Deterministic then its the same approach as in Problem Solving
-
Since we are using this representation using Action Schema, there are other possibilities available
- Example: Fly Cargo
- Initial State
At(P1, SFO), At(C1, SFO)
- Branching to possible Actions
i.e. one possible Action is to Load
Load(C1, P1, SFO)
- Continue Branching until satisfy Goal predicate
- Initial State
- Regression (Backwards) State Space Search
- Searching through Abstract states (where some Variables unspecified)
- Where we start at the Goal representation (of many world states)
-
Use the same representation for arrows that represent a set of possible Actions
- Example: Fly Cargo
- Take description of Goal State
At(C1, JFK), At(C2, SFO)
(all we know about the state is that these two Propositions are True) - Backward Search by asking what Actions would lead to that Goal State, which represents a whol family of States with different Variables and values
- Individually inspect one at a time the “Definition” of each possible Actions
that would result in this each of the part of the Goal
- i.e. inspect Actions in Action Schema that would result in
At(C1, JFK)
, where we find the only Action that results in anAt
is the Unload Action Schema, such thatUnload(c,p,a)
results inAt(c,a)
, so we know to achieve goal we’d have to perform an ActionUnload(C1, anything, JFK)
, (i.e. branch leading into Goal state isUnload(c,p,a)
, which represents all possible Actions for any Plane “p” for Unloading cargo at the destination)
- i.e. inspect Actions in Action Schema that would result in
- Regress the State over the Operator to get another representation before the Action that also will be Unknown (i.e. do not know what “p” is involved)
- Continue searching backwards until enough of Variables are filled in and where we match against Initial State so we have our Solution that we found going Backwards
- Apply the found Solution going Forwards
- Take description of Goal State
- Example: Buying a Book
# Single Action (buying a book) # Precondition is identifying what book and Effect is that own the book # Goal is to own the book satisfying precondition ISBN Action(buy(b) PRECOND: ISBN(b) EFFECT: OWN(b) Goal (Own(0136042597))
- Progression Forward Search Solution:
- Initial State - own nothing
- Actions - which possible to apply i.e. if 10,000,000 possible books then branching factor is 10,000,000 coming out of initial state node and have to try all in order until hit goal Very Inefficient
- Regression Backward Search Solution:
- Goal State - own ISBN No. 0136042597
- Actions - Inspect possible ones out of the 10,000,000, where there is Action Schema, which can only match Goal in one way when “b” == 0136042597 so we know Action is to buy 0136042597, so able to connect Goal State to Initial State in backwards direction in one step
- Progression Forward Search Solution:
Plan Space Search * Dfn: Searching through space of plans (rather than searching through space of states) * Note: * Plan Space Search - Old approach * Forward Search - Popular as easier to come up with good Heuristics for search, as deals with Concrete Plan States
* Example: Sliding Puzzle (using **Heuristics**) * **Action Schema** for puzzle is formal representation of **encoding Actions in logical form, so a program could modify and generate Heuristics from it (rather than requiring a human to manually do it)** ``` # Action - slide tile to from location a to b Action(Slide (t,a,b) # Precondition - tile must be on location a, and must be a tile, and location b must be blank, and location a and b must be adjacent PRECOND: On(t,a) ^ Tile(t) ^ Blank(b) ^ Adj(a,b) # Effect - tile is now on location b, and blank is now on location a, and tile is no longer on location a, and blank is no longer on location b EFFECT: On(t,b) ^ Blank(a) ^ ¬ On(t,a) ^ ¬ Blank(b) ) ``` * Use **Action Schema** to automatically derive good **Heuristics** by modifying original Problem Heuristic * Heuristic #1 * Remove a Precondition to make the problem easier and generate a different Heuristic. i.e. remove `Blank(b)` to end up with the **"Manhattan" (aka City Block) Heuristic** (i.e. since not Blank, and not contain Tile, it contains something else) * Heuristic #2 * i.e. remove `Adj(a,b)` to get the **Qty Misplaced Tiles Heuristic** (i.e. allows sliding a tile from any location a to any location b, no matter how far apart they) * Heuristic #3 (Ignore negative effects) i.e. remove `¬ Blank(b)` from EFFECT, to make the problem easier (more relaxed problem) * Backward - Benefit as may be more Efficient than Forward Search * Example: Getting Dressed with clothes in right order * Initial Empty Plan (flawed as does not lead from Start to Goal) ``` Start State Goal State ``` * Update Plan * Add "Right Shoe" Operator ``` Start State Right Shoe Goal State ``` * Check if updated plan solves problem, otherwise further refine plan * Update Plan * Add parallel sequence of Operators in branching structure * "Left Shoe" * "Right Shoe" * Note: Captures that can achieve in any order ``` Start State / \ Left Shoe Right Shoe \ / Goal State ``` * Repeat until get plan that works
Planning Representation
* **Situation Calculus (using FOL for Planning)** * Dfn: Regular FOL with a set of conventions for how to represent states and actions * **Possibility Axioms** and **Successor-State Axioms** are most of what comprises Situational Calculus and are used to describe entire **Domains** (i.e. airport cargo domain). We describe our specific Problem within that Domain by describing the Initial State/Situation, where we make different types of assertions using different predicates * Note: Deriving Solution from a Problem described in that supports full power of FOL using Situation Calculus where we can write anything we want (much more flexible than in Problem Solving or Classical Planning) and doesn't need any special program since there are already Theorem Provers for FOL, so simply state as a Problem, apply the Normal Theorem Prover that we already had for other uses so it comes up with a Situation as an answer which corresponds to a Path that satisfies the Goal, given the Initial State and Action descriptions. * **Conventions** * **Actions** represented as Objects in FOL (normally by functions) ``` # Function Fly - represents an Object that is the Action # (i.e. fly plane from airport x to y) Fly(p,x,y) ``` * **Situations** described by Objects in the logic and correspond to the **Past History of Actions** that are in state space search i.e. arriving at same world state from two different sets of actions, they would be considered two different situations ``` # Initial Situation S0 # Function of Situations (aka "Result") where result of situation object "s" and action object "a" equals another situation S' S' = Result(s, a) # Possible # Note: Situation Calculus does not describe Actions applicable in a Situation with predicate Actions of "s" as `Actions(s)`. Instead Situation Calculus talks about Actions that are possible in a state by using a predicate `Poss(a,s)`, which is an Action "a" that is possible in a state Poss(a, s) # Precondition # Note: Form for describing predicates (i.e. Poss(a,s)), which has the form of a Precondition of state "s" that implies it's possible to do Action "a" in state "s" SomePrecond(s) -> Poss # Possibility Axiom for the Action "Fly" # if some "p" exists which is a Plane in state "s" and there's some "x" which is an Airport in state "x" and there's also some "y" which is also an Airport in state "s" and "p" is at Location "x" in state "s", then that implies it's possible to fly "p" from "x" to "y" in state "s" Plane(p,s) ^ Airport(x,s) ^ Airport(y,s) ^ At(p,x,s) -> Poss(Fly(p,x,y)) ``` * **Predicates** * **Fluents** (fluidity or change over time) ``` i.e. # Plane "p" at Airport "x" in Situation "s" At(p,x,s) ``` * **Possibility Axiom** `Poss(...)` * **Successor-State Axiom** `In(...)` * **Describing what Changed and what Does Not Change in Classical Planning vs Situational Calculus** * **Classical Planning** had "Action Schemas" that described one Action at a time and what changed * **Situation Calculus** however does it the other way around by instead of writing one Action, Schema, or Axiom for each Action, we instead do one for each Fluent (Predicate) that can change using a **convention** called **Successor-State Axioms (SSA)** that describe what happens in the state that is a Successor of executing an Action. SSAs have the form of saying: ``` i.e. For All Actions and States, if it's possible to execute Action "a" in State "s", then the Fluent is True if and only if (IFF) Action "a" made it True OR if Action "a" didn't undo it (i.e. either it wasn't true before and "a" made it True, or it was true before and "a" didn't stop it from being True) ∀a,s Poss(a,s) -> (fluent true <-> "a" made it true V "a" didn't undo it) ``` * Example: **Successor State Axiom** for the `In` Predicate ``` i.e. For All Actions and States, for which it's Possible to execute Action "a" in Situation "s", and if that's true then the In Predicate holds between some cargo "c" in some plane "p" which is in the state that is the "result" of executing Action "a" in state "s", so the In Predicate will hold, IFF "a" was a Load Action (i.e. if we load the cargo into the plane, then the result of executing that Action "a" is that the cargo is in the Plane "p") OR it might be already true that the cargo was in the Plane "p" in Situation "s", AND Action "a" is not an Unload Action ∀(a,s) Poss(a,s) -> In(c,p,result(s,a)) <-> (a=Load(c,p,x) V (In(c,p,s) ^ "a" != Unload(c,p,x))) # Predicate states valid sentences in FOL like the following to asset the initial state - Plane P1 is at Airport JFK in Situation S0 - For All C, if C is Cargo then that C is at JFK in Situation S0 Initial state: S0 At (P1,JFK,S0) ∀c Cargo(c) -> At(c,JFK,S0) # There Exists a Goal State "s" such that For All "c" if "c" is Cargo, we want all that cargo to be as SFO in state "s' Goal: Ǝs ∀c Cargo(c) -> At(c,SFO,s) ``` * Example: Fly Cargo * Goal: Move all cargo from Airport A to Airpot B, regardless of qty of cargo * Note: In FOL, you can express the notion of "All", but NOT in Propositional Languages like Classical Planning Notation
-
Predicates i.e.
At(...)
- Complete Assignment Dfn: when all Variables have values
-
Partial Assignment Dfn: when some Variables have values and others do not
-
Theorem Provers - My Research
I did some web research this morning to try and answer my question and prepared the below summary of my findings. Unfortunately I couldn’t find any helpful nested infographics that present the big picture concisely, perhaps this still hasn’t been done before (if you happen to know of any, please share). So to understand how everything fits together better, I guess I’d just have to find the time to read, summarise, and consolidate a lot of different research papers, books, lecture notes from various sources.
Problem Types =========
- CSF - Constraint Satisfaction Problem (CSP) problems
- SAT - Boolean Satisfiability (SAT) problems
- IP - Integer Programming (IP) problems
- LP - Linear Programming (LP) problems
Algorithms =========
- DPLL - Davis–Putnam–Logemann–Loveland (DPLL) algorithm - complete backtracking-based search algorithm
Approaches =========
- ASP - Answer Set Programming (ASP) problem solving approach
Strategies / Techniques =========
- Resolution - Theorem Proving Strategy for Propositional Logic and First-Order Logic, which is the way most modern automated Theorem Provers are implemented
- Unification - Tool of matching two terms so they look the same by determining what variables/functions/constants combine, as an enabler of performing Resolution in FOL
Formats =========
- CNF - Conjunctive Normal Form (CNF) format may be used to represent SAT engines/solvers
- STRIPS - formal language of the inputs to the STRIPS planner
Solvers =========
- Randomised Systematic Solvers - ?
- SATzilla - SAT-based and DPLL-based
- PrecoSAT (previously known as RelSAT, Siege, and ZChaff) - SAT-based and DPLL-based
Planners / Provers =========
- SATPlan - general propositional theorem prover that uses propositional logic instead of FOL. Input is a set of Axiom Schemas
- GraphPLAN - Input is a set of STRIPS-style operators
- Blackbox - planning system that works by converting problems specified in STRIPS notation into SAT problems
- STRIPS (Stanford Research Institute Problem Solver) system - automated planner. STRIPS problems may be translated into Boolean Satisfiability (SAT) problems
- Tweak - non-linear planning
Misc =========
- Ground - Variable-free (i.e. FOL is a lifted version of Ground Propositional Logic where Variables are used)
References: https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-825-techniques-in-artificial-intelligence-sma-5504-fall-2002/lecture-notes/Lecture7FinalPart1.pdf
- About:
- Project 3:
- Discussion: https://discussions.udacity.com/c/nd889-logic-reasoning-planning/nd889-implement-planning-search
- Lesson 12 - Probability:
- Reading: Chapter 13 AIMA
- Conditional Probability and Bayes’ Rule (used in Machine Learning Algorithms)
- Bayes’ Rule (from Statistics)
P(X|Y) = P(Y|X) * P(X) / P(Y) posterior = likelihood * prior / marginal likelihood where "prior" = prior probability (aka unconditional probability) where "posterial" = posterior probability (aka conditional probability) where "|" = given where P(Y|X) = 1 - P(~Y|X) where P(Y) equals the "Theorem of Total Probability" P(Y) = ∑i P(Y|X=i) P(X=i) (prob. of Y is sum over all probabilities of Y conditional on X equals i, times probability of X equals i
- Example
LHS of Bayes' Rule Given Y (i.e. test result), but we want to know X (probability of have or not have cancer) where P(X|Y) is diagnostic reasoning (from evidence to causes) RHS of Bayes' Rule (turns LHS on its head and correct for the inversion by multiplying by the "prior" of the cause to be be the case in the first place, i.e. whether have cancer or not, and dividing by the probability of the evidence Y, where the divisor is often expanded using the Theorem of Total Probability ) where P(Y|X) is causal reasoning (i.e. hypothetically given we know the cause X, what is probability of evidence Y we just observed)
- Applied to Cancer Question 4 to understand (further below)
P(C|+) = P(+|C) * P(C) / P(+) = 0.9 * 0.001 / Theorem of Total Probability = 0.9 * 0.001 / (prob. of pos. test result given have cancer + prob. of pos. test result given do not have cancer) = 0.9 * 0.001 / (0.9 * 0.01 + 0.2 * 0.99) = 0.009 / 0.009 + 0.198 = 0.0434 care about prob. of cancer (cause) conditional on result of hidden cause, which is the positive test result in this case (evidence) = probability of seeing pos. test result given have cancer * prior probability of having cancer / prob. of seeing pos. test result
- Question:
1 - Probability of a given population positively having disease X being 3%. 2 - Drug company has test Y to show if person has disease X. 3 - Given probability of negative detection using test Y given a person with disease X being 1% 4 - Given probability of positive detecting using test Y but person not actually having disease X is 10% (false positive) 5 - Find probability person positively has disease X given they took test Y with positive result?
- Solution:
- Translate problem into Probability Notation:
1 - P(X) = 0.03 (prob. with X) so P(~X) = 0.97 (prob. not with X) 3 - P(~Y/X) = 0.01 (prob. Y neg. detect with X) 4 - P(Y/~X) = 0.1 (prob. Y pos. detect but not with X) - P(Y/X) = 0.99 (prob. Y given pos. detect with X) - P(X/Y) = ? (prob. with X given pos. detect)
- Find P(Y/X) first to use with Bayes’ Rule
P(Y/X) = 1 - P(~Y/X) = 0.99 (prob. pos. detect)
- Find P(Y) second using Ven Diagrams where
P(Y) is overlap intersection of X and Y plus
area of Y without the overlap intersection
P(Y) = P(Y/X) P(X) + P(Y/~X) P(~X) = 0.99 * 0.03 + 0.1 * 0.97 = 0.1267
- Substitute into Bayes’ Rule
P(X/Y) = P(Y/X) P(X) / P(Y) = 0.99 * 0.3 / 0.1267 = 0.234 so, 23.4%, mainly since disease X is rare with 3% in numerator so doctors would rely on multiple tests before giving definitive diagnosis
- Translate problem into Probability Notation:
- Bayes’ Rule (from Statistics)
- Graph Structure of Probability using Bayes’ Networks
- Bayes’ Network
- Dfn: Compact representation of a distribution
over a very large Joint Probability Distribution
of all Variables (i.e. Node)
- Assume that events are discrete (i.e. Binary Events)
- Specify the Bayes’ Network
- Observe values of Variables (Nodes) (i.e. car not start, oil light, etc)
- Compute Probabilities of hypotheses (i.e. probability fanbelt broken)
-
Dfn: Complex structure composed of nodes that correspond to events (known or unknown, random variables). Nodes are linked by Arcs that suggest a Child of an Arc is Influenced by its Parent (perhaps in Probabilistic way but not a Deterministic way) (i.e. battery age older has higher chance of causing battery dead but not clear that every old battery is dead). Describes a way to understand how a car (complex system) doesn’t start, where many variables not immediately measurable, but with sensors that allow understand a bit about state of the car Assists in reasoning from observable variables (i.e. car not start, dipstick value) to hidden causes (i.e. is fanbelt broken?, is battery dead?)
- Usage: Used extensively in most fields of
smart computer systems:
- Diagnostics
- Prediction
- Machine Learning
- Finance
- Robotics
- Computer Vision
- Bayes’ Network Inference
- Advanced AI Techniques that extend from Bayes’s Networks
- Particle Filters
- Hidden Markov Models
- MDP’s
- POMDP’s
- Kalman Filters
- Dfn: Compact representation of a distribution
over a very large Joint Probability Distribution
of all Variables (i.e. Node)
- Conditional Independence
- Dfn: ??
- D-Separation Concept
- Dfn: ??
- Parameter Counts
- Dfn: ??
- Inference of Bayes’ Networks
- Dfn: ??
- Question: Car not start
- Problem: Given car not start in morning.
- Solution:
- Specify the Bayes’ Network by drawing it (Influence Diagram) graph of possible causes
and show methods of diagnosis
(increase/decrease belief that battery may cause car failure)
and Effects
- Diagnostic Tool
- Check battery age
- Problem:
- Car not start
- Possible Causes:
- Battery ok, but not charging
OR Battery dead
- Possible Causes:
- Alternator broken
- Fanbelt broken
- Starter broken
- Diagnositics
- Battery Meter (Diagnostic Tool)
- Battery age
- Possible Causes:
- Lights
- Possible Causes:
- Lightings draining battery
- Diagnositics
- Check if lights on
- Possible Causes:
- Oil
- Possible Causes:
- No oil
- Diagnostics
- Check oil light
- Possible Causes:
- Fuel
- Possible Causes:
- No fuel
- Blocked fuel line
- Diagnostics
- Check fuel guage
- Possible Causes:
- Battery ok, but not charging
OR Battery dead
- Possible Effects:
- Battery Flat
- Affects:
- Lights (no power)
- Oil light
- Fuel guage as no power
- Not Affect:
- Oil level dipstick
- Affects:
- No Oil
- Affects:
- Oil level dipstick
- Oil light
- Not Affect: *
- Affects:
- No Gas
- Affects:
- Fuel guage
- Car not start
- Not Affect: *
- Affects:
- Battery Flat
- Possible Causes:
- Car not start
- Count Total Variable Centers
- Note: Graph Structure and associated Probabilities specify a large Probability Distribution in space of all Variables (i.e. one Variable for each Node)
- Assuming each Variable (Node) is Binary then given 16 Variables, we assume there can be 2^16 different Values
- Specify the Bayes’ Network by drawing it (Influence Diagram) graph of possible causes
and show methods of diagnosis
(increase/decrease belief that battery may cause car failure)
and Effects
- Probabilities
- Total Probability
P(Y) = ∑i P(Y|X=i) P(X=i) prob. any random variable Y can be written: sum of all possible outcomes i for the random variable X of the: prob. of Y given another random variable X of assumed value i * prob. of X equals i
- Negation of Probabilities
P(¬X|Y) = 1 - P(X|Y)
- Complementary Probability
P(A) = p => P(¬A) = 1 - p
- Independence
- Probability of the Join of any two
Variables is the Product of the Marginals
X ⊥ Y : P(X)(Y) = P(X,Y) marginals = joint probability
- Usage: When asked for probability of any combination of a number of coins, look at probability of each coin individually and multiple them
- Important note: ``` If X and Y are not independent, you can’t use P(X and Y) = P(X | Y).
But if X and Y are independent. You can replace the formula:
P(X and Y) = P(Y/X) P(X) X and Y independent so, P(X) P(Y) = P(Y/X) P(X) so, if I eliminate P(X) in both sides (P(X) not 0) P(Y) = P(Y/X) ```
- Probability of the Join of any two
Variables is the Product of the Marginals
- Product Rule (Joint - Multiplication Rule)
P(X,Y) = P(X and Y) = P(X) * P(Y|X)
- http://www.mathgoodies.com/lessons/vol6/conditional.html
- Dependence
- Given two coins flipped
P(X1 = H) = 0.5 (prob. 1st flip heads) | branch to pick another coin based on 1st flip outcome | `--> H : P(X2 = H | X1 = H) = 0.9 | `--> T : P(X2 = T | X1 = T) = 0.8 i.e. prob. 2nd flip X2 is H given that 1st flip X1 was H is 0.9 (by Conditional Probability) where X1 is outcome of first flip What is probability of 2nd flip being H ? P(X2 = H) = ???
- Solution:
compute probability catering for the complement using Theorem of Total Probability P(X2 = H) = P(X2 = H | X1 = H) * P(X1 = H) + P(X2 = H | X1 = T) * P(X1 = T) = 0.9 * 0.5 + 0.2 * 0.5 = 0.45 + 0.1 = 0.55
- Given two coins flipped
- Total Probability
- Question: Heads or Tails 1
1 - Given coin, with possible values heads or tails P(H) = 0.5 P(T) = ?
- Assume only two possible outcomes
- Solution
- Recall Bayes’ Rule
P(X/Y) = P(Y/X) P(X) / P(Y) where P(Y/X) = 1 - P(~Y/X)
- Translate problem into Probability Notation:
1- P(H) = 0.5 (prob. with H) where P(T) = 1 - P(H) = P(~H) = 1 - 0.5 = 0.5
- Recall Bayes’ Rule
- Question: Heads or Tails 2
1 - Given coin, with possible values heads or tails P(H) = 0.25 P(T) = ?
- Assume only two possible outcomes
- Solution
- Recall Bayes’ Rule
P(X/Y) = P(Y/X) P(X) / P(Y) where P(Y/X) = 1 - P(~Y/X)
- Translate problem into Probability Notation:
1) P(H) = 0.25 (prob. with H) where P(T) = 1 - 0.25 = P(~H) = 1 - 0.25 = 0.75
- Recall Bayes’ Rule
- Question: Heads or Tails 3
1 - Given coin, with possible values heads or tails Find probability of Heads 3x times in a row P(H) = 0.5 (per flip) P(H,H,H) = ?
- Assume only two possible outcomes
- Assume coin flips independent
- Solution
- Translate problem into Probability Notation:
1) P(H) = 0.5 (prob. with H) * Multiple probabilities since each is independent event P(H,H,H) = 0.5 * 0.5 * 0.5 = 0.125 12.5%
- Translate problem into Probability Notation:
- Question: Heads or Tails 4
1 - Given coin, with possible values heads or tails Find probability of same side 4x times in a row P(H) = 0.5 (per flip) P(~H) = 0.5 P(X1 = X2 = X3 = X4) = same result Pi(H) = 0.5 ∀i where Xi = result of i'th flip Xi = { H, T }
- Assume only two possible outcomes
- Assume coin flips independent
- Assume each flip has identity and equal probability of heads result
- Solution
- Translate problem into Probability Notation:
```
- Multiple probabilities since each is independent event
- Already know that probability 4x Heads is:
- 0.5 * 0.5 * 0.5 * 0.5 = 0.0625
- Already know that probability 4x Tails is 0.0625
P(X1 = X2 = X3 = X4) = 0.0625 + 0.0625 = 0.125
12.5% ```
- Translate problem into Probability Notation:
```
- Question: Heads or Tails 5
1 - Given coin, with possible values heads or tails Find probability of at least 3x heads out of 4x coin flips P(H) = 0.5 (per flip) P( {X1 X2 X3 X4} contains >= 3H )
- Assume only two possible outcomes
- Assume coin flips independent
- Assume each flip has identity and equal probability of heads result
- Solution
- Identify different Sequences where
head occurs at least 3x
HHHH HHHT HHTH HTHH THHH
- Translate problem into Probability Notation:
```
- Already know that probability 4x Heads is: 0.5 * 0.5 * 0.5 * 0.5 = 0.0625
P( {X1 X2 X3 X4} contains >= 3H ) = 5 * 0.0625 = 0.3125
31.25% ```
- Identify different Sequences where
head occurs at least 3x
- Question: Weather 1
1 - Given weather being sunny or rainy P(D1 = sunny) = 0.9 (prob. sunny day) P(D2 = sunny | D1 = sunny) = 0.8 (prob. sunny day follows a sunny day) P(D2 = rainy | D1 = sunny) = ??? (prob. rainy day follows a sunny day) What is prob. of rainy day given prev. day was sunny day?
- Assume sunny or rainy are only two possible outcomes
- Solution
Using Negation of Probabilities P(¬X|Y) = 1 - P(X|Y) P(D2 = rainy | D1 = sunny) = 1 - P(¬ D2 = rainy | D1 = sunny) = 1 - P(D2 = sunny | D1 = sunny) = 1 - 0.8 = 0.2 20%
- Question: Weather 2
1 - Given weather being sunny or rainy P(D1 = sunny) = 0.9 (prob. sunny day) P(D2 = sunny | D1 = sunny) = 0.8 (prob. sunny day follows a sunny day) P(D2 = rainy | D1 = sunny) = 0.2 (prob. rainy day follows a sunny day) P(D2 = sunny | D1 = rainy) = 0.6 (prob. sunny day follows a rainy day) P(D2 = rainy | D1 = rainy) = ??? (prob. rainy day follows a rainy day) What is prob. of rainy day given prev. day was rainy day?
- Assume sunny or rainy are only two possible outcomes
- Solution
Using Negation of Probabilities P(¬X|Y) = 1 - P(X|Y) P(D2 = rainy | D1 = rainy) = 1 - P(¬ D2 = rainy | D1 = rainy) (replace with complement probability) = 1 - P(D2 = sunny | D1 = rainy) (substitute with negation that we already know) = 1 - 0.6 = 0.4 40%
- Question: Weather 3
1 - Given weather being sunny or rainy P(D1 = sunny) = 0.9 (prob. sunny day) P(D2 = sunny | D1 = sunny) = 0.8 (prob. sunny day follows a sunny day) P(D2 = rainy | D1 = sunny) = 0.2 (prob. rainy day follows a sunny day) P(D2 = sunny | D1 = rainy) = 0.6 (prob. sunny day follows a rainy day) P(D2 = rainy | D1 = rainy) = 0.4 (prob. rainy day follows a rainy day) What is prob. that day 2 is sunny? What is prob. that day 3 is sunny?
- Assume sunny or rainy are only two possible outcomes
- Assume that same Conditional Probability dynamics of D2 followed by D1 apply also to D3 followed by D2
- Solution
Use "Dependence" i.e. Given sunny or rainy P(D1 = sunny) = 0.9 (prob. 1st day sunny) | branch to pick another day based on 1st flip outcome | `--> sunny : P(D2 = sunny | D1 = sunny) = 0.8 | `--> rainy : P(D2 = rainy | D1 = rainy) = 0.4 i.e. prob. 2nd day D2 is sunny given that 1st day D1 was sunny is 0.8 (by Conditional Probability) P(D2 = sunny) = ??? compute probability catering for the complement P(D2 = sunny) = P(D2 = sunny | D1 = sunny) * P(D1 = sunny) + P(D2 = sunny | D1 = rainy) * P(D1 = rainy) = 0.8 * 0.9 + 0.6 * 0.1 = 0.72 + 0.06 = 0.78 now, what is prob. that day 3 is sunny? P(D3 = sunny) = ??? P(D3 = sunny) = P(D3 = sunny | D2 = sunny) * P(D2 = sunny) + P(D3 = sunny | D2 = rainy) * P(D2 = rainy) now, we know from last answer that P(D2 = sunny) = 0.78, and its complement P(D2 = rainy) = 0.22, so substitute: and given the assumptions we know the following are the same (only relevant ones from previous to get the answer here are shown): P(D2 = sunny | D1 = sunny) = P(D3 = sunny | D2 = sunny) = 0.8 P(D2 = sunny | D1 = rainy) = P(D3 = sunny | D2 = rainy) = 0.6 so substitute these too: = P(D3 = sunny | D2 = sunny) * 0.78 + P(D3 = sunny | D2 = rainy) * 0.22 = 0.8 * 0.78 + 0.6 * 0.22 = 0.624 + 0.132 = 0.756 75.6%
- Question: Cancer 1
1 - Probability of given population having cancer C is 1% P(C) = 0.01 (prob. of given population having cancer C) What is prob. of not having the cancer?
- Assume having or not having are only two possible outcomes
- Solution
P(¬ C) = 1 - 0.01 = 0.99 (prob. of given population not having cancer C)
- Question: Cancer 2
1 - Probability of given population having cancer C is 1% 2 - Drug company has test Y to show if person has cancer C (with a probabilistic answer) where P(+|C) - prob. of test Y Positive detection given we have a cancer C where P(-|C) - prob. of test Y Negative detection given we have a cancer C P(C) = 0.01 (prob. of given population having cancer C) P(¬ C) = 0.99 (prob. of given population not having cancer C) P(+|C) = 0.9 (prob. of test Y Positive detection given we have a cancer C) P(-|C) = ??? What is prob. of test Y being Negative given we have a cancer C?
- Assume having or not having are only two possible outcomes
- Solution
recall, Bayes' Rule: P(X/Y) = P(Y/X) P(X) / P(Y) where P(Y/X) = 1 - P(~Y/X)
- Translate problem into Probability Notation:
1- P(C) = 0.01 (prob. of given population having cancer C) 2- P(¬ C) = 0.99 (prob. of given population not having cancer C) 3- P(Y/C) = P(+|C) = 0.9 ?? (prob. Y pos. detect with C present) P(¬ Y/C) = P(-|C) = ?? (prob. Y neg. detect with C present)
- Find P(¬ Y/C) = P(-|C) = ?? first to use with Bayes’ Rule
P(¬ Y/C) = 1 - P(Y/C) = 1 - P(+|C) = 1 - 0.9 = 0.1 (prob. Y neg. detect with C) probability of the test coming out negative given that the cancer is present
- Translate problem into Probability Notation:
- Question: Cancer 3
In addition to that given in Question Cancer 2, assume: P(Y/ ¬ C) = P(+|¬ C) = 0.2 (prob. Y pos. detect with no C present) P(¬ Y/ ¬ C) = P(-|¬ C) = 0.8 (prob. Y neg. detect with no C present) Find the "Joint Probabilities" (i.e. not Conditional): P(Y, C) = P(+, C) = (prob. Y pos. detect and having C) P(¬ Y, C) = P(-, C) = P(Y, ¬ C) = P(+, ¬ C) = P(¬ Y, ¬ C) = P(-, ¬ C) =
-
Solution: * ``` Substitute values into the Product Rule: P(X,Y) = P(X and Y) = P(X) * P(Y|X)
P(+, C) = P(+|C) * P(C) = 0.9 * 0.01 = 0.009 P(-, C) = P(-|C) * P(C) = 0.1 * 0.01 = 0.001 P(+, ¬ C) = P(+| ¬ C) * P(¬ C) = 0.2 * 0.99 = 0.198 P(-, ¬ C) = P(-| ¬ C) * P(¬ C) = 0.8 * 0.99 = 0.792 ```
- Question: Cancer 4
In addition to that given in Question Cancer 2 and 3, find: P(C|Y) = P(C|+) = ??? (prob. of having cancer C, given received a positive test Y diagnosis) Review prior cases... we know we have a Positive test (i.e. +) where we found P(+, C) = 0.009 P(+, ¬ C) = 0.198 Noting and using Conditional Probability (Bayes' Rule) P(X|Y) = P(Y|X) P(X) / P(Y) P(X|Y) P(Y) = P(Y|X) P(X) Using the Product Rule P(X,Y) = P(X and Y) = P(X) * P(Y|X) Substituting Product Rule into Conditional Probability: P(X|Y) = P(X,Y) / P(Y) Solving using rearranged form of Product Rule P(C|+) = P(C|Y) = P(C,Y) / P(Y) where P(Y,C) = P(C,Y) = P(Y and C) = P(C and Y) = P(C and Y) / P(Y) substitute equivalent values from Q3 where P(+, C) = (prob. Y pos. detect and having C) where P(Y) = Theorem of Total Probability = (prob. Y pos. detect and either having or not having C) = P(+, C) + P(+, ¬ C) = P(+, C) / Theorem of Total Probability = P(+, C) / P(+, C) + P(+, ¬ C) = 0.009 / 0.009 + 0.198 = 0.043 4.3%
- Ch 13 Questions
- What are “Query Propositions”? (bottom of actual pg 490)
- Bayes’ Network
- Lesson 13 - Bayes Nets
- Dfn: Bayes’ (Baysian) Networks (theory of the world)
Constructed and used to make Inferences
about new Situations
to takes uncertainty in probability
and marry it with Efficient Structures,
so can see how each Uncertain Variable influences
other Uncertain Variables
- Usage:
- Construct (a Bayes’ Nets)
- Indifference (using Bayes’ Nets)
- Solve using either approach of:
- Summing the Results of All Relevant Situations
- Inference by Sampling (to handle bigger Bayesian Networks)
- Algorithms:
- Monte Carlo Markov Chain
- Usage:
- Bayesian Network - Graphical Representation of Distribution of Two Variables
(A) (NOT OBSERVABLE) | | (B) (OBSERVABLE) where A and B are "Internal Variables" where A means: have or not have cancer where B means: test result pos. or neg. diagnosis where P(A) is "Prior Probability" (i.e. best guess given no other evidence) where P(B | A) and P(B | ¬ A) is prob. of B given different values of A where P(A | B) and P(A | ¬ B) are "Diagnostic Reasonings" (Inverse of Causal Reasoning)
- Question: What is qty of numerical parameters (of underlying probabilities)
required to specify the entire join prob. between A and B
- Answer:
Recall that "Joint Probability" using the "Independence" formula between two variables A and B is: (i.e. When asked for probability of any combination of a number of A and B, look at probability of each individually and multiple them) A ⊥ B : P(A)(B) = P(A,B) marginals = joint probability Also recall the Product Rule P(A, B) = P(A and B) = P(A) * P(B | A) Parameter 1: A to specify P(A) from which can "derive" P(¬ A) Parameters 2, and 3: A and B to specify P(B | A) and P(¬ A) from which we can "derive" P(¬ B | A) and P(¬ B | ¬ A) Answer: 3x parameters for this Bayesian Network
- Answer:
- Question: What is qty of numerical parameters (of underlying probabilities)
required to specify the entire join prob. between A and B
- Example: Student Job
https://www.youtube.com/watch?v=8dvd80dgsEE
Given ======== Student about to graduate Applied for jobs at 2x companies - G represents Graduated Student - O1 represents Student Received Job Offer from Company 1 - O2 represents Student Received Job Offer from Company 2 G ------> O1 | P(O1 | G) = 0.5 | P(O1 | ¬ G) = 0.05 | |___> O2 P(O2 | G) = 0.75 P(O2 | ¬ G) = 0.25 Question: ========= Calc. prob. that student received offer from Company 2, given we know she received an offer from Company 1.
- Solution
Find ratio of times that O2 is True when given that O1 is True, to all situations when just O1 is True. Calculate answer using: "Approach of Summing the Results of All Relevant Situations" Note: Alternative approach is to use: "Inference by Sampling to handle bigger Bayesian Networks" P(O2 | O1) = Sum all possible Grad. situations where O1 and O2 are True / Sum all possibe Grad situations * Sum of all possible O2 situations, where where O1 and O2 are True = Sum all possible Grad. situations where O1 and O2 are True / Normalising Number + Situation when O2 is Negative = (Sum G) P(O2, O1, G) / (Sum G) (Sum O2) P(O2, O1, G) = P(O1, O2, G) + P(O1, O2, ¬ G) / ( [P(O1, O2, G) + P(O1, O2, ¬ G)] + [P(O1, ¬ O2, G) + P(O1, ¬ O2, ¬ G)] ) = ??? For Numerator and Denominator, we know: P(O1, O2, G) = P(O1 | G) * P(O2 | G) * P(G) Substitute values we know from Bayes' Network: = 0.5 * 0.75 * 0.9 = 0.3375 P(O1, O2, ¬ G) = P(O1 | ¬ G) * P(O2 | ¬ G) * P(¬ G) Substitute values we know from Bayes' Network: = 0.05 * 0.25 * 0.10 = 0.00125 P(O1, ¬ O2, G) = P(O1 | G) * P(¬ O2 | ¬ G) * P(G) Substitute values we know from Bayes' Network: = 0.05 * 0.25 * 0.9 = 0.1125 P(O1, ¬ O2, ¬ G) = P(O1 | ¬ G) * P(¬ O2 | ¬ G) * P(¬ G) Substitute values we know from Bayes' Network: = 0.05 * 0.75 * 0.1 = 0.00375 P(O2 | O1) = 0.3375 + 0.00125 / [0.3375 + 0.00125] + [0.1125 + 0.00375] = 0.7445
-
Complex Bayesian Rule (and Complex Bayes’ Networks)
Bayes Rule P(A | B) = P(B | A) * P(A) / P(B) = easily computable term / hard to compute term (but not depend on assumption for Var A) Now the "complementary" range is: P(¬ A | B) = P(B | ¬ A) * P(¬ A) / P(B) Note: "Normaliser" is both P(B) whether use A on LHS or ¬ A on LHS Since two complementary events, also know that: P(A | B) + P(¬ A | B) = 1 Compute the "posterior" probability that is not "Normalised" (i.e. whilst ignoring/omitting the "Normaliser" P(B)) P'(A | B) = P(B | A) * P(A) P'(¬ A | B) = P(B | ¬ A) * P(¬ A) where P' is "prime" since not a real probability where P'(A | B) equals the "Normaliser" (Denominator of prev. equation) where P'(¬ A | B) " " " ... Now, recover original probabilities by "Normalising" based on values P'(A | B) and P'(¬ A | B) P(A | B) = η * P'(A | B) (actual probability) P(¬ A | B) = η * P'(¬ A | B) where η is "Normaliser" eta (times the non-"Normalised" form) and where: η = 1 / ( P'(A | B) + P'(¬ A | B) ) Using these computed non-"Normalised" pseudo-probabilities we defer the calculation of "Normaliser" P(B), and make the calculation much easier
- Absolute Independence (aka Independence) Bayes Network
B C
- Dfn: A ⊥ B where two nodes not connected at all
- Note: Absolute Independence DOES NOT imply Conditional Independence
- Conditional Independence Bayes Network
A ------> B | |___> C
- Dfn: A ⊥ B | C where two nodes A and B are “conditionally independent” given a hidden cause A caused two independent different measurements (we know information from a common branch C)
- Note: Conditional Independence DOES NOT imply Absolute Independence (i.e. since if A is test result returning pos., this will affect the probability of the subsequent B test result likelihood of being pos.)
- Confounding Cause Bayes Network
A -----, |--- C B _____/
-
Dfn: (two independent hidden causes confounded within a single observational variable)
-
C adds a dependence on A and B
-
-
Example: Two Tests for Cancer (Complex) and Conditional Independence
-
Conditional Independence Dfn: Given C, T1, and T2 in below diagram,
-
T1 ⊥ T2 C (i.e. T1 is “conditionally independent” of T2 given we know C) - Question:
- If we did not know C (i.e. knowing whether we had Cancer)
and if T1 is “conditionally independent” of T2,
does that imply that T1 and T2 are “independent” of each other?
- No, since getting a pos. test result with T1 of a specific cancer C, gives us info about whether we have cancer or not (i.e. raises probability of having cancer). With that increased probability outcome, we will predict that a subsequent test T2 will have a higher probability of giving a pos. test result (than if we didn’t do T1 first), so they are NOT “independent”
- If we did not know C (i.e. knowing whether we had Cancer)
and if T1 is “conditionally independent” of T2,
does that imply that T1 and T2 are “independent” of each other?
-
Given, C (Not Observable) / \ T1 ___ | T2 where C is Cancer where T1 is Test 1 where T2 is Test 2 P(C) = 0.01 (prior probability of Cancer) P(+ | C) = 0.9 (probability of receiving positive test result for either test) P(- | ¬ C) = 0.8 (prob. of neg. test result for either test, i.e. cancer free) Compute all other probabilities using the above: P(¬ C) = 0.99 P(- | C) = 0.1 P(+ | ¬ C) = 0.2 Assuming both T1 and T2 results a pos., i.e. P(C | T1 = +, T2 = +) = P(C | + +) = ?? What is probability of cancer now? Answer: Use the trick to compute: prior + + P' P(C|++) C 0.01 0.9 0.9 0.0081 0.1698 ¬ C 0.99 0.2 0.2 0.0396 0.8301 ------ 0.0477 "Normaliser" (use non-"Normalised" Bayes Rule and multiple 0.01 * 0.9 * 0.9 to get non-"Normalised" probability P' of having cancer C given a TWO pos. test results from T1 and T2 of 0.0081) P' values are not probabilities. Add the P' values of 0.0081 and 0.0396 we get 0.0477. Lastly we divide (i.e. "Normalise") the non-"Normalised" probabilities 0.0081 and 0.0396 by the Factor 0.0477 to get the correct "posterior probability" answers P(C|++) = 0.1698 Other notes: Hidden Variable C, causes the still Stochastic test results T1 and T1 We assumed T1 and T2 are identically distributed (i.e. use same 0.9 for each) We assumed T1 and T2 were "Conditionally Independent" i.e. if knew for certain that had cancer C, then knowing T1 would not help us make a statement about T2 We assumed that probability of T2 given C and T1, which is same as probability of T2 given C i.e. P(T2 | C, T1) = P(T2 | C) (this is known as "Conditional Independence", where if we knew the value of cancer variable C for sure, then T2 would be "independent" of T1, but it's "Conditionally Dependent" since independence only holds True if we actually know C, since C separately causes T1 and T2 as shown in the diagram) What is the prob. of second T2 test result being pos. if we know that first T1 test result outcome was pos. ? P(T2 = + | T1 = +) = ??? Solving using Bayes Rule P(+2 | +1) = P(+1 | +2) P(+2) Solve using "Theorem of Total Probability" (note that everything is conditional on +1) (i.e. Denominator fn RHS in Bayes Rule) P(T2 = + | T1 = +) = P(+2 | +1, C) * P(C | +1) + P(+2 | +1, ¬ C) * P(¬ C | +1) Note that P(+2 | +1, C) is an example of "conditional independence" since 1st test T1 gives no more information to 2nd test T2, as it only gives more info if C was unknown (which is not the case here), so we can re-write it as P(+2 | C), since we know now that P(+2 | +1, C) = P(+2 | C) So, simplifying the equation, by removing +1 from P(+2 | +1, C), since it doesn't give any more info = P(+2 | C) * P(C | +1) + P(+2 | ¬ C) * P(¬ C | +1) Substitute with values we already know: = 0.9 * P(C | +1) + 0.2 * P(¬ C | +1) Use the same trick used previously to compute P(C | +1) and P(¬ C | +1), but with only one test T2, instead of both of them: prior + P' P(C|+) C 0.01 0.9 0.009 0.009/0.207 = 0.043 ¬ C 0.99 0.2 0.198 0.198/0.207 = 0.957 ------ 0.207 "Normaliser" (use non-"Normalised" Bayes Rule and multiple 0.01 * 0.9 to get non-"Normalised" probability P' of having cancer C given one pos. test result T1 of 0.009) P' values are not probabilities. Add the P' values of 0.009 and 0.198 we get 0.207. Lastly we divide (i.e. "Normalise") the non-"Normalised" probabilities 0.009 and 0.198 by the Factor 0.207 to get the correct "posterior probability" answers P(C | +1) = 0.043 P(¬ C | +1) = 0.957 = 0.9 * 0.043 + 0.2 * 0.957 = 0.2301 = 0.230 (i.e. if 1st test returns pos. then expect 2nd test to return pos. with probability of 23% after seeing the pos. test result of 1st test, which is an increase over the default probability P(T2 = +) of 2nd test returning pos. without a 1st year being performed of 20%, which was calculated using "Normaliser" of Bayes Rule, which was 0.207 )
-
- Example: Happiness (using Confounding Cause)
R -----, |--- H S _____/ where happiness H (or unhappy) driven by two causes, whether it's sunny S or job raise R where Specification of a Probability Distribution is: P(S) = 0.7 P(R) = 0.01 P(H | S, R) = 1 P(H | ¬ S, R) = 0.9 P(H | S, ¬ R) = 0.7 P(H | ¬ S, ¬ R) = 0.1 What is: P(R | S) = ??? Note that there is no mechanism by which R and S could co-occur, so P(R | S) = P(R) (i.e. prob. of raise given sunny is same as prob. of raise given any weather) P(R | S) = P(R) = 0.01
- Explaining Away (special instance of Bayes Network reasoning)
R -----, |--- H S _____/ If we know we are happy H, then sunny S weather can "explain away" the cause of happiness H If also know that it's sunny S weather, then the probability that we received a raise R then DECREASES (since the "observer" could already "explain away" that the reason for being happy H is likely because it's sunny S if they first observe that it's sunny outside Such that "seeing" one of the causes R or S first, can "explain away" any other potential cause of the effect H 1) What is: P(R | H, S) = ??? (prob. raise given that i'm happy and its sunny) expanding th term using Bayes Rule and using Total Probability: P(R | H, S) = P(H | R, S) * P(R | S) / P(H | S) now simplifying P(R | S) into just P(R) since cannot co-occur: = P(H | R, S) * P(R) / P(H | R, S) now expand the Denominator by folding in R and ¬ R = P(H | R, S) * P(R) / P(H | R, S) * P(R) + P(H | ¬ R, S) * P(¬ R) substitute values from tables: = 1 * 0.01 / 0.01 + 0.7 * 0.99 = 0.0142 2) What is: P(R | H) = ??? (prob. of raise given we're observed as being happy, but we do not know anything about the weather) compute auxiliary variable P(H) as combination of different variables that can make lead to happiness P(H) = P(H | S, R) * P(S, R) + P(H | ¬ S, R) * P(¬ S, R) + P(H | S, ¬ R) * P(S, ¬ R) + P(H | ¬ S, ¬ R) * P(¬ S, ¬ R) using "independence" to replace Join Probabilities with Marginals = P(H | S, R) * P(S)P(R) + P(H | ¬ S, R) * P(¬ S)P(R) + P(H | S, ¬ R) * P(S)P(¬ R) + P(H | ¬ S, ¬ R) * P(¬ S)P(¬ R) now substituting in values we know = 1 * (0.7 * 0.01) + 0.9 * (0.3 * 0.01) + 0.7 * (0.7 * 0.99) + 0.1 * (0.3 * 0.99) = 0.007 + 0.0027 + 0.4851 + 0.0297 = 0.5245 now using Bayes Rule to substitute into the initial equation with values we already know P(R | H) = P(H | R) * P(R) / P(H) = P(H | R) * 0.01 / 0.5245 now find out what P(H | R) is using "Total Probability", and substitute numbers: P(H | R) = P(H | R, S) * P(S) + P(H | R, ¬ S) * P(¬ S) = 1 * 0.7 + 0.9 * 0.3 = 0.7 + 0.27 = 0.97 now substitute this result back into equation to find P(R | H): P(R | H) = 0.97 * 0.01 / 0.5245 = 0.0185 What is (extreme case for making raise R likely), solve similar to before: P(R | H, ¬ S) = P(H | R, ¬ S) * P(R | ¬ S) / P(H | ¬ S) = 0.9 * 0.01 / P(H | R, ¬ S) * P(R) + P(H | ¬ R, ¬ S) * P(¬ R) = 0.009 / (0.9 * 0.01 + 0.1 * 0.99) = 0.0833 so now it is clear that if don't know about the weather, that there is a higher chance that the raise R caused happiness H: P(R | H) > P(R | H, S) = 0.0185 > 0.0142 and if we observe it to not be sunny, then it is much more likely that cause of happiness is related to a job raise P(R | S) < P(R | H, ¬ S) = 0.01 < 0.0833
- Bayes Networks (General Definition)
- Dfn: Define Probability Distributions over Graphs of Random Variables (i.e. A, B, C, D, E)
A B \ / \ / * * C / \ / \ * * D E
- Bayes Network is defined by prob. distrib. inherent to
each individual node
i.e. nodes A and B are defined as the following since A and B have no incoming arcs P(A) P(B) i.e. node C is defined as "condition distribution" (conditioned on A and B) P(C | A, B) i.e. node D and E are conditioned on C P(D | C) P(E | C) The "Joint Probability" represented by a Bayes Network is the product of various Bayes Network probabilities defined over individual nodes, where each node's probability is only conditioned on the incoming arcs P(A, B, C, D, E) = P(A) * P(B) * P(C | A, B) * P(D | C) * P(E | C) Required Variables (of Bayes Network vs Combinatorial Approach) ================== Advantage of using Bayes Networks here is that only requires 10 values (so compactness of Bayes Network leads to representation that scales much better on LARGE Networks, than using the Combinatorial Approach that goes through all combinations of variable values) required parameters 1 1 4 2 2 (A and B binary) (C binary) (C binary) Using a Combinatorial Approach (unstructured representation) the "Joint Distribution" over any 5x variables would require all combinations of variable values required parameters (2 ^ 5) - 1 = 31 (probability values) Noting that any variable that has K inputs, requires 2 ^ K search variables
- Question
How many probability values are required to specify this Bayes Network? A / | \ / | \ * * * B C D \ \ / \ \ / * * E F Answer: P(A, B, C, D, E, F) = P(A) * P(B | A) * P(E | B) * P(C | A) * P(D | A) * P(F | C, D) 1 2 2 2 2 4 Summing up the qty of required parameters (search variables) is 13 (i.e. 13 numeric probabilities are required to specify this Bayes Network)
-
D-Separation
-
Rule: Any two variables are “independent” if they are not linked by just unknown variables i.e. If know B, then any variable downstream of B becomes “independent” of anything upstream of B
Given the following Bayes Network A / \ / \ * * B D / \ / \ * * C E Are the following "independent" of each other: C ⊥ A NO, since A influence C by virtue of B, but if we know B, then A becomes independent of C. So the only determinant of C is B, if you know B for certain since in this case A won't tell C anything B hasn't already told it. C ⊥ A | B YES C ⊥ D NO, same as above C ⊥ D | A YES, since if C already knows A, then D won't tell C anything it doesn't already know E ⊥ C | D NO E is not "independent of C | D Knowledge of B does NOT render A and E independent
-
Example 2
Given the following Bayes Network A B \ / \ / * * C / \ / \ * * D E Are the following "independent" of each other? A ⊥ E NO, since through C we can see influence of A to E A ⊥ E | B NO, since A still influences C despite knowing B A ⊥ E | C YES, since if know C the influence is cut off so A cannot influence E if we know C A ⊥ B YES, different entry variables with no arcs A ⊥ B | C NO, since given C, A and B become "dependent" due to the "Explain Away" effect (Knowledge of variable C, renders previously "independent" variables as "dependent") (i.e. since if know C to be True, then knowledge of A will substantially affect what we believe about B) (i.e. if there are two joint causes for C, i.e. A and B, and if we know A is True, then we will discredit cause B, and conversely, if we know B is False, it will increase our belief that B is True)
-
- Dfn: Bayes’ (Baysian) Networks (theory of the world)
Constructed and used to make Inferences
about new Situations
to takes uncertainty in probability
and marry it with Efficient Structures,
so can see how each Uncertain Variable influences
other Uncertain Variables
-
Lesson 14 - Bayes Networks using Probabilistic Inference (i.e. how to answer questions we’re interested in, given certain inputs) * This Lesson 14: * Probabilistic Inference * Dfn: How to answer Probability questions using Bayes Networks
* Previous Lesson 13: * Probability Theory of how a Bayes network can concisely represent a "joint probability distribution", including representation of "independence" between variables * Example: Earthquake ``` Given Bayes Network graph (B)urglary (E)arthquake \ / \ / * * (A)larm / \ / \ / \ * * (J)ohn calls (M)ary calls "Inference" question: What are outputs given some inputs ? "Probabilistic Inference" Terminology: "Evidence" variables = Inputs = B and E (i.e. variables we know value of) "Query" variables = Outputs = J and M (i.e. variabes we want to find value of) Important Note: "Probabilistic Inference" outputs are not single numbers, instead outputs are a "Probability Distribution", so the answer will be a Complete "Joint Probability Distribution" over the "Query" variables, which is known as the: "Posterior Distribution" (given the "Evidence") =============================================== P(Q1, Q2 ... | E1 = e1, E2 = e2) (i.e. "Posterior Distribution" of one or more "Query" variables given the exact values of the zero or more "Evidence" variables) "Hidden" variables = Internal = (i.e. variables that are neither "Evidence" nor "Query" that need to compute with internally) Maximisable "Query" values (Q), given the "Evidence" values: =========================================================== argmax q P(Q1 = q1, Q1 = q2, ... | E1 = e1, ...) Note: The most likely explanation out of all the possible values for all the "Query" variables, which combination of values has the highest probability Note: In ordinary programming, each function goes only one way: * Input variables * Computation * Output result variable(s) However, in Bayes Networks, we can go MULTIPLE DIRECTIONS i.e. CAUSAL DIRECTION FLOW - giving as "Evidence" the root nodes of the tree - asking as "Query" values the nodes at the bottom of the tree REVERSE CAUSAL DIRECTION FLOW - giving as "Evidence" the J and M variables - asking as "Query" variables the B and E variables or any other combination... OTHER - giving as "Evidence" the M variable - asking as "Query" variables the B and J variables ``` * Question ``` Situation where Mary (M) has called to report Alarm (A) is going off, and we want to know whether or not a Burglary (B) occurred. Show what nodes are "Evidence", "Hidden" or "Query" nodes B - "Query" E - "Hidden" A - "Hidden" J - "Hidden" M - "Evidence" (known) ``` * **Inference on Bayes Networks** https://youtu.be/nJqloQLLYaI?t=3m25s * **Enumeration** Method * Dfn: Goes through all possibilities, adds them up, and derives answer ``` Question: What is probability that the Burglar Alarm occurred given that John called and Mary called? Answer: Use the definition of "Conditional Probability" and perform "Inference" by "Enumeration" on the Belief network P(Q | E) = P(Q, E) / P(E) P(+b | +j, +m) ("conditional probability") = joint probability distribution of all three variables / conditionalised variables = P(+b, +j, +m) / P(+j, +m) (using shorter notation) Note: here we "Normalise" by the Denominator P(+j, +m) (re-written without any "conditional probability") where P(B = True) = P(+e) = 1 - P(¬ e) = P(e) now enumerate all "atomic" probabilities and calculate the sum of products the Numerator can be determined by Enumerating an expression with all possible values of the Hidden variables (i.e. in this case E and A) P(+b, +j, +m) = ∑ "e" ∑ "a" P(+b, +j, +m, e, a) (note how only b, j, and m are fixed as +) (sum over all values for "e" and "a" where only two values for each since Boolean, we ask what is the probability of this unconditional term) to get the values of the "atomic" events, we re-write the equation in a form that corresponds to the "conditional probability" tables that we have associated with the Bayes Network re-write the part after the ∑ "e" ∑ "a", as being an expression in terms of the Parents of each of the nodes in the network, giving a product of 5x terms, which we then sum over all values of Hidden variable values for "e" and "a" = ∑ "e" ∑ "a" P(+b) * P(e) * P(a | +b,e) * P(+j | a) * P(+m | a) = ∑ "e" ∑ "a" f(e, a) = f(+e, +a) + f(+e, ¬a) + f(¬e, +a) + f(¬e, ¬a) now expand and substitute +/¬e and +/¬a for each respectively: = P(+b) * P(+e) * P(a | +b,+e) * P(+j | +a) * P(+m | +a) + P(+b) * P(+e) * P(¬a | +b,+e) * P(+j | ¬a) * P(+m | ¬a) + P(+b) * P(¬e) * P(+a | +b,¬e) * P(+j | +a) * P(+m | +a) + P(+b) * P(¬e) * P(¬a | +b,¬e) * P(+j | ¬a) * P(+m | ¬a) (sum of f, for all values of "e" and "a") (i.e. the sum of 4x terms, where each of the terms is the product of 5x numbers now, we get the numbers to substitute into this equation from the Conditional Probability Tables from our Model Question: Given the case where "e" is pos. and "a" is pos., to lookup in the Conditional Probability Tables https://youtu.be/nJqloQLLYaI?t=3m25s and substitute the values for each of the 5x terms, then multiple them together, and to determine their product ``` * **Efficient Inference Technique #1 - Faster technique for doing Inference on Bayes Networks that involves "Pulling out terms" from the Enumeration** * Reasoning: * e.g. Network 1: If had 5x Hidden variables, there'd only be 32 rows in Conditional Probability Table to sum up * e.g. Network 2: If had 27x Hidden variables (if all Boolean), gives 100 million rows to sum out, but since some aren't Boolean, so representing the whole enumeration, we'd have to sum over a quadrillion rows, so we need methods that are faster than enumerating everything * e.g. ``` given: ∑ "e" ∑ "a" P(+b) * P(e) * P(a | +b,e) * P(+j | a) * P(+m | a) here the prob. of "b" is the same for all values of "e" and "a" so we can take that out of the equation and have less work to do (i.e. multiply by P(+b) once, instead of having it in each row of the table) also, here the prob. of "e" does not depend on the summation of "a" so we can move it left outside the summation of "a", resulting in less work so the inner loop of the summation has less terms P(+b) ∑ "e" P(e) ∑ "a" P(a | +b,e) * P(+j | a) * P(+m | a) so we've reduced the cost of doing each row of the table, but still have same qty of rows in the table, so we'll need to try a different technique #2 ``` * **Efficient Inference Technique #2 - "Maximise Independence"** of variables, in order to reduce the number of rows in the Conditional Probability Table * Note: The structure of a Bayes Network determines the efficiency for performing "inference" on it ``` i.e. 1) Linear Bayesian Network: - with string of variables X1 through to Xn O(n) here "inference" can be done in a timeframe that's proportional to the number "n" 2) "Complete" Network: - where every node points to every other node O(2^n) could take up to 2^n timeframe, if all "n" variables are Boolean variables 3) Example Alarm Network: - where had all "independence" relations represented in the structure of the network - BUT if we put the nodes in different order, we'd end up with different structure i.e. change to order with J node first then M node second (i.e. John calls then Mary calls). now, given just these two nodes J and M, is M node "dependent" or "independent" of the the J node ANSWER: "dependent" (since here we don't know that the alarm had occurred) J -------* M \ / \ / \ / * * A / \ / \ / \ * * B ------* E If we add A to the network, it is "dependent" on both M and J If we add B to the network, it's just "dependent" on A (i.e. B is "independent" of J and M, given A) If we add E to the network, it's "dependent" on both A and B (since if E did occur, then its more likely that A would go off, and since if B did occur that would explain why A is going off, and would mean E is less likely) ``` * **Causal Direction** * Bayes Networks easier to do "inference" on when they are written in the "Causal Direction" (i.e. when networks flow from "causes" to "effects" * **Efficient Technique #3 - "Variable Elimination"** * Reasoning: ``` previously to evaluate this equation, we first joined up the whole Joint Distribution, before Summing over the Hidden Variables "e" and "a", which is SLOW, since we end up repeating most of the work ∑ "e" ∑ "a" (whole joint distribution) = ∑ "e" ∑ "a" P(+b) * P(e) * P(a | +b,e) * P(+j | a) * P(+m | a) ``` * Dfn: * Faster that doing "inference" by Enumeration in most cases, but is still a difficult computation (its NP-hard to do "inference" on Bayes Networks in general) * Requires an Algebra for Manipulating Factors (names for multi-dimensional arrays that come out of these probabilistic terms) * Example ``` Given a Bayes Network with 3x Boolean variables: where T "dependent" on R where L "dependent" on T R --------* T --------* L (raining) (traffic) (late for appointment) Conditional Probability Tables - for each variable ============================== P(R) P(T | R) P(L | T) +r 0.1 +r +t 0.8 +t +l 0.3 -r 0.9 +r -t 0.2 +t -l 0.7 -r +t 0.1 -t +l 0.1 -r -t 0.9 -t -l 0.9 now, use "inference" to determine answer to questions: QUESTION: Am I going to be Late ? ANSWER (using "Enumeration" ok for this simple example): by going through all possible values for "r" and "t", and then Summing the product of the 3x nodes P(+l) = ∑ r ∑ t P(r) P(t | r) P(+l | t) ANSWER (using "Variable Elimination" alternative for complex examples): Variable Elimination Steps ==================== NOTE: (continued process of joining together "factors" to form a larger factor, and then eliminating variables by summing out. importantly, if we apply a good order by which we apply the operations in this process, then Variable Elimination can be much more efficient that just doing the whole "Enumeration") STEP 1 - starts with a big network and perform "Joining Factors" where each "factor" is one of the Conditional Probability Tables (multi-dimensional matrix) choose two or more of the "factors" (i.e. R and T) combine them together to form a NEW "factor" called RT that represents the "Joint Probability" of all the variables (i.e. "r" and "t") in that "factor" P(R | T) +r +t 0.1 * 0.8 = 0.08 +r -t 0.1 * 0.2 = 0.02 -r +t 0.9 * 0.1 = 0.09 -r -t 0.9 * 0.9 = 0.81 so we converted from before to after BEFORE R --------* T --------* L AFTER R T --------* L STEP 2 - eliminate some variables (aka Summing Out, aka Marginalisation) the objective of this step is to take the new Table called RT, and REDUCE it P(R | T) P(L | T) +r +t 0.08 +t +l 0.3 +r -t 0.02 +t -l 0.7 -r +t 0.09 -t +l 0.1 -r -t 0.81 -t -l 0.9 STEP 3 - compute by summing out (aka marginalising) over the variable "r" to give us a P(T) Table that just operators on T, which has two entries, +t and -t. where +t is formed by "summing out" all the entry values in the P(R | T) Table for all values of "r" for which we have +t, and form the -t entry the same way ("combine" parts of the big network into "smaller parts", so we have a smaller network to deal with) P(T) +t 0.08 + 0.09 = 0.17 -t 0.02 + 0.81 = 0.83 so we converted from before to after BEFORE R T --------* L AFTER T --------* L so now we have Probability Tables: P(T) P(L | T) +t 0.17 +t +l 0.3 -t 0.83 +t -l 0.7 -t +l 0.1 -t -l 0.9 STEP 5 - continue "combining" by performing a "Join" over T and L variables using point-wise multiplication to give a new Table with "Joint Probability" P(T, L) and then "Enumerate" over it now we are down to a Single Node with this "Joint Probability" table: P(T, L) +t +l 0.17 * 0.3 = 0.051 +t -l 0.17 * 0.7 = 0.119 -t +l 0.83 * 0.1 = 0.083 -t -l 0.83 * 0.9 = 0.747 STEP 6 - finally, "Sum Out" to give us just the L node P(L) +l 0.051 + 0.083 = 0.134 -l 0.119 + 0.747 = 0.866 ``` * **"Sampling" approach to achieve Approximate Inference** * Advantage 1 - of "Sampling" over "Inference" is that we know a procedure for coming up with at least an approximate value for the "Joint Prob. Dist.", as opposed to EXACT "Inference" where the computation may be very complex * Advantage 2 - If do not know the numeric values of the "Cond. Prob. Tables" are, but we can Simulate the process, then we can still proceed by using "Sampling", whereas we could not proceed in this situation using EXACT "Inference" * Example: ``` Given a "Joint Probability Distribution" i.e. distribution of Heads or Tails over two coins 1c and 5c STEP 1 - Build "Probability Table" 1c 5c ========= h h I h t IIII t h II t t I STEP 2 - Start counting by "Sampling" i.e. flip the coins, and add the result to the relevant row in the Table STEP 3 - Repeat "Sampling" process until enough counts until we are close to the True distribution so that we can "Estimate" the "Joint Prob. Dist" by inspecting the count tally If only a few "Samples" then results may not be accurate (random variation that prevents them from converging to true values) ``` * Example: Using "Sampling" to do "Inference" and **Rejection Sampling** ``` Given 4x boolean variables Cloudy / \ / \ / \ * * Sprinkler Rain * * \ / \ / \ / Wet Grass And given "Condit. Prob. Tables"; P(S | C) =============== +c +s 0.1 " -s 0.9 -c +s 0.5 " -s 0.5 P(W | S, R) =============== +s +r +w 0.99 " " -w 0.01 " -r +w 0.9 " " -w 0.1 -s +r +w 0.9 " " -w 0.1 " -r +w 0.01 " " -w 0.99 P(R | C) ============== +c +r 0.8 " -r 0.2 -c +r 0.2 " -r 0.8 "inference" over this network using "Sampling" STEP 1: Choose Variable where all Parents are defined i.e. Cloudy is the only such variable Conditional Prob. Table P(C) +c 0.5 -c 0.5 STEP 2: "Sample" from P(C) for variable C generate a random sample number, and let's say it comes up as a +c value, which is defined in P(C) Table now, choose another variable from a Parent, say S, and then look at rows of P(S | C) Table for which Cloudy (the Parent) is pos. +c, where we see +c +s is 0.1 and +c -s is 0.9 STEP 3: "Sample" from P(S | C) for variable S generate a random sample number by "sampling" for S, and let's say it comes up -s now for the Rain variable R, go to P(R | C) Table, find row where the Parent C is pos +c (since we got a +c sample before), i.e. +c +r 0.8 and +c -r 0.2, STEP 4: "Sample" from P(R | C) for variable R generate a random sample number, and let's say it comes up as a +r value now, choose final Parent W, and look at rows of P(W | S, R) Table for which Which rows of P(W | S, R) should we be considering ? since previously we generated "Samples" of +c, -s, and +r we choose rows with: -s +r +w -s +r -w Is it more likely to have a +w or a -w ? +w (since it has higher value % of 0.9 than -w does) STEP 5: Write down our Complete "Sample" based on outcomes of choosing random values and working our way through the network to see if + or - for each variable Samples: +c, -s, +r, +w Note: prob. of choosing a particular variable value (i.e. +w vs -w), depends on the Parent variable values, which are chosen according to the Conditional Probability Tables (i.e. P(C), P(S | C), P(R | C), and P(W | S, R) ). So, in the count of each Sampled variable will approach the true probability (i.e. with infinite qty of Samples then this procedure computes the True "Joint Probability Distribution", such that the Sampling method is "Consistent". Use this kind of Sampling to compute the Complete "Joint Probability Distribution" OR use this kind of Sampling just to compute the value of an Individual Variable ``` * **Rejection Sampling** Procedure *Dfn: ``` "Rejection Sampling" Procedure ==================== If want to compute a Conditional Probability like P(w | -c), then the Sample we got previously (i.e. +c, -s, +r, +w) wouldn't be helpful at all, since it had to do with it being Cloudy +c, but not NOT Cloudy -c, so we would "Reject the Sample" by crossing it off the list of Samples (repeat rejecting for any other Sample that doesn't match conditional probabilities we're interested in, but keeping those Samples that do match). ``` * Issues: * If the evidence is unlikely, then most of the Samples are Rejected. Overcome this issue by using the **Likelihood Weighting** Technique ``` e.g. given the Burglary and the Alarm B ----------* A want to compute probability of a B given that the alarm A goes off P(B | +a) Issue: B's are infrequent Procedure for generating Samples: ============= 1 Generate a B, 2 Results are added to Samples list 3 Go back and check if results match what we want (i.e. does -b, -a match P(B | +a)? ) 4 Reject the Sample if no match 5 Repeat from 1 to 5 i.e. -b, -a REJECT -b, -a REJECT +b, -a KEEP -b, -a REJECT ... ``` * **Likelihood Weighting** Technique for Sampling * Dfn: Generates Samples so we can keep them all and do not have to Reject any, by **fixing** the value of the "Evidence"(input) variables to be what we want to compute (i.e. if we want to compute P(B | +a), then we'd fix the variable value of A to always be +a when generating Samples. So we'd get Samples i.e. -b, +a -b, +a +b, +a ... * Issue: **Fixing** values in Samples causes results that are "Inconsistent", but we can overcome this by applying **Likelihood Weighting** * Solve: Fix this issue by assigning a **"Probabilistic Weight"** to each Sample * Example: 1 ``` Say we want to compute P(R | +s, +w) (prob. of Rain given that the Sprinklers are on and the Wet Grass is wet) 1 Pick random value for Cloudy C of +c 2 Pick random value for Sprinkler S, but we are "Constrained" to always choosing +s (since want it to match what we want to compute) so select from P(S | C) Table the row +c +s, which as prob. of only 0.1, so we insert that prob. into the "Weight" 3 Pick random value for Rain R, (no constraints of matching with what we want to compute) by selecting row from P(R | C) Table with +c (prev. chosen) and selecting say +r, so row has +c +r with prob. 0.8 4 Pick random value for Wet Grass W (but we are "Constrained" to choose +w, since "w" appears in what we want to compute P(R | +s, +w) ), and we know that the Parents are also both positive (i.e. previously chose +s +r) so choose row +s +r +w with prob. of 0.99 since it's a "Constrained" choice, we add in this prob. into the "Weight" Table and multiply it with the previously recorded weight value "Weight" "Sample" ======== ======== 0.10 * 0.99 = 0.099 +c, +s, +r, +w When we include the "Weights", then counting a "Sample" that was forced to have +s, +w, with a "Weight of 0.099 instead of counting it as a full 1.0 sample, makes it such that "Likelihood Weighting" is "Consistent" ``` * Example: 2 - Doesn't solve all our problems ``` want to compute P(C | +s, +r) (i.e. "Constraint" S and R to always be positive +s +r) now, since we use the "Evidence" when we generate a node that has that "Evidence" as Parents, then the Wet Grass W node will always get good values for each Sample based on the "Evidence" but the Cloudy C node won't, so it will be generating values are random without looking at these values, and most of the time or some of the time it'll generate values that don't go well with the evidence. We won't have to reject them (like in Rejection Sampling), but they will have a low probability associated with them. ``` * **Gibbs Sampling** Technique * Dfn: Takes all the "Evidence" into account, not just the Upstream "Evidence" (according to the Causal Directions) by using the **Markov Chain Monte Carlo (MCMC)**, which is "Consistent" and which involves re-Sampling just one variable at a time, Conditioned on all the other variables ``` i.e. given a Set of variables C S R W first we initialise them to random variable values (keeping the "Evidence" values Fixed) to create a Sample like below: +c +s -r -w now, at each iteration through the loop, just select one non-"Evidence" variable, and re-Sample it based on all the other variables, which will result in outputting another Sample, i.e. +c -s -r -w repeat (walking around in the space of assignments of variables randomly) +c +s +r -w Note: In "Rejection Sampling" and in "Likelihood Weighting Sampling" each sample was "independent" of the other Samples. However, in MCMC, this isn't true, instead in MCMC, the Samples are "dependent" on each other, and Adjacent Samples are very similar (i.e. only differ in one place). MCMC is still "Consistent" ``` * **Monty Hall Problem** ``` Given 3x doors D1 D1 D3 Behind each door is a prize One door's prize is sports car Other two doors contain a goat (undesirable) Step 1: Choose D1, but don't open it Step 2: Monty opens one of the other two doors, but deliberately chooses D3, the one with the Goat Step 3: Now we need to decide whether to Stay or Switch What is prob. of winning if you stick to D1?? Note: We learned info about D2, since it wasn't the door with the Goat flipped over by the host, so that additional info has updated the probability, but we haven't learnt any new info about D1 since it was never an option for the host to switch D1 ```
- Lesson 15 - Hidden Markov Models (HMMs) (Probabilistic Models)
- References:
- Probabilistic Theory: http://www.med.mcgill.ca/epidemiology/hanley/bios601/GaussianModel/JaynesProbabilityTheory.pdf
- Other videos: https://www.youtube.com/playlist?list=PLjLGYJqya5UiWa1bNPAAX8oOQ1NCUbvB8
- Dfn: Machine Learning Algorithms (tools) to process sequences of time-series data such as Pattern Recognition through time (i.e. speech, hand-writing, of signals with a language-like structure such as sign-language recognition, and other human activities like playing basketball, driving a car, vacuuming the floor), which all have familiar units (i.e. movement), combinations of those units, and a source statistical grammar (most promising for future research in Thad’s opinion)
- Dfn: HMMs have:
- States that represent Observed phenomena
- Transitions that describe how to move between states
- Dfn: “Hidden” in HMM refers to the fact we don’t know exactly what State matches what Physical Event. Instead each State may yield certain Outputs, that we Observe over time, and Encode a Sequence of States over time based on how likely they were to produce that Output
- Decoding HMMs Find out which Model “best fits” a Sequence of Data
- HMM Performance Boosters tricks
- State Tiling
- Context
- Stochastic Grammars
- Boosting
- First-Order HMMs
- Dfn: Only depend on State immediately preceding them, but not an entire history of multiple States
- HMM Representation
- Example: (from Norvig textbook) https://www.youtube.com/watch?v=34Noy-7bPAo ``` Given Markov chain with Output node for each State (common in Machine Learning community)
X0 —* X1 —* … —* XK —* … —* XT | | | | | | * * * E1 EK ET
where X’s each represent a Data Frame where XO is initial state (useful for book keeping) where X1 represents first timeframe T = 1 where E1 represents Output at T = 1
HMM States are implicit in the representation
BUT HMMs are better thought of in terms of their States
Given a graph of signal (y-axis) through time (x-axis) (0:59) Create an x-State HMM, where x is the qty of different Gradients in the signal to design a Model that may have generated the Data shown in the graph
Assume we can create a HMM by “Inspection” to represent at given signal. In reality we want many examples of signal we want to Model, and need a Model that can accommodate all different examples (balancing generalisation and overfitting)
Use a “Left-To-Right HMM” (i.e. never transition back to a previous state after leaving it)
Loops at 1:31 represent “Self-Transitions”, indicating that the Model may remain in same State for several timeframes
Sometimes Entry Point values before entering a State are shown.
“Output Probabilities” (Emission Probabilities) - values that are allowable whilst in a given State (but Output Distributions are “densities”, so the Output Probabilities aren’t really probabilities)
Step 1: “Output Probabilities” for first State may be shown as a “Box Car” Distribution (2:07) since from -2 to -1 and all values equally represented. Repeat for other States.
Step 2: Determine values for “Transition Probabilities”. Note that at start of graph we spend 10 timeframes before “transitioning” to second part of the graph, so want to assign a probability at the end of the first part where we escape State No. 1 (i.e. 0.1, so on average we expect to stay in that state for 10 frames)
Note: All probabilities out of a state must add up to 1, and that we only have the “Self-Transition” remaining, we know it has probability of 0.9
Note: Qty of timeframes we expect to stay in a given State is
Qty_Timeframes_in_State = 1 / (1 - "Self-Transition Probability")
Step 3: Repeat similar to Step 2, which has 5 timeframes, so make “Escape Probability” of 1/5 = 0.2, and “Self-Transition Prob.” of 1 - 0.2 = 0.8
Repeat…
… ```
- Dynamic Time Warp
- See in example
-
Note: In this section, focus on situations where language-like structure to a problem (i.e. a fundamental unit like a “letter” in handwriting) that’s combined with other units to make larger constructs such as “words”, which are then combined into larger constructs like “sentences”
- Example: Characters and Words https://www.youtube.com/watch?v=0DJanDy8-eA
Given 2x HMMs (HMM A, HMM B), each with 3x states Graph for each state shows "topology" and "transition probability" Graph for each state shows only "output probability" (1x gaussian, including a "mean" and "standard deviation" of 1) Goal is to use HMM A and HMM B for "Optical Character Recognition" Given graphs with "y" (y-axis) and "t" (x-axis) that depict a Character, and Words. Given graph Inputs, a sequence of "v" values indexed over time from left to right, where "v" computed by counting qty of "black" squares in each column of time i.e. L character has sequence of "v" values of 5, 1, 1, 1, 1 Given Figures 1 to 4 showing a sequence of characters over time, indicate which of HMM A or HMM B most likely produced the data? Note: All states from both HMM A and HMM B have same "transition probability" The "output probabilities" of HMM A and HMM B are what makes them different Each HMM state's "output probability" gaussian shows high or low "mean" values at different times, so match these with qty of black squares over columns of time in each of Figures 1 to 4
- Example: Dolphin Communication
(Euclidean Distance, Dynamic Time Warping, Sakoe Chiba Bounds)
Given Spectogram of dolphin whistle, where: - x-axis is Time - y-axis is Frequency - Pixel Brightness indicates Power in a Frequency Band - Data in Spectrum shows same Dolphin Whistle repeated twice where each Dolphin Whistle contains 2x parts. Note: - Noise in Ocean at Lower Frequencies when recording across broadband - 5 kHz to 17 kHz (Atlantic Spotted Dolphin whistles) and up to 110 kHz (Ultrasonic) - <= 3 miles (detection audible range under water) - Signature Whistles for Dolphin Identification (if in new area, emits Signature whistle to indicate to friends location where it wants their company) - < 5 kHz (Boat Wave Noise) - Dolphins have Sound Categories vocalisation: - frequency modulated whistles (Signature whistle) - burst pulse squawks (aggression, close proximity, social) - echolocation clicks (sonar, navigation, feeding by dig under sand, close proximity 1km) - close proximity sounds (aggressive contacts) - buzzes (discipline, close proximity, tactile effect used during courtship) - Human audible range is up to 15 kHz (i.e. Whistles) - Correlate Sound Categories with certain Behaviour types Goal: * Handle different Signal Categories (Classes) where each example may be "warped" differently in time Issues: * Cannot just use the Whistle frequency over time as Dolphins Raise/Lower the basic frequency of their Whistle Important: * Identify Patterns of Rise/Falls of Whistle frequency over time * Plot "Delta Frequency" (the Change in frequency) instead (i.e. instead of plot 5, 14, 10, 7, 10, 14... instead use: 9, -4, -3, 3, 4... that way Recogniser can handle Whistle frequencies starting at different values * Try plot "Time Warping" (i.e. saying quickly vs slowly) * Solution: * Match "features" that make Whistle distinct whether produced quickly or slowly * Try plot "Euclidean Distance" (INSUFFICIENT, use DTW instead) https://www.youtube.com/watch?v=IdIiVJG4M0c * Weakness: Sensitivity to distortion in time axis i.e. plot two graphs, first G1 with 21 samples (S0, S1...) of Delta Frequency, and second G2 with 12 samples with remaining 9 padded with 0's (no change), or fill in with average or last value * Calculate "Euclidean Distance" by calculating square root of the squares of the distances (i.e. sqrt( t0 G1's S0 of 2 and G2's S0 of 5 would be (2-5)^2 = 9 + t1 G1's S1 of 3 and G2's S1 of 5 would be (3-5)^2 = 4 + ... ) ) * Try plot "Dynamic Time Warping" (DTW) https://www.youtube.com/watch?v=RvO7DhXKHn0 * Benefit: When we know the two signals we are comparing will have some sections that are faster, whilst others are slower * Allows two similar time series but locally out of phase to align in non-linear manner * Disadvantage: O(n^2) complexity, but applying the Lower Bounding Technique on the DTW distance becomes O(n) (even with 4S approach). * Lower Bound Technique: Bounding envelope below and above the warping window. Used for sequences of same length. If sequences have different lengths, one of them must be reinterpolated and becomes the 4S approach (see Part 2.1 of paper for code) * Objective: Align 2x signal samples of Whistles to see/compare how we are matching togther * Important Note: A "Match" without any "time warping" on the x and y axis graph, would be a straight line diagonal from lower left to upper right corner (rare) * Process * Draw dotted diagonal line * Start plotting 1st Sample of each signal * Plot transitions to try and Match values as "close as possible to the ideal matching diagonal dotted line" to minimise error, and if values are tied, try and take preference in choice to keep to the dotted diagonal * After all dots plotted, calculate for each dot the sqrt of the x-axis value minus the y-axis value. then add all these values up and take the sqrt of the total (which is half the amount of using just plain Euclidean Distance padding the remaining with 0's) * "Sakoe-Chiba Bounds" * If allow too much "warping" then distance may be smaller than desired * Use "Sakoe Chiba Bounds" to force more "reasonable" matching by bounding amount can deviate from diagonal "Sakoe Chiba Bounds" does not allow matches outside limits of two diagonal bounding dotted lines, causes matching to be worse and to lead to a bigger distance, which is what we want * Calculate "bounds" often empirically. Set different "bounds" and use cross-validation to ensure they're reasonable * Set different "bounds" for different sections of the signal each with different amounts of allowed warping (i.e. where lots of potential variance of short/long, but where hump shapes may have small variation) * Avoid different "Sakoe Chiba Bounds" for different sections of signal but would be complicated to train, instead use Hidden Markov Models (HMMs) * Note: DTW with lower bound requires two sequences being compared both of same length where amount of warping is constrained * Note: The 10% constraint on warping inherited from speech processing community is actually too large for real world data mining * Readings on Dynamic Time Warping Ratanamahatana and Keogh’s “Everything you know about Dynamic Time Warping is Wrong” http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf * Section 2.1 * shows how given two sequences, the optimal path is a path that minimises the warping cost (minimises the total cumulative distance between them) * shows how to align two sequences by constructing a warping matrix to search for the optimum path, and using constraints to restrict number of paths needed to be considered (using Sakoe-Chiba Bounds)
- Readings:
- Dolphins https://www.youtube.com/watch?v=blmTrZMTcUs&feature=youtu.be after 26 mins it talks about interaction with other dolphin species using underwater camera
- Prairie Dogs http://www.cogs.indiana.edu/spackled/2009readings/Slobodchikoff%202009.PDF
Notes from Dolphins YouTube video: - Together or Apart from different Species - Mutual dialects (when with other species) - Individual dialect (when separate from other species of Dolphin) - Signals - Referential (i.e. refer to some label like a name or a word) i.e. Monkey Alarm Calls for different predators so take appropriate evasive action i.e. Prairie Dog Alarm Calls - Graded (i.e. intensity/frequency of sound as motivation or emotional state of animal) - Multiple Sounds - Dolphins can make two Sound Categories at the same time - Sound Direction - Dolphins are Directional senders/receivers of ultrasonic sounds and can interally vary angle by 20 degrees - Key Challenges: - Recording power to sample and decoding High Frequency sounds - Localising to deciphering which Dolphin made the sound - i.e. Hypdrophone array - Whistles are best studied sound, since easier to measure (but not most frequently used sound, not do they have much broadband characteristics with important information) - Goals with Thad's Georgia Tech CS: 1 Categorise difficult Sounds (make sense of info, match with Behaviour types) 2 Match with Behaviour 3 Look for Structure and Order (mine data, look for longer Sequences, Structure, and Order, as not yet applied to animals) - Tech Overview - Neural Networks to train computer if have enough Samples of each individual to see quantitatively measurement of differences between Whistles - Analysis of Dolphin sounds using "Multi Model Signal Integration" with Noldus "Observer Behaviour Software" that shows interaction of Sound and Behaviour over timeline: - Body postures timeline from video breakdown on Ethergram over time are labelled (of behaviour) - Sounds from Spectogram over time and labelling the Sound Categories - Tools and Techniques (Thad) - Goal - Find Fundamental Units (FU) used by Dolphins - Check if they combine FU to form new vocalisations - Implementation - Algorithm takes large DB of data and search for FUs and Categorise different parts (previously done for handwriting recognition to pull out individual letters and words used repeatedly in a sentence) - Goal is Universal Algorithm that takes any structure from given data, uncovers FUs (those of interest excluding noise), put the FUs in a Visualisation tool and automatically tune it to recognise repeated Patterns in data set specifically (grammars, rules to the behaviour), by labelling the data set with colour codes for each discovered Pattern. Find "Rules, using Regular Expressions (Alignment Based Learning) where there is a structure in the data set (i.e. when a pattern is allowed or not to be used in the data set). The "Rules" indicated much more structure than first thought, and were better predictors of visual behaviour. Scenarios: - Add Dolphin "D1" and "D2" to scene - Find patterns in Behaviour (i.e. head to head, head to tail) - Try to "Predict" what Labels would be - DISCOVERED that "Rules" (combination of Patterns) were very good at "Predicting" which Dolphin was around at certain time - Try to use Acoustic Data to "Predict" Categories of distinct biological data. Found that all are distinct - aggression - play - foraging - reunion - Statistical Testing - use System and Rules to use Acoustic Data to "Predict" different "labels" of Visual Behaviours Tables of Correlations: https://youtu.be/blmTrZMTcUs?t=22m55s (where "blue" means very similar "green" means very different "yellow" means some correlation - Discovered "Rules" (Patterns generated) of aggression don't correspond to "Rules" of play - Discovered "Rules of play don't correspond with "Rules" of foraging - Tool that takes 30 years of Data, processes it allowing hypotheses to be made and tested quickly (similar to Tool for gene banks in genetics that looked at repeated patterns in different Human characteristics such as people with Parkinsons', or blue eyes, etc. it uses discrete 4-base-pairs and sequence to predict when something interesting will occur, but for Dolphin Acoustics we are looking at Continous time instead of Discrete time - Tech Interface for Mutual Shared Communication dialect b/w Humans and Dolphins - Missing element is Tool allowing Dolphins to communicate back - Created portable Keyboard activated by Action touch point by Dolphin. Labeled Objects with Visual and Acoutistic cues: (shapes such as Scarf, Rope, Bow Ride) - Goal of Dolphins mimic human-Whistles so humans could respond to their requests - Problem: Too Slow response time - SOLUTION: Built underwater wearable computer called CHAT (Cetacean Hearing And Telemetry) (PROJECT 1 FUTURE FROM 2015 - ideally want Google Glass but waterproof so can see all FU as a script moving across i.e. ABBC that can manipulate and repeat back to Dolphins) - Bond Conducting Headphones - speaker sends vibrations through skull to cochlea (for underwater) - Hyrophones - receives sounds - Speaker - plays sounds - Keypad - allows user generate sounds - Instead of being on Arm, they put it on the box (cast buttons and display in resin) - Paper was released on how they designed this - Real-Time Sound Recognition (so if Dolphin mimics a sound we know in real-time what sound it mimiced, and respond accordingly) Issue: Salt water intrusion (took 4 years) - Delays ordering IP67 connectors to aluminium box and $400, but now did with 3D Printer much cheaper - Clear cover on case too, so Wifi signals go through (can't go through aluminium) to download video quickly, and send patch over wifi in real-time - Hydrophones not entirely waterproof through movement - Not want low power high throughput computer to not overheat in small box of aluminium in Bahama sun - 192 kHz auio system (2 channel) - Audio amplifier (to make non-compressible media sound go through water lots of power required, so had to design their own speaker systems, for certain frequencies, where amplifier matches speaker and powered off batteries) - LED screen for status - Risk of battery on fire near chest - Risk of electrical current underwater attracts sharks - Only tuned to narrowband width, but Not powerful enough to recognise full 96 kHz band used by Dolphins when they tried to mimic, so they could Sample at, so needed Faster computer, so used an Nvideo Jetson board and replaced big fan with a large heatsink, so could get Spectogram running Samples at 192 kHz and 32 bits per channel using 256 GB SSD ACOUSTICALLY CONTROLLED SUB (DOLPHIN SUB) - undergraduate project - must reward animals immediately within 50 ms otherwise cannot be trained - Issue Identified: - Maybe time to pass scarf underwater taking too long for Dolphin to associate - ideally a toy that reacted to Dolphin vocalisations, i.e. ultimately put CHAT inside a DOLPHIN SUB, that makes sounds with a chat box and flashes lights and performs an corresponding action like move left or right (to spark interest from Dolphins) RECOGNISING SPOTS SYSTEM - 95% of time recognises a Dolphin by its spots (even works from tail-end view) - applied same to Humpbacks GET INVOLVED CHAT Society Technical & Scientific Student Support www.WildDolphinProject.org Goal: Recognise the Types of Dolphin Whistles so marine mammal researchers can automatically annotate their database of recordings Issue: The second time the Dolphin Whistle is repeated, its lasts longer
- Example: Sign Language Recognition (real problem using HMM Representation)
- Step 1: Create HMMs for Word “I” (finger point to chest) Gesture
using Delta Y as our first feature (a bad feature), to purposely show
how powerful HMMs can be (can still tell difference b/w two words by
difference in timing, since it will show a higher probability in middle state
for I than WE, and the Transition Probabilities lower in I than WE since
it spends less time there moving hand across chest, since longer a
gesture is, the easier it is to recognise)
- Note: Delta Y is similar for both Word “I” and Word “WE” Delta X is easier to differentiate the two words “I” and “WE”
- Visualise the Derivative of a Signal
- Derivative Plots https://www.youtube.com/watch?v=LiDdrJjjjfc
- Step 2: Plot HMM “I” with 3x States
for each of the 3x separate Motions of the Gesture
- State 1 - arm comes up toward chest
- State 2 - arm pauses for a moment
- State 3 - arm comes down away from chest
- Note: Higher probability on State 1 and 3 “Self-Transition” since spend more time on them. “Transition Probabilities” for each State based on the timing
- Step 3: Observe Plot of “Delta Y over time” for the Model to check it’s reasonable
-
Step 4: Observe Plot of “Output Probability Distribution” (i.e. P(Delta Y) over Delta Y), where State 2 has higher probability of Delta Y of 0, where used Gaussians to model upper probabilities. https://youtu.be/CBW2C3lwbz4?t=47s
- Step 5: Create HMMs for Word “WE” (finger from side to right chest then
left chest then down to side) https://youtu.be/1mtzJGhRYHg
- Use 3x States
- State 2 Delta Y varies more so wider/shorter for “WE” than “I” (see “Output Prob. Distrib” plot)
- State 2 more time spent for “WE” than “I” (see “Delta Y over time” plot)
- State 2 more time spent for “WE” so higher prob. on Self-Transition (loop) (see HMM Model)
- Note: Distinguish between two gestures by observing the following sequence
properties of Delta Y:
- Duration of time spent in a state
- Distribution of Delta Y observed
- Step 1: Create HMMs for Word “I” (finger point to chest) Gesture
using Delta Y as our first feature (a bad feature), to purposely show
how powerful HMMs can be (can still tell difference b/w two words by
difference in timing, since it will show a higher probability in middle state
for I than WE, and the Transition Probabilities lower in I than WE since
it spends less time there moving hand across chest, since longer a
gesture is, the easier it is to recognise)
- Recognition of HMM Models
- Given set of Observations “O” (Samples in example to recognise)
- Create a Viterbi Trellis to check likelihood that the
Model generated Samples in “O”.
- Correct Match will be the Model with highest probability
- Note: “Hidden” is the exact Sequence of States that created the Output, so need to go through all possibilities
- Lambda λ I is the combination of the HMM Model and the “Output Probability Distribution” Plot
- List the Data, a Sequence of Delta Δ Y values over time that we want to recognise as either “I” or “WE”
- Goal is to solve P(O | λI) for “I” and “WE” and compare them (i.e. probability of Observation Sequence given HMM Model of “I”
- Create a “Trellis” with separate row for each State S1, S2, S3, in the HMM Model
(want find State we’re in at each Time). Fill out the Trellis
- Eliminate impossible States and paths using HMM. Fill in path of state transitions possible (note that Transition Probabilities limit what State you can be in at any given time) i.e. obviously S1 leads to S1 or S2 at the Start. and the last two States must be S2 and/or S3 https://www.youtube.com/watch?v=aOPeQAxDS4E
- In Middle of Trellis, many more options for State Transitions exist.
- Add “Transition Probabilities” directly from HMM Model λI
- Go through Observation Sequence over time, determine probability of
getting the Delta Y value at each time, by looking at the
“Output Probability Distribution” to see what the probability the Gaussian
would generate for a Delta Y value at each time https://youtu.be/tqYJnaQfsn0?t=1m7s
- Note: Lower likelihood for tighter Gaussian width
- Find the Viterbi Path (most likely path)
- At each timeframe, we multiply the “Transition Probability” by the “Output Probability” to find the Maximum Path
- Note: The “Viterbi Path” may not be the “Greedy Path” (i.e. the highest expected value with each transition may not lead to highest valued overall path)
- i.e.
- Expected Value of Staying in a state S1 instead moving to S2 at Time 2 is 0.8 * 0.5 versus 0.2 * 10^-7
- Greedy Algorithm would expect to Stay in S1 since the value from the multiplication is bigger there
- Note: To account for the overall path, we need to track
the probability of each possible sequence of the Trellis.
Keep track of Maximum value path to each State in the sequence
at each time through the Trellis
https://youtu.be/Hq2P6zWA9F8?t=2m14s
and at the end choose the Most likely path
At Time = 2, there are only 2x nodes each sequence that could be chosen:
(ideally use Log space to calculate the probabilities to avoid
running out of precision in the way the OS represents the numbers)
1 * 0.5 * 0.8 * 0.5 = 0.2 OR 1 * 0.5 * 0.2 * 10 ^ -7 = 10 ^ -8
- HMM Training for “I”
-
Dfn: We usually Train using the Data itself, instead of just creating the Models by inspection. When using HMMs for Gesture Recognition, have minimum 5x examples, and ideally we have 12x examples for each Gesture we’re trying to recognise. Each example set of Data may have different amount of Data Points (timeframes)
-
Example:
Given 3x examples of Data, each with different amount of Data Points (timeframes) Use a 3x State Left-To-Right Topology for HMM. What examples of Data match with which State? STAGE 1: Use Verterbi Alignment to Initialise the values for each HMM State (for each timeframe in each example of Data we assigned a State for that timeframe and calculated the resulting averages of all values assigned to each State, along with average amount of time we expect to stay in each State) ========================================================================== Calculate "Transition Probabilities" ================================== - Assume each State has approx equal qty of timeframes. - Divide each example's timeframes into 3rds - Average length of each Data is 12 timeframes long (i.e. 4x Data Points per State per example Data) So "Transition Probability" out of each State is approx 0.25 And so "Self-Transitions" must be 0.75 (since total probability for each State is 1) Calculate "Output Probabilities" ============================== - Assign the first 4x Data Points to the 1st State S1 for each example of Data Average of all the values is Mean: 5.4; Standard Deviation: 3.1, and then plot on single Gaussian - Repeat for State S2 and S3 Iterate in Process similar to "Expectation Maximisation" ============================== - Try to get a Gaussian with less variance by performing calculation and moving the Mean and Std Deviations between the States S1, S2, S3. (See if moving boundaries in these examples of Data between States S1, S2 and S3 will lead to improved explanation of the Data) - Recalculate, by re-counting qty of Data Points in S1 for each example and recalculate the "Transition Probability" and "Self-Transition" in the HMM. Repeat for each State - Update the "Output Probabilities" on the Gaussians - Repeat until "Convergence" when none of the boundaries need to move, nor are the "Output Probabilities" going to change STAGE 2: Finish HMM Training ============================ "Baum Welch (aka Expectation-Maximization (EM) algorithm) Re-Estimation Process" using the "Forward-Backward Algorithm" to keep track of calculations (final Model often gives better results than initial estimate). (similar to "Expectation Maximisation", but instead, every sample of Data contributes to every State proportionally to the probability of that frame of Data being in that State) (i.e. even though a Data Point is most likely in say S1, its value still affects S2 and S3 a little bit, since its likelihood of belonging to S2 and S3 is low) Note: Some Data Points have higher likelihood of being in two out of the three States i.e. S1 or S2, and so has a larger Influence on the final "Output Probabilities" for S1 and S2 states. ================================ References: - Rabiner Tutorial on Baum Welch http://www.cogsci.ucsd.edu/~ajyu/Teaching/Tutorials/hmm.pdf - Thad's Master's Thesis (continues with Sign Language example using Baum Welch) - Review code of HMM Toolkit like HTK (watch Means and Variances change as re-estimation process iterates to build intuition)
-
- HMM Improvement using Delta X Feature (instead of Delta Y done previously)
-
Note: Choice of Delta X or Delta Y as being the better Feature depends on the Sign being recognised. Since Sign Language is two-handed, we need these features both for Left and Right Hands, so would have 8x features being tracked for time frames that we need to integrate into our HMM Models by adding more dimensions (i.e. calculate multi-dimensional distances instead of just one dimension) for our Multidimensional Output Probabilities, and all the other HMM Training and recognition are same as before. It’s hard to show on Graphs, but easy to code https://youtu.be/xDN1V4Y1Ung?t=56s
-
Examples of Good Features:
- “Size” of the hand is a good feature of info about whether hand coming toward or away from camera
- “Angle” of hand to the horizontal
-
- “Mixture of Gaussians” Technique
- Dfn: Use two Gaussians to Model whenever “Output Probabilities” are Bimodal instead of Gaussian. The more Gaussians we use, the closer we will model it (try to limit to 2-3 Gaussians, otherwise tends to over fit)
- Note: Central Limit Theorem states we should get Gaussians if enough factors affect the data
- HMM Topologies
- Strategy of try simple Topology first and only add States when necessary. Many different Topologies may be used with cross-validation with randomly selected training-independent test sets to find the best one
- Note: Increase size of vocabulary from just “I” and “WE” by creating a Topology for each new Sign that we can use to start making phrases.
- “WANT” has onset lifting arms up arms stretched out infront, and motion for sign itself of pull them towards body, then drop hands. Requires 3x States.
- “NEED” has onset of lift one hand up with index and thumb like a claw, then lower hand about 10cm, and then drop hand. Requires 3x States.
- “TABLE” has onset of both arms crossed horizontally with right on top
about 5cm apart, then bounce up/down connecting both arms twice
or three times (means same thing),
then drop arms.
- HMM Alternative Topology (instead of Left-To-Right Topology)
- Loop in HMM from 3rd to 2nd State so motions may repeat a number of times.
- HMM Alternative Topology (instead of Left-To-Right Topology)
- “CAT” can be either with one hand lifted to mouth, then pull away horizontally
away from body OR with both hands (means same thing)
- Find best Topology by using Strategy of simplest first, starting with normal 3-State HMM, then use “Mixture of Gaussians” Technique to help handle the “Output Probabilities” we get
- Alternatively (results in highest accuracy given sufficient examples to train them, whilst adding more signs to vocabulary) recognise as two different signs. Train a Model for first variation of the sign called “CAT 1”, and train a separate Model for second variation of the sign called “CAT 2”, and after recognise “CAT 1” or “CAT 2” we’ll just have recogniser output CAT
-
Phrase Level Sign Language Recognition
- Example
Give X Signs and Y Phrases to recognise. Note: If have 2x Variants of CAT ("CAT 1" and "CAT 2" to recognise) we need more Phrases. Note: - Grammar used to simplify with order [Pronoun, Verb, Noun] i.e. PRONOUN = {I | WE}, etc VERB = {NEED | WANT} NOUN = {TABLE | CAT} CAT = {CAT1 | CAT2} - We will only recognise one phrase at a time Given Data from someone using Sign Language that we want to recognise (i.e. the Phase "I NEED CAT") Inspecting the Delta Y "feature" of the Data over time gives us Observations (i.e. 20 Samples through time): 3 7 3 0 1 3 1 0 -1 -3 -2 0 2 1 0 1 0 -3 -7 -3 Use recognition as before but using Big "Trellis" Create horizontal row with 20 Samples Create vertical row with 22 States, break-down as follows: 3x States for "I", "WE", "NEED", "WANT", "CAT 1", "CAT 2" 4x States for "TABLE" "Breadth-First Search"-like expansion of each possible path in each time step At T = 1, Grammar says must start with with "I" or "WE" (Pronouns) so must start in State 1 (S1) these At T = 2, could be in S1 or S2 of "I" or "WE" At T = 3, could be in S1, S2, or S3 of "I" or "WE" At T = 4, could be in S1, S2, or S3 of "I" or "WE" (Pronouns) OR transitioned to S1 of "NEED" or "WANT" (Verbs) https://youtu.be/0M-WJCgqtEc?t=1m45s ... complexity begins (even though simple example with 3x letters) ... see 2:45 to 2:53 that shows elimination of impossible final States that aren't Nouns
- Example
- Stochastic Beam Search (SBS)
- Note: HMM uses a lot of memory (especially if had
vocabulary with all 6000 signs in American Sign Language ASL
and variations that give its expressiveness and speed.
Recognisers will find it hard to keep full “Trellis” in memory,
and keeping all the paths updated, so we use SBS to overcome the issue
instead of Breadth-First Search-like expansion of each possible path
in each time step.
SBS involves pruning some paths
- Remove some Low Probability Paths
- i.e. some paths get low probability quickly (i.e. staying in S1 until T = 1 seems impossible)
- Do not remove all low probability paths (since possibility of
Bad match in beginning of the Phrase that becomes a Good match later)
- i.e. signer might hesitate or accidently start with wrong sign before changing it (like stuttering or false start in spoken language)
- Keep paths Randomly (important in AI) in proportion to their probability (similar to the Fitness Function used previously in Genetic Algorithms and to the randomness of Simulated Annealing)
- Draw Low Probability Paths through the Trellis
- Draw High Probability Paths through the Trellis
- Remove some Low Probability Paths
- Note: HMM uses a lot of memory (especially if had
vocabulary with all 6000 signs in American Sign Language ASL
and variations that give its expressiveness and speed.
Recognisers will find it hard to keep full “Trellis” in memory,
and keeping all the paths updated, so we use SBS to overcome the issue
instead of Breadth-First Search-like expansion of each possible path
in each time step.
SBS involves pruning some paths
- Context Training Trick (reduce error rate by factor of 2 when applied to language-like structure)
- Difference in movement between Combination of signs in a Phrase
vs just Isolated signs (i.e. in “I” vs “I NEED CAT”, in the
Phrase the hand transitions skipping the last “I” position
to a mid “NEED” position, and so on, so hands don’t have to move
as much for each word)
- The signs before and after a given sign affect how it appears when being signed in a Phrase
- So, for Verbs that connect with other signs at the beginning and end, we’ll concatenate a combined 6x State model of “I NEED” and “NEED CAT” (in practice lots of examples of Phrases of a sign to Train on, rather than individual signs). Assume we have many examples Data of three sign phrase
- Note: In Thad’s Paper on recognising Phrases of ASL, he had 40 signs in vocabulary, but the only Training Data used was 500 Phrases that each contained 5x signs.
-
Use same steps used in Stage 1 and Stage 2 of HMM Training for “I” shown previously for Phrase “I NEED CAT”, doing same thing for first step, but assuming the Data is evenly divided between each sign, and then divide the Data of each State within each sign, then as before, iterate and adjust boundaries for each State and each sign until convergence. After converged everything for each sign, go back and find every place where “I NEED” occurs (whilst we’ll assume there are enough examples, there will be less examples of “I NEED” than of “NEED”), and cut out all Data that we think belongs to “I NEED”, and use the modelling approach Context Training Trick to Train the Combined 6x State HMM Model on it, since the “Output Probabilities” at the boundary b/w “I” and “NEED”, as well as the “Transition Probabilities” in that region will then be tuned to better represent “I NEED”, rather than the general case which would also include Phrases like “WE NEED”, since in speech, the effect of one phoneme on an adjacent phoneme is called Coarticulation. Repeat for “NEED CAT 1” and every other two-sign Combination. Iterate using Baum Welch using these larger two-sign Contexts for embedded Training until converge again. If have sufficient Data, ideally use larger three-sign Contexts (or more) when Phrases are complex since huge benefits
- Benefits: Context Training reduces error rate by 2 in recognition tasks where there’s a language structure
- Difference in movement between Combination of signs in a Phrase
vs just Isolated signs (i.e. in “I” vs “I NEED CAT”, in the
Phrase the hand transitions skipping the last “I” position
to a mid “NEED” position, and so on, so hands don’t have to move
as much for each word)
- Statistical Grammar Trick (reduce error rate by factor of 4 when applied to language-like structure)
- Dfn: In real life, language isn’t well segmented. If record large qty of language and record proportion of times that “I” comes before “NEED” versus “I” coming before “WANT”: i.e. “I NEED” 6%, “I WANT” 10%, etc Use these probabilities to help bia our recognition based on expected distribution of co-occurrence of the signs.
Instead of Simple Grammar (i.e. order [PRONOUN, VERB, NOUN]) that places strong limit of start and end in Viterbi Trellis
- Benefits: Statistical Grammar reduces error rate by a factor of 4
- State Tying Trick
- Dfn: Combining Training for States, where the States within the HMM Model have close Means and Variances during Training, and then determine if performing “State Tying” looks logical given the motion we expect
- Example:
Given HMM Models and "Output Probability Distribution" for "I" and "WE" Note that when recognising Isolated Signs, the initial movement of right hand to chest is similar in both HMM Models. So instead of having a separate "I" State 1 (S1) and "WE" State 1 (S1), we'll define a "COMMON" Initial State and included it in the definition of both the "I" and "WE" HMM Models, so when we Train the HMM, we have twice as much data to Train the Initial State. Repeat for the Final State S3, since inspecting the "Output Probability Distribution" of the hand movement back down to rest looks the same for both "I" and "WE" Care must be taken since "State Tying" gets complicated when "Context Training" is used
- Segmentally Boosted HMMs (SBH) Trick
- About: Combines some advantages of discriminative models with generative models. Not yet in HTK Toolkit or Georgia Tech Gesture Toolkit https://wiki.cc.gatech.edu/ccg/projects/gt2k/gt2k
- Example: Hand Appearance Models
- Similarity Metric of how closely current hand looked
like different visual models of the hand as “features” for
the HMM. Issue encountered by using many dimensions
was that when models that didn’t match
well were producing relatively random results and caused
noise such that many dimensions became meaningless
- Boosting (to Weight the “feature” vector) used as mitigation measure and is called Segmentally Boosted HMMs
- Similarity Metric of how closely current hand looked
like different visual models of the hand as “features” for
the HMM. Issue encountered by using many dimensions
was that when models that didn’t match
well were producing relatively random results and caused
noise such that many dimensions became meaningless
- SBH Process:
- Align and Train HMMs as normal
- Use the Training to Align the Data that belongs to each State as best as possible
- Iteratively examine each State of each Model
- Boost by finding what “features” help us most in differentiating the Data for our chosen State versus the rest of the States
- Weight (Up/Down) the Dimensions appropriately in that HMM
- Benefit: 20%-70% better results than with normal HMMs (depending on the Data set used)
- Using HMMs to Generate Data
- Initially hoped that same HMMs used to recognise speech could
also be used to generate speech, but was not a good idea,
since when plot sequence of data points in time, the
“Output Distributions” do not understand the concept of
“continuity” (could solve but using many States to force a better
ordering of the Output, but would lead to over-fitting)
- Solution:
- Model the State Transitions with a HMM, but
using a different process whilst in each State,
a process that is more aware of the Context required
to generate the final Data with “continuity”, using techniques
such as the following for better results:
- Combining HMMs with Deep Belief Networks at Google
- Model the State Transitions with a HMM, but
using a different process whilst in each State,
a process that is more aware of the Context required
to generate the final Data with “continuity”, using techniques
such as the following for better results:
- Solution:
- Initially hoped that same HMMs used to recognise speech could
also be used to generate speech, but was not a good idea,
since when plot sequence of data points in time, the
“Output Distributions” do not understand the concept of
“continuity” (could solve but using many States to force a better
ordering of the Output, but would lead to over-fitting)
- TODO Readings on HMMs
- AIMA: Chapter 15.1-15.3
- Rabiner’s famous Tutorial on hidden Markov models and selected applications in speech recognition
- Slides http://www.cogsci.ucsd.edu/~ajyu/Teaching/Tutorials/hmm.pdf
- PDF http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf
- errata http://alumni.media.mit.edu/~rahimi/rabiner/rabiner-errata/
- Thad Starner’s MS thesis: Visual Recognition of American Sign Language Using Hidden Markov Models
- link: https://dspace.mit.edu/handle/1721.1/29089
- PDF https://dspace.mit.edu/bitstream/handle/1721.1/29089/32601581-MIT.pdf
- The Hidden Markov Model Toolkit (HTK)
- Download http://htk.eng.cam.ac.uk/
- About http://svr-www.eng.cam.ac.uk/research/projects/HTK_HMM_Toolkit/
- Chapter 1 The Fundamentals of HTK (pages 3-13) in The HTK Book (version 3.4)
- PDF http://speech.ee.ntu.edu.tw/homework/DSP_HW2-1/htkbook.pdf
- HTML http://www.ee.columbia.edu/ln/LabROSA/doc/HTKBook21/HTKBook.html
- TODO HMM Resources
-
AIMA: Chapter 15.4-15.6 (provides another viewpoint on HMMs with a natural extension to Kalman filters, particle filtering, and Dynamic Bayes Nets), Chapter 20.3 (hidden variables, EM algorithm)
-
Yechiam Yemini’s “slides on HMMs used in genetics (gene sequencing, decoding).” http://www.cs.columbia.edu/4761/notes07/chapter4.1-HMM.pdf
- Sebastian Thrun and Peter Norvig’s AI course:
- HMMs and Kalman filters https://classroom.udacity.com/courses/cs271/lessons/48734405/concepts/last-viewed
- Natural Language Processing https://classroom.udacity.com/courses/cs271/lessons/48641663/concepts/last-viewed
- Natural Language Processing II https://classroom.udacity.com/courses/cs271/lessons/48734403/concepts/last-viewed
- Further study Huang, Ariki, and Jack’s book “Hidden Markov Models for Speech Recognition.” http://www.amazon.com/Hidden-Recognition-Edinburgh-Information-Technology/dp/0748601627
-
- TODO Resources for Segmentally Boosted HMMs
- Gesture and Activity Recognition Toolkit (GART), which utilises HTK
- https://wiki.cc.gatech.edu/ccg/projects/gt2k/gt2k
-
SBHMM project at Georgia Tech http://www.cc.gatech.edu/cpl/projects/sbhmm/
- Further study
- Pei Yin’s dissertation: Segmental discriminative analysis for American Sign Language recognition and verification https://smartech.gatech.edu/handle/1853/33939
- Gesture and Activity Recognition Toolkit (GART), which utilises HTK
- TODO HMMs for Speech Synthesis
-
Junichi Yamagishi’s An Introduction to HMM-Based Speech Synthesis https://wiki.inf.ed.ac.uk/twiki/pub/CSTR/TrajectoryModelling/HTS-Introduction.pdf
-
Heiga Zen’s Deep Learning in Speech Synthesis http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41539.pdf
-
- TODO Project 4
- file:///Users/Ls/Downloads/0dreuw07-interspeech.pdf
- https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
- http://www.speech.sri.com/projects/srilm/
- https://discussions.udacity.com/t/slm-data-for-this-asl-dataset/230822/16
- http://www.cs.cmu.edu/~bhiksha/courses/11-756.asr/spring2010/class.19apr/ngrams.pdf
- TODO
- Optional AI lectures http://artificialbrain.xyz/artificial-intelligence-complete-lectures-01-23/
- References:
- Project 4
-
Discussions https://discussions.udacity.com/c/nd889-probabilistic-models/nd889-project-translate-sign-language
-
Part 4 Optional Notes:
- Definitions
- SLM - Statistical Language Models
- Visual
- References
- Refer to folder: docs_part4
- DONE READ Slides - Introduction to N-grams (Stanford Jurafsky) https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
- Detailed Slides - Language Model Topics http://wwwhomes.uni-bielefeld.de/~gibbon/Handbooks/gibbon_handbook_1997/node199.html
- Paper - Speech Recognition Techniques for Sign Language https://www-i6.informatik.rwth-aachen.de/publications/download/154/Dreuw–2007.pdf
- DONE SKIM READ IT - http://www.cs.cmu.edu/~bhiksha/courses/11-756.asr/spring2010/class.19apr/ngrams.pdf
- Other links
- DONE SKIM READ IT - https://en.wikipedia.org/wiki/Language_model
- http://masatohagiwara.net/training-an-n-gram-language-model-and-estimating-sentence-probability.html
- ARPA Format links
- http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html
- https://msdn.microsoft.com/en-us/library/office/hh378460(v=office.14).aspx
- Discussion Forums
- https://discussions.udacity.com/t/slm-data-for-this-asl-dataset/230822/9
- Refer to folder: docs_part4
- Tutorials
- Data Analysis and
- Data Analysis - https://classroom.udacity.com/courses/ud170
- Pandas
- Pandas Cheat Sheet - https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf
- Panda DataFrames - https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python#gs.AzoIXlk
- Pandas quick guide - http://pandas.pydata.org/pandas-docs/stable/10min.html
- Data Analysis and
- Goal
- Improve the WER of the “0-gram” (equivalent) recogniser by using SLM data in the downloaded data set using “1-gram”, “2-gram”, and/or “3-gram” statistics. Use probabilities data already calculated and convert into a pandas DataFrame if desired.
- Setup
- Download SLM data for this ASL dataset:
- ftp://wasserstoff.informatik.rwth-aachen.de/pub/rwth-boston-104/lm/
- Extract all
gzip *.gz
- Download SLM data for this ASL dataset:
-
Natural Language Processing (NLP) Summary of: https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf
- Language Model (LM) (aka grammar)
- Dfn: Model that computes probability of upcoming word
OR probability of a certain sequence of words (sentence)
P(W) = P(w1, w2, w3, ..., wn) P(W) = P(wn | w1, w2, w3, ..., wn-1)
- N-gram Model Issues:
- language has long-distance dependencies
- Always in Log space (avoids underflow, and adding is faster than multiplying)
log(p1 * p2 * p3) = log p1 + log p2 + log p3
- Overfitting,
where N-Grams only work well for
word prediction if the Test corpus looks like the
Training corpus (but in real life it often doesn’t)
- Overcome by Training the Models to generalise and account for where a word or sequence occurs in the Test Set but does not exist in the Training Set
i.e. Training set Test set I am I did Dog am P(did|I) = 0
- Zero Probability Bigrams
- i.e. cannot divide by 0 to compute Perplexity
- Smoothing
-
i.e. ``` given broad statistics
P(w|I) (i.w. probability of a word w given previous word is “I”) 3 am 2 eat 1 have ——- 6 total
Use a “Probability Mass” to “Generalize” better
P(w|I) 2.5 am 1.5 eat 0.5 have 1.5 other ——- 6 total
-
- Laplace Smoothing (Add-One Estimation)
- Usage: Not used for N-Grams (better options out there), only used to “Smooth” other NLP methods (i.e. when not many zeros)
- Dfn: Pretend saw each word one more time than
we actually did (i.e. add 1 to all counts)
to overcome Zero Probability issue.
Modify the MLE Equation into the Add-One Equation estimate
(where V is Vocabulary??)
P(wi|wi-1) = count(wi-1,wi) + 1 / count(wi-1) + V
- Example: (Perplexity)
Compute "Joint Probability" of: P(W) = P(its, water, is, so) Use "Chain Rule of Probability" P(A,B,C,D) = P(A) * P(B|A) * P(C|A,B) * P(D|A,B,C) P(x1,x2,..,xn) = P(x1) * P(x2|x1) * P(x3|x1,x2) ... P(xn|x1,...,xn-1) P(W) = P(its) * P(water|its) * P(is|its,water) * P(so|its,water,is) Estimate Probabilities with N-gram Models: ================================= * "Unigram" Model (1-gram) P(w1,w1,...,wn) ~= ∑i P(wi) * "Bigram" Model (2-gram) (conditional on previous word) P(wi|w1,w2,...,wi-1) ~= P(wi|wi-1) Maximum Likelihood Estimate (MLE) ================================= * "Bigram" P(wi|wi-1) = count(wi-1,wi) / count(wi-1) Dfn: MLE of some Parameter of a Model M from Training Set T maximisess the likelihood of T given M * NOTE: Modify the equation to use "Laplace Smoothing" to overcome the "Zero Probability" issue e.g. given word "bagel" occurs 10 times in 100 words the probability that a random word from another text will be "bagel" using the MLE estimate is: 10/100 = 0.1 e.g. given these sentences: "now I am Sam ok" "now Sam am I ok" "now I am dog ok" P(am|I) = 2/3 = 0.67 P(ok|Sam) = 0.5 P(Sam|now) = 1/3 = 0.33 P(am|I) = 2/3 = 0.67 Raw "Bigram" Counts ===================== * "Bigram" counts given various sentences (3 OFF), draw table of count of first (rows) and second (cols) words now I am Sam dog ok now 0 0 0 1 0 0 I 0 0 2 0 0 0 am 0 1 0 1 1 0 Sam 0 0 1 0 0 1 ok 0 0 0 0 0 0 * "Normalise" by "Unigrams" counts (i.e. individual word counts) now I am Sam dog ok 3 3 3 2 1 3 * Result (i.e. divide each in value in "Bigram" by respective count for the first word in combo in the "Normalised" table: now I am Sam dog ok now 0 0 0 1/3 0 0 I 0 0 2/3 0 0 0 am 0 1/3 0 1/3 1/3 0 Sam 0 0 1/2 0 0 1/2 ok 0 0 0 0 0 0 * "Bigram" Estimate of Sentence Order probability P(am Sam dog) = P(Sam|am) * P(dog|Sam) = 1/3 * 0 = 0 Evaluate Model goodness * Check assigns higher probability to sentences * real or frequently observed VS ungrammatical or rarely observed * Train Model Params on a Training Set * Test Model Performance on Test Set (unseen dataset) and use Evaluation Metric to show performance * Note: Test Set is unused, differs from Training Set Extrinsic Evaluation of N-Gram Models i.e. comparing accuracy of Models A vs Model B by comparing WER Intrinsic Evaluation of N-Gram Models: i.e. Perplexity Benefit: shorter duration than Extrinsic Perplexity Dfn: Best language Modle is one that best predicts (best performance) during Evaluate Model goodness phase the an Test Set (unseen test set) and gives highest probability of a sentence Perplexity is Weighted Equivalent Branching Factor i.e. Perplexity of recognising digits "0,1,2" is equal to 3 Minimising Perplexity == Maximising Probability Lower Perplexity == Better Model (i.e. Perplexity decreases the higher the N in N-Gram) PP(W) = P(w1,w2,...,wN) ^ 1/N Dfn: Perplexity is inverse probability of test set, normalised by qty of words. so, represented for "Bigram" using Chain Rule, gives: PP(W) = Nsqrt ( of Sum(j=1 to N) of 1 / P(wi|wi-1) ) Example: Perplexity given sentence with random digits where Model assigns probability of 1/10 to each digit: PP(W) = P(w1,w2,...,wN) ^ -1/N = ( (1/10)^N ) ^ -1/N = (1/10) ^-1 = 10 given sentence with 2x random words where Model assigns probabilities: dog (probability 1 in 4) cat (probability 1 in 3) PP(W) = P(w1,w2,...,wN) ^ -1/N = ( ) Shannon Visualisation Model * Dfn: Showing 2-gram/3-gram of words in step-like appearance, for 2-gram choose random N-gram for each next stage that has first element in 2-gram the same as the second element in previous, until add obvious last element Reconstituted Counts (pg 53/88) ??? i.e. compared with raw bi-gram counts Backoff * Dfn: * uses 3-gram only when good evidence * use 2-gram / 1-gram when poor evidence Interpolation * Dfn: * Mixes 1-gram, 2-gram, 3-gram * i.e. Extended Interpolated Kneser-Ney * Types: * Linear Interpolation * Simple Interpolation P(wn|wn-1,wn-2) = λ1 * P(wn|wn-1,wn-2) + λ2 * P(wn|wn-1) + λ3 * P(wn) where ∑i λi = 1 * Lambdas Conditional on Context P(wn|wn-2,wn-1) = λ1(wn-1,n-2) * P(wn|wn-2,wn-1) + λ2(wn-1,n-2) * P(wn|wn-1) + λ3(wn-1,n-2) * P(wn) * Set Lambda: * Use "held-out" corpus: * Training data * Held-out data * Test data * Choose λs to maximise probabilities of "held-out" data by fixing N-gram probabilities on training data and then searching for λs giving largest prob. to "held-out" data logP(w1,...,wn|M(λ1...λk)) = ∑i log P _M(λ1..λk) (wi|wi-1) where _ means subscript Open vs Closed Vocabulary Tasks (All Unknown words vs Known words) * Closed Voca. Tasks * Vocab. V is fixed * Open Vocab. Tasks * Out of Vocabulary = OOV words * Create Unknown word token <UNK> then Train the <UNK> word probabilities (create fixed lexicon L of size V, and at text Normalisation phase change to <UNK> any Training word not in L, then Train probabilities of <UNK> like a normal word, and use <UNK> probabilities when decoding any word not in Training Large-scale N-Grams (i.e. Google N-gram) * see Skipped pg 62-63 (too advanced) including "Stupid Back-offs", "Pruning", etc Smoothing Algorithms (use count of things seen to help estimate count of things we've never seen) * God-Turing * Kneser-Ney * Witten-Bell ... skipped the rest
- Google N-Gram (Google Machine Translation Team at Google Research)
- https://research.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html
- Shared large dataset of 1 trillion words and counts of all 1 billion five-word sequences that appear at least 40 times (where total of ~13 million unique words)
i.e. 3-gram Data in Corpus sentence count =================== I am dog 4 4-gram Data in Corpus sentence count in db =================== I am a dog 92
- Google Books N-Gram Viewer
- https://books.google.com/ngrams/graph?content&year_start=1800&year_end=2000&corpus=0&smoothing=3&direct_url=
- Raw data set http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
- Dfn: Model that computes probability of upcoming word
OR probability of a certain sequence of words (sentence)
- Automatic Sign Language Recognition (ASLR) Summary of: https://www-i6.informatik.rwth-aachen.de/publications/download/154/Dreuw–2007.pdf
- ASLR difficulties are with Computer Vision, whereas Auto Speech Recog. (ASR) issues already solved
- Try adopting “speech” recognition system’s large Vocabulary
to recognise sentences of continuous sign language.
- Obtain “features” from video cameras
- Focus on:
- Feature and Model combo techniques used in ASR
- Language Models (LM) and pronounciation in sign language
- Vision issues:
- capturing, tracking, segmentation
- Try adopting “speech” recognition system’s large Vocabulary
to recognise sentences of continuous sign language.
- Sign language
- Basic components
- Manual components
- Hand configuration
- Place of articulation
- Hand movement
- Hand orientation
- Non-Manual Components
- Facial expression
- Body posture
- Manual components
- Basic components
- Continous Sign Language Recognition must deals with
- Coarticulation effects (probability of sign depends on preceding and succeeding signs)
- Inter- and Intra-personal variability
- Goal:
- Sign Language Recognition System, that is person independent, and recognises sentences of continous sign language using vision-based approach (i.e. no data acquisition devices like gloves or motion capture) using a webcam
- Sytem Overview
- Based on Bayes Decision Rule
- Search for the “best” fit word sequence to recognise for the current observation against the Acoustic Model (trained word model inventory from ASR) and against the LM
-
Visual Modelling Phonological Model Dfn: * Signs divided into “chiremes” units * Sign language words split into sub-word “phonemes” units * e.g ``` Split Words
each word model split into 1-3 pseudo-"phonemes" modelling avg word length from training) lexicon ends up with 247 pseudo-"phonemes" after dividing up 104 words Model the Split Words model each pseudo-"phoneme" with 3-state left-to-right HMM with 3x Gaussian mixtures and a pooled Covariance Matrix ``` * Issues and Solutions * Different dialects (many words with same meaning) * HMMs fix small difference between appearance and length of utterances * Different Models required to fix different pronounciations of signs (with different qty of States and GMMs) * RWTH-Boston-104 Corpus Data (of sign language sentences) used (see Section 3) * Perplexity PP of the language model is low since its sentences have simple structure * Out-of-vocabulary (OOV) word (only 1 OFF) is in the Test Corpus (that cannot be recognised correctly)
- Language Models
- Bayes Decision Rule
- Acoustic Model (AM) and Language Model (LM) have same impact on decision
- Increase the Weight/scale of LM (alpha scale) above AM (beta scale) where Language Model Factor = alpha/beta (see equation in section 2.2)
- Bayes Decision Rule
- ASLR difficulties are with Computer Vision, whereas Auto Speech Recog. (ASR) issues already solved
- Language Model (LM) (aka grammar)
- Definitions
-
- About Sudoku http://www.conceptispuzzles.com/?uri=puzzle/sudoku/rules