
Artificial Intelligence
 symbol search
 intro
 knowledge representation
 expert systems
 decisions
 logic and planning
[toc]
symbol search
 computer science  empirical inquiry
symbols and physical symbol systems
 intelligence requires the ability to store and manipulate symbols
 laws of qualitative structure
 cell doctrine in biology
 plate tectonics in geology
 germ theory of disease
 doctrine of atomism
 “physical”
 obey laws of physics
 not restricted to human systems
 designation  then given the expression, the system can affect the object
 interpretation  expression designates a process
 physical symbol system hypothesis  a physical symbol system has the necessary and sufficient means for general intelligent action
 from { cite newell1980physical }
 identify a task domain calling for intelligence; then construct a program for a digital computer that can handle tasks in that domain
 no boundaries have come up yet
 wanted general problem solver  leads to generalized schemes of representation
 goes along with information processing psychology
 observe human actions requiring intelligence
 program systems to model human actions
heuristic searching
 symbol systems solve problems with heuristic search
 Heuristic Search Hypothesis  solutions are represented as symbol structures. A physical symbol system exercises its intelligence in problem solving by search–that is, by generating and progressively modifying symbol structures until it produces a solution structure
 from { cite newell1976computer }
 there are practical limitations on how fast computers can search
 To state a problem is to designate
 a test for a class of symbol structures (solutions of the problem)
 a generator of symbol structures (potential solutions).
 To solve a problem is to generate a structure, using (2), that satisfies the test of (1).
 searching is generally in a treeform
intro
 AI  field of study which studies the goal of creating intelligence
 intelligent agent  system that perceives its environment and takes actions that maximize its chances of success
 expert task examples  medical diagnosis, equipment repair, computer configuration, financial planning
 formal systems  use axioms and formal logic
 ontologies  structuring knowledge in graph form
 statistical methods
 turing test  is human mind deterministic { turing1950computing }
 chinese room argument  rebuts turing test { cite searle1980minds }
 china brain  what if different people hit buttons to fire individual neurons
knowledge representation
 physical symbol system hypothesis  a physical symbol system has the necessary and sufficient means for general intelligent action
 computers and minds are both physical symbol systems
 symbol  meaningful pattern that can be manipulated
 symbol system  creates, modifies, destroys symbols
 want to represent
 metaknowledge  knowledge about what we know
 objects  facts
 performance  knowledge about how to do things
 events  actions
 two levels
 knowledge level  where facts are described
 symbol level  lower
 properties
 representational adequacy  ability to represent
 inferential adequacy
 inferential efficiency
 acquisitional efficiency  acquire new information
 two views of knowledge
 logic
 a logic is a language with concrete rules
 syntax  rules for constructing legal logic
 semantics  how we interpret / read
 assigns a meaning
 multivalued logic  not just booleans
 higherorder logic  functions / predicates are also objects
 multivalued logics  more than 2 truth values
 fuzzy logic  uses probabilities rather than booleans
 matchresolveact cycle
 associationist
 knowledge based on observation
 semantic networks  objects and relationships between them  like is a, can, has
 graphical representation
 equivalent to logical statements
 ex. nlp  conceptual dependency theory  sentences with same meaning have same graphs
 frame representations  semantic networks where nodes have structure
 ex. each frame has age, height, weight, …
 when agent faces new situation  slots can be filled in, may trigger actions / retrieval of other frames
 inheritance of properties between frames
 frames can contain relationships and procedures to carry out after various slots filled
 logic
expert systems
 expert system  program that contains some of the subjectspecific knowledge of one or more human experts.
 problems
 planning
 monitoring
 instruction
 control
 need lots of knowledge to be intelligent
 rulebased architecture  conditionaction rules & database of facts
 acquire new facts
 from human operator
 interacting with environment directly
 forward chaining
 until special HALT symbol in DB, keep following logical rule, add result to DB
 conflict resolution  which rule to apply when many choices available
 pattern matching  logic in the if statements
 backward chaining  check if something is true
 check database
 check if on the right side of any facts
 CLIPS  expert system shell
 define rules and functions…
 explanation subsystem  provide explanation of reasoning that led to conclusion
 people
 knowledge engineer  computer scientist who designs / implements ai
 domain expert  has domain knowledge
 user interface
 knowledge engineering  art of designing and building expert systems
 determine characteristics of problem
 automatic knowledgeacquisition  set of techniques for gaining new knowledge
 ex. parse Wikipedia
 crowdsourcing
 creating an expert system can be very hard
 only useful when expert isn’t available, problem uses symbolic reasoning, problem is wellstructured
 MYCIN  one of first successful expert systems { cite shortliffe2012computer }
 Stanford in 1970s
 used backward chaining but would ask patient questions  sometimes too many questions
 advantages
 can explain reasoning
 can free up human experts to deal with rare problems
decisions
game trees – R&N 5.25.5
 minimax algorithm
 ply  half a move in a tree
 for multiplayer, the backedup value of a node n is the vector of the successor state with the highest value for the player choosing at n
 time complexity  $O(b^m)$
 space complexity  O(bm) or even O(m)
 alphabeta pruning cut in half the exponential depth
 once we have found out enough about n, we can prune it
 depends on moveordering
 might want to explore best moves = killer moves first
 transposition table can hash different movesets that are just transpositions of each other
 imperfect realtime decisions
 can evaluate nodes with a heuristic and cutoff before reaching goal
 heuristic uses features
 want quiescent search  consider if something dramatic will happen in the next ply
 horizon effect  a position is bad but isn’t apparent for a few moves
 singular extension  allow searching for certain specific moves that are always good at deeper depths
 forward pruning  ignore some branches
 beam search  consider only n best moves
 PROBCUT prunes some more
 search vs lookup
 often just use lookup in the beginning
 program can solve and just lookup endgames
 stochastic games
 include chance nodes
 change minimax to expectiminimax
 $O(b^m numRolls^m)$
 cutoff evaluation function is sensitive to scaling  evaluation function must be a postive linear transformation of the probability of winning from a position
 can do alphabeta pruning analog if we assume evaluation function is bounded in some range
 alternatively, could simulate games with Monte Carlo simulation
utilities / decision theory – R&N 16.116.3

$P(RESULT(a)=s’ a,e)$  s  state, observations e, action a
 utility function U(s)
 rational agent should choose action with maximum expected utility

expected utility $EU(a e) = \sum_{s’} P(RESULT(a)=s’ a,e) U(s’)$

 notation
 A>B  agent prefers A over B
 A~B  agenet is indifferent between A and B
 preference relation has 6 axioms of utility theory
 orderability  A>B, A~B, or A<B
 transitivity
 continuity
 substitutability  can do algebra with preference eqns
 monotonicity  if A>B then must prefer higher probability of A than B
 decomposability  2 consecutive lotteries can be compressed into single equivalent lottery
 these axioms yield a utility function
 isn’t unique (ex. affine transformation yields new utility function)
 sometimes ranking, not numbers needed  value function = ordinal utility function
 agent might not be explicitly maximizing the utility function
utility functions
 preference elicitation  finds utility function
 normalized utility to have min and max value
 assess utility of s by asking agent to choose between s and (p:min, (1p):max)
 micromort  one in a million chance of death
 QALY  qualityadjusted life year
 money
 agenets exhibits monotonic preference for more money
 gambling has expected monetary value = EMV
 when utility of money is sublinear  risk averse
 value agent will accept in lieu of lottery = certainty equivalent
 EMV  certainty equivalent = insurance premium
 when supralinear  riskseeking or linear  riskneutral
 optimizer’s curse  tendency for E[utility] to be too high
 descriptive theory  how actual agents work
 decision theory  normative theory
 certainty effect  people are drawn to things that are certain
 ambiguity aversion
 framing effect  wording can influence people’s judgements
 evolutionary psychology
 anchoring effect  buy middletier wine because expensive is there
decision theory / VPI – R&N 16.5 & 16.6
 decision network
 chance nodes  represent RVS (like BN)
 decision nodes  points where decision maker has a choice of actions
 utility nodes  represent agent’s utility function
 can ignore chance nodes
 then actionutility function = Qfunction maps directly from actions to utility
 evaluation
 set evidence
 for each possible value of decision node
 set decision node to that value
 calculate probabilities of parents of utility node
 calculate resulting utility
 return action with highest utility
the value of information
 information value theory  enables agent to choose what info to acquire
 observations only effect agents belief state
 value of info = expected value between best actions before and after info is obtained
 value of perfect information VPI  assume we can obtain exact evidence on some variable $e_j$

$VPI_e(E_j) = \left(\sum_k P(E_j = e_{jk} e) : EU(\alpha_{ejk} e, E_j = e_{jk})\right)  EU(\alpha e)$  info is more valuable when it is likely to cause a change of plan
 info is more valuable when the new plan will be much better than the old plan
 VPI not linearly additive, but is orderindependent

 informationgathering agent
 myopic  greedily obtain evidence which yields highest VPI until some threshold
 conditional plan  considers more things
mdps and rl – R&N 17.117.4, 21.121.6
 fully observable  agent knows its state
 markov decision process
 set of states
 set of actions

transition model $P(s’ s,a)$  reward function R(s)
 solution is policy $\pi^* (s)$  what action to do in state s
 optimal policy yields highest expected utlity
 optimizing MDP  multiattribute utility theory
 could sum rewards, but results are infinite
 instead define objective function (maps infinite sequences of rewards to single real numbers)
 ex. set a finite horizon and sum rewards
 optimal action in a given state could change over time = nonstationary
 ex. discounting to prefer earlier rewards (most common)
 could discount reward n steps away by $\gamma^n$, 0<r<1
 ex. average reward rate per time step
 ex. agent is guaranteed to get to terminal state eventually  proper policy
 ex. set a finite horizon and sum rewards
 expected utility executing $\pi$: $U^\pi (s) = E[\sum_t \gamma^t R(S_t)]$
 when we use discounted utilities, $\pi$ is independent of starting state

$\pi^*(s) = \underset{\pi}{argmax : U^\pi (s)} = \underset{a}{argmax} \sum_{s’} P(s’ s,a) U’(s)$  utility of state is immediate reward for that state plus the expected discounted utility of the next state, assuming agent chooses optimal action
value iteration
 value iteration  calculates utility of each state and uses utilities to find optimal policy

bellman eqn  $U(s) = R(s) + \gamma : \underset{a}{max} \sum_{s’} P(s’ s, a) U(s’)$  start with arbitrary utilities
 recalculate several times with Bellman update to approximate solns to bellman eqn = $U_{i+1}(s) = R(s) + \gamma : \underset{a}{max} \sum_{s’} P(s’s, a) U_i(s’)$

 value iteration eventually converges
 contraction  function of one variable that when applied to 2 different inputs in turn produces 2 output values that are closer together than the original inputs
 contraction only has 1 fixed point
 Bellman update is a contraction on the space of utility vectors and therefore converges
 error is reduced by factor of $\gamma$ each iteration

also, terminating condition  if $ U_{i+1}U_i < \epsilon (1\gamma) / \gamma$ then $ U_{i+1}U <\epsilon$ 
what actually matters is policy loss $ U^{\pi_i}U $  the most the agent can lose by executing $\pi_i$ instead of the optimal policy $\pi^*$ 
if $ U_i U < \epsilon$ then $ U^{\pi_i}  U < 2\epsilon \gamma / (1\gamma)$

 contraction  function of one variable that when applied to 2 different inputs in turn produces 2 output values that are closer together than the original inputs
policy iteration
 another way to find optimal policies
 policy evaluation  given a policy $\pi_i$, calculate $U_i=U^{\pi_i}$, the utility of each state if $\pi_i$ were to be executed
 like value iteration, but with a set policy so there’s no max
 can solve exactly for small spaces, or approximate
 policy improvement  calculate a new MEU policy $\pi_{i+1}$ using onestep lookahead based on $U_i$

same as above, just $\pi^*(s) = \underset{\pi}{argmax : U^\pi (s)} = \underset{a}{argmax} \sum_{s’} P(s’ s,a) U’(s)$

 policy evaluation  given a policy $\pi_i$, calculate $U_i=U^{\pi_i}$, the utility of each state if $\pi_i$ were to be executed
 asynchronous policy iteration  don’t have to update all states at once
partially observable markov decision processes (POMDP)
 agent is not sure what state it’s in

same elements but add sensor model P(e s)  have prob. distr b(s) for belief states
 updates like the HMM

$b’(s’) = \alpha P(e s’) \sum_s P(s’ s, a) b(s)$  changes based on observations
 optimal action depends only on the agent’s current belief state  use belief states as the states of an MDP and solve as before
 changes because state space is now continuous
 value iteration
 expected utility of executing p in belif state is just $b \cdot \alpha_p$  dot product
 $U(b) = U^{\pi^*}(b)=\underset{p}{max} : b \cdot \alpha_p$
 belief space is continuous [0,1] so we represent it as piecewise linear, and store these discrete lines in memory
 do this by iterating and keeping any values that are optimal at some point
 remove dominated plans  generally this is far too inefficient
 dynamic decision network  online agent
 still don’t really understand this
reinforcement learning
 reinforcement learning  use observed rewards to learn optimal policy for the environment
 3 agent designs
 utilitybased agent  learns utility function on states
 requires model of the environment
 Qlearning agent
 learns actionutility function = Qfunction maps directly from actions to utility
 reflex agent  learns policy that maps directly from states to actions
 utilitybased agent  learns utility function on states
passive reinforcement learning
 given policy $\pi$, learn $U^\pi (s)$
 like policy evaluation, but transition model / reward function are unknown
 direct utility estimation  run a bunch of trials to sample utility = expected total reward from each state
 adaptive dynamic programming (ADP)  learn transition model and rewards, and then plug into Bellman eqn
 prioritized sweeping  prefers to make adjustements to states whose likely succesors have just undergone a large adjustment in their own utility estimates
 two ways to add prior
 Bayesian reinforcement learning  assume a prior P(h) on the transition model

use prior to calculate $P(h e)$  let $u_h^\pi$ be expected utility avareaged over all possible start states, obtained by executing policy $\pi$ in model h

$\pi^* = \underset{\pi}{argmax} \sum_h P(h e) u_h^\pi$

 give best outcome in the worst case over H (from robust control theory)
 $\pi^* = \underset{\pi}{argmax} \underset{h}{min} u_h^\pi$
 Bayesian reinforcement learning  assume a prior P(h) on the transition model
 temporaldifference learning  adjust utility estimates towards the ideal equilibrium that holds locally when the utility estimates are correct
 $U^\pi = U^\pi (s) + \alpha (R(s) + \gamma U^\pi (s’)  U^\pi (s))$
 like a crude approximation of ADP
active reinforcement learning
 explore states to find their utilities and exploit model to get highest reward
 bandit problems  determining exploration policy
 should be GLIE  greedy in the limit of infinite exploration  visits all states infinitely, but eventually become greedy
 ex. choose random action 1/t of the time
 better ex. give optomistic prior utility to unexplored states
 uses exploration function f(u,numTimesVisited) in utility update rule
 narmed bandit  pulling n levelers on a slot machine, each with different distr.
 Gittins index  function of number of pulls / payoff
 should be GLIE  greedy in the limit of infinite exploration  visits all states infinitely, but eventually become greedy
learning actionutility function
 U(s) = $\underset{a}{max} Q(s,a)$

does require $P(s’ s,a)$ if we use ADP 
doesn’t require knowing $P(s’ s,a)$ if we use TD: $Q(s,a) = Q(s,a) + \alpha (R(s) + \gamma \underset{a’}{max} Q(s’, a’)  Q(s,a))$

 SARSA is related: $Q(s,a) = Q(s,a) + \alpha (R(s) + \gamma Q(s’, a’)  Q(s,a))$
 here, a’ is action actually taken
 SARSA is onpolicy while Qlearning is offpolicy
generalization
 approximate Qfunction
 ex. linear function of parameters
 can learn params online with delta rule = wildrowhoff rule: $\theta_i = \theta  \alpha : \frac{\partial Loss}{\partial \theta_i}$
 ex. linear function of parameters
policy search
 keep twiddling the policy as long as it improves, then stop
 store one Qfunction (parameterized by $\theta$) for each action
 $\pi(s) = \underset{a}{max} \hat{Q}_\theta (s,a)$
 this is discontinunous, instead often use stochastic policy representation (ex. softmax for $\pi_theta (s,a)$)
 learns $\theta$ that results in good performance
 Qlearning learns actual Q* function  coulde be different (scaling factor etc.)
 to find $\pi$ maximize policy value $p(\theta)$
 could do this with gradient ascient / empirical gradient hill climbing
 when environment/policy is stochastic, more difficult
 could sample mutiple times to compute gradient
 REINFORCE algorithm  could approximate gradient at $\theta$ by just sampling at $\theta$: $\nabla_\theta p(\theta) \approx \frac{1}{N} \sum_{j=1}^N \frac{(\nabla_\theta \pi_\theta (s,a_j)) R_j (s)}{\pi_\theta (s,a_j)}$
 PEGASUS  correlated sampling  ex. 2 blackjack programs would both be dealt same hands
applications
 game playing
 robot control
logic and planning
 knowledgebased agents  intelligence is based on reasoning that operates on internal representations of knowledge
logical agents  7.17.7 (omitting 7.5.2)
 3 steps’ given a percept, the agent
 adds the percept to its knowledge base
 asks the knowledge base for the best action
 tells the knowledge base that it has taken that action
 declarative approach  tell sentences until agent knows how to opearte
 procedural approach  encodes desired behaviors as program code
 ex. Wumpus World
 logical entailment between senteces
 B follows logically from A (A implies B)
 $A \vDash B$
 model checking  try everything to see if A $\implies$ B
 this is sound=truthpreserving
 complete  can derive any sentence that is entailed
 grounding  connection between logic and real environment (usually sensors)
 inference
 TTENTAILS  recursively enumerate all sentences  check if a query is in the table
 theorem properties
 satisfiable  true under some model
 validity  tautalogy  true under all models
 monotonicity  set of impliciations can only increase as info is added to the knowledge base
 if $KB \implies A$ then $KB \land B \implies A$
theorem proving
 resolution rule  resolves different rules with each other  leads to complete inference procedure
 CNF  conjunctive normal form  conjunction of clauses
 anything can be expressed as this
 skip this  resolution algorithm: check if $KB \implies A$ so check if $KB \land A$
 keep adding clauses until
 nothing can be added
 get empty clause so $KB \implies A$
 ground resolution thm  if a set of clauses is unsatisfiable, then the resolution closure of those clauses contains the empty clause
 resolution closure  set of all clauses derivable by repeated application of resolution rule
 keep adding clauses until
 restricted knowledge bases
 horn clause  at most one positive
 definite clause  disjunction of literals with exactly one positive
 goal clause  no positive
 benefits
 easy to understand
 forwardchaining / backwardchaining are applicable
 deciding entailment is linear in size(KB)
 forward/backward chaining
 checks if q is entailed by KB of definite clauses
 keep adding until query is added or nothing else can be added
 backward chaining works backwards from the query
 checks if q is entailed by KB of definite clauses
 horn clause  at most one positive
 checking satisfiability
 complete backtracking
 davisputnam algorithm = DPLL  like TTentails with 3 improvements
 early termination
 pure symbol heuristic  pure symbol appears with same sign in all clauses
 unit clause heuristic  clause with just on eliteral or one literal not already assigned false
 other improvements (similar to search)
 component analysis
 variable and value ordering
 intelligent backtracking
 random restarts
 clever indexing
 davisputnam algorithm = DPLL  like TTentails with 3 improvements
 local search
 evaluation function can just count number of unsatisfied clauses (MINCONFLICTS algorithm for CSPs)
 WALKSAT  randomly chooses between flipping based on MINCONFLICTS and randomly
 runs forever if no soln
 underconstrained problems are easy to find solns too
 satisfiability threshold conjecture  for random clauses, probability of satisfiability goes to 0 or 1 based on ratio of clauses to symbols
 hardest problems are at the threshold
 state variables that change over time also called fluents
 can index these by time
 effect axioms  specify the effect of an action at the next time step
 frame axioms  assert that all propositions remain the same
 succesorstate axiom: $F^{t+1} \iff ActionCausesF^t \lor (F^t \land ActionCausesNotF^t )$
 complete backtracking
 keeping track of belief state
 can just use 1CNF
 1CNF includes all states that are in fact possible given the full percept history
 conservative approximation
 can just use 1CNF
 SATPLAN  how to make plans for future actions that solve the goal by propositional inference
 must add precondition axioms  states that action occurrence requires preconditions to be satisfied
 action exclusion axioms  one action at a time
firstorder logic  8.18.3.3
 declarative language  semantics based on a truth relation between sentences and possible worlds
 has compositionality  meaning decomposes
 SapirWhorf hypothesis  understanding of the world is influenced by the language we speak
 3 elements
 objects
 relations
 functions  only one value for given input
 firstorder logic assumes more about the world than propositional logic
 epistemological commitments  the possible states of knowledge that a logic allows with respect to each fact
 higherorder logic  views relations and functions as objects in themselves
 firstorder consists of symbols
 constant symbols  stand for objects
 predicate symbols  stand for relations
 function symbols  stand for functions
 arity  fixes number of args
 term  logical expresision tha refers to an object’
 atomic sentence  formed from a predicate symbol optionally followed by a parenthesized list of terms
 true if relation holds among objects referred to by the args  $\forall, \exists$, etc.  interpretation  specifies exactly which objects, relations and functions are referred to by the symbols
inference in firstorder logic  9.19.4
 propositionalization  can convert firstorder logic to propositional logic and do propositional inference
 universal instantiation  we can infer any sentence obtained by substituting a ground term for the variable
 replace “forall x” with a specific x
 existential instantiation  variable is replaced by a new constant symbol
 replace “there exists x” with a specific x
 Skolem constant  new name of constant
 only need finite subset of propositionalized KB  can stop nested functions at some depth
 semidecidable  algorithms exist that say yes to every entailed sentence, but no algorithm exists that also says no to every nonentailed sentence
 universal instantiation  we can infer any sentence obtained by substituting a ground term for the variable
 generalized modus ponens
 unification  finding substitutions that make different logical expressions look identical
 UNIFY(Knows(John,x), Knows(x,Elizabeth)) = fail .
 use different x’s  standardizing apart
 want most general uniier
 need occur check  S(x) can’t unify with S(S(x))
 UNIFY(Knows(John,x), Knows(x,Elizabeth)) = fail .
 storage and retrieval
 STORE(s)  stores a sentence s into the KB
 FETCH(q)  returns all unifiers such that the query q unifies with some sentence in the KB
 only try to unfity reasonable facts using indexing
 query such as Employs(IBM, Richard)
 all possible unifying queries form subsumption lattice
 forward chaining
 firstorder definite clauses  disjunctions of literals of which exactly one is positive (could also be implication whose consequent is a single positive literal)
 Datalog  language restricted to firstorder definite clauses with no function symbols
 simple forwardchaining: FOLFCASK
 pattern matching is expensive
 rechecks every rule
 generates irrelevant facts
 efficient forward chaining (solns to above problems)
 conjuct odering problem  find an ordering to solve the conjuncts of the rule premise so the total cost is minimized
 requires heuristics (ex. minimumremainingvalues)
 incremental forward chaining  ignore redundant rules
 every new fact inferred on iteration t must be derived from at least one new fact inferred on iteration t1
 rete algorithm was first to do this
 irrelevant facts can be ignored by backward chaining
 could also use deductive database to keep track of relevant variables
 conjuct odering problem  find an ordering to solve the conjuncts of the rule premise so the total cost is minimized
 backwardchaining
 simple backwardchaining: FOLBCASK
 is a generator  returns multiple times, each giving one possible result
 logic programming: algorithm = logic + control
 ex. prolog
 a lot more here
 can have parallelism
 redudant inference / infinite loops because of repeated states and infinite paths
 can use memoization (similar to the dynamic programming that forwardchaining does)
 generally easier than converting it into FOLD
 constraint logic programming  allows variables to be constrained rather than bound
 allows for things with infinite solns
 can use metarules to determine which conjuncts are tried first
 simple backwardchaining: FOLBCASK
classical planning 10.110.2
 planning  devising a plan of action to achieve one’s goals
 Planning Domain Definition Language (PDDL)  uses factored representation of world
 closedworld assumption  fluents that aren’t present are false
 set of ground (variablefree) actions can be represented by a single action schema
 like a method
algorithms for planning as statespace search
knowledge representation 12.1  12.3
 ontological engineering  representing objects and their relationships
 upper ontology  tree more general at the top more specific at bottom
 must represent categories
 subcategories make a taxonomy
 can also define functions
 mass noun  function that includes only intrinsic properties
 count noun  function that includes any extrinsic properties
 physical symbol system hypothesis  a physical symbol system has the necessary and sufficient means for general intelligent action
 computers and minds are both physical symbol systems
 symbol  meaningful pattern that can be manipulated
 symbol system  creates, modifies, destroys symbols
 want to represent
 metaknowledge  knowledge about what we know
 objects  facts
 performance  knowledge about how to do things
 events  actions
 two levels
 knowledge level  where facts are described
 symbol level  lower
 properties
 representational adequacy  ability to represent
 inferential adequacy
 inferential efficiency
 acquisitional efficiency  acquire new information
 two views of knowledge
 logic
 a logic is a language with concrete rules
 syntax  rules for constructing legal logic
 semantics  how we interpret / read
 assigns a meaning
 multivalued logic  not just booleans
 higherorder logic  functions / predicates are also objects
 multivalued logics  more than 2 truth values
 fuzzy logic  uses probabilities rather than booleans
 matchresolveact cycle
 associationist
 knowledge based on observation
 semantic networks  objects and relationships between them  like is a, can, has
 graphical representation
 equivalent to logical statements
 ex. nlp  conceptual dependency theory  sentences with same meaning have same graphs
 frame representations  semantic networks where nodes have structure
 ex. each frame has age, height, weight, …
 when agent faces new situation  slots can be filled in, may trigger actions / retrieval of other frames
 inheritance of properties between frames
 frames can contain relationships and procedures to carry out after various slots filled
 logic

Deep Learning
 neural networks
 training
 CNNs
 1  AlexNet (2012)
 2  ZFNet (2013)
 3  VGGNet (2014)
 4  GoogLeNet (2015)
 5  Msft ResNet (2015)
 6  Region Based CNNs (RCNN  2013, Fast RCNN  2015, Faster RCNN  2015)
 7 GAN (2014)
 8  Karpathy Generating image descriptions (2014)
 9  Spatial transformer networks (2015)
 10  Segnet (2015)
 11  Unet (2015)
 12  Pixelnet (2017)
 recent papers
 RNNs
[toc]
neural networks
 basic perceptron update rule
 if output is 0 but should be 1: raise weights on active connections by d
 if output is 1 but should be 0: lower weights on active connections by d
 transfer / activation functions
 sigmoid(z) = $\frac{1}{1+e^{z}}$
 Binary step
 TanH
 Rectifier = ReLU
 deep  more than 1 hidden layer
 regression loss = $\frac{1}{2}(y\hat{y})^2$
 classification loss = $y log (\hat{y})  (1y) log(1\hat{y})$
 can’t use SSE because not convex here
 multiclass classification loss $=\sum_j y_j ln \hat{y}_j$
 backpropagation  application of reverse mode automatic differentiation to neural networks’s loss
 apply the chain rule from the end of the program back towards the beginning
 $\frac{dL}{dx_i} = \frac{dL}{dz} \frac{\partial z}{\partial x_i}$
 sum $\frac{dL}{dz}$ if neuron has multiple outputs z
 L is output
 $\frac{\partial z}{\partial x_i}$ is actually a Jacobian (deriv each $z_i$ wrt each $x_i$  these are vectors)
 each gate usually has some sparsity structure so you don’t compute whole Jacobian
 apply the chain rule from the end of the program back towards the beginning
 pipeline
 initialize weights, and final derivative ($\frac{dL}{dL}=1$)
 for each batch
 run network forward to compute outputs at each step
 compute gradients at each gate with backprop
 update weights with SGD
training
 vanishing gradients problem  neurons in earlier layers learn more slowly than in later layers
 happens with sigmoids
 dead ReLus
 exploding gradients problem  gradients are significantly larger in earlier layers than later layers
 RNNs
 batch normalization  whiten inputs to all neurons (zero mean, variance of 1)
 do this for each input to the next layer
 dropout  randomly zero outputs of p fraction of the neurons during training
 like learning large ensemble of models that share weights
 2 ways to compensate (pick one)
 at test time multiply all neurons’ outputs by p
 during training divide all neurons’ outputs by p
 softmax  takes vector z and returns vector of the same length
 makes it so output sums to 1 (like probabilities of classes)
CNNs
 kernel here means filter
 convolution G takes a windowed average of an image F with a filter H where the filter is flipped horizontally and vertically before being applied
 G = H $\ast$ F
 if we do a filter with just a 1 in the middle, we get the exact same image
 you can basically always pad with zeros as long as you keep 1 in middle
 can use these to detect edges with small convolutions
 can do Guassian filters
 convolutions typically sum over all color channels
 weight matrices have special structure (Toeplitz or block Toeplitz)
 input layer is usually centered (subtract mean over training set)
 usually crop to fixed size (square input)
 receptive field  input region
 stride m  compute only every mth pixel
 downsampling
 max pooling  backprop error back to neuron w/ max value
 average pooling  backprop splits error equally among input neurons
 data augmentation  random rotations, flips, shifts, recolorings
1  AlexNet (2012)
 landmark (5 conv layers, some pooling/dropout)
2  ZFNet (2013)
 fine tuning and deconvnet
3  VGGNet (2014)
 19 layers, all 3x3 conv layers and 2x2 maxpooling
4  GoogLeNet (2015)
 lots of parallel elements (called Inception module)
5  Msft ResNet (2015)
 very deep  152 layers
 connections straight from initial layers to end
 only learn “residual” from top to bottom
6  Region Based CNNs (RCNN  2013, Fast RCNN  2015, Faster RCNN  2015)
 object detection
7 GAN (2014)
 might not converge
 generative adversarial network
 goal: want G to generate distribution that follows data
 ex. generate good images
 two models
 G  generative
 D  discriminative
 G generates adversarial sample x for D
 G has prior z
 D gives probability p that x comes from data, not G
 like a binary classifier: 1 if from data, 0 from G
 adversarial sample  from G, but tricks D to predicting 1
 training goals
 G wants D(G(z)) = 1
 D wants D(G(z)) = 0
 D(x) = 1
 converge when D(G(z)) = 1/2
 G loss function: $G = argmin_G log(1D(G(Z))$
 overall $min_g max_D$ log(1D(G(Z))
 training algorithm
 in the beginning, since G is bad, only train my minimizing G loss function
 later
for for max D by SGD min G by SGD
8  Karpathy Generating image descriptions (2014)
 RNN+CNN
9  Spatial transformer networks (2015)
 transformations within the network
10  Segnet (2015)
 encoderdecoder network
11  Unet (2015)
 Ronneberger  applies to biomedical segmentation
12  Pixelnet (2017)
 predicts pixellevel for different tasks with the same architecture
 convolutional layers then 3 FC layers which use outputs from all convolutional layrs together
recent papers
 deepmind’s learning to learn
 optimal brain damage  starts with fully connected and weeds out connections (Lecun)
 tiling  train networks on the error of previous networks
RNNs
 feedforward NNs have no memory so we introduce recurrent NNs
 able to have memory
 could theoretically unfold the network and train with backprop
 truncated  limit number of times you unfold
 $state_{new} = f(state_{old},input_t)$
 ex. $h_t = tanh(W h_{t1}+W_2 x_t)$
 train with backpropagation through time (unfold through time)
 truncated backprop through time  only run every k time steps
 error gradients vanish exponentially quickly with time lag
LSTMs
 have gates for forgetting, input, output
 easy to let hidden state flow through time, unchanged
 gate $\sigma$  pointwise multiplication
 multiply by 0  let nothing through
 multiply by 1  let everything through
 forget gate  conditionally discard previously remembered info
 input gate  conditionally remember new info
 output gate  conditionally output a relevant part of memory
 GRUs  similar, merge input / forget units into a single update unit

Dimensionality Reduction
PCA
 have p random variables
 want new set of K axes (linear combinations of the original p axes) in the direction of greatest variability
 this is best for visualization, reduction, classification, noise reduction
 to find axis  minimize sum of squares of projections onto line =($v^TX^TXv$ subject to $v^T v=1$ )
 $\implies v^T(X^TXv\lambda v)=0$
 SVD: let $X = U D V^T$
 $V_q$ (pxq) is first q columns of V
 $H = V_q V_q^T$ is the projection matrix
 to transform $x = Hx$
 columns of $UD$ (Nxp) are called the principal components of X
 $V_q$ (pxq) is first q columns of V
 eigenvectors of covariance matrix > principal components
 most important corresponds to largest eigenvalue (eigenvalue corresponds to variance)
 finding eigenvectors can be hard to solve, so 3 other methods
 singular value decomposition (SVD)
 multidimensional scaling (MDS)
 based on eigenvalue decomposition
 adaptive PCA
 extract components sequentially, starting with highest variance so you don’t have to extract them all
nonlinear PCA
 usually uses an autoassociative neural network
ICA
 like PCA, but instead of the dot product between components being 0, the mutual info between components is 0
 goals
 minimizes statistical dependence between its components
 maximize information transferred in a network of nonlinear units
 uses information theoretic unsupervised learning rules for neural networks
 problem  doesn’t rank features for us

Learning Theory
 books
 bounds
 approximations
 evolution
 sample problems
 computational learning theory
 concept learning and the generaltospecific ordering
[toc]
books
 Machine Learning  Tom Mitchell
 An Introduction to Computational Learning Theory  Kearns & Vazirani
bounds
 2 major inequalities
 Markov’s inequality
 $P(X \geq a) \leq \frac{E[X]}{a}$
 X is typically running time of the algorithm
 if we don’t have E[X], can use upper bound for E[X]
 Chebyshev’s inequality

$P( X\mu \geq a) \leq \frac{Var[X]}{a^2}$  utilizes the variance to get a better bound  CLT  law of large numbers  Chernoff bounds  Hoeffding bounds

 Markov’s inequality
approximations
 $\left( \frac{n}{k} \right) < \left( \frac{ne}{k} \right)^k$
 $\left( \frac{n}{e} \right)^n < n!$
 $(1x)^N \leq e^{Nx}$
 Poisson pmf approximates binomial when N large, p small
evolution
 performance is correlation $Perf_D (h,c) = \sum h(x) \cdot c(x) \cdot P(x)$
 want $P(Perf_D(h,c) < Perf_D(c,c)\epsilon) < \delta$
sample problems
 ex: N marbles in a bag. How many draws with replacement needed before we draw all N marbles?
 write $P_i = \frac{N(i1)}{N}$ where i is number of distinct drawn marbles
 transition from i to i+1 is geometrically distributed with probability $P_i$
 mean times is sum of mean of each geometric
 in order to get probabilities of seeing all the marbles instead of just mean[# draws], want to use Markov’s inequailty
 write $P_i = \frac{N(i1)}{N}$ where i is number of distinct drawn marbles
 box full of 1e6 marbles
 if we have 10 evenly distributed classes of marbles, what is probability we identify all 10 classes of marbles after 100 draws?
computational learning theory
 frameworks
 PAC
 mistakebound  split into b processes which each fail with probability at most $\delta / b$
 questions
 sample complexity  how many training examples needed to converge
 computational complexity  how much computational effort needed to converge
 mistake bound  how many training examples will learner misclassify before converging
 must define convergence based on some probability
PAC  probably learning an approximately correct hypothesis  Mitchell
 want to learn C
 data X is sampled with Distribution D
 learner L considers set H of possible hypotheses
 true error $err_d (h)$ of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D.
 $err_D(h) = \underset{x\in D}{Pr}[c(x) \neq h(x)]$
 getting $err_D(h)=0$ is infeasible
 PAC learnable  consider concept class C defined over set of instances X of length n and a learner L using hypothesis space H
 C is PAClearnable by L using H if for all $c \in C$, distributions D over X, $\epsilon$ s.t. 0 < $\epsilon$ < 1/2 $\delta$ s.t. $0<\delta<1/2$, learner L will with probability at least $(1\delta)$ output a hypothesis $h \in H$ s.t $err_D(h) \leq \epsilon$
 efficiently PAC learnable  time that is polynomial in $1/\epsilon, 1/\delta, n, size(c )$
 probably  probability of failure bounded by some constant $\delta$
 approximately correct  err bounded by some constant $\epsilon$
 assumes H contains hypothesis with artbitraily small error for every target concept in C
sample complexity for finite hypothesis space  Mitchell
 sample complexity  growth in the number of training examples required
 consistent learner  outputs hypotheses that perfectly fit training data whenever possible
 outputs a hypothesis belonging to the version space
 consider hypothesis space H, target concept c, instance distribution $\mathcal{D}$, training examples D of c. The versions space $VS_{H,D}$ is $\epsilon$exhaused with respect to c and $\mathcal{D}$ if every hypothesis h in $VS_{H,D}$ has error less than $\epsilon$ with respect to c and $\mathcal{D}$: $(\forall h \in VS_{H,D}) err_\mathcal{D} (h) < \epsilon$
rectangle learning game  Kearns
 data X is sampled with Distribution D
 simple soln: tightestfit rectangle
 define region T so prob a draw misses T is $1\epsilon /4$
 then, m draws miss with $(1\epsilon /4)^m$
 choose m to satisfy $4(1\epsilon/4)^m \leq \delta$
 then, m draws miss with $(1\epsilon /4)^m$
VC dimension
 VC dimension measures capacity of a space of functions that can be learend by a statistical classification algorithm
 let H be set of sets and C be a set

$H \cap C := { h \cap C : h \in H }$  a set C is shattered by H if $H \cap C$ contains all subsets of C
 The VC dimension of $H$ is the largest integer $D$ such that there exists a set $C$ with cardinality $D$ that is shattered by $H$
 VC dimension 0 > hypothesis either always returns false or always returns true
 Sauer’s lemma  let $d \geq 0, m \geq 1$, $H$ hypothesis space, VCdim(H) = d. Then, $\Pi_H(m) \leq \phi (d,m)$
 fundamental theorem of learning theory provides bound of m that guarantees learning: $m \geq [\frac{4}{\epsilon} \cdot (d \cdot ln(\frac{12}{\epsilon}) + ln(\frac{2}{\delta}))]$
concept learning and the generaltospecific ordering
 definitions
 concept learning  acquiring the definition of a general category given a sample of positive and negative training examples of the category
 concept is boolean function that returns true for specific things
 can represent function as vector acceptable features, ?, or null (if any null, then entire vector is null)
 general hypothesis  more generally true
 general defines a partial ordering
 a hypothesis is consistent with the training examples if it correctly classifies them
 an example x satisfies a hypothesis h if h(x) = 1
 concept learning  acquiring the definition of a general category given a sample of positive and negative training examples of the category
 findS  finding a maximally specific hypothesis
 start with most specific possible
 generalize each time it fails to cover an observed positive training example
 flaws  ignores negative examples  if training data is perfect, then will get answer 1. no errors 2. there exists a hypothesis in H that describes target concept c
 version space  set of all hypotheses consistent with the training examples
 listtheneliminate  list all hypotheses and eliminate any that are inconsistent (slow)
 candidateelimination  represent most general (G) and specific (S) members of version space
 version space representation theorem  version space can be found from most general / specific version space members
 for positive examples
 make S more general
 fix G
 for negative examples
 fix S
 make G more specific
 in general, optimal query strategy is to generate instances that satisfy exactly half the hypotheses in the current version space
 testing?
 classify as positive if satisfies S
 classify as negative if doesn’t satisfy G
 bias
 unbiased learner
 might have to learn a union of rules  then target concept is expressible
 however, this doesn’t generalize at all
 thus need inductive inference property: a learner that makes no a prior assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances
 define inductive bias of a learner as the set of additional assumptions B sufficient to justify its inductive inferences as deductive inferences
 required to generalize
 inductive bias of candidateelimination  target concept c is contained in H
 unbiased learner

Machine Learning
[toc]
Overview
 3 types
 supervised
 unsupervised
 reinforcement
Evaluation
 accuracy = number of correct classifications / total number of test cases
 balanced accuracy = 1/2 (TP/P + TN/N)
 recall  TP/(TP+FN)
 precision  TP/(TP+FP)
 you train by minimizing SSE on training data
 report MSE for test samples
 cross validation  don’t have enough data for a test set
 kfold  split data into N pieces, test on only 1
 LOOCV  train on all but one
Error
 Define a loss function $\mathcal{L}$

01 loss: Cf(X)  $L_2$ loss: $(Cf(X))^2$

 Expected Prediction Error EPE(f) = $E_{X,C} [\mathcal{L}(C,f(X))]$

=$E_{X}\left[ \sum_i \mathcal{L}(C_i,f(X)) Pr(C_i X) \right]$

 Minimize EPE

Bayes Classifier minimizes 01 loss: $\hat{f}(X)=C_i$ if $P(C_i X)=max_f P(f X)$ 
KNN minimizes L2 loss: $\hat{f}(X)=E(Y X)$

 EPE(f(X)) = $noise^2+bias^2+variance$
 noise  unavoidable
 bias=$E[(\bar{\theta}\theta_{true})^2]$  error due to incorrect assumptions
 simple linear regression has 0 bias
 variance=$E[(\bar{\theta}\theta_{estimated})^2]$  error due to variance of training samples
 more complex models (more nonzero parameters) have lower bias, higher variance
 if high bias, train and test error will be very close (model isn’t complex enough)
Classification
 asymptotic classifier  assume you get infinite training / testing points

discriminative  model P(C X) directly  smaller asymptotic error
 slow convergence ~ O(p)

generative  model P(X C) directly  generally has higher bias > can handle missing data
 fast convergence ~ O(log(p))
Discriminative
SVMs
 svm benefits
 maximum margin separator generalizes well
 kernel trick makes it very nonlinear
 nonparametric  can retain training examples, although often get rid of many
 notation
 $y \in {1,1}$
 $h(x) = g(w^tx +b)$
 g(z) = 1 if $z \geq 0$ and 1 otherwise
 define functional margin $\gamma^{(i)} = y^{(i)} (w^t x +b)$
 want to limit the size of (w,b) so we can’t arbitrarily increase functional margin
 function margin $\hat{\gamma}$ is smallest functional margin in a training set

geometric margin = functional margin / w 
if w =1, then same as functional margin  invariant to scaling of w

 optimal margin classifier

want $$max : \gamma : s.t. : y^{(i)} (w^T x^{(i)} + b) \geq \gamma, i=1,..,m; w =1$$ 
difficult to solve, especially because of w =1 constraint  assume $\hat{\gamma}=1$  just a scaling factor

now we are maximizing $1/ w $


equivalent to this formulation: $$min : 1/2 w ^2 : s.t. : y^{(i)}(w^Tx^{(i)}+b)\geq1, i = 1,…,m$$

 Lagrange duality
 dual representation is found by solving $\underset{a}{argmax} \sum_j \alpha_j  1/2 \sum_{j,k} \alpha_j \alpha_k y_j y_k (x_j \cdot x_k)$ subject to $\alpha_j \geq 0$ and $\sum_j \alpha_j y_j = 0$
 convex
 data only enter in form of dot products, even when predicting $h(x) = sgn(\sum_j \alpha_j y_j (x \cdot x_j)  b)$
 weights $\alpha_j$ are zero except for support vectors
 replace dot product $x_j \cdot x_k$ with kernel function $K(x_j, x_k)$
 faster than just transforming x
 allows to find optimal linear separators efficiently
 soft margin classifier  lets examples fall on wrong side of decision boundary
 assigns them penalty proportional to distance required to move them back to correct side
 want to maximize margin $M = \frac{2}{\sqrt{w^T w}}$

we get this from $M= x^+  x^ = \lambda w = \lambda \sqrt{w^Tw} $  separable case: argmin($w^Tw$) subject to
 $w^Tx+b\geq 1$ for all x in +1 class
 $w^Tx+b\leq 1$ for all x in 1 class

 solve with quadratic programming
 nonseparable case: argmin($w^T w/2 + C \sum_i^n \epsilon_i$) subject to
 $w^Tx_i +b \geq 1\epsilon_i $ for all x in +1 class
 $w^Tx_i +b \leq 1+\epsilon_i $ for all x in 1 class
 $\forall i, \epsilon_i \geq 0$
 large C can lead to overfitting
 benefits
 number of parameters remains the same (and most are set to 0)
 we only care about support vectors
 maximizing margin is like regularization: reduces overfitting
 these can be solved with quadratic programming QP
 solve a dual formulation (Lagrangian) instead of QPs directly so we can use kernel trick
 primal: $min_w max_\alpha L(w,\alpha)$
 dual: $max_\alpha min_w L(w,\alpha)$
 KKT condition for strong duality
 complementary slackness: $\lambda_i f_i(x) = 0, i=1,…,m$
 VC (VapnicChervonenkis) dimension  if data is mapped into sufficiently high dimension, then samples will be linearly separable (N points, N1 dims)
 kernel functions  new ways to compute dot product (similarity function)
 original testing function: $\hat{y}=sign(\Sigma_{i\in train} \alpha_i y_i x_i^Tx_{test}+b)$
 with kernel function: $\hat{y}=sign(\Sigma_{i\in train} \alpha_i y_i K(x_i,x_{test})+b)$
 linear $K(x,z) = x^Tz$
 polynomial $K (x, z) = (1+x^Tz)^d$

radial basis kernel $K (x, z) = exp(r xz ^2)$  computing these is O($m^2$), but dotproduct is just O(m)
 function that corresponds to an inner product in some expanded feature space
 practical guide
 use m numbers to represent categorical features
 scale before applying
 fill in missing values
 start with RBF
Logistic Regression

$p = P(Y=1 X)=\frac{exp(\theta^T x)}{1+exp(\theta ^Tx)}$  logit (logodds) of $p:ln\left[ \frac{p}{1p} \right] = \theta^T x$
 predict using Bernoulli distribution with this parameter p
 can be extended to multiple classes  multinomial distribution
Generative
Naive Bayes Classifier
 let $C_1,…,C_L$ be the classes of Y

want Posterior $P(C X) = \frac{P(X C)(P(C)}{P(X)}$  MAP rule  maximum A Posterior rule \begin{itemize}
 use Prior P(C)
 using x, predict $C^*=\text{argmax}_C P(CX_1,…,X_p)=\text{argmax}_C P(X_1,…,X_pC) P(C)$  generally ignore denominator \end{itemize}
 naive assumption  assume that all input attributes are conditionally independent given C \begin{itemize}
 $P(X_1,…,X_pC) = P(X_1C)\cdot…\cdot P(X_pC) = \prod_i P(X_iC)$ \end{itemize}
 learning \begin{enumerate}
 learn L distributions $P(C_1),P(C_2),…,P(C_L)$
 learn $P(X_j=x_{jk}C_i)$ \begin{itemize}
 for j in 1:p

i in 1:$ C $ 
k in 1:$ X_j $  for discrete case we store $P(X_jc_i)$, otherwise we assume a prob. distr. form \end{itemize} \begin{itemize}

naive: $ C \cdot ( X_1 + X_2 + … + X_p )$ distributions  otherwise: $C\cdot (X_1 \cdot X_2 \cdot … \cdot X_p)$ \end{itemize} \end{enumerate}
 testing \begin{itemize}
 P(Xc)  look up for each feature $X_iC$ and try to maximize \end{itemize}
 smoothing  used to fill in 0s \begin{itemize}

$P(x_i c_j) = \frac{N(x_i, c_j) +1}{N(c_j)+ X_i }$  then, $\sum_i P(x_ic_j) = 1$ \end{itemize}
Gaussian classifiers
 distributions \begin{itemize}

Normal $P(X_j C_i) = \frac{1}{\sigma_{ij} \sqrt{2\pi}} exp\left( \frac{(X_j\mu_{ij})^2}{2\sigma_{ij}^2}\right)$ requires storing $ C \cdot p$ distributions  Multivariate Normal $\frac{1}{(2\pi)^{D/2}} \frac{1}{\Sigma^{1/2}} exp\left(\frac{1}{2} (x\mu)^T \Sigma^{1} (x\mu)\right)$where $\Sigma$ is covariance matrix \end{itemize} \end{itemize} \begin{itemize}

decision boundary are points satisfying $P(C_i X) = P(C_j X)$  LDA  linear discriminant analysis  assume covariance matrix is the same across classes \begin{itemize}
 Gaussian distributions are shifted versions of each other
 decision boundary is linear \end{itemize}
 QDA  different covariance matrices \begin{itemize}
 estimate the covariance matrix separately for each class C
 decision boundaries are quadratic
 fits data better but has more parameters to estimate \end{itemize}
 Regularized discriminant analysis  shrink the separate covariance matrices towards a common matrix \begin{itemize}
 $\Sigma_k = \alpha \Sigma_k + (1\alpha) \Sigma$ \end{itemize}
 treat each feature attribute and class label as random variables \begin{itemize}
 we assume distributions for these
 for 1D Gaussian, just set mean and var to sample mean and sample var \end{itemize}
Text classification
 bag of words  represent text as a vector of word frequencies X \begin{itemize}
 remove stopwords, stemming, collapsing multiple  NLTK package in python
 assumes word order isn’t important
 can store ngrams \end{itemize}

multivariate Bernoulli: $P(X C)=P(w_1=true,w_2=false,… C)$  multivariate Binomial: $P(XC)=P(w_1=n_1,w_2=n_2,…C)$ \begin{itemize}
 this is inherently naive \end{itemize}
 time complexity \begin{itemize}

training O(n*average_doc_length_train+ c dict )  testing O(C average_doc_length_test) \end{itemize}
 implementation \begin{itemize}
 have symbol for unknown words
 underflow prevention  take logs of all probabilities so we don’t get 0
 c = argmax log$P(c)$ + $\sum_i log P(X_ic)$ \end{itemize}
InstanceBased (ex. K nearest neighbors)
 also called lazy learners
 makes Voronoi diagrams
 can take majority vote of neighbors or weight them by distance
 distance can be Euclidean, cosine, or other
 should scale attributes so largevalued features don’t dominate
 Mahalanobois distance metric takes into account covariance between neighbors
 in higher dimensions, distances tend to be much farther, worse extrapolation
 sometimes need to use invariant metrics
 ex. rotate digits to find the most similar angle before computing pixel difference
 could just augment data, but can be infeasible
 computationally costly so we can approximate the curve these rotations make in pixel space with the invariant tangent line
 stores this line for each point and then find distance as the distance between these lines
 ex. rotate digits to find the most similar angle before computing pixel difference
 finding NN with kd (kdimensional) tree
 balanced binary tree over data with arbitrary dimensions
 each level splits in one dimension
 might have to search both branches of tree if close to split
 finding NN with localitysensitive hashing
 approximate
 make multiple hash tables
 each uses random subset of bitstring dimensions to project onto a line
 union candidate points from all hash tables and actually check their distances
 comparisons
 error rate of 1 NN is never more than twice that of Bayes error
Feature Selection
Filtering
 ranks features or feature subsets independently of the predictor
 univariate methods (consider one variable at a time) \begin{itemize}
 ex. Ttest of y for each variable
 ex. Pearson correlation coefficient  this can only capture linear dependencies
 mutual information  covers all dependencies \end{itemize}
 multivariate methods \begin{itemize}
 features subset selection
 need a scoring function
 need a strategy to search the space
 sometimes used as preprocessing for other methods \end{itemize}
Wrapper
 uses a predictor to assess features of feature subsets
 learner is considered a blackbox  use train, validate, test set
 forward selection  start with nothing and keep adding
 backward elimination  start with all and keep removing
 others: Beam search  keep k best path at teach step, GSFS, PTA(l,r), floating search  SFS then SBS
Embedding
 uses a predictor to build a model with a subset of features that are internally selected
 ex. lasso, ridge regression
Unsupervised Learning
 labels are not given
 intracluster distances are minimized, intercluster distances are maximized
 Distance measures \begin{itemize}
 symmetric D(A,B)=D(B,A)
 selfsimilarity D(A,A)=0
 positivity separation D(A,B)=0 iff A=B
 triangular inequality D(A,B) <= D(A,C)+D(B,C)
 ex. Minkowski Metrics $d(x,y)=\sqrt[r]{\sum x_iy_i^r}$ \begin{itemize}
 r=1 Manhattan distance
 r=1 when y is binary > Hamming distance
 r=2 Euclidean
 r=$\infty$ “sup” distance \end{itemize}
 correlation coefficient  unit independent
 edit distance
Hierarchical
 Two approaches:
 Bottomup agglomerative clustering  starts with each object in separate cluster then joins
 Topdown divisive  starts with 1 cluster then separates
 ex. starting with each item in its own cluster, find best pair to merge into a new cluster
 repeatedly do this to make a tree (dendrogram)
 distances between clusters \begin{itemize}
 singlelink=nearest neighbor=their closest members (long, skinny clusters)
 completelink=furthest neighbor=their furthest members (tight clusters)
 average=average of all crosscluster pairs  most widely used \end{itemize}
 Complexity: $O(n^2p)$ for first iteration and then can only get worse
Partitional
 partition n objects into a set of K clusters (must be specified)
 globally optimal: exhaustively enumerate all partitions
 minimize sum of squared distances from cluster centroid
 Evaluation w/ labels  purity  ratio between dominant class in cluster and size of cluster
Expectation Maximization (EM)
 general procedure that includes Kmeans
 Estep
 calculate how strongly to which mode each data point “belongs” (maximize likelihood)
 Mstep  calculate what each mode’s mean and covariance should be given the various responsibilities (maximization step)
 known to converge
 can be suboptimal
 monotonically decreases goodness measure
 can also partition around medoids
 mixturebased clustering
 KMeans
 start with random centers

assign everything to nearest center: O( clusters *np)  recompute centers O(np) and repeat until nothing changes
 partition amounts to Voronoi diagram
Gaussian Mixture Model (GMM)
 continue deriving new mean and variance at each step
 “soft” version of Kmeans  update means as weighted sums of data instead of just normal mean
Derivations
normal equation
 $L(\theta) = \frac{1}{2} \sum_{i=1}^n (\hat{y}_iy_i)^2$
 $L(\theta) = 1/2 (X \theta  y)^T (X \theta y)$
 $L(\theta) = 1/2 (\theta^T X^T  y^T) (X \theta y)$
 $L(\theta) = 1/2 (\theta^T X^T X \theta  2 \theta^T X^T y +y^T y)$
 $0=\frac{\partial L}{\partial \theta} = 2X^TX\theta  2X^T y$
 $\theta = (X^TX)^{1} X^Ty$
ridge regression

$L(\theta) = \sum_{i=1}^n (\hat{y}_iy_i)^2+ \lambda \theta _2^2$  $L(\theta) = (X \theta  y)^T (X \theta y)+ \lambda \theta^T \theta$
 $L(\theta) = \theta^T X^T X \theta  2 \theta^T X^T y +y^T y + \lambda \theta^T \theta$
 $0=\frac{\partial L}{\partial \theta} = 2X^TX\theta  2X^T y+2\lambda \theta$
 $\theta = (X^TX+\lambda I)^{1} X^T y$
single Bernoulli

L(p) = P(Train Bernoulli(p)) = $P(X_1,…,X_n p)=\prod_i P(X_i p)=\prod_i p^{X_i} (1p)^{1X_i}$  $=p^x (1p)^{nx}$ where x = $\sum x_i$
 $log(L(p)) = log(p^x (1p)^{nx}=x log(p) + (nx) log(1p)$
 $0=\frac{dL(p)}{dp} = \frac{x}{p}  \frac{nx}{1p} = \frac{xxp  np+xp}{p(1p)}=xnp$
 $\implies \hat{p} = \frac{x}{n}$
multinomial

$L(\theta)=P(Train Multinomial(\theta))=P(d_1,…,d_n \theta_1,…,\theta_p)$ where d is a document of counts x  =$\prod_i^n P(d_i\theta_1,…\theta_p)=\prod_i^n factorials \cdot \theta_1^{x_1},…,\theta_p^{x_p}$ ignore factorials because they are always same \begin{itemize}
 require $\sum \theta_i = 1$ \end{itemize}
 $\implies \theta_i = \frac{\sum_{j=1}^n x_{ij}}{N}$ where N is total number of words in all docs
 3 types

Regression
[toc]
problem formulation
 absorb intercept into feature vector of 1
 x = column
 add $x^{(0)} = 1$ as the first element
 matrix formation
 $\hat{y} = f(x) = x^T \theta = \theta x^T = \theta_0 + \theta_1 x^1 + \theta_2 x^2 + …$
 $\pmb{x_1}$ is all the features for one data sample
 $\pmb{x^1}$ is the first feature over all the data samples
 our goal is to pick the optimal theta to minimize least squares
 loss function  minimize SSE
 SSE is a convex function
 single point with 0 derivative
 second derivative always positive
 hessian is psd (positive semidefinite)
optimization
 gradient  denominator layout  size of variable you are taking  we always use denominator layout
 numerator layout  you transpose the size
 optimization  find values of variables that minimize objective function while satisfying constraints
 normal equations
 $L(\theta) = \frac{1}{2} \sum_{i=1}^n (f(x_i)y_i)^2$
 = $1/2 (X \theta  y)^T (X \theta y)$
 set derivative equal to 0 and solve
 $\theta = (X^TX)^{1} X^Ty$
 solving normal function is computationally expensive  that’s why we do things like regularization (matrix multiplication is $O(n^3)$)
 gradient descent = batch gradient descent
 gradient  vector that points to direction of maximum increase
 at every step, subtract gradient multiplied by learning rate: $x_k = x_{k1}  \alpha \nabla_x F(x_{k1})$
 alpha = 0.05 seems to work
 $J(\theta) = 1/2 (\theta ^T X^T X \theta  2 \theta^T X^T y + y^T y)$
 $\nabla_\theta J(\theta) = X^T X \theta  X^T Y$
 = $\sum_i x_i (x_i^T  y_i)$
 this represents residuals * examples
 stochastic gradient descent
 don’t use all training examples  approximates gradient
 singlesample
 minibatch (usually better in offline case)
 coordinatedescent algorithm
 online algorithm  update theta while training data is changing
 when to stop?
 predetermined number of iterations
 stop when improvement drops below a threshold
 each pass of the whole data = 1 epoch
 benefits
 less prone to getting stuck to shallow local minima
 don’t need huge ram
 faster
 don’t use all training examples  approximates gradient
 newton’s method for optimization
 secondorder optimization  requires 1st & 2nd derivatives
 $\theta_{k+1} = \theta_k  H_K^{1} g_k$
 update with inverse of Hessian as alpha  this is an approximation to a taylor series
 finding inverse of Hessian can be hard / expensive
evaluation
 accuracy = number of correct classifications / total number of test cases
 you train by lowering SSE or MSE on training data
 report MSE for test samples
 cross validation  don’t have enough data for a test set
 data is reused
 kfold  split data into N pieces
 N1 pieces for fit model, 1 for test
 cycle through all N cases
 average the values we get for testing
 leave one out (LOOCV)
 train on all the data and only test on one
 then cycle through everything
 regularization path of a regression  plot each coeff v. $\lambda$
 tells you which features get pushed to 0 and when
1  simple LR
 ml: task > representation > score function > optimization > models
 all of these things are assumptions
2  LR with nonlinear basis functions
 can have nonlinear basis functions (ex. polynomial regression)
 radial basis function  ex. kernel function (Gaussian RBF)
 $exp((xr)^2 / (2 \lambda ^2))$
 nonparametric algorithm  don’t get any parameters theta; must keep data
3  locally weighted LR
 recompute model for each target point
 instead of minimizing SSE, we minimize SSE weighted by each observation’s closeness to the sample we want to query
4  linear regression model with regularizations
 when $(X^T X)$ isn’t invertible can’t use normal equations and gradient descent is likely unstable
 X is nxp, usually n » p and X almost always has rank p
 problems when n < p
 intuitive way to fix this problem is to reduce p by getting read of features
 a lot of papers assume your data is already zerocentered
 conventionally don’t regularize the intercept term
regularizations
 ridge regression (L2)
 if (X^T X) not invertible, add a small element to diagonal
 then it becomes invertible
 small lambda > numerical solution is unstable
 proof of why it’s invertible is difficult

argmin $\sum_i (y_i  \hat{y_i})^2+ \lambda \beta _2^2 $  equivalent to minimizing $\sum_i (y_i  \hat{y_i})^2$ s.t. $\sum_j \beta_j^2 \leq t$
 solution is $\hat{\beta_\lambda} = (X^TX+\lambda I)^{1} X^T y$
 for small $\lambda$ numerical solution is unstable
 When $X^TX=I$, $\beta {Ridge} = \frac{1}{1+\lambda} \beta{Least Squares}$
 lasso regression (L1)

$\sum_i (y_i  \hat{y_i})^2+\lambda \beta _1 $ 
equivalent to minimizing $\sum_i (y_i  \hat{y_i})^2$ s.t. $\sum_j \beta_j \leq t$  “least absolute shrinkage and selection operator”
 lasso  least absolute shrinkage and selection operator  L1
 acts in a nonlinear manner on the outcome y
 keep the same SSE loss function, but add constraint of L1 norm
 doesn’t have closed form for Beta
 because of the absolute value, gradient doesn’t exist
 can use directional derivatives
 best solver is LARS  least angle regression
 if tuning parameter is chose well, will set lots of coordinates to 0
 convex functions / convex sets (like circle) are easier to solve
 disadvantages
 if p>n, lasso selects at most n variables
 if pairwise correlations are very high, lasso only selects one variable

 elastic net  hybrid of the other two

$\beta_{Naive ENet} = \sum_i (y_i  \hat{y_i})^2+\lambda_1 \beta _1 + \lambda_2 \beta _2^2$  l1 part generates sparse model
 l2 part encourages grouping effect, stabilizes l1 regularization path
 grouping effect  group of highly correlated features should all be selected
 naive elastic net has too much shrinkage so we scale $\beta_{ENet} = (1+\lambda_2) \beta_{NaiveENet}$
 to solve, fix l2 and solve lasso

 absorb intercept into feature vector of 1

qi notes
 linear discriminant analysis
 datasets
 algorithms
 kendall tau
 gaussian graphical model
 graphical lasso
 structure learning
 representational similarity learning
 latent dirichlet allocation
 latent variable model
linear discriminant analysis
 PCA  find component axes that maximize the variance of our data
 “unsupervised”  ignores class labels
 LDA  maximize the separation between multiple classes
 “supervised”  computes directions (linear discriminants) that represent axes that maximize separation between multiple classes
 used as dimensionality reduction technique
 project a dataset onto a lowerdimensional space with good classseparability
datasets
 resting state fMRI gives a timeseries of things turning on
 we want to model correlations between everything, we use a gaussian graphical model
 brain atlas  serial sections of brain images
 histology  the study of the microscopic structure of tissuesch
 leaveoneout cross validation and try to classify autism / nonautism
 see how well autism subjects are identified
 how good is the final connectome?
 ABIDE
 normal group ~500
 has brain imaging
 subjects as rows, features as cols
 each feature can be an ROI
 autism group ~500
 has brain imaging
 no molecular measurements
 ABA data
 has both genotype & phenotype level data
 don’t know what kind of phenotype
 mostly about genotype data
 human ROI has ~200
 in clustering case, 30000 recordings, need to cluster into groups
algorithms
 SLFA
 the features (ex. ROI) have clusters
 SLFA tries to find dependencies between the clusters instead of the variables
 group and group dependency
 this works better in genotype case
 SIMULE
 contextsensitive graph
kendall tau
 https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient
 (# concordant pairs  # discordant pairs) / (n*(n1)/2)
 concordant if both ranks agree
 must do something special if tied
 matlab has pretty fast implementation, R slow
 want to speed up / parallelize this  use multicore, not gpu
 the Kendall correlation is not affected by how far from each other ranks are but only by whether the ranks between observations are equal or not
gaussian graphical model
 likelihood = probability of seeing data given parameters of the model
 clustering algorithm  associate conditional probability with each node (relation between the nodes, given all the other nodes)
 http://www.cis.upenn.edu/~mkearns/papers/barbados/jordantut.pdf
 the weights in a network make local assertions about the relationships between neighboring nodes
 inference algorithms turn these local assertions into global assertions about the relationships between nodes

$P(A B) = P(AB) / P(B)$  can be used for learning (given inputs, outputs)
 A Gaussian graphical model is a graph in which all random variables are continuous and jointly Gaussian.
 see defs.png
 precision matrix  inverse of covariance matrix; gives pairwise correlations
graphical lasso
 optimize parameter to minimize regression between Y and B*X
 problem is hard because far less samples than nodes so can’t invert covariance matrix
 coordinatedescent methods: optimize over one variable at a time
 l1normalization makes it so there have to be a lot of 0s in B
 so does l0, but this is harder to solve
 have regress all the variables against all the other variables
 graphical lasso lets us do this very efficiently with coordinate descent
structure learning
 structure learning aims to discover the topology of a probabilistic network of variables such that this network represents accurately a given dataset while maintaining low complexity
 accuracy of representation  likelihood that the model explains the observed data
 complexity of a graphical model  number of parameters
representational similarity learning
 aims to discover features that are important in representing (humanjudged) similarities among objects
 can be posed as a sparsityregularized multitask regression problem
 related to representational similarity analysis
latent dirichlet allocation
 generative model  explain observations from unobserved variables
 In LDA, each document may be viewed as a mixture of various topics
 similar to probabilistic latent semantic analysis (pLSA), except that in LDA the topic distribution is assumed to have a Dirichlet prior
latent variable model
 relates manifest variables to latent variables
 responses on the manifest variables are result of latent variables
 manifest variables have local independence  nothing in common after controlling for latent variable
 latent factor models have been proposed to find concise descriptions of data

Search
 Uninformed Search – Russell & Norvig 3rd ed. (R&N) 3.13.4
 A* Search and Heuristics – R&N 3.53.6
 Local Search – R&N 4.14.2
 Constraint satisfaction problems – R&N 6.16.5
[toc]
Uninformed Search – Russell & Norvig 3rd ed. (R&N) 3.13.4
problemsolving agents
 goal  1st step
 problem formulation  deciding what action and states to consider given a goal
 uninformed  given no info about problem besides definition
 an agent with several immediate options of unknown value can decide what to do first by examining future actions that lead to states of known value
 5 components
 initial state
 applicable actions at teach state
 transition model
 goal states
 path cost function
problems
 toy problems
 vacuum world
 8puzzle (type of slidingblock puzzle)
 8queens problem
 Knuth conjecture
 realworld problems
 routefinding
 TSP (and othe touring problems)
 VLSI layout
 robot navigation
 automatic assembly sequencing
searching for solutions
 start at a node and make a search tree
 frontier = open list = set of all leaf nodes available for expansion at any given point
 search strategy determines which state to expand next
 want to avoid redundant paths
 TREESEARCH  continuously expand the frontier
 GRAPHSEARCH  keep track of previously visited states in explored set = closed set and don’t revisit
infrastructure
 node  data structure that contains parent, state, pathcost, action
 metrics
 completeness  does it find a solution
 optimality  does it find the best solution
 time/space complexity

theoretical CS: V + E  b  branching factor  max number of branches of any node
 d  depth  number of steps from the root
 m  max length of any path in the search space

 search cost  just time/memory
 total cost  search cost + path cost
uninformed search = blind search
 bfs
 uniformcost search  always expand node with lowest path cost g(n)
 frontier is priority queue ordered by g
 dfs
 backtracking search  dsp but only one successor is generated at a time; each partially expande node remembers which succesor to generate next
 only O(m) memory instead of O(bm)
 depthlimited search
 diameter of state space  longest possible distance to goal from any start
 iterative deepening dfs  like bfs explores entire depth before moving on
 iterative lengthening search  instead of depth limit has pathcost limit
 backtracking search  dsp but only one successor is generated at a time; each partially expande node remembers which succesor to generate next
 bidirectional search  search from start and goal and see if frontiers intersect
 just because they intersect doesn’t mean it was the shortest path
 can be difficult to search backward from goal (ex. Nqueens)
A* Search and Heuristics – R&N 3.53.6
informed search
 informed search  uses problemspecific knowledge
 has evaluation function f which likely incorporate g and h
 heuristic h = estimated cost of cheapest path from state at node n to a goal state
 bestfirst  choose nodes with best f
 has evaluation function f which likely incorporate g and h
 greedy bestfirst search  keep expanding node closest to goal
 $A$* search
 $f(n) = g(n) + h(n)$ represents the estimated cost of the cheapest solution through n
 $A$ (with tree search) is optimal and complete if h(n) is *admissible
 h(n) never overestimates the cost to reach the goal
 $A$ (with graph search) is optimal and complete if h(n) is *consistent (stronger than admissible)
 $h(n) \leq cost(n \to n’) + h(n’)$
 can draw contours of f (because nondecreasing)
 $A$* is also optimally efficient (guaranteed to expand fewest nodes) for any given consisten heuristic because any algorithm that that expands fewer nodes runs the risk of missing the optimal solution
 for a heuristic, absolute error $\delta := h^h$ and *relative error $\epsilon := \delta / h^*$
 here $h^*$ is actual cost of root to goal
 bad when lots of solutions with small absolute error because it must try them all
 bad because it must store all nodes in memory
 for a heuristic, absolute error $\delta := h^h$ and *relative error $\epsilon := \delta / h^*$
 memorybounded heuristic search
 iterativedeepening $A$*  iterative deepening with cutoff fcost
 recursive bestfirst search  like standard bestfirst search but with linear space
 each node keeps f_limit variable which is best alternative path available from any ancestor
 as it unwinds, each node is replaced with backedup value  best fvalue of its children
 decides whether it’s worth reexpanding subtree later
 often flips between different good paths (h is usually less optimistic for nodes close to the goal)
 $SMA$*  simplified memorybounded A*  bestfirst until memory is full then forgot worst leaf node and add new leaf
 store forgotten leaf node info in its parent
 on hard problems, too much time switching between nodes
 agents can also learn to search with metalevel learning
heuristic functions
 effective branching factor $b^$  if total nodes generated by A is N and solution depth is d, then b* is branching factor for uniform tree of depth d for N+1 nodes:
 want $b^*$ close to 1
 generally want bigger heuristic because everything with f(n) < C* will be expanded, the less f(n) < C* the better
 h1 dominates h2 if $h1(n) \geq h2(n) \forall : n$
 relaxed problem  removes constraints and adds edges to the graph
 solution to original problem still solves relaxed problem
 cost of optimal solution to a relaxed problem is an admissible heuristic for the original problem
 also is consistent
 when there are several good heuristics, pick h(n) = max(h1(n), …, hm(n)) for each node
 pattern database  heuristic stores exact solution cost for every possible subproblem instance
 disjoint pattern database  break into independent possible subproblems
 can learn heuristic by solving lots of problems using useful features
 aren’t necessarily admissible / consisten
Local Search – R&N 4.14.2
 local search looks for solution not path
 maintains only current node and its neighbors
 more like optimization
 complete  finds a goal
 optimal  finds global min/max
 hillclimbing = greedy local search
 also stochastic hill climbing and randomrestart hill climbing
 simulated annealing  pick random move
 if move better, then accept
 otherwise accept with some probability proportional to how bad it is and accept less as time goes on
 local beam search  pick k starts, then choose the best k states from their neighbors
 stochastic beam search  pick best k with prob proportional to how good they are
 genetic algorithms  population of k individuals
 each scored by fitness function
 pairs are selected for reproduction using crossover point
 each location subject to random mutation
 schema  substring in which some of the positions can be left unspecified (ex. $246**$)
 want schema to be good representation because chunks tend to be passed on together
continuous space
 hillclimbing / simulated annealing still work
 could just discretize neighborhood of each state
 use gradient
 if possible, solve $\nabla f = 0$
 otherwise SGD $x = x + \alpha \nabla f(x)$
 can estimate gradient by evaluating response to small increments
 line search  repeatedly double $\alpha$ until f starts to increase again
 NewtonRaphson method
 uses 2nd deriv: x = x  g(x) / g’(x)
 $x = x  H_f^{1} (x) \nabla f(x)$ where H is the Hessian of 2nd derivs
 constrained optimization
 contains linear programming problems in which constraints must be linear inequalities forming a convex set
 these have no local minima
 contains linear programming problems in which constraints must be linear inequalities forming a convex set
Constraint satisfaction problems – R&N 6.16.5
 CSP
 set of variables $X_1, …, X_n$
 set of domains $D_1, …, D_n$
 set of constraints $C$ specifying allowable values
 each state is an assignment of variables
 consistent  doesn’t violate constraints
 complete  every variable is assigned
 constraint graph  nodes are variables and links connect any 2 variables that participate in a constraint
 unary constraint  restricts value of single variable
 binary constraint
 global constraint  arbitrary number of variables (doesn’t have to be all)
 converting graphs to only binary constraints
 every finitedomain constraint can be reduced to set of binary constraints w/ enough auxiliary variables
 another way to convert an nary CSP to a binary one is the dual graph transformation  create a new graph in which there is one variable for each constraint in the original graph and one binary constraint for each pair of original constraints that share variables
 also can have preference constraints instead of absolute constraints
inference
 node consistency  prune domains violating unary constraints
 arc consistency  satisfy binary constraints
 uses AC3 algorithm
 set of all arcs = binary constraints
 pick one and apply it
 if things changed, readd all the neighboring arcs to the set

$O(cd^3)$  domain = d, # arcs = c
 variable can be generalized arc consistent
 uses AC3 algorithm
 path consistency  consider constraints on triplets  PC2 algorithm
 extends to kconsistency (although path consistency assumes binary constraint networks)
 strongly kconsistent  also (k1) consistent, (k2) consistent, … 1consistent
 implies $O(k^2d)$
 establishing kconsistency time/space is exponential in k
 global constraints can have more efficient algorithms
 ex. assign different colors to everything
 resource constraint = atmost constraint  sum of variable must not exceed some limit
 bounds propagation  make sure variables can be allotted to solve resource constraint
backtracking
 CSPs are commutative  order of choosing states doesn’t matter
 backtracking search  depthfirst search that chooses values for one variable at a time and backtracks when no legal values left
 variable and value ordering
 minimumremainingvalues heuristic  assign variable with fewest choices
 degree heuristic  pick variable involved in largest number of constraints on other unassigned variables
 leastconstrainingvalue heuristic  prefers value that rules out fewest choices for nieghboring variables
 interleaving search and inference
 forward checking  when we assign a variable in search, check arcconsistency on its neighbors
 maintaining arc consistency (MAC)  when we assign a variable, call AC3, intializing with arcs to neighbors
 intelligent backtracking  looking backward
 keep track of conflict set for each node (list of variable assignments that deleted things from its domain)
 backjumping  backtracks to most recent assignment in conflict set
 too simple  forward checking makes this redundant
 conflictdirected backjumping
 let $X_j$ be current variable and $conf(X_j)$ be conflict set. If every possible value for $X_j$ fails, backjump to the most recent variable $X_i$ in $conf(X_j)$ and set $conf(X_i) = conf(X_i) \cup conf(X_j)  X_i$
 constraint learning  findining minimum set of variables/values from conflict set that causes the problem = nogood
 variable and value ordering
local search for csps
 start with some assignment to variables
 minconflicts heuristic  change variable to minimize conflicts
 can escape plateaus with tabu search  keep small list of visited states
 could use constraint weighting
structure of problems
 connected components of constraint graph are independent subproblems
 tree  any 2 variables are connected by only one path
 directed arc constistency  ordered variables $X_i$, every $X_i$ is consistent with each $X_j$ for j>i
 tree with n nodes can be made directed arcconsisten in O(n) steps  $O(nd^2)$
 directed arc constistency  ordered variables $X_i$, every $X_i$ is consistent with each $X_j$ for j>i
 two ways to reduce constraint graphs to trees
 assign variables so remaining variables form a tree
 assigned variables called cycle cutset with size c
 $O(d^c \cdot (nc) d^2$
 finding smallest cutset is hard, but can use approximation called cutset conditioning
 tree decomposition  view each subproblem as a megavariable
 tree width w  size of largest subproblem  1
 solvable in $O(nd^{w+1})$
 assign variables so remaining variables form a tree
 also can look at structure in variable values
 ex. value symmetry  can assign different colorings
 use symmetrybreaking constraint  assign colors in alphabetical order
 ex. value symmetry  can assign different colorings

structure learning
 1  introduction
 2  binary classification
 3  multiclass classification
 4  neural networks
 5  structure
 6  sequential models
 7  graphical models
 8  constrained conditional models
 9  inference
 10/11  learning protocols
[toc]
1  introduction
 structured prediction  have multiple independent output variables
 output assignements are evaluated jointly
 requires joint (global) inference
 can’t use classifier because output space is combinatorially large
 three steps
 model  pick a model
 learning = training
 inference = testing
 representation learning  picking features
 usually use domain knowledge
 combinatorial  ex. map words to higher dimensions
 hierarchical  ex. first layers of CNN
 usually use domain knowledge
2  binary classification
 learn  learn w
 $w$ should point to positive examples
 inference  predict
 $\hat{y}=sign(w^T x)$
 losses
 usually don’t minimize 01 loss (combinatorial)
 usually $w^Tx$ includes b term, but generally we don’t want to regularize b
 perceptron  tries to find separating hyperplane
 whenever misclassified, update w
 can add in delta term to maximize margin
 $\hat{w} = argmin_w \sum_i max(0, y_i \cdot w^T x_i)$
 linear svm
 $\hat{w} = argmin_w [w^Tw + C \sum_i max(0,1y_i \cdot w^T x_i))]$
 minimize norm of weights s.t. the closest points to the hyperplane have a score 1
 stochastic subgradient descent
 learning different ws differently
 logistic regression = loglinear model
 learning different ws differently
 $\hat{w} = argmin_w [w^Tw + C \sum_i log(1+exp(y_i \cdot w^T x_i))]$
3  multiclass classification
 reducing multiclass (K categories) to binary
 oneagainstall
 train K binary classifiers
 class i = positive otherwise negative
 take max of predictions
 onevsone = allvsall
 train C(K,2) binary classifiers
 labels are class i and class j
 inference  any class can get up to k1 votes, must decide how to break ties
 flaws  learning only optimizes local correctness
 oneagainstall
 single classifier
 multiclass perceptron (Kesler)
 if label=i, want $w_i ^Tx > w_j^T x \quad \forall j$
 if not, update $w_i$ and $w_j$* accordingly
 kessler construction
 $w = [w_1 … w_k] $
 want $w_i^T x > w_j^T x \quad \forall j$
 rewrite $w^T \phi (x,i) > w^T \phi (x,j) \quad \forall j$
 here $\phi (x,i)$ puts x in the ith spot and zeros elsewhere
 $\phi$ is often used for feature representation
 define margin: $\Delta (y,y’) = \begin{cases} \delta& if y \neq y’ \\ 0& if y=y’ \end{cases}$
 check if $y=argmax_{y’}w^T \phi(x,y’) + \delta (y,y’)$
 multiclass SVMs (Crammer&Singer)
 minimize total norm of weights s.t. true label is score at least 1 more than second best label
 multinomial logistic regression = multiclass loglinear model

$P(y x,w)=\frac{exp(w^T_yx)}{\sum_{y’ \in { 1,…,K}} exp(w_{y’}^T,x)}$  we control the peakedness of this by dividing by stddev
 softmax: sometimes substitue this for $w^T_y x$

 multiclass perceptron (Kesler)
4  neural networks
 if neurons within layer are connected, not feedforward
 remember to include bias unit at every layer
 typically, first layer has (# input features)
 last layer has (# classes)
 must convert labels to 1ofK representation
 perceptron convergence thm  if data is linearly separable, perceptron learning algorithm wiil converge
 backprop yields a local optimum but works well in practice
5  structure
 structured output can be represented as a graph
 outputs y
 inputs x
 two types of info are useful
 relationships between x and y
 relationships betwen y and y
 complexities
 modeling  how to model?
 train  can’t train separate weight vector for each inference outcome
 inference  can’t enumerate all possible structures
 need to score nodes and edges
 could score nodes and edges independently
 could score each node and its edges together
6  sequential models
sequence models
 goal: learn distribution $P(x_1,…,x_n)$ for sequences $x_1,…,x_n$
 ex. text generation
 discrete Markov model

$P(x_1,…,x_n) = \prod_i P(x_i x_{i1})$  requires
 initial probabilites
 transition matrix

 mth order Markov model  keeps history of previous m states
 each state is an observation
hidden Markov model  this is generative
 goal
 learn distribution $P(x_1,…,x_n,y_1,…,y_n)$
 ex. POS tagging
 learn distribution $P(x_1,…,x_n,y_1,…,y_n)$
 model

define $P(x_1,…,x_n,y_1,…,y_n) = P(y_1) P(x_1 y_1) \prod_{i} P(y_i y_{i1})$  each output label is dependent on its neighbors in addition to the input

 definitions
 $\mathbf{y}$  state
 states are not observed
 $\mathbf{x}$  observation
 $\pi$  initial state probabilities

A = transition probabilities $P(y_2 y_1)$ 
B = emission probabilities $P(x_1 y_1)$  each state stochastically emits an observation
 inference
 each state stochastically emits an observation
 given $(\pi,A,B)$ and $\mathbf{x}$
 calculate probability of $\mathbf{x}$
 calculate most probable $\mathbf{y}$

define $P(x_1,…,x_n,y_1,…,y_n) = P(y_1) P(x_1 y_1) \prod_{i} P(y_i y_{i1}) P(x_i y_i)$ 
use MAP: $\hat{y}=\underset{y}{argmax} : P(y x,\pi, A,B)=\underset{y}{argmax} : P(y \land x \pi, A,B)$

 use Viterbi algorithm
 initial for each state s

$score_1(s) = P(s) P(x_1 s) = \pi_s B_{x_1,s}$

 recurrence  for i = 2 to n, calculate scores using previous score only

$score_i(s) = \underset{y_i1}{max} P(s y_{i1}) P(x_i s) \cdot score_{i1}(y_{i1})$

 final state

$\hat{y}=\underset{y}{argmax} : P(y,x \pi, A,B) = \underset{x}{max} : score_n (s)$

 initial for each state s
 complexity
 K = number of states
 M = number of observations
 n = length of sequence
 memory  nK
 runtime  $O(nK^2)$
 learning  learn $(\pi,A,B)$
 supervised (given y)
 learning  learn $(\pi,A,B)$
 basically just count (maximizing joint likelihood of input and output)
 $\pi_s = \frac{count(start \to s)}{n}$
 $A_{s’,s} = \frac{count(s \to s’)}{count(s)}$
 $B_{s,x} = \frac{count (s \to x)}{count(s)}$ 2. unsupervised (not given y)
 $\mathbf{y}$  state
conditional models and local classifiers  discriminative model
 conditional models = discriminative models

goal: model $P(Y X)$  learns the decision boundary only
 ignores how data is generated (like generative models)

 ex. loglinear models

$P(\mathbf{y x,w}) = \frac{exp(w^T \phi (x,y))}{\sum_y’ exp(w^T \phi (x,y’))}$ 
training: $w = \underset{w}{argmin} \sum log : P(y_i x_i,w)$

 ex. nextstate model

$P(\mathbf{y} \mathbf{x})=\prod_i P(y_i y_{i1},x_i)$

 ex. maximum entropy markov model

$P(y_i y_{i1},x) \propto exp( w^T \phi(x,i,y_i,y_{i1}))$  adds more things into the feature representation than HMM via $\phi$
 has label bias problem
 if state has fewer next states they get high probability

effectively ignores x if $P(y_i y_{i1})$ is too high

 if state has fewer next states they get high probability

 ex. conditional random fields=CRF
 a global, undirected graphical model
 divide into factors

$P(Y x) = \frac{1}{Z} \prod_i exp(w^T \phi (x,y_i,y_{i1}))$  $Z = \sum_{\hat{y}} \prod_i exp(w^T \phi (x,\hat{y_i},\hat{y}_{i1}))$
 $\phi (x,y) = \sum_i \phi (x,y_i,y_{i1})$
 prediction via Viterbi (with sum instead of product)
 training

maximize loglikelihood $\underset{W}{max} \frac{\lambda}{2} w^T w + \sum log : P(y_I x_I,w)$  requires inference

 linearchain CRF  only looks at current and previous labels
 a global, undirected graphical model
 ex. structured perceptron
 HMM is a linear classifier
7  graphical models
 graphical models represent prob. distributions over multiple random variables
1  Bayesian networks = causal networks (directed graphs)
 must be acyclic
 local independence  each node is independent of its nondescendants given its parents

$P(z_1,…,z_n)=\prod P(z_i Parents(z_i))$

 topological independence  a node is independent of all other nodes given its parents, children, and children’s parents = markov blanket
 compact representation of the joint prob. distr.
 global independencies  Dseparation
 sometimes Bayes nets cannot represent the independence relations we want conveniently
 arrow direction unclear
 independencies structures can be strange
2  Markov networks = Markov random fields (undirected graphs)
 each node is independent of its nondescendants given its immediate neighbors
 the nodes in a complete subgraph form a clique
 complete  all connections present
 $P_\theta(X) = \frac{1}{Z(\theta)} \prod_{c \in Cliques} f(x_c,\theta)$
 the joint probability decouples over cliques
 every clique $x_c$ is associated with a potential function $f(x_c,\theta)$
 partition function $Z(\theta) = \sum_x \prod_{c \in Cliques} f(x_c,\theta)$
 local independence  a node is independent of all other nodes given its neighbors
 global interdependence  if X,Y,Z are sets of nodes, X is conditionally independent of Y given Z if removing all nodes of Z removes all paths from X to Y
 factor graph  makes the factorization explicit
 replaces cliques with factors
 if x is dependent on all its neighbors
 Ising model  if x is binary
 Potts model  x is multiclass
learning
 train via maximum likelihood
inference
 compute probability of subset of states
 exact inference
 variable elimination
 belief propagation
 approximate inference
 MCMC
 variational algorithms
 loopy belief propagation
8  constrained conditional models
consistency of outputs and the value of inference
 ex. POS tagging  sentence shouldn’t have more than 1 verb
 inference
 a global decision comprising of multiple local decisions and their interdependencies
 local classifiers
 constraints
 a global decision comprising of multiple local decisions and their interdependencies
 learning
 global  learn with inference (computationally difficult)
constrained conditional models via an example
hard constraints and integer programs
soft constraints
9  inference
 inference constructs the output given the model
 goal: find highest scoring state sequence
 $argmax_y : score(y) = argmax_y w^T \phi(x,y)$
 naive: score all and pick max  terribly slow
 viterbi  decompose scores over edges
 questions
 exact v. approximate inference
 exact  search, DP, ILP
 approximate = heuristic  Gibbs sampling, belief propagation, beam search, linear programming relaxations
 randomized v. deterministic
 if run twice, do you get same answer
 exact v. approximate inference
 ILP  integer linear programs
 combinatorial problems can be written as integer linear programs
 many commercial solvers and specialized solvers
 NPhard in general
 special case of linear programming  minimizing/maximizing a linear objective function subject to a finite number of linear constraints (equality or inequality)
 in general, $ c = \underset{c}{argmax}: c^Tx $ subject to $Ax \leq b$
 maybe more constraints like $x \geq 0$
 the constraint matrix defines a polytype
 only the vertices or faces of the polytope can be solutions
 $\implies$ can be solved in polynomial time
 in ILP, each $x_i$ is an integer
 LPrelaxation  drop the integer constraints and hope for the best
 01 ILP  $\mathbf{x} \in {0,1}^n$
 decision variables for each label $z_A = 1$ if output=A, 0 otherwise
 don’t solve multiclass classification with an ILP solver (makes it harder)
 belief propagation
 variable elimination
 fix an ordering of the variables
 iteratively, find the best value given previous neighbors
 use DP  ex. Viterbi is maxproduct variable elimination
 when there are loops, require approximate solution
 uses message passing to determine marginal probabilities of each variable
 message $m_{ij}(x_j)$ high means node i believes $P(x_j)$ is high
 use beam search  keep sizelimited priority queue of states
 uses message passing to determine marginal probabilities of each variable
 variable elimination
10/11  learning protocols
structural svm
 $\underset{w}{min} : \frac{1}{2} w^T w + C \sum_i \underset{y}{max} (w^T \phi (x_i,y)+ \Delta(y,y_i)  w^T \phi(x_i,y_i) )$
empirical risk minimization
 subgradients
 ex. $f(x) = max ( f_1(x), f_2(x))$, solve the max then compute gradient of whichever function is argmax
sgd for structural svm
 highest scoring assignment to some of the output random variables for a given input?
 lossaugmented inference  which structure most violates the margin for a given scoring function?
 adagrad  frequently updated features should get smaller learning rates

Algorithms
 asymptotics
 recursion
 dynamic programming
 mincut / maxcut
 hungarian
 maxflow
 sorting
 searching
 computational geometry
[toc]
asymptotics
 BigO
 bigoh: O(g): functions that grow no faster than g  upper bound, runs in time less than g
 f(n)≤c*g(n) for some c, large n
 bigtheta: Θ(g): functions that grow at the same rate as g
 bigoh(g) and bigtheta(g)  asymptotic tight bound
 bigomega: Ω(g): functions that grow at least as fast as g
 f(n)≥c*g(n) for some c, large n
 Example: f = 57n+3
 O(n^2)  or anything bigger
 Θ(n)
 Ω(n^.5)  or anything smaller
 input must be positive
 We always analyze the worst case runtime
 littleomega: omega(g)  functions that grow faster than g
 littleo: o(g)  functions that grow slower than g
 we write f(n) ∈ O(g(n)), not f(n) = O(g(n))
 They are all reflexive and transitive, but only Θ is symmetric.
 Θ defines an equivalence relation.
 The difference between log10n and log2n is always a constant (about 3.322)
 bigoh: O(g): functions that grow no faster than g  upper bound, runs in time less than g
 existence then efficiency
 Upper bound: O(g(n))  set of functions s.t. there exists c,k>0, 0 ≤ f(n) ≤ c*g(n), for all n > k
 o(g(n))  O(g(n)) and not Ω(g(n))
 Tight bound: Θ(g(n))  set of functions s.t. O(g(n)) and Ω(g(n))
 Lower bound: Ω(g(n))  set of functions s.t. there exists c,k>0, 0 ≤ f(n) ≤ c*g(n), for all n > k
 ω(g(n))  Ω(g(n)) but not O(g(n))
 add 2 functions, growth rate will be $O(max(g_1(n)+g_2(n))$ (same for sequential code)
 recurrence thm:$ f(n) = O(n^c) => T(n) = 0(n^c)$
 $T(n) = a*T(n/b) + f(n)$
 $c = log_b(a)$
 Stirling’s formula: $ n! ~= (\frac{n}{e})^n $
 corollary: log(n!) = 0(n log n)
 gives us a bound on sorting
 over bounded number of elements, almost everything is constant time
recursion
 moving down/right on an NxN grid  each path has length (N1)+(N1)
 we must move right N1 times
 ans = (N1+N1 choose N1)
 for recursion, if a list is declared outside static recursive method, it shouldn’t be static
 generate permutations  recursive, add char at each spot
 think hard about the base case before starting
 look for lengths that you know
 look for symmetry
 nqueens  one array of length n, go row by row
dynamic programming
//returns max value for knapsack of capacity W, weights wt, vals val int knapSack(int W, int wt[], int val[]) int n = wt.length; int K[n+1][W+1]; //build table K[][] in bottom up manner for (int i = 0; i <= n; i++) for (int w = 0; w <= W; w++) if (i==0  w==0) // base case K[i][w] = 0; else if (wt[i1] <= w) //max of including weight, not including K[i][w] = max(val[i1] + K[i1][wwt[i1]], K[i1][w]); else //weight too large K[i][w] = K[i1][w]; pipes are connected at their endpoints. What is the maximum amount of water that you can route from a given starting point to a given ending point? return K[n][W];
mincut / maxcut
hungarian
 assign N things to N targets, each with an associated cost
maxflow
 A list of pipes is given, with different flowcapacities. These
sorting
 you can assume w.l.o.g. all input numbers are unique
 sorting requires Ω(nlog n) (proof w/ tree)
 considerations: worst case, average, in practice, input distribution, stability (order coming in is preserved for things with same keys), insitu (inplace), stack depth, having to read/write to disk (disk is much slower), parallelizable, online (more data coming in)
 adaptive  changes its behavior based on input (ex. bubble sort will stop)
comparisedbased
bubble sort
 keep swapping adjacent pairs
for i=1:n1 if a[i+1]<a[i] swap(a,i,i+1)
 have a flag that tells if you did no swaps  done
 number of passes ~ how far elements are from their final positions
 O(n^2)
oddeven sort
 swap even pairs
 then swap odd pairs
 parallelizable
selection sort
 move largest to current position for i=1:n1 for j=1:n x = max(x,a[j]) jmax = j swap(a,i,j)
 0(n^2)
insertion sort
 insert each item into lists for i=2:n insert a[i] into a[1..(i1)] shift
 O(n^2), O(nk) where k is max dist from final position
 best when alsmost sorted
heap sort
 insert everything into heap
 kepp remove max
 can do in place by storing everything in array
 can use any heightbalanced tree instead of heap
 traverse tree to get order
 ex. Btree: multirotations occur infrequently, average O(log n) height
 0(n log n)
smooth sort
 adaptive heapsort
 collection of heaps (each one is a factor larger than the one before)
 can add and remove in essentially constant time if data is in order
merge sort
 split into smaller arrays, sort, merge
 T(n) = 2T(n/2) + n = 0(n log n)
 stable, parallelizable (if parallel, not in place)
quicksort
 split on pivot, put smaller elements on left, larger on right
 O(n log n) average, O(n^2) worst
 O(log n) space
shell sort
 generalize insertion sort
 insertionsort all items i apart where i starts big and then becomes small
 sorted after last pass (i=1)
 O(n^2), O(n^(3/2), … unsure what complexity is
 no matter what must be more than n log n
 not used much in practice
not comparisonbased
counting sort
 use values as array indices in new sort
 keep count of number of times at each index
 for specialized data only, need small values
 0(n) time, 0(k) space
bucket sort
 spread data into buckets based on value
 sort the buckets
 O(n+k) time
 buckets could be trees
radix sort
 sort each digit in turn
 stable sort on each digit
 like bucket sort d times
 0(d*n time), 0(k+n) space
meta sort
 like quicksort, but 0(nlogn) worst case
 run quicksort, mergesort in parallel
 stop when one stops
 there is an overhead but doesn’t affect bigoh analysis
 ave, worstcast = O(n log n)
sorting overview
 in exceptional cases insertionsort or radixsort are much better than the generic QuickSort / MergeSort / HeapSort answers.
 merge a and b sorted  start from the back
searching
 binary sort can’t do better than linear if there are duplicates
 if data is too large, we need to do external sort (sort parts of it and write them back to file)
 write binary search recursively
 use low<= val and high >=val so you get correct bounds
 binary search with empty strings  make sure that there is an element at the end of it
 “a”.compareTo(“b”) is 1
 we always round up for these
 finding minimum is Ω(n)
 pf: assume an element was ignored, that element could have been minimum
 simple algorithm  keep track of best so far
 thm: n/2 comparisons are necessary because each comparison involves 2 elements
 thm: n1 comparisons are necessary  need to keep track of knowledge gained
 every nonmin element must win atleast once (move from unkown to known)
 find min and max
 naive solution has 2n2 comparison
 pairwise compare all elements, array of maxes, array of mins = n/2 comparisons
 check min array, max array = 2* (n/21)
 3n/22 comparisons are sufficient (and necessary)
 pf: 4 categories (not tested, only won, only lost, both)
 not tested> w or l =n/2 comparisons
 w or l > both = n/21
 therefore 3n/22 comparisons necessary
 find max and nexttomax
 thm: n2 + log(n) comparisons are sufficient
 consider elimination tournament, pairwise compare elements repeatedly
 2nd best must have played best at some point  look for it in log(n)
 selection  find ith largest integer
 repeatedly finding median finds ith largest
 finding median linear yields ith largest linear
 T(n) = T(n/2) + M(n) where M(n) is time to find median
 quickselect  partition around pivot and recur
 average time linear, worst case O(n^2)
 median in linear time  quickly eliminate a constant fraction and repeat
 partition into n/5 groups of 5
 sort each group high to low
 find median of each group
 compute median of medians recursively
 move groups with larger medians to right
 move groups with smaller medians to left
 now we know 3/10 of elements larger than median of medians
 3/10 of elements smaller than median of medians
 partition all elements around median of medians
 recur like quickselect
 guarantees each partition contains at most 7n/10 elements
 T(n) = T(n/5) + T(7n/10) + O(n) > f(x+y)≥f(x)+f(y)
 T(n) ≤ T(9n/10) + O(n) > this had to be less than T(n)
 partition into n/5 groups of 5
computational geometry
 range queries
 input = n points (vectors) with preprocessing
 output  number of points within any query rectangle
 1D
 range query is a pair of binary searches
 O(log n) time per query
 O(n) space, O(n log n) preprocessing time
 2D
 subtract out rectangles you don’t want
 add back things you double subtracted
 we want rectangles anchored at origin
 nD
 make regions by making a grid that includes all points
 precompute southwest counts for all regions  different ways to do this  tradeoffs between space and time
 O(log n) time per query (after precomputing)  binary search x,y
 polygonpoint intersection
 polygon  a closed sequence of segments
 simple polygon  has no intersections
 thm (Jordan)  a simple polygon partitions the plane into 3 regions: interior, exterior, boundary
 convex polygon  intersection of halfplanes
 polytope  higherdimensional polygon
 raycasting
 intersections = odd  interior, even  exterior
 check for tangent lines, intersecting corners
 O(n) time per query, O(1) space and time
 convex case
 preprocessing
 find an interior point p (pick a vertext or average the vertices)
 partition into wedges (slicing through vertices) w.r.t. p
 sort wedges by polar angle
 query
 find containing wedge (look up by angle)
 test interior / exterior
 check triangle  cast ray from p to point, see if it crosses edge
 O(log n) time per query (we binary search the wedges)
 O(n) space and O(n log n) preprocessing time
 preprocessing
 nonconvex case
 preprocessing
 sort vertices by x
 find vertical slices
 partition into trapezoids (triangle is trapezoid)
 sort slice trapezoids by y
 query
 find containing slice
 find trapezoid in slice
 report interior/ exterior
 O(log n) time per query (two binary searches)
 O(n^2) space and O(n^2) preprocessing time
 preprocessing
 convex hull
 input: set of n points
 output: smallest containing convex polygon
 simple solution 1  Jarvis’s march
 simple solution 2  Graham’s scan
 mergehull
 partition into two sets  computer MergeHull of each set
 merge the two resulting CHS
 pick point p with least x
 form anglemonotone chains w.r.t p
 merge chains into anglesorted list
 run Graham’s scan to form CH
 T(n) = 2T(n/2) + n = 0(n log n)
 generalizes to higher dimensions
 parallelizes
 quickhull (like quicksort)
 find right and leftmost points
 partition points along this line
 find points farthest from line  make quadrilateral
 eliminate all internal points
 recurse on 4 remaining regions
 concatenate resulting CHs
 O(n log n) expected time
 O(n^2) worstcase time  ex. circle
 generalizes to higher dim, parallelizes
 find right and leftmost points
 lower bound  CH requires Ω(n log n) comparisons
 pf  reduce sorting to convex hull
 consider arbitrary set of x_i to be sorted
 raise the x_i to the parabola (x_i,x_i^2)  could be any concave function
 compute convex hull of the parabola  all connected and line on top
 from convex hull we can get sorted x_i => convex hull did sorting so at least n log n comparisons
 corollary  Graham’s scan is optimal
 Chan’s convex hull algorithm
 assume we know CH size m=h
 partitoin points into n/m sets of m each
 convex polygon diameter
 Voronoi diagrams  input n points  takes O(nlogn) time to compute
 problems that are solved
 Voronoi cell  the set of points closer to any given point than all others form a convex polygon
 generalizes to other metrics (not just Euclidean distance)
 a Voronoi cell is unbounded if and only if it’s point is on the convex hull
 corollary  convex hull can be computed in linear time
 Voronoi diagram has at most 2n5 vertices and 3n6 edges
 every nearest neighbor of a point defines an edge of the Voronoi diagram
 corollary  all nearest neighbors can be computed from the Voronoi diagram in linear time
 corollary  nearest neighbor search in O(log n) time using planar subdivision search (binary search in 2D)
 connection points of neighboring Voronoi diagram cells form a triangulation (Delanuay triangulation)
 a Delanuay triangulation maximizes the minimum angle over all triangulations  no long slivery triangles
 Euclidean minimum spanning tree is a subset of the Delanuay triangulation (can be computed easily)
 calculating Voronoi diagram
 discrete case / bitmap  expand breadthfirst waves from all points
 time is O(bitmap size)
 time is independent of #points
 intersecting half planes
 Voronoi cell of a point is intersection of all halfplanes induced by the perpendicular bisectors w.r.t all other points
 use intersection of convex polygons to intersect halfplanes (nlogn time per cell)
 can be computed in nlogn total time
 idea divide and conquer
 merging is complex
 sweep line using parabolas
 idea divide and conquer
 discrete case / bitmap  expand breadthfirst waves from all points
 problems that are solved

C/C++ Reference
 The C memory model: global, local, and heap variables. Where they are stored, their properties, etc.
 Variable scope
 Using pointers, pointer arithmetic, etc.
 Managing the heap: malloc/free, new/delete
C basic
 #include
 printf(“the variable ‘i’ is: %d”, i);
 can only use /* */ for comments
 for constants:
#define MAX_LEN 1024
malloc
 malloc ex.
 There is no bool keyword in C
/* We're going to allocate enough space for an integer and assign 5 to that memory location */ int *foo; /* malloc call: note the cast of the return value from a void * to the appropriate pointer type */ foo = (int *) malloc(sizeof(int)); *foo = 5; free(foo);
char *some_memory = "Hello World";
 this creates a pointer to a readonly part of memory
 it’s disastrous to free a piece of memory that’s already been freed
 variables must be declared at the beginning of a function and must be declared before any other code
memory
 heap variables stay  they are allocated with malloc
 local variables are stored on the stack
 global variables are stored in an initialized data segment
structs
struct point { int x; int y; }; struct point *p; p = (struct point *) malloc(sizeof(struct point)); p>x = 0; p>y = 0;
strings
 array with extra null character at end ‘\0’
 strLen doesn’t include null character
pointers
int fake = NULL; int val = 20; int * x; // declare a pointer x = &val; //take address of a variable  can use pointer ++ and pointer  to get next values
//Hello World #include
using namespace std; //always comes after the includes, like a weaker version of packages //everything needs to be in a namespace otherwise you have to writed std::cout to look in iostream  you would use this for very long programs int main(){ //main function, not part of a class, must return an int cout << "Hello World" << endl; return 0; //always return this, means it didn't crash } Preprocessor #include
//System file  angle brackets #include "ListNode.h" //user file  inserts the contents of the file in this place #ifndef  "if not defined" #define  defines a macro (direct text replacement) #define TRUE 0 //like a final int, we usually put it in all caps if(TRUE ==0) #define MY_OBJECT_H //doesn't give it a value  all it does is make #ifdef true and #ifndef false #if/#ifdef needs to be closed with #endif if 2 files include each other, we get into an include loop we can solve this with the header of the .h files  everything is only defined once odd.h: #ifndef ODD_H #define ODD_H #include "even.h" bool odd (int x); #endif even.h: #ifndef EVEN_H #define EVEN_H #include "odd.h" bool even (int x); #endif I/O #include
using namespace std; int main(){ int x; cout << "Enter a value for x: "; //the arrows show you which way the data is flowing cin >> x; return 0; } C++ Primitive Types int can be 16,32,64 bits depending on the platform double better than char
If statement can take an int, if (0) then false. Otherwise true. //don’t do single equals instead of double equals, will return false
Compiler: clang++
Functions  you can only call methods that are above you in the file function prototype  to compile mutually recursive functions, you need to declare the function with a semicolon instead of brackets and no body. bool even(int x); //called forward declaration / function prototype bool odd(int x){ if(x==0) return false; return even(x1); } bool even(int x){ if(x==0) return true; return odd(x1); } Classes Need 3 Separate files: 1. Header file that contains class definition  like an interface  IntCell.h #ifndef INTCELL_H //all .h files start w/ these #define INTCELL_H class IntCell{ public: //visibility blocks, everything in this block is public IntCell(int initialValue=0); //if you don’t provide a parameter, it assumes it is 0. You can call it with 1 or no parameters. ~IntCell(); //destructor, takes no parameters
int getValue() const; //the const keyword when placed here means the method doesn’t modify the object void setValue(int val); private: int storedvalue; int max(int m); }; #endif //all .h files end w/ these 2. C++ file that contains class implementation IntCell.cpp #include “IntCell.h” using namespace std; // (not really necessary, but…) IntCell::IntCell( int initialValue ) : //default value only listed in .h file storedValue( initialValue ) { //put in all the fieldname(value), this is shorthand } int IntCell::getValue( ) const { return storedValue; } void IntCell::setValue( int val ) { //this is how you define the body of a method storedValue = val; } int IntCell::max(int m){ return 1; } 3. C++ file that contains a main()  TestIntCell.cpp #include#include "IntCell.h" using namespace std; int main(){ IntCell m1; //calls default constructor  we don't use parentheses! IntCell m2(37); cout << m1.getValue() << " " << m2.getValue() << endl; m1 = m2; //there are no references  copies the bits in m2 into m1 m2.setValue(40); cout << m1.getValue() << " " << m2.getValue() << endl; return 0; } Pointers Stores a memory address of another object //we will assume everyhing is 32 bit Can be a primitive type or a class type int * x; pointer to int char *y; pointer to char Rational * rPointer; pointer to Rational all pointers are 32 bits in size because they are just addresses in a definition, * defines pointer type: int * x; in an expression, * dereferences: *x=2; (this sets a value for what the pointer points to) in a definition, & defines a reference type &x means get the address of x int x = 1; //Address 1000, value 1  don’t forget to make the pointee int y = 5; //Address 1004, value 5 int * x_pointer = &x; //Address 1008, value 1000 cout « x_pointer; //prints the address 1000 cout « *x_pointer; //prints the value at the address *x_pointer = 2; //this changes the value of x to 2 x_pointer = &y; //this means x_pointer now stores the address of y *x_pointer = 3; //this changes the value of y to 3
int n = 30; int * p; //variables are not initialized to any value *p = n; //this throws an error because you have not requested enough memory, unless it happens to be pointing to memory that you have allocated int *p = NULL; //this will still crash, but it is a better way to initialize void swap(int * x, int * y) { int temp = *x; //temp takes the value x is pointing to *x = *y; //x points to the value that y was pointing to *y = temp; //y points to the value 3 } //at the end, x and y still are the same addresses int main() { int a=0; int b=3; cout << "Before swap(): a: " << a << "b: " << b << endl; swap(&b,&a); cout << "After swap(): a: " << a << "b: " << b << endl; return 0; } Dynamic Memory Allocation //not very efficient Static Memory Allocation  the compiler knows at compile time how much memory is needed int someArray[10]; //declare array of 10 elements int *value1_address = &someArray[3]; // declare a pointer to int new keyword returns a pointer to newly created "thing" int main() { int n; cout << "Please enter an integer value: " ; // read in a value from the user cin >> n; int * ages = new int [n];// use the user's input to create an array of int using new for (int i=0; i < n; i++) { // use a loop to prompt the user to initialize array cout << "Enter a value for ages[ " << i << " ]: "; cin >> ages[i]; } for(int i=0; i<n; i++) { // print out the contents of the array cout << "ages[ " << i << " ]: " << ages[i]; delete [] ages; //finished with the array  clean up the memory used by calling delete return 0; //everything you allocate with new needs to be deleted, this is faster than java } Generally, SomeTypePtr = new SomeType; int * intPointer = new int; delete intPointer; //for array, delete [] ages; this only deals with the pointee, not the pointer Accessing parts of an object regular object: Rational r; r.num = 3; for a pointer, dereference it: Rational *r = new Rational(); (*r).num=4; //or r>num = 4; (shorthand) char* x,y; //y is not a pointer! Write like char *x,y; Linked Lists List object keeps track of size, pointers to head, tail head and tail are dummy nodes ListNode holds a value, previous, and next ListItr has pointer to current ListNode Friend class ListNode { public: ListNode(); //Constructor private: //only this class can modify these fields int value; ListNode *next, *previous; //for doubly linked lists friend class List; //these classes can bypass private visibility friend class ListItr; }; Constructor  just has to initialize fields Foo() { ListNode* head = new ListNode(); //because we put the class type ListNode*, then we are creating a new local variable and not modifying the field //head = new Listnode()  this works } Foo() { ListNode temp; head = &temp; //this ListNode is deallocated after the constructor ends, doesn't work } Assume int *x has been declared And int y is from user input Consider these separate C++ lines of code: x = new int[10]; // 40 bytes x = new int; // 4 bytes x = new int[y]; // y*4 bytes sizeof(int) > tells you how big an integer is (4 bytes) References  like a pointer holds an address, with 3 main differences 1. Its address cannot change (its address is constant) 2. It MUST be initialized upon declaration Cannot (easily) be initialized to NULL 3. Has implicit dereferencing If you try to change the value of the reference, it automatically assumes you mean the value that the reference is pointing to //can't use it when you need to change ex. ListItr has current pointer that changes a lot Declaration List sampleList List & theList = sampleList;//references has to be initialized to the object, not the address void swap(int & x, int & y) { //this passes in references int temp = x; x = y; y = temp; } int main() { //easier to call, references are nice when dealing with parameters int a=0; int b=3; cout << "Before swap(): a: " << a << "b: " << b << endl; swap(b,a); cout << "After swap(): a: " << a << "b: " << b << endl; return 0; } You can access its value with just a period Location * & Definition "pointer to" "reference to" Statement "dereference" "address of" subroutines methods are in a class functions are outside a class Parameter passing Call by value  actual parameter is copied into formal parameter This is what Java always does  can be slow if it has to copy a lot actual object can't be modified Call by reference  pass references as parameters Use when formal parameter should be able to change the value of the actual argument void swap (int &x, int &y); Call by constant reference  parameters are constant and are passed by reference Both efficient and safe bool compare(const Rational & left, const Rational & right); Can also return by different ways C++ default class 1. Destructor //this will do nothing Frees up any resources allocated during the use of an object 2. Copy Constructor //copies something over Creates a new object based on an old object IntCell copy = original; //or Intcell copy(original) automatically called when object is passed by value into a subroutine automatically called when object is returned by value from a subroutine 3. operator=() also known as the copy assignment operator intended to copy the state of original into copy called when = is applied to two objects after both have been previously constructed IntCell original; //constructor called IntCell copy; copy = original; //operator called overrides the = operator //operator overrides only work on objects, not pointers (and a default constructor, if you don't supply one) //this will do nothing
C++ has visibility on the inheritance
class Name { public: Name(void) : myName(“”) { } ~Name(void) { } void SetName(string theName) { myName = theName; } void print(void) { cout « myName « endl; }
private: string myName; };
class Contact: public Name { //this is like contact extends name public: Contact(void) { myAddress = “”; } ~Contact(void) { } void SetAddress(string theAddress) { myAddress = theAddress; } void print(void) { Name::print(); //this can’t access private fields in Name, needs to call print from super class cout « myAddress « endl; } private: string myAddress; };
C++ has multiple inheritance  you can have as many parent classes as you want class Sphere : public Shape, public Comparable, public Serializable { };
Dispatch Static  Decision on which member function to invoke made using compiletime type of an object when you have a pointer Person *p; p = new Student(); p.print(); //will alway call the Person print method  uses type of the pointer Dynamic  Decision on which member function to invoke made using runtime type of an object Incurs runtime overhead Program must maintain extra information Compiler must generate code to determine which member function to invoke Syntax in C++: virtual keyword (Java does this by default, i.e. everything is virtual) Example class A virtual void foo()
class B : public A virtual void foo() void main () int which = rand() % 2; A *bar; if ( which ) bar = new A(); else bar = new B(); bar>foo(); return 0;Virtual method tables  stores the virtual methods in an array Each object contains a pointer to the virtual method table In addition to any other fields That table has the addresses of the methods Any virtual method must follow the pointer to the object... (one pointer dereference) Then follow the virtual method table pointer... (second pointer dereference) Then lookup the method pointer In Java default is Dynamic In C++, default is Static  this is faster When creating a subclass object, the constructor of each subclass overwrites the appropriate pointers in the virtual method table with the overridden method pointers
Abstract Classes class foo { public: virtual void bar() = 0; };
Types of multiple inheritance 1. Shared What Person is in the diagram on the previous slide 2. Replicated (or repeated) What gp_list_node is in the diagram on the previous slide 3. Nonreplicated (or nonrepeated) A language that does not allow shared or replicated (i.e. no common ancestors allowed) 4. Mixin What Java (and others) use to fake multiple inheritance through the use of interfaces
In C++, replicated is the default Shared can be done by specifying that a base class is virtual: class student: public virtual person, public gp_list_node { class professor: public virtual person, public gp_list_node { Java has ArrayStoreException  makes sure the thing you are adding to the array is of the correct type String[] a = new String[1]; Object[] b = a; b[0] = new Integer (1);

Java Reference
data structures
 LinkedList, ArrayList  add(Element e), add(int idx, Element e), get(int idx)  remove(int index)  remove(Object o)  Stack  push(E item)  peek()  pop()  PriorityQueue  peek()  poll()  default is minheap  PriorityQueue(int initialCapacity, Comparator<? super E> comparator)  PriorityQueue(Collection<? extends E> c)  HashSet, TreeSet  add, remove  HashMap  put(K key, V value)  get(Object key)  keySet()  if you try to get something that's not there, will return null
 default init capacities all 1020
 clone() has to be cast from Object
useful
iterator
 it.next()  returns value  it.hasNext()  returns boolean  it.remove()  removes last returned value
strings
 String.split(" \\.\\?") //split on space, ., and ?  StringBuilder  much faster at concatenating strings  thread safe, but slower  StringBuilder s = new StringBuilder(CharSequence seq)();  s.append("cs3v");  s.charAt(int x), s.deleteCharAt(int x), substring  s.reverse()  Since String is immutable it can safely be shared between many threads  formatting String s = String.format("%d", 3); "%05d" //pad to fill 5 spaces "%8.3f" //max number of digits "%d" //left justify "%,d" //print commas ex. 1,000,000  int  double  string    d  f  s  new StringBuilder(s).reverse().toString() int count = StringUtils.countMatches(s, something);  integer  String toString(int i, int base)  int parseInt(String s, int base)  array char[] data = {'a', 'b', 'c'}; String str = new String(data);
sorting
 Arrays.sort(Array a)  Collections.sort(Collection c), Collections.sort(Collection l, Comparator c)  use mergeSort (with insertion sort if very small)  Collections.reverseOrder() returns comparator opposite of default class ComparatorTest implements Comparator<String> public int compare(String one, String two) //if negative, one comes first class Test implements Comparable<Object> public int compareTo(Object two)
exceptions
 ArrayIndexOutOfBoundsException
throw new Exception("Chandan type")
higher level
 primitives 
byte, short, char, int, long, float, double
 java only has primitive and reference types
 when you assign primitives to each other, it’s fine
 when you pass in a primitive, its value is copied
 when you pass in an object, its reference is copied
 you can modify the object through the reference, but can’t change the object’s address
 garbage collection
 once an object no longer referenced, gc removes it and reclaims memory
 jvm intermittently runs a markandsweep algorithm
 runs when shortterm stuff gets full
 older stuff moves to different part
 eventually older stuff is cleared
objectoriented
 declare  instantiate  initialize  ———————————  Robot k  new  Robot() 
 class method = static
 called with Foo.DoIt()
 initialized before constructor
 class shares one copy, can’t refer to nonstatic
 instance method  invoked on specific instance of the class
 called with f.DoIt()
 protected member is accessible within its class and subclasses

Theory of Computation
 introduction
 ch 13  finite automata, regular expressions
 ch 4  properties of regular languages (except Sections 4.2.3 and 4.2.4)
 ch 5  context free grammars and languages
 ch 6  pushdown automata (don’t need to know 6.3 proofs)
 ch 7  properties of CFLs (except pp. 295297)
 ch 8  intro to turing machines (except 8.5.3)
 ch 9  undecidability (9.1,9.2,9.3)
 Ch 10  10.110.4 know the additional problems that are NPcomplete
 more on NP Completeness
introduction
 Chomsky hierarchy of languages: $L_3 \subset L_2 \subset L_1 \subset L_R \subset L_0 \subset Σ*$
 each L is a set of languages
 $L_0=L_{RE}$  unrestricted grammars  general phase structure grammars  recursively enumerable languages  include all formal grammars. They generate exactly all languages that can be recognized by a Turing machine.
 computable, maybe undecidable (if not in L_R)
 L_R  recursive grammars  Turing machine that halts eventually
 decidable
 L_1  contextsensitive grammars  all languages that can be recognized by a linear bounded automaton
 L_2  contextfree grammars  these languages are exactly all languages that can be recognized by a nondeterministic pushdown automaton.
 L_3  regular grammars  all languages that can be decided by a finite state automaton

contains Σ*, Σ* is countably infinite

 strings
 languages
 Σ* Kleene Closure has multiple definitions

{w w is a finite length string ^ w is a string over Σ} 
{xw w in Σ* ^ x in Σ} U {Ɛ}

 Σ_i has strings of length i
 Σ* Kleene Closure has multiple definitions
 problems
 automata
 delta v deltahat  delta hat transitions on a string not a symbol

 notation writes the state between the symbols you have read and have yet to read 
 notation with * writes the state before the symbols you have to read and after what you have read
 grammars
 leftmost grammar  expand leftmost variables first  doesn’t matter for contextfree
 parse tree  write string on bottom
 sets
 finite
 countably infinite
 not countably infinite
 mappings
 onto  each output has at least 1 input
 11  each output has at most 1 input
 total  each input has at least 1 output
 function  each input has at most 1 output
 equivalence relation  reflexive, symmetric, transitive
 proof methods
 **read induction **
 library of babel
 distinct number of books, each contained, but infinite room
ch 13  finite automata, regular expressions
 alphabet  any nonempty finite set
 string  finite sequence of symbols from an alphabet
 induction hypothesis  assumption that P(i) is true
 lexicographic ordering  {Ɛ,0,1,00,01,10,11,000,…}
 finite automata  like a Markov chain w/out probabilities  5 parts
 states
 E  finite set called the alphabet
 f: Q x E > Q is the transition function
 ex. f(q,0) = q’
 start state
 final states
 language  L(M)=A  means A is the set of all strings that the machine M accepts

A* = {$x_1x_2…x_k k\geq0 \wedge x_i \in A$}  A+ = A*  Ɛ

concatenation A o B = {xy x in A and y in B}  regular language  is recognized by a finite automata
 class of regular languages is closed under union, concatenation, star operation
 nondeterministic automata
 can have multiple transition states for one symbol
 can transition on Ɛ
 can be thought of as a tree
 After reading that symbol, the machine splits into multiple copies of itself and follows all the possibilities in parallel. Each copy of the machine takes one of the possible ways to proceed and continues as before. If there are subsequent choices, the machine splits again.
 If the next input symbol doesn’t appear on any of the arrows exiting the current state, that copy of the machine dies.
 if any copy is in an accept state at the end of the input, the NFA accepts the input string.
 can also use regular expressions (stuff like unions) instead of finite automata
 to convert, first convert to gnfa
 gnfa (generalized nfa)  start state isn’t accept state
 nonregular languages  isn’t recognized by a finite automata

ex. C = {w w has an equal number of Os and 1s}  requires infinite states

ch 4  properties of regular languages (except Sections 4.2.3 and 4.2.4)
 pumping lemma proves languages not to be regular

if L regular, there exists a constant n such that for every string w in L such that w ≥ n, we can break w into 3 strings w=xyz, such that:  y≠Ɛ

xy ≤ n  For all k ≥ 0, x y^k z is also in L
 closed under union, intersection, complement, concatenation, closure, difference, reversal
 convert NFA to DFA  write the possible routes to the final state, write the intermediate states, remove unnecessary ones
 minimization of DFAs
 eliminate any state that can’t be reached
 partition remaining states into blocks so all states in same block are equivalent
 can’t do this grouping for nfas
ch 5  context free grammars and languages
 $w^R$ = reverse
 contextfree grammar  more powerful way to describe a language
 ex. substitution rules (generates 0#1)
 A > OA1
 A > B
 B > #
 def
 variables  finite set
 terminals  alphabet
 productions
 start variable
 recursive inference  start with terminals, show that string is in grammar
 derivation  sequence of substitutions to obtain a string
 can also make these into parse trees
 leftmost derivation  at each step we expand leftmost variable
 arrow with a star does many derivations at once
 parse tree  final answer is at bottom
 sentential form  the string at any step in a derivation
 proofs in 5.2s
 w equivalence
 parse tree
 leftmost derivation
 rightmost derivation
 recursive inference
 derivation

if else grammar: $S \to \epsilon SS iS iSeS $  contextfree grammars used for parsers (compilers), matching parentheses, palindromes, ifelse, html, xml
 if a grammar generates a string in several different ways, we say that the string is derived ambiguously in that grammar
 ambiguity resolution
 some operators take precedence
 make things leftassociative
 think about terms, expressions, factors
 if unambiguous, leftmost derivation will be unique
 in an unambiguous grammar, leftmost derivations will be unique
 inherently ambiguous language  all its grammars are ambiguous
 ex: $L = {a^nb^nc^md^m} \cup {a^nb^mc^md^n} , n \geq 1, m\geq1$
ch 6  pushdown automata (don’t need to know 6.3 proofs)
 pushdown automata  have extra stack of memory  equivalent to contextfree grammar
 similiar to parser in typical compiler
 two ways of accepting
 entering accept state
 accept by emptying stack
 convert from empty stack to accept state
 add symbol X_1
 start by pushing it onto the stack then push on Z_1, spontaneously transition to q_0
 everything has epsilontransition to final accepting state when they read X_1  convert accept state to empty stack
 add symbol X_1 under Z_1 (this is so we never empty stack unless we are in p there are no transitions on X_1)
 all accept states transition to new state p
 p epsilontransitions to itself, removes element from each stack every time
 6.3
 convert context free grammar to empty stack
 simulate leftmost derivations
 put answer on stack, most recent variable on top
 if terminal remove
 if variable nondeterministically expand
 if empty stack, accept
 convert PDA to grammar
 every transition is of from pXq
 variables of the form [pXq] (X is on the stack)
 [pXq] > a where a is what transitioned p to q
 convert context free grammar to empty stack
 pushdown automata can transition on epsilon
 def:
 transition function  takes (state,symbol,stack symbol)  returns set of pairs (new state, new string to put on stack  length 0, 1, or more)
 start state
 start symbol (stack starts with one instance of this symbol)
 set of accepting states
 set of all states
 alphabet
 stack alphabet
 ex. palindromes
 push onto stack and continue OR
 assume we are in middle and start popping stack  if empty, accept input up to this point
 label diagrams with i, X/Y  what input is used and new/old tops of stack
 ID for PDA: (state,remaining string,stack)
 conventionally, we put top of stack on left
 parsers generally behave like deterministic PDA
 DPDA also includes all regular languages, not all context free languages
 only include unambiguous grammars
ch 7  properties of CFLs (except pp. 295297)
 Chomsky Normal Form
 A>BC
 A>a
 no epsilon transitions
 for any variable that derived to epsilon (ex. A *> epsilon)
 if B > CAD
 replace with B > CD and B > CAD and remove all places where A could become epsilon  no unit productions  eliminate useless symbols  works for any CFL
 Greibach Normal Form
 A>aw where a is terminal, w is string of 0 or more variables
 every derivation takes exactly n steps (n length)
 generating  if x produces some terminal string w
 reachable  x reachable if S ${\to}^*$ aXb for some a,b
 CFL pumping lemma  pick two small strings to pump

If L CFL, then z ≥ n, we can break z into 5 strings z=uvwxy, such that:  vx ≠ Ɛ

vwx ≤ n, middle portion not too long  For all i ≥ 0, $u v^i w x^i y \in$ L
 ex. ${0^n1^n}$
 often have to break it into cases
 proof uses Chomsky Normal Form
 not context free examples

${0^n1^n2^n n\geq1}$ 
{$0^i1^j2^i3^j i\geq 1,j\geq 1$} 
{ww w $\in {0,1}^*$ }

 closed under union, concatenation, closure, and positive closure, homomorphism, reversal, inverse homomorphism, substitutions
 intersection with a regular language (basically run in parallel)
 not closed under intersection, complement
 substitution  replace each letter of alphabet with a language
 s(a) = $L_a$
 if $w = ax$, $s(w) = L_aL_x$
 if L CFL, s(L) CFL
 time complexities
 O(n)
 CFG to PDA
 PDA final state > empty stack
 PDA empty stack > final state
 PDA to CFG: O($n^3$) with size O(n^3)
 converstion to CNF: O(n^2) with size O(n^2)
 emptiness of CFL: O(n)
 O(n)
 testing emptiness  O(n)
 which symbols are reachable
 test membership with dynamic programming table  O(n^3)
 CYK algorithm
ch 8  intro to turing machines (except 8.5.3)
 Turing Machine def
 states
 start state
 final states
 input symbols
 tape symbols (includes input symbols)
 transition function $\delta(q,X)=\delta(q,Y,L)$
 B  blank symbol
 infinite blanks on either side
 arc has X/Y D with old/new tape symbols and direction
 if the TM enters accepting state, it accepts  assume it halts if it accepts
 we can think of Turing machine as having multiple tracks (symbol could represent a tuple like [X,Y])
 multitape TM has each head move independently, multitrack doesn’t
 common use one track for data, one track for mark
 running time  number of steps that TM makes
 NTM  nondeterministic Turing machine  accepts no languages not accepted by a deterministic TM
 halts if enters a state q, scanning X, and there is no move for (q,X)
 restrictions that don’t change things
 tape infinite only to right
 TM can’t print blank
 simplified machines
 two stacks machine  one stack keeps track of left, one right
 every recursively enumerable language is accepted by a twocounter machine
 TM can simulate computer, and time is some polynomial multiple of computer time (O(n^3))
 limit on how big a number computer can store  one instruction  word can only grow by 1 bit
 LBA  linear bounded automaton  Turing machine with left and right end markers
 programs might take infinitely long before terminating  can’t be decided
 turing machine can take 2 inputs: program P and input I
 ID  instantaneous description
 write $X_1X_2…qX_iX_{i+1}…$ where q is scanning X_i
 program that prints “h” as input > yes or no
 imagine instead of no prints h
 now feed it to itself
 if it would print h, now prints yes  paradox! therefore such a machine can’t exist
 TM simulating computer
 tape that has memory
 tape with instruction counter
 tape with memory address
 reduction  we know X is undecidable  if solving Y implies solving X, then Y is undecidable
 if X reduces to Y, solving Y solves X
 define a total mapping from X to Y
 $X \leq _m Y$  X reduces to Y  mapping reduction, solving Y solves X
 intractable  take a very long time to solve (not polynomial)
 <> notation means bitstring representation
 $
= 0^n$  $<m,w> means w \in L(M)$
 KD  “known to be distinct”
 idempotent  R + R = R
ch 9  undecidability (9.1,9.2,9.3)
 does this TM accept (the code for) itself as input?
 enumerate binary strings  add a leading 1
 express TM as binary string
 give it a number
 TM uses this for each transition
 separate transitions with 11
 diagonalization language  set of strings w_i such that w_i is not in L(M_i)
 make table with M_i as rows, w_j as cols
 complement the diagonal is characteristic vector in L_d
 diagonal can’t be characteristic vector of any TM
 not RE
 recursive  complement is also recursive
 just switch accept and reject
 if language and complement are both RE, then L is recursive
 universal language  set of binary strings that encode a pair (M,W) where M is TM, w $\in (0+1)^*$  set of strings representing a TM and an input accepted by that TM
 there is a universal Turing machine such that L_u = L(U)
 L_u is undecidable: RE but not recursive
 halting problem  RE but not recursive
 Rice’s Thm  all nontrivial properties of the RE languages are undecidable
 property of the RE languages is a set of RE languages
 property is trivial if it is either empty or is all RE languages
 empty property $\emptyset$ is different from the property of being an empty language {$\emptyset$}
 ex. “the language is contextfree, empty, finite, regular”
 however properties such as 5 states are decidable
Ch 10  10.110.4 know the additional problems that are NPcomplete
 intractable  can’t be solved in polynomial time
 NPcomplete examples
 boolean satisfiability
 symbols ^, etc. are represent by themselves
 x_i is represented by x followed by i in binary
 Cook’s thm  SAT is NPcomplete
 show SAT in NP
 show all other NP reduce to SAT
 pf involves matrix of cell/ID facts
 cols are ID 0,1,…,p(n)
 rows are alpha_0,alpha_1,…alpha_p(n)
 for any problem’s machine M, there is polynomialtimeconverter for M that uses SAT decider to solve in polynomial time
 3SAT  easier to reduce things to
 AND of clauses each of which is the OR of exactly three variables or negated variables
 conjunctive normal form  if it is the AND of clauses
 conversion to cnf isn’t always polynomial time  don’t have to convert to equivalent expression, just have to both be satisfiable at the same times
 push all negatives down the expression tree  linear
 put it in cnf  demorgans, double negation
 literal  variable or a negated variable
 kconjunctive normal form  k is number of literals in clauses
 traveling salesman problem  find cycle of weight less than W
 O(m!)
 Independent Set  graph G and a lower bound k  yes if and only if G has an indpendent set of k nodes
 none of them are connected by an edge
 reduction from 3SAT
 nodecover problem
 node cover  every node is on one of the edges
 Undirected Hamiltonian circuit problem
 TSP with all weights 1
 Directed HamiltonianCircuit Problem
 subset sum
 is there a subset of numbers that sums to a number
 boolean satisfiability
 reductions must be polynomialtime reductions
 P  solvable in polynomial by deterministic TM
 NP  solvable in polynomial time by nondeterministic TM
 NPcompleteness (Karpcompleteness)  a problem is at least as hard as any problem in NP = for every language L’ in NP, there is a polynomialtime reduction of L’ to L
 Cookcompleteness equivalent to NPcompleteness  if given a meachansim that in one unit of time would answer any equestion about membership of a string in P, it was possible to recognize any language in NP in polynomial time
 NPhard  we don’t know if L is in NP, but every problem in NP reduces to L in polynomial time
 if some NPcomplete problem p is in P then P=NP
 there are things between polynomial and exponential time (like 2^logn), and we group these in with the exponential category
 could have P polynomials run forever when they don’t accept
 could simply tell them to stop after a certain amount of steps
 there are algorithms called verifiers
more on NP Completeness
 a language is polynomialtime reducible to a language B if there is a polynomial time comoputable function that maps one to the other
 to solve a problem, efficiently transform to another problem, and then use a solver for the other problem
 satisfiability problem  check if a boolean expression is true
 have to test every possible boolean value  2^n where n is number of variables
 this can be mapped to all problems of NP
 ex. traveling salesman can be reduced to satisfiability
 have to test every possible boolean value  2^n where n is number of variables
 P  set of all problems that can be solved in polynomial time
 NP  solved in polynomial time if we allow nondeterminism
 we count the time as the length of the shortest path
 NPhard problem L’
 every L is NP reduces to L’ in polynomial time
 NPcomplete L’
 L’ is NPhard
 L’ is in NP
 ex. graph coloring
 partitioning into equal sums
 if one NPcomplete problem is in P, P=NP
 decider vs. optimizer
 decider tells whether it was solved or not
 if you keep asking it boolean questions it gives you the answer
 graph clique problem  given a graph and an integer k is there a subgraph in G that is a complete graph of size k
 this is reduction from boolean satisfiability
 graph 3colorability
 reduction from satisfiability  prove with or gate type structure
 approximation algorithms
 find minimum
 greedy  keep going down
 genetic algorithms  pretty bad
 minimum vertex cover problem  given a graph, find a minimum set of vertices such that each edge is incident to at least one of these vertices
 NPcomplete
 can not be approximated within 1.36*solution
 can be approximated within 2*solution in linear time
 pick an edge, pick its endpoints
 put them in solution
 eliminate these points and their edges from the graph
 repeat
 find minimum
 maximum cut problem  given a graph, find a partition of the vertices maximizing the number of crossing edges
 can not be approximated within 17/16*solution
 can be approximated within 2*solution
 if moving arbitrary node across partition will improve the cut, then do so
 repeat

Data Structures
lists
arrays and strings
 start by checking for null, length 0
 ascii is 128, extended is 256
queue  linkedlist
 has insert at back (enqueueu) and remove from front (dequeue)
class Node { Node next; int val; public Node(int d) { val = d; } }
 finding a loop is tricky, use visited
 reverse a linked list
 requires 3 ptrs (one temporary to store next)
 return pointer to new end
stack
class Stack { Node top; Node pop() { if (top != null) { Object item = top.data; top = top.next; return item; } return null; } void push(Object item) { Node t = new Node(item); t.next = top; top = t; } }
 sort a stack with 2 stacks
 make a new stack called ans
 pop from old
 while old element is > ans.peek(), old.push(ans.pop())
 then new.push(old element)
 stack with min  each el stores min of things below it
 queue with 2 stacks  keep popping everything off of one and putting them on the other
 sort with 2 stacks
trees
 Balanced binary trees are generally logarithmic
 Root: a node with no parent; there can only be one root
 Leaf: a node with no children
 Siblings: two nodes with the same parent
 Height of a node: length of the longest path from that node to a leaf
 Thus, all leaves have height of zero
 Height of a tree: maximum depth of a node in that tree = height of the root
 Depth of a node: length of the path from the root to that node
 Path: sequence of nodes n1, n2, …, nk such that ni is parent of ni+1 for 1 ≤ i ≤ k
 Length: number of edges in the path
 Internal path length: sum of the depths of all the nodes
 Binary Tree  every node has at most 2 children
 Binary Search Tree  Each node has a key value that can be compared
 Every node in left subtree has a key whose value is less than the root’s key value
 Every node in right subtree has a key whose value is greater than the root’s key value
void BST::insert(int x, BinaryNode * & curNode){ //we pass in by reference because we want a change in the method to actually modify the parameter (the parameter is the curNode *) //left associative so this is a reference to a pointer if (curNode==NULL) curNode = new BinaryNode(x,NULL,NULL); else if(x<curNode>element) insert(x,curNode>left); else if(x>curNode>element) insert(x,curNode>right); }
 BST Remove
 if no children: remove node (reclaiming memory), set parent pointer to null
 one child: Adjust pointer of parent to point at child, and reclaim memory
 two children: successor is min of right subtree
 replace node with successor, then remove successor from tree
 worstcase depth = n1 (this happens when the data is already sorted)
 maximum number of nodes in tree of height h is 2^(h+1)  1
 minimum height h ≥ log(n+1)1
 Perfect Binary tree  impractical because you need the perfect amount of nodes
 all leaves have the same depth
 number of leaves 2^h
AVL Tree
 if no children: remove node (reclaiming memory), set parent pointer to null
 For every node in the tree, the height of the left and right subtrees differs at most by 1
 guarantees log(n)
 balance factor := The height of the right subtree minus the height of the left subtree
 “Unbalanced” trees: A balance factor of 2 or 2
 AVL Insert  needs to update balance factors
 same sign > single rotation
 2, 1 > needs right rotation
 2, +1 > needs left then right
 Find: Θ(log n) time: height of tree is always Θ(log n)
 Insert: Θ(log n) time: find() takes Θ(log n), then may have to visit every node on the path back up to root to perform up to 2 single rotations
 Remove: Θ(log n): left as an exercise
 Print: Θ(n): no matter the data structure, it will still take n steps to print n elements
RedBlack Trees
 definition
 A node is either red or black
 The root is black
 All leaves are black The leaves may be the NULL children
 Both children of every red node are black Therefore, a black node is the only possible parent for a red node
 Every simple path from a node to any descendant leaf contains the same number of black nodes
 properties
 The height of the right and left subtree can differ by a factor of n
 insert (Assume node is red and try to insert)
 The new node is the root node
 The new node’s parent is black
 Both the parent and uncle (aunt?) are red
 Parent is red, uncle is black, new node is the right child of parent
 Parent is red, uncle is black, new node is the left child of parent
 Removal
 Do a normal BST remove
 Find next highest/lowest value, put it’s value in the node to be deleted, remove that highest/lowest node
 Note that that node won’t have 2 children!
 We replace the node to be deleted with it’s left child
 This child is N, it’s sibling is S, it’s parent is P
Splay Trees
 This child is N, it’s sibling is S, it’s parent is P
 Find next highest/lowest value, put it’s value in the node to be deleted, remove that highest/lowest node
 Do a normal BST remove
 A selfbalancing tree that keeps “recently” used nodes close to the top
 This improves performance in some cases
 Great for caches
 Not good for uniform access
 Anytime you find / insert / delete a node, you splay the tree around that node
 Perform tree rotations to make that node the new root node
 Splaying is Θ(h) where h is the height of the tree
 At worst this is linear time  Θ(n)
 We say it runs in Θ(log n) amortized time  individual operations might take linear time, but other operations take almost constant time  averages out to logarithmic time
 m operations will take m*log(n) time
other trees
 to go through bst (without recursion) in order, use stacks
 push and go left
 if can’t go left, pop
 add new left nodes
 go right
 breadthfirst tree
 recursively print only at a particular level each time
 create pointers to nodes on the right
 balanced tree = any 2 nodes differ in height by more than 1
 (maxDepth  minDepth) <=1
 trie is an infix of the word “retrieval” because the trie can find a single word in a dictionary with only a prefix of the word
 root is empty string
 each node stores a character in the word
 if ends, full word
 need a way to tell if prefix is a word > each node stores a boolean isWord
heaps
 used for priority queue
 peek(): just look at the root node
 add(val): put it at correct spot, percolate up
 percolate  Repeatedly exchange node with its parent if needed
 expected run time: ∑i=1..n 1/2^n∗n=2
 pop(): put last leaf at root, percolate down
 Remove root (that is always the min!)
 Put “last” leaf node at root
 Repeatedly find smallest child and swap node with smallest child if needed.
 Priority Queue  Binary Heap is always used for Priority Queue
 insert
 inserts with a priority
 findMin
 finds the minimum element
 deleteMin
 finds, returns, and removes minimum element
 insert
 perfect (or complete) binary tree  binary tree with all leaf nodes at the same depth; all internal nodes have 2 children.
 height h, 2h+11 nodes, 2h1 nonleaves, and 2h leaves
 Full Binary Tree
 A binary tree in which each node has exactly zero or two children.
 Minheap  parent is min
 Heap Structure Property: A binary heap is an almost complete binary tree, which is a binary tree that is completely filled, with the possible exception of the bottom level, which is filled left to right.
 in an array  this is faster than pointers
 left child: 2*i
 right child: (2*i)+1
 parent: floor(i/2)
 pointers need more space, are slower
 multiplying, dividing by 2 are very fast
 in an array  this is faster than pointers
 Heap ordering property: For every nonroot node X, the key in the parent of X is less than (or equal to) the key in X. Thus, the tree is partially ordered.
 Heap Structure Property: A binary heap is an almost complete binary tree, which is a binary tree that is completely filled, with the possible exception of the bottom level, which is filled left to right.
 Heap operations
 findMin: just look at the root node
 insert(val): put it at correct spot, percolate up
 percolate  Repeatedly exchange node with its parent if needed
 expecteed run time: ∑i=1..n 1/2^n∗n=2
 deleteMin: put last leaf at root, percolate down
 Remove root (that is always the min!)
 Put “last” leaf node at root
 Repeatedly find smallest child and swap node with smallest child if needed.
 Compression
 Lossless compression: X = X’
 Lossy compression: X != X’
 Information is lost (irreversible)

Compression ratio: X / Y 
Where X is the number of bits (i.e., file size) of X

 Huffman coding
 Compression
 Determine frequencies
 Build a tree of prefix codes
 no code is a prefix of another code
 start with minheap, then keep putting trees together
 Write the prefix codes to the output
 reread source file and write prefix code to the output file
 Decompression
 read in prefix code  build tree
 read in one bit at a time and follow tree
 Compression
 ASCII characters  8 bits, 2^7 = 128 characters
 cost  total number of bits
 “straight cost”  bits / character = log2(numDistinctChars)
 Priority Queue Example
 insert (x)
 deleteMin()
 findMin()
 isEmpty()
 makeEmpty()
 size()
Hash Tables
 java: load factor = .75, default init capacity: 16, uses buckets
 string hash function: s[0]31^(n1) + s[1]31^(n2) + … + s[n1] where n is length mod (table_size)
 Standard set of operations: find, insert, delete
 No ordering property!
 Thus, no findMin or findMax
 Hash tables store keyvalue pairs
 Each value has a specific key associated with it
 fixed size array of some size, usually a prime number
 A hash function takes in a “thing” )string, int, object, etc._
 returns hash value  an unsigned integer value which is then mod’ed by the size of the hash table to yield a spot within the bounds of the hash table array
 Three required properties
 Must be deterministic
 Meaning it must return the same value each time for the same “thing”
 Must be fast
 Must be evenly distributed
 implies avoiding collisions  Technically, only the first is required for correctness, but the other two are required for fast running times
 Must be deterministic
 A perfect hash function has:
 No blanks (i.e., no empty cells)
 No collisions
 Lookup table is at best logarithmic
 We can’t just make a very large array  we assume the key space is too large
 you can’t just hash by social security number
 hash(s)=(∑k−1i=0si∗37^i) mod table_size
 you would precompute the powers of 37
 collision  putting two things into same spot in hash table
 Two primary ways to resolve collisions:
 Separate Chaining (make each spot in the table a ‘bucket’ or a collection)
 Open Addressing, of which there are 3 types:
 Linear probing
 Quadratic probing
 Double hashing
 Two primary ways to resolve collisions:
 Separate Chaining
 each bucket contains a data structure (like a linked list)
 analysis of find
 The load factor, λ, of a hash table is the ratio of the number of elements divided by the table size
 For separate chaining, λ is the average number of elements in a bucket
 Average time on unsuccessful find: λ
 Average length of a list at hash(k)
 Average time on successful find: 1 + (λ/2)
 One node, plus half the average length of a list (not including the item)
 Average time on unsuccessful find: λ
 typical case will be constant time, but worst case is linear because everything hashes to same spot
 λ = 1
 Make hash table be the number of elements expected
 So average bucket size is 1
 Also make it a prime number
 λ = 0.75
 Java’s Hashtable but can be set to another value
 Table will always be bigger than the number of elements
 This reduces the chance of a collision!
 Good tradeoff between memory use and running time
 λ = 0.5
 Uses more memory, but fewer collisions
 For separate chaining, λ is the average number of elements in a bucket
 The load factor, λ, of a hash table is the ratio of the number of elements divided by the table size
 Open Addressing: The general idea with all of them is that, if a spot is occupied, to ‘probe’, or try, other spots in the table to use
 3 Types:
 General: pi(k) = (hash(k) + f(i)) mod table_size
1.Linear Probing: f(i) = i
 Check spots in this order :
 hash(k)
 hash(k)+1
 hash(k)+2
 hash(k)+3
 These are all mod table_size
 find  keep going until you find an empty cell (or get back)
 problems
 cannot have a load factor > 1, as you get close to 1, you get a lot of collisons
 clustering  large blocks of occupied cells
 “holes” when an element is removed 2.Quadratic: f(i) = i^2
 hash(k)
 hash(k)+1
 hash(k)+4
 hash(k)+9
 you move out of clusters much quicker 3.Double hashing: i * hash2(k)
 hash2 is another hash function  typically the fastest
 problem where you loop over spots that are filled  hash2 yields a factor of the table size
 solve by making table size prime
 hash(k) + 1 * hash2(k)
 hash(k) + 2 * hash2(k)
 hash(k) + 3 * hash2(k)
 Check spots in this order :
 General: pi(k) = (hash(k) + f(i)) mod table_size
1.Linear Probing: f(i) = i
 a prime table size helps hash function be more evenly distributed
 problem: when the table gets too full, running time for operations increases
 solution: create a bigger table and hash all the items from the original table into the new table
 position is dependent on table size, which means we have to rehash each value
 this means we have to recompute the hash value for each element, and insert it into the new table!
 When to rehash?
 When half full (λ = 0.5)
 When mostly full (λ = 0.75)
 Java’s hashtable does this by default
 When an insertion fails
 Some other threshold
 Cost of rehashing
 Let’s assume that the hash function computation is constant
 We have to do n inserts, and if each key hashes to the same spot, then it will be a Θ(n2) operation!
 Although it is not likely to ever run that slow
 Removing
 You could rehash on delete
 You could put in a ‘placeholder’ or ‘sentinel’ value
 gets filled with these quickly
 perhaps rehash after a certain number of deletes
 3 Types:
 has functions
 MD5 is a good hash function (given a string or file contents)
 generates 128 bit hash
 when you download something, you download the MD5, your computer computes the MD5 and they are compared to make sure it downloaded correctly
 not reversible because when a file has more than 128 bits, won’t be 11 mapping
 you can lookup a MD5 hash in a rainbow table  gives you what the password probably is based on the MD5 hash
 SHA (secure Hash algorithm) is much more secure
 generates hash up to 512 bits
 MD5 is a good hash function (given a string or file contents)

Graphical Models
big data
 marginal correlation  covariance matrix
 estimates are bad when not n » d
 eigenvalues are not wellapproximated
 often enforce sparsity
 ex. threshold each value in the cov matrix (set to 0 unless greater than thresh)  this threshold can depend on different things
 can also use regularization to enforce sparsity
 POET doesn’t assume sparsity
 conditional correlation  inverse covariance matrix = precision matrix
1  bayesian networks

A and B have conditional independence given C if A B and A C are independent
bayesian networks intro
 represented by directed acyclic graph
 could get an expert to design Bayesian network
 otherwise, have to learn it from data
 each node is random variable
 weights as tables of conditional probabilities for all possibilities
 encodes conditional independence relationships
 compact representation of joint prob. distr. over the variables
 markov condition  given its parents, a node is conditionally independent of its nondescendants

therefore joint distr: $P(X_1 = x_1,…X_n=x_n)=\prod_{i=1}^n P(X_i = x_i Parents(X_i))$  inference  using a Bayesian network to compute probabilities
 sometimes have unobserved variables
sampling
 exact inference is feasible in small networks, but takes a long time in large networks
 approximate inference techniques
 learning
 prior sampling
 draw N samples from a distribution S
 approximate posterior probability based on observed values
 ex. flip a weighted coin to find out what the probabilities are
 then move to child nodes and repeat
 inference
 then move to child nodes and repeat

suppose we want to know P(D !A)  sample network N times, report probability of D being true when A is false
 more samples is better

rejection sampling  if want to know P(D !A)  sample N times, throw out samples where A isn’t false
 return probability of D being true
 this is slow
 likelihood weighting  fix our evidence variables to their observed values, then simulate the network
 can’t just fix variables  distr. might be inconsistent
 instead we weight by probability of evidence given parents, then add to final prob
 for each observation
 if correct, Count = Count+(1*W)
 always, Total = Total+(1*W)
 return Count/Total
 this way we don’t have to throw out wrong samples
 doesn’t solve all problems  evidence only influences the choice of downstream variables
 approximate inference techniques
16 learning overview
 notation
 P*  true distribution
 $\hat{P}$  sample distribution of P
 $\tilde{P}$  estimated distribution P
 density estimation  construct a model $\tilde{M}$ such that $\tilde{P}$ is close to generating $P^*$
 this can be estimated with relative entropy distance = $\mathbf{E_X}\left[ log(\frac{P^*(X)}{\tilde{P(X)}} \right]$
 also $=  \mathbf{H}_{P^*}(X)  \mathbf{E}_X [log \tilde{P}(X)]$
 intuitively measures extent of compression loss (in bits)
 can ignore first term because it is unaffected by the model

concentrate on expected loglikelihood = $\mathcal{l}(D M) = \mathbf{E}_X [log \tilde{P}(X)]$  maximizes probability of data given the model
 maximizes prediction assuming we are given complete instances
 could design test suite of queries to evaluate performance on a range of queries
 classification
 can set loss function to classification error (0/1 loss)
 this doesn’t work well for multiclass labeling
 Hamming loss  counts number of variables Y in which pred differs from ground truth

conditional loglikelihood = $\mathbf{E}_{x,y ~ P}[log \tilde{P}(y x)]$  only measure likelihood with respect to predicted y  knowledge discovery
 far more critical to assess the confidence in a prediction
 the amount of data required to estimate parameters reliably grows linearly with the number of parameters, so that the amount of data required can grow exponentially with the network connectivity
 goodness of fit  how well does the learned distribution represent the real distribution?
17 parameter estimation

assume parametric model P ($x$ $\theta$)  a sufficent statistic can be used to calculate likelihood
18 structure learning
 structure learning  learning the structure (e.g. connections) in the model
20 learning undirected models
 a potential function is a nonnegative function
 values with higher potential are more probable
 can maximize entropy in order to impose as little structure as possible while satisfying constraints
 marginal correlation  covariance matrix

Graphs
 Edges are of the form (v1, v2)
 Can be ordered pair or unordered pair
 Definitions
 A weight or cost can be associated with each edge  this is determined by the application
 w is adjacent to v iff (v, w) $\in$ E
 path: sequence of vertices w1, w2, w3, …, wn such that (wi, wi+1) ∈ E for 1 ≤ i < n
 length of a path: number of edges in the path
 simple path: all vertices are distinct
 cycle:
 directed graph: path of length $\geq$ 1 such that w1 = wn
 undirected graph: same, except all edges are distinct
 connected: there is a path from every vertex to every other vertex
 loop: (v, v) $\in$ E
 complete graph: there is an edge between every pair of vertices
 digraph
 directed acyclic graph: no cycles; often called a “DAG”
 strongly connected: there is a path from every vertex to every other vertex
 weakly connected: the underlying undirected graph is connected
 For Google Maps, an adjacency matrix would be infeasible  almost all zeros (sparse)
 an adjacency list would work much better
 an adjacency matrix would work for airline routes
 detect cycle
 dfs from every vertex and keep track of visited, if repeat then cycle
 Topological Sort
 Given a directed acyclic graph, construct an ordering of the vertices such that if there is a path from vi to vj, then vj appears after vi in the ordering
 The result is a linear list of vertices
 indegree of v: number of edges (u, v) – meaning the number of incoming edges
 Algorithm
 start with something of indegree 0
 take it out, and take out the edges that start from it
 keep doing this as we take out more and more edges
 can have multiple possible topological sorts
 Shortest Path
 singlesource  start somewhere, get shortest path to everywhere
 unweighted shortest path  breadth first search
 Weighted Shortest Path
 We assume no negative weight edges
 Djikstra’s algorithm: uses similar ideas as the unweighted case
 Greedy algorithms: do what seems to be best at every decision point
 Djikstra: v^2
 Initialize each vertex’s distance as infinity
 Start at a given vertex s
 Update s’s distance to be 0
 Repeat
 Pick the next unknown vertex with the shortest distance to be the next v
 If no more vertices are unknown, terminate loop
 Mark v as known
 For each edge from v to adjacent unknown vertices w
 If the total distance to w is less than the current distance to w
 Update w’s distance and the path to w
 It picks the unvisited vertex with the lowestdistance, calculates the distance through it to each unvisited neighbor, and updates the neighbor’s distance if smaller. Mark visited (set to red) when done with neighbors.
 Shortest path from a start node to a finish node

 We can just run Djikstra until we get to the finish node

 Have different kinds of nodes
 Assume you are starting on a “side road”
 Transition to a “main road”
 Transition to a “highway”
 Get as close as you can to your destination via the highway system
 Transition to a “main road”, and get as close as you can to your destination
 Transition to a “side road”, and go to your destination
 Have different kinds of nodes

 Traveling Salesman
 Given a number of cities and the costs of traveling from any city to any other city, what is the leastcost roundtrip route that visits each city exactly once and then returns to the starting city?
 Hamiltonian path: a path in a connected graph that visits each vertex exactly once
 Hamiltonian cycle: a Hamiltonian path that ends where it started
 The traveling salesperson problem is thus to find the least weight Hamiltonian path (cycle) in a connected, weighted graph
 Minimum Spanning Tree
 Want fully connected
 Want to minimize number of links used
 We won’t have cycles
 Any solution is a tree
 Slow algorithm: Construct a spanning tree:
 Start with the graph
 Remove an edge from each cycle
 What remains has the same set of vertices but is a tree
 Spanning Trees
 Minimalweight spanning tree: spanning tree with the minimal total weight
 Generic Minimum Spanning Tree Algorithm
 KnownVertices < {}
 while KnownVertices does not form a spanning tree, loop:
 find edge (u,v) that is “safe” for KnownVertices
 KnownVertices < KnownVertices U {(u,v)}
 end loop
 Prim’s algorithm
 Idea: Grow a tree by adding an edge to the “known” vertices from the “unknown” vertices. Pick the edge with the smallest weight.
 Pick one node as the root,
 Incrementally add edges that connect a “new” vertex to the tree.
 Pick the edge (u,v) where:
 u is in the tree, v is not, AND
 where the edge weight is the smallest of all edges (where u is in the tree and v is not)
 Running time: Same as Dijkstra’s: Θ(e log v)
 Kruskal’s algorithm
 Idea: Grow a forest out of edges that do not create a cycle. Pick an edge with the smallest weight.
 When optimized, it has the same running time as Prim’s and Dijkstra’s: Θ(e log v)
 unoptomized: v^2
 Generic Minimum Spanning Tree Algorithm
 Edges are of the form (v1, v2)

Computer Architecture
 units
 numbers
 x86
 intro to C  we use ANSI standard
 compile steps
 strings
 memory in C
 call stack
 types
 boolean operators
 ATT x86 assembly
 y86  all we use
 hardware
 executing instructions
 pipelining
 memory
 optimization
 exceptions
 processes
 threads
 system calls
 software exceptions
 signals, setjmps
 virtual memory
 overview
 segments
 quiz rvw
 labs
 reading
units
 we will use only the i versions (don’t have to write i):
 K  10^3: Ki  1024
 M  10^6: Mi  1024^2
 G  10^9: Gi  1024^3
 convert to these: 2^27 = 128M
 log(8K)=13
 hardware is parallel by default
 amdahl’s law: tells you how much of a speedup you get
 S = 1 / (1a+a/k)
 aportion optimized, klevel of parallelization, Stotal speedup
 if you really want performance increase in java, allocate a very large array, then keep track of it on your own
numbers
0x means hexadecimal 0 means octal bit  stores 0 or 1 byte  8 bits  2 hex digits integer  almost always 32 bits “Bigendian”: most significant first (lowest address)  how we thnk 1000 0000 0000 0000 = 2^5 = 32768 “Littleendian”: most significant last (highest address)  this is what computers do 1000 0000 0000 0000 = 2^0 = 1 Note that although all the bits are reversed, usually it is displayed with just the bytes reversed Consider 0xdeadbeef On a bigendian machine, that’s 0xdeadbeef On a littleendian machine, that’s 0xefbeadde 0xdeadbeef is used as a memory allocation pattern by some OSes Representing integers Sign and magnitude  first digit specifies sign One’s complement  encode using n1 bits, flip if negative Two’s complement  encode using n1 bits, flip if negative, add 1 only one representation for 0 maximum: 2^(n1)  1 minimum:  2^(n1) flip everything to the left of the rightmost 1 Floating point  like scientific notation 3.24 * 10 ^ 6 Mantissa  3.24  between 1 and the base (10) For binary, the mantissa must be between 1 and 2 we assume the base is 2 32 bits are split as follows: bit 1: sign bit, 1 means negative (1 bit) bits 29: exponent (8 bits) Exponent values: 0: zeros 1254: exponent127 255: infinities, overflow, underflow, NaN bits 1032: mantissa (23 bits) mantissa=1.0+∑(i=1:23)(b^i/2^i) //we don’t encode the 1. because it has to be there value=(1−2∗sign)∗(1+mantissa)∗2^(exponent−127) The largest float has: 0 as the sign bit (it’s positive) 254 as the exponent (1111 1110) 255 is reserved for infinities and overflows That exponent is 254127 = 127 All 1’s for the mantissa Which yields almost 2 2 * 2^127 = 2^128 = 3.402823 * 10^38 //actually a little bit lower Minimum positive: 1 * 2^126 = 2^126 = 1.175494 x 10^38 //this is exact Floating point numbers are not spatially uniform Depending on the exponent, the difference between two successive numbers is not the same union class  converts from one data type to another //when you write one field, it overrides the other union foo { //this converts a float to hex float f; int *x; } bar;
int main() { bar.f = 42.125; cout << bar.x << endl; // this outputs as 0x42288000 (it is now converted to hex) } // if you were to dereference it, bad things would happen Never compare floating point numbers  even if you print them out, they might be stored internally Any fraction that doesn't have a power of 2 as the denominator will be repeating // C++ (need to #include <math.h> and compile with lm) define EPSILON 0.000001 bool foo = fabs (ab) < EPSILON; You could use a rational or use more digits 64bit: 11 exponent bits offset=1023 Cowlishaw encoding: use 3 bits to store a binary digit  inefficient
x86
Assembly Language  assembler translates text into machine code x86 is the type of chip 8 registers, although you can’t use 2 (stack pointer and base pointer) they are all 32 bit registers 1 byte = 8 bits Declare variables with 3 things: identifier, how big it is, and value doesn’t give you a type ? means uninitialized x DD 1, 2, 3 //declares arr with 3, 4byte integers y TIMES 8 DB 0 //declares 8 bytes all with value 0 nasm assumes you are using 32 bits mov
, more like copy Where dest and src can be: A register A constant Variable name Pointer: [ebx] always put square brackets around variable you can ADD up to two registers, add one constant, and premultipy ONE register by 2,4,or 8 The destination cannot be a constant (would overwrite the constant) You cannot access memory twice in one instruction not enough time to do that at clock speed Stack starts at the end of memory and goes backwards when you push onto it, it's actually at a lower index ESP points to most recently pushed item push First decrements ESP (stack pointer) by 4 (stack grows down) push (mov) operand onto stack (4 bytes  we make this assumption, not always true) pop Pop top element of stack to memory or register, then increment stack pointer (ESP) by 4 Value is written to the parameter Commands 0fH  H at end specifies hex number lea is like & (get the address of) Load effective address Place address of second parameter into the first parameter this is faster than arithmetic because you can do things as a a single command add, sub a += b inc, dec a++ imul a *= b idiv  use shift if possible have to load a,b into one 64bit integer and, xor  bitwise cmp  compare two things je  jump when equal  specify where you are going to jump to Others: jne, jz, jg, jge, jl, jle, js call subroutine may not know how many parameters are passed to it  thus, 1st arg must be at ebp+8 and the rest are pushed above it. Every subroutine call puts return address and ebp backup on the stack
Activation Records Every time a subroutine is called, a number of things are pushed onto the stack: Registers Parameters Old base/stack pointers Local variables Return address The callee also pushes callersaved registers Typically stack stops around 100200 Megabytes, although this can be changed
Memory  There are two types of memory that need to be handled: Dynamic memory (via new, malloc(), etc.) This is stored on the heap Static memory (on the stack) This is where the activation records are kept
The binary program starts at the beginning of the 2^32 = 4 Gb of memory The heap starts right after this The stack starts at the end of this 4 Gb of memory, and grows backward If they meet, you run out of memory
Buffer Overflow void security_hole() { char buffer[12]; scanf (“%s”, buffer); // how C handles input } The stack looks like (with sizes in parenthesis):
esi (4) edi (4) buffer (12) ebp (4) ret addr (4) Addresses increase to the right (the stack grows to the left) What happens if the value stored into buffer is 13 bytes long? We overwrite one byte of ebp What happens if the value stored into buffer is 16 bytes long? We completely overwrite ebp What if it is exactly 20 bytes long? We overwrite the return address! Buffer Overflow Attack When you read in a string (etc.) that goes beyond the size of the buffer You can then overwrite the return address And set it to your own code For example, code that is included later on in the string  overwrite ebp, overwrite ret addr with beginning of malicious code
We are using nasm as our assembler for the x86 labs looks different when you use the compiler in C, you can only have one method with the same name C translates more cleanly into assembly optimization rearranges stuff to lessen memory access _Z3maxii: ii is the parameter list (two integers)
In littleEndian, the entire 32bit word and the 8bit least significant byte have the same address this makes casting very easy RISC Reduced instruction set computer Fewer and simpler instructions (maybe 50 or so) Less chip complexity means they can run fast CISC Complex instruction set computer More and more complex instructions (300400 or so) More chip complexity means harder to make run fast
Caller Parameters: pushed on the stack Registers: saved on the stack eax,ecx,edx can be modified ebx,edi,esi shouldn’t be call  places return address on stack Local variables: placed in memory on the stack Return value: eax Callee: the function which is called by another function push ebp mov ebp, esp sub esp, 4 //allocate local variables push ebx //you don’t have to back these up mov ebp4, 1 //load 1 into local variable
add esp, 4 //deallocate local var pop ebx pop ebp ret
intro to C  we use ANSI standard
all C is valid C++ // doesn't work always use +=1 not ++ compile with gcc ansi pedantic Wall Werror program.c Werror will stop the program from compiling all variables have to be declared at the top int main(int argc, char*argv[]){ int x = 1; int y = 34; int z = y*y/x; x = 13; int w = 1; < this will not work label_name: printf("omg!"); goto label_name; /*this goes to the label_name line  don't do this, but assembly only has this*/ return 0; } printf(const char *format, ...) printf("%d %f %g %s\n",); /* int, double (these must be explicitly doubles), double (as small as possible), string */
compile steps
source (text) > preprocessor > modified source (text) > compiler > assembly (text) > assembler > binary program > linker > executable
 preprocessing  deals with hashtags  sets up line numbers for errors, includes external definitions, normal defines (.i)
 compile  turns it into assembly (.s)
 this assembly has commands with destination, src
 assemble  turns assembly into binary with a lot of wrappers (.o)
 link  makes the file executable, gets the code from includes together (a.out)
strings
 char  number that fits in 1 byte
 string is an array of chars: char*
 all strings end with null character \0, bad security
 length of string doesn’t include null character
h e l l o \0 10 . . . . 15 memory in C
 byte is smallest accessible memory unit  2 hex digits (ex: 0x3a)
Bits Name Bits Name 1 bit 16 word 4 nyble 32 double word, long word 8 byte 64 quad word theoretically would work:
Void * p = 3 (address 3) *p = 0x3a (value 3a) p[0] = 0x3a (value 3a) p[3] is same as *(p+3)  can even use negative addresses, (at end wraps around  overflows)
in practice:
int* p sizeof(int) == 4, but all pointers are just one byte  points to location of four consecutive bytes that are int indexing this pointer will tell you how much to offset memory by address must be 4bytes aligned (means it will be a multiple of 4)
 littleendian  least significant byte first
 bigendian  most significant byte first (normal)  networks use this
 we will use littleendian, this is what most computers use
 low addresses are unused  will throw error if accessed
 top has the os  will throw error if accessed
 contains heap, stack, code globals, shared stuff
call stack
 return address
 local variables
 backups
 top pointer
 base pointer
 next pointer
 parameters (sometimes)
 return values (sometimes)
one frame  between the lines is everything the current method has  largest addresses at top, grows downwards
 parameters (backwards)
 ret address
 base pointer
 previous stack base
 saved stuff
 locals
 top pointer (actually at the bottom)
 return value
 in practice, most parameters and return values are put in registers
types
 2’s complement: positive normal, negative subtract 1 more than biggest number you can do
 flip everything to the left of the rightmost one
 math is exactly the same, discard extra bits
type  signed?  bits

char ? 8 short signed 16 (usually) int . 16 or 32 long . ≤16 or ≥ int long long signed 64  everything can have an unsigned / signed in front of the type
 C will cast things without throwing error
boolean operators
 9 = 1001 = 1.001 x 2^3
 x && y > {0,1}
 x & y > {0,1,…,2^32} (bit operations)
 ^ is xor
 !x  0 > 1, anything else > 0
 and is often called masking because only the things with 1s go through
 shifting will pad with 0s
 (1 « 3 )1 gives us 3 1s
 ~((1 « 3 )1) gives us 111…111000
 x & ~((1 « 3 )1) gives us x1x2…..000

copies the msb
 then we can or it in order to change those last 3 bits
 trinary operator  a≥b:c means if(a) b; else c

a&b ~a&c  only works for 2bit numbers if a=00 or a=11 !x 1 if x=0 !!x 1 if x!=0 so we want a=!!x
ATT x86 assembly
 there are 2 hardwares
 x8664 (desktop market)
 Arm7 (mobile market)  all instructions are like cmov
 think > not = ex. mov $3, %rax is 3 into rax
 prefixes
 $34  $ is an immediate (literal) value
 %rax  the contents of register rax
 main  label (no prefix)  assembler turns this into an immediate
 3330(%rax,%rdi,8)  memory at (3330+rax+8*rdi)  in x86 but not y86
 you could do 23(%rax)
 gcc S will give you .S file w/ assembly
 what would actually be compiled
 gcc c will give you object file
 then, objdump d will dissassemble object file and print assembly
 leaves out suffixes that tell you sizes
 can’t be compiled, but is informative
 gcc O3 will be fast, default is O0, OS optimizes for size
 we call the part of x86 we use y86
 registers
 general purpose registers (program registers)
 PC  program counter  cannot directly change this  what line to run next
 CC  condition codes  sign of last math op result
 remembers whether answer was 0,,+
 cmp  basically subtraction (works backwards, 2nd1st), but only sets the condition codes
 in assembly, arithmetic works as +=
 ex. imul %rdi, %rax multiplies and stores result in rdi
 doesn’t really matter: eax, rax are same register but eax is bottom half of rax
 on some hardwares eax is faster than rax
 call example
 PC=0x78 callq 0x56
 PC=0x7d next command (because callq is 5 bytes long, it could be different)
 puts 7d on stack to return to at address 0x100
 this address (0x100) is subtracted by number of bytes in address (8)
 this value (0x0f8) is put into rsp(in C this is always on the stack)
 rsp stores address of the top of the stack
 PC becomes 56
 PC=0x7d next command (because callq is 5 bytes long, it could be different)
 call
 movq (next PC), (%rsp) ~PC can’t actually be changed
 addq $8, %rsp
 jmp $0x56
 ret
 addq $8, %rsp
 movq (%rsp), (PC) ==same as== jmp (%rsp)
 push
 mov _, (%rsp)
 sub $8, %rsp
 pop does the opposite
 add $8, %rsp
 movq (%rsp), _ cmp
 cmovle %5, (%rax)  move only if we are in this state
y86  all we use
 halt  stops the chip
 nop  do nothing
 op_q
 addq, subq, andq, xorq
 takes 2 registers, stores values in second
 sub is 2nd1st
 jxx
 jl, jle, jg, je, jne, jmp
 takes immediate
 movq longer PC increment because it stores offset, register always comes first (is rA)
 rrmovq (registerregister, same as cmov where condition is always)
 irmovq (immediateregister)
 rmmovq
 mrmovq (memory)
 cmovxx (registerregister)
 call
 takes immediate
 pushes the return address on the stack and jumps to the destination address.
 ret
 pops a value and jumps to it
 pushq
 one register
 pushes then decrements
 popq
 one register
 programmervisible state
 registers
 program
 raxr14 (8 registers x86 has 15)
 rsp is the special one
 64 bit integer (or pointer)  there is no floating point, in x86 floating point is stored in other registers
 other
 program counter (PC), instruction pointer
 64bit pointer
 condition codes (CC)  not important, tell us <, =, > on last operation
 only set by the op_q commands
 program counter (PC), instruction pointer
 program
 memory  all one byte array
 instruction
 data
 registers
 encoding (assembly > bytes)

1st byte > highorder nybble loworder nybble  higher order is opcode (add, sub, …)
 lowerorder is either instruction function or flag(le, g, …)  usually 0
 remaining bytes
 argument in littleendian order
 examples
 call $0x123 > 80 23 01 00 00 00 00 00 00
 ret > 90
 subq %rcx, %r11 > 81 1b (there are 15 registers, specify register with one nybble)
 irmov $0x3330, %rdi > 30 f7 30 33 00 00 00 00 00 00 (registerfirst always, f means no source, but destination of register 7)
 compact binary (variablelength encoding) vs. simple binary (fixedlength encoding)
 x86 vs. ARM
 people can’t decide
 compact binary  complex instruction set computer (cisc)  emphasizes programmer
 simple binary  reduced instruction set computer (risc)  emphasizes hardware
 have more complex compilers
 fixed width instructions
 lots of registers
 few memory addressing modes  no memory operands (only mrmov, rmmov)
 few opcodes
 passes parameters in registers, not the stack (usually)
 no condition codes (uses condition operands)
 in general, computers use compact and tablets/phones use simple
 if we can get away from x86 backwards compatibility, we will probably meet in the middle
 study RISC vs. CISC

hardware
 flows when control is high
 power  everything loses power by creating heat (every gate consumes power)
 changing a transistor takes more power than leaving it
 voltage  threshold above which transistor is open
 register  on rising clock edge store input
 overclock computer  could work, or logic takes too long to get back  things break
 could be fixed with colder, more power
 chips are small because of how fast they are
 mux  selectors pick which input goes through
 out = [ guard:value; … ];
 this is a language called HCL written by the book’s authors
 out = g?input: g2:input2: …:0
 if first is true return that, otherwise keep going otherwise return 0
executing instructions
 we have wires
 register file
 data memory
 instruction memory
 status output  3bits
 ex. popq, %rbx
 todo: get icode, check if it was pop, read mem at rsp, put value in rbx, inc rsp by 8
 getting icode
 instruction in instruction memory: B0 3F
 register pP (p is inputs), (P is outputs)
 pP { pc:64 = 0;}  stores the next pc
 pc < P_pc  the fixed functionality will create i10 bytes
 textbook: icode:ifun = M_1[PC]  gets one byte from PC
 HCL (HCL uses =): wire icode:4; icode = i10bytes[4..8]  little endian values, one byte at a time  this grabs B from B0 3F
 assume icode was b (in reality, this must be picked with a mux)
valA < R[4] // gets %rsp  rsp is 4th register rA:rB < M_1[PC+1] // book's notation  splits up a byte into 2 halves, 1 byte in rA, 1 byte in rB, PC+1 because we want second byte // 3 is loaded into rA, F is loaded into rB valE < valE+8 // inc rsp by 8 valM < M_8[valA] // send %rsp to R[rA] < valM // writes to %rbx R[4] < valE // writes to %rsp p_pc = P_pc+2 // increment PC by 2 because popq is 2byte instruction
 steps
```java
 fetch  what is wanted
 decode  find what to do it to  read prog registers
 execute and/or memory  do it
 write back  tell you result
020 10 nop fetch change pc to 021 021 6303 xorq %rax,%rbx fetch pc < 0x023 decode read reg. file at 0,3 to get 17 and 11 execute 17^11 = 10001^01011 = 11010 = 26, also sets CC to >0 write back write 26 to regFile[3] 023 50 23 1000000000000000 the immediate 16 is in littleendian but is 8 bytes  1st in memory last in littleendian mrmovq 16(%rbx),%rcx fetch read bytes, understand, PC < 0x02d decode read regFile to get (26),13 execute 16+26=42 (to be new address) memory ask RAM for address 42, it says 0x0000000071000000  littleendian write back put 0x71000000 into regFile[2] 02d 71 0000000032651131 jle 0x3111653200000000 fetch valP < 0x036 decode cc > 0, not jump, PC < valP 036 00 halt  set STAT to HALT and the computer shuts off, STAT is always going on in the background push
 reads source register  reads rsp  dec rsp by 8  writes read value to new rsp address # hardware wires  opq example ```java # fetch # 1. set pc # 1.a. make a register to store the next PC register qS { pc : 64 = 0; lt : 1 = 0; eq : 1 = 0; gt : 1 = 0; } # 2. read i10bytes pc = S_pc; # 3. parse out pieces of i10bytes wire icode:4, ifun:4, rA:4, rB:4; icode = i10bytes[4..8]; # 1st byte: 0..8 highorder nibble 4..8 ifun = i10bytes[0..4]; const OPQ = 6; const NO_REGISTER = 0xf; rA = [ icode == OPQ : i10bytes[12..16]; 1: NO_REGISTER; ]; rB = [ icode == OPQ : i10bytes[8..12]; 1: NO_REGISTER; ]; wire valP : 64; valP = [ icode == OPQ : S_pc + 2; 1 : S_pc + 1; # picked at random ]; Stat = STAT_HLT; # fix this # decode # 1. set srcA and srcB srcA = rA; srcB = rB; dstE = [ icode == OPQ : rB; 1 : NO_REGISTER; ]; # execute wire valE : 64; valE = [ icode == OPQ && ifun == 0 : rvalA + rvalB; 1 : 0xdeadbeef; ]; q_lt = [ icode == OPQ : valE < 0; # ... ]; # memory # writeback wvalE = [ icode == OPQ : valE; 1: 0x1234567890abcdef; ]; # PC update q_pc = valP;
pipelining
 nonuniform partitioning  stages don’t all take same amount of time
 register  changes on the clock
 normal  output your input
 bubble  puts “nothing” in registers  register outputs nop
 stall  put output into input (same output)
 see notes on transitions
 stalling a stage (usually because we are waiting for some earlier instruction to complete)
 stall every stage before you
 bubble stage after you so nothing is done with incomplete work  this will propagate
 stages after that are normal
 bubbling a stage  the work being performed should be thrown away
 bubble stage after it
 basically send a nop  use values NO_REGISTER and 0s
 often there are some fields that don’t matter
 stalling a pipeline = stalling a stage
 bubbling a pipeline  bubble all stages
irmovq $31, %rax addq $rax,%rax jle
 the stages are offset  everything always working
 when you get a jmp, you have to wait for the thing before you to writeback. 2 possible solns
 stall decode
 forward value from the stage it is currently in
 look at online notes  save everything in register that needs to be used later
problems
 dependencies  outcome of one instruction depends on the outcome of another  in the software
 data  data needed before advancing  destination of one thing is used as source of another
 load/use
 control  which instruction to run depends on what was done before
 data  data needed before advancing  destination of one thing is used as source of another
 hazards  potential for dependency to mess up the pipeline  in the hardware design
 hardware may or may not have a hazard for a dependency
 can detect them by comparing the wire that reads / writes to regfile (rA,rB / dstE)  they shouldn’t be the same because you shouldn’t be reading/writing to the same register (except when all NO_REGISTER)
solutions
 P is PC, then FDEMW
 stall until it finishes if there’s a problem stall_P = 1; //stall the fetch/decode stage bubble_E = 1; //completes and then starts a nop, gives it time to write values
 forward values (find what will be written somewhere)
 in a 2stage system, we have dstE and we use it to check if there’s a problem
 usually if we can check that there’s a problem, we have the right answer
 if we have the answer, put value where it should be
 this is difficult, but doesn’t slow down hardware
 we decide whether we can forward based on a pipeline diagram  boxes with stages (time is y axis, but also things are staggered)
 we need to look at when we get the value and when we need it
 we can pipeline if we need the value after we get it
 if we don’t, we need to stall
subq %rax,%rbx jge bazzle
 we could stall until CC is set in execute of subq to fetch for jge  this is slow
 speculation execution  we make a guess and sometimes we’ll be right (modern processors are right ~90%)
 branch prediction  process of picking branch
 jumps to a smaller address are taken more often than not  this algorithm and more things make it complicated
 if we were wrong, we need to correct our work
 Example:
1 subq 2 jge 4 3 ir 4 rm
 the stage performing ir is wrong  this stage needs to be bubbled
 in a 5stage pipleine like this, we only need to bubble decode
 ret  doesn’t know the next address to fetch until after memory stage, also can’t be predicted well
 returns are slow, we just wait
 some large processors will guess, can still make mistakes and will have to correct
real processors
 memory is slow and unpredictably slow (10100 cycles is reasonable)
 pipelines are generally 1015 stages (Pentium 4 had 30 stages)
 multiple functional units
 alu could take different number of cycles for addition/division
 multiple functional units lets one thing wait while another continues sending information down the pipeline
 outoforder execution
 compilers look at whether instructions can be done out of order
 it might start them out of order so one can compute in functional unit while another goes through
 can swap operations if 2nd operation doesn’t depend on 1sts
 (x86) turn the assembly into another language
 micro ops
 makes things specific to chips  profiler  software that times software  should be used to see what is taking time in code  hardware vs. software
 software: .c > compiler > assembler > linker > executes
 compiler  lexes, parses, O_1, O_2, …
 hardware: register transfer languages (.hcl) > … > circuitlevel descriptions > layout (veryify this) > mask > add silicon in oven > processor
memory
 we want highest speed at lowest cost (higher speeds correspond to higher costs)
 fastest to slowest
 register  processor  about 1K
 SRAM  memory (RAM)  about 1M
 DRAM  memory (RAM)  416 GB
 SSD (solidstate drives)  mobile devices
 Disk/Harddrive  filesystem  500GB
 the first three are volatile  if you turn off power you lose them
 the last three are nonvolatile
 things have gotten a lot faster
 where we put things affects speed
 registers near ALU
 SRAM very close to CPU
 CPU also covered with heat sinks
 DRAM is large  pretty far away (some other things between them)
 locality
 temporal  close in time  the characteristic of code that repeatedly uses the same address
 spatial  close in space  the characteristic of code that uses addresses that are numerically close
 realtime performance is not bigO  a tree can be faster than hash because of locality
 caching  to keep a copy for the purpose of speed
 if we need to read a value, we want to read from SRAM (we call SRAM the cache)
 if we must read from DRAM (we call DRAM the main memory), we usually want to store the value in SRAM
 cache  what they wanted most recently (good guess because of temporal locality)
 slower caches are still bigger
 cache nearby bytes to recent acesses (good guess because of spatial locality)
 simplified cache
 4 GB RAM > 32bit address
 1MB cache
 64bit words
 ex. addr: 0x12345678
 simple  use entire cache as one block
 chop off the bottom log(1MB)=20 bits
 send addr = 0x12300000
 fill entire cache with 1MB starting at that address
 tag cache with 0x123
 send value from cache at address offset=0x45678
 if the tag of the address is the same as the tag of the cache, then return from cache
 otherwise redo everything
 slightly better cache
 beginning of address is tag
 middle of address might give you index of block in table = log(num_blocks_in_cache)
 end of address is block offset (offset size=log(block_size))
 a (tag,block) pair is called a line
 Fullyassociative cache set of (tag,block) pairs
 table with first column being tags, second column being blocks
 to read, check tag against every tag in cache
 if found, read from that block
 else, pick a line to evict (this is discussed in operating systems)
 read DRAM into that line, read data from that line’s block
 long sets are slow!  typicaly 2 or 3 lines would be common 2. Directmapped cache  array of (tag,block) pairs
 table with 1st column tags, 2nd column blocks
 to read, check the block at the index given in the address
 if found, read from block
 else, load into that line
 good spatial locality would make the tags adjacent (you read one full block, then the next full block)
 this is faster (like a mux) instead of many = comparisons, typically big (1K1M lines) 3. Setassociative cache  hybrid  array of sets
 indices each link to a set, each set with multiple elements
 we search through the tags of this set
 look at examples in book, know how to tell if we have hit or miss
writing
 assume writeback, writeallocate cache
 load block into cache (if not already)  writeallocate cache
 change it in cache
 two optionsm
 write back  wait until remove the line to update RAM
 line needs to have tag, block, and dirty bit
 dirty = wrote but did not update RAM
 write through  update RAM now
 nowriteallocate bypasses cache and writes straight to memory if block not already in cache (typically goes with writethrough cache)
 valid bits  whether or not the line contains meaningful information
 kinds of misses
 cold miss  never looked at that line before
 valid bit is 0 or we’ve only loaded other lines into the cache
 associated with tasks that need a lot of memory only once, not much you can do
 capacity miss  we have n lines in the cache and have read ≥ n other lines since we last read this one
 typically associated with a fullyassociative cache
 code has bad temporal locality
 conflict miss  recently read line with same index but different tag
 typically asssociated with a directmapped cache
 characteristic of the cache more than the code
 cold miss  never looked at that line before
cache anatomy
 icache  holds instructions
 typically read only
 dcache  holds program data
 unified cache  holds both
 associativity  number of cache lines per set
optimization
 only need to worry about locality for accessing memory
 things that are in registers / immediates don’t matter
 compilers are often cautious
 some things can’t be optimized because you could pass in the same pointer twice as an argument
 loop unrolling  often dependencies accross iterations
for(int i=0;i<n;i++){ //this line is bookkeeping overhead  want to reduce this a[i]+=1; //has temporal locality because you add and then store back to same address, obviously spatial locality } // unrolled for(int i=0;i<n2;i+=3){ //reduced number of comparisons, can also get benefits from vectorization (complicated) a[i]+=1; a[i+1]+=1; a[i+2]+=1; } if(n%3>=1) a[n1]+=1; if(n%2>=2) a[n2]+=1;
 n%4 = n&3
 n%8 = n&7s
java //less error prone for(int i=0;i<n;i+=1){ //this can be slower, data dependency between lines might not allow full parallelism (more stalling) a[i]+=1; i+=1; a[i]+=1; i+=1; a[i]+=1; i+=1; }
 loop order
 the order of loops can make this way faster
for(i...) for(j...) a[i][j]+=1; //two memory accesses
 flatten arrays //this is faster, especially if width is a factor of 2, end of one row and beginning of next row are spatially local
 row0 then row1 then row2 ….
 float[heightwidth], access with array[rowwidth+column]
 problem  loop can’t be picked
for(i..) for(j...) a[i][j]=a[j][i];
 solution  blocking
 pick two chunks
 one we read by rows and is happy  only needs one cache line
 one we read by columns  needs blocksize cache lines (by the time we get to the second column, we want all rows to be present in cache)
int bs=8; for (bx=0;bx<N;bx+=bs) // for each block for(by=0;by<N;by+=bs) for(x=bx;x<bx+bs;x+=1) // for each element of block for(y=by;y<by+bs;y+=1) swap(a[x][y],a[y][x]); // do stuff
 conditions for blocking
 the whole thing doesn’t fit in cache
 there is no spatially local loop order
 block size must be able to fit in cach
 reassociation optimization  compiler will do this for integers, but not floats (because of roundoff errors)
 a+b+c+d+e > this is slower (addition is sequential by default, the sum of the first two is then added to the third number)
 ((a+b)+(c+d))+e > this can do things in parallel, we have multiple adders (we can see logarithmic performace if the chip can be fully parallell)
 using methods can increase your instruction cache hit rate
exceptions
 processor is connected to I/O Bridge which is connected to Memory Bus and I/O Bus
 these things are called the mother board
 we don’t want to wait for I/O Bus
 Polling  CPU periodically checks if ready
 Interrupts  CPU asks device to tell the CPU when the device is ready
 this is what is usually done
 CPU has an interrupt pin (interrupt sent by I/O Bridge)
 steps for an interrupt
 pause my work  save where I can get back to it
 decide what to do next
 do the right thing
 resume my suspended work
 jump table  array of code addresses
 each address points to handler code
 CPU must pause, get index from bus, jump to exception_table[index]
 need the exception table, exception code in memory, need register that tells where the exception table is
 the user’s code should not be able to use exception memory
 memory
 mode register (1bit): 2 modes
 kernel mode (operating system)  allows all instructions
 user mode  most code, blocks some instructions, blocks some memory
 cant set mode register
 can’t talk to the I/O bus
 largest addresses are kernel only
 some of this holds exception table, exception handlers
 between is user thhings
 smallest addresses are unused  they are null (people often try to dereference them  we want this to throw an error)
 mode register (1bit): 2 modes
 exceptions
 interrupts, index: bus, who causes it: I/O Bridge
 i1,i2,i3,i4 > interrupt during i3 instruction
 let i3 finish (maybe)
 handle interrupt
 resume i4 (or rest of i3)
 trap, %al (user) int assembly instruction (user code)
 trap is between instructions, simple
 fault, based on what failed failing usermode instruction
 fault during i3
 suspend i3
 handle fault
 rerun i3 (assuming we corrected fault  ex. allocating new memory) otherwise abort  abort  reaction to an exception (usually to a fault)  quits instead of resuming
 interrupts, index: bus, who causes it: I/O Bridge
 suspending
 save PC, program register, condition codes (put them in a struct in kernel memory)
 on an exception
 switch to kernel mode
 suspend program
 jump to exception handler
 execute exception handler
 resume in user mode
processes
 (user) read file > (kernel) send request to disk, wait, clean up > (user) resume
 this has lots of waiting so we run another program while we wait (see pic)
 process  code with an address space
 CPU has a register that maps user addresses to physical addresses (memory pointers to each process)
 general we don’t call the kernel a process
 also had pc, prog. registers, cc, etc.
 each process has a pointer to the kernel memory
 also has more (will learn in OS)…
 context switch  changing from one process to another
 generally each core of a computer is running one process
 freeze one process
 let OS do some bookkeeping
 resume another process
 takes time because of bookkeeping and cache misses on the resume
 you can time context switches while(true) getCurrentTime() if(increased a lot) contextSwitches++
 generally each core of a computer is running one process
threads
 threads are like processes that user code manages, not the kernel
 within one address space, I have 2 stacks
 save/restores registers and stack
 hardware usually has some thread support
 save/restore instructions
 a way to run concurrent threads in parallel
 python threads don’t run in parallel
system calls
 how user code asks the kernel to do stuff
 exception table  there are 30ish, some free spots for OS to use
 system call  Linux uses exception 128 for almost all user > kernel requests
 uses rax to decide what you are asking //used in a jump table inside the 128 exception handler
 most return 0 on success
 nonzero on error where the # is errno
 you can write assembly in C code
software exceptions
 you can throw an exception and then you want to return to something several method calls before you
 nonlocal jump
 change PC
 and registers (all of them)
 try{}  freezes what we need for catch
 catch{}  what is frozen
 throw{}  resume
 hardware exception can freeze state
signals, setjmps
 exceptions  caused by hardware (mostly), handled by kernel
 signal  caused by kernel, handled by user code (or kernel)
 mimics exception (usually a fault)
 userdefined signal handler know to the OS
 various signals (identified by number)
 implemented with a jump table
 we can mask (ignore) some signals
 turns hardware fault into software exception (ex. divide by 0, access memory that doesn’t exist), this way the user can handle it
 SIGINT (ctrlc)  usually cancels, can be blocked
 ctrlc > interrupt > handler > terminal (user code) > trap (SIGINT is action) > handler > sends signal
 SIGTER  cancels, can’t be blocked
 SIGSEG  seg fault
 setjmp/longjmp  caused by user code, handled by user code
 functions in standard C library in setjmp.h
 jumps somewhere where pointer is something that stores current state
 setjmp  succeeds first time (returns 0)
 longjmp  never returns  calls setjmp with a different return value
 you usually use if(setjmp) else {handle error}  basically trycatch
virtual memory
 2 address spaces
 virtual address space (addressable space)
 used by code
 fixed by ISA designer
 memory management unit (MMU)  takes in a virtual address and spits out physical address
 page fault  MMU says this virtual address does not have a physical address
 when there’s a page fault, go to exception handler in kernel
 usually we go to disk
 page fault  MMU says this virtual address does not have a physical address
 physical address space (cannot be discovered by code)
 used by memory chips
 constrained by size of RAM
 virtual address space (addressable space)
 assume all virtual addresses have a physical address in RAM (this is not true, will come back to this)
 each process has code, globals, heap, shared functions, stack
 lots of unused at bottom, top because few programs use 2^64 bytes
 RAM  we’ll say this includes all caches
 virtual memory is usually mostly empty
 allocated in a few blocks / regions
 MMU
 bad idea 1: could be a mapping from every virtual address to every physical address, but this wastes a lot
 instead, we split memory into pages (page is continuous block of addresses ~ usually 4k)
 bigger = fewer things to map, more likely to include unused addresses
 address = loworder bits: page offset, highorder bits: page number
 page offset takes log_2(page_size)
 bad idea 2: page table  map from virtual page number > physical page number
 we put the map in RAM, we have a register (called the PTBR) that tells us where it is
 we change the PTBR for each process
 CPU sends MMU a virtual address
 MMU splits it into a virtual page number and page offset
 takes 2 separate accesses to memory
 uses register to read out page number from page table
 page table  array of physical page numbers, 2^numbits(virtual page numbers)
 page table actually stores page table entries (PTEs)
 PTE = PPN, readonly?, code or data?, user allowed to see it?
 MMU will check this and fault on error
 page table  array of physical page numbers, 2^numbits(virtual page numbers)
 then it sends page number and page offset and gets back data
 lookup address PTBR + VPN*numbytes(PPN)
 uses register to read out page number from page table
 consider 32bit VA, 16k page (too large)
 page offset is 14 bits
 2^18 PTEs = 256k PTEs
 each PTE could be 4 bytes so the page table takes about 1 Megabyte
 64bit VA, 4k page
 2^52 PTE > the page table is too big to store
 good idea: multilevel page table
 virtual address: page offset, multiple virtual page numbers $VPN_0,VPN_1,VPN_2,VPN_3$ (could have different number of these)
 start by reading highest VPN: PTBR[VPN_3] > PTE_3
 read PPN[VPN_2] > PTE_2
 read PPN_2[VPN_1] > PTE_1
 read PPN_1[VPN_0] > PTE_0
 read PPN_0[VPN] > PTE_ans
 check at each level if valid, if unallocated/kernel memory/not usable then fault and stop looking
 highest level VP_n is highest bits of address, likely that it is unused
 therefore we don’t have to check the other addresses
 they don’t exist so we save space, only create page tables when we need them  OS does this
 look at these in the textbook
 virtual memory ends up looking like a tree
 top table points to several tables which each point to more tables
 virtual address: page offset, multiple virtual page numbers $VPN_0,VPN_1,VPN_2,VPN_3$ (could have different number of these)
 bad idea 1: could be a mapping from every virtual address to every physical address, but this wastes a lot
 TLB  maps from virtual page numbers to physical page numbers
 TLB vs L1, L2, etc:
 Similarities
 They are all caches i.e., they have an index and a tag and a valid bit (and sets)
 Differences
 TLB has a 0bit BO (i.e., 1 entry per block; lg(1) = 0)
 TLB is not writable (hence not writeback or writethrough, no dirty bit)
 TLB entries are PPN, L* entries are bytes
 TLB does VPN â†’ PPN; the L* do PA â†’ data
overview
 CPU > creates virtual address
 virtual address: 36 bits (VPN), 12 bits (PO) //other bits are disregarded
 VPN broken into 32 bits (Tag), 4 bits (set index)
 set index tells us which set in the Translation Lookaside Buffer to look at
 there are 2^4 sets in the TLB
 currently there are 4 entries per set ~ this could be different
 each entry has a valid bit
 a tag  same length as VP Tag
 value  normally called block  but here only contains one Physical page number  PPN = 40 bits ~ this could be different
 when you go into kernel mode, you reload the TLB
 set index tells us which set in the Translation Lookaside Buffer to look at
segments
 memory block
 kernel at top
 stack (grows down)
 shared code
 heap (grows up)
 empty at bottom
base: 0x60
read: 0xb6 val at 0x6b > 0x3d val at 0xd6 > ans
read: 0xa4 val at 0x6a > 0x53 val at 0x34 > ans
0xb3a6
read: 0xb3 val at 0x6b > 0x3d val at 0xd3 > 0x0f val at 0xfa > 0x6b val at 0xb6
[toc]
quiz rvw
 commands
 floats
 labs
 In method main you declare an int variable named x. The compiler might place that variable in a register, or it could be in which region of memory?  Stack
 round to even is default
 Which Y8664 command moves the program counter to a runtimecomputed address?  ret
 [] mux defaults to 0
 callersave register  caller must save them to preserve them
 calleesaved registers  callee must save them to edit them
 in the sequential y86 architecture valA<eax
 valM is read out of memory  used in ret, mrmovl
 labels are turned into addresses when we assemble files
 accessing memory is slow
 most negative binary number: 100000
 floats can represent less numbers than unsigned ints
 0s and 0s are same
 NaN doesn’t count
 push/pop  sub/add to %rsp, put value into (%rsp)
 opl is 32bit, opq is 64bit
 fetch determines what the next PC will be
 fetch reads rA,rB,icode,ifun  decode reads values from these
labs
strlen
unsigned int strlen( const char * s ){ unsigned int i = 0; while(s[i]) i++; return i; }
strsep
char *strsep( char **stringp, char delim ){ char *ans = *stringp; if (*stringp == 0) return 0; while (**stringp != delim && **stringp != 0) /* don't need this 0 check, 0 is same as '\0' */ *stringp += 1; if (**stringp == delim){ **stringp = 0; *stringp += 1; } else *stringp = 0; return ans; }
###lists
 always test after malloc
 singlylinked list: node* { TYPE payload, struct node *next }
 length: while(list) list = (*list).next
 allocate: malloc(sizeof(node)*length) head[i].next=(i >= length) ? 0 : (head+i+1)
 access: (*list).payload or list[i].payload (for accessing)
 array: TYPE*
 length: while(list[i] != sentinel)
 allocate: malloc(sizeof(TYPE) * (length+1));
 access: list[i]
 range: { unsigned int length, TYPE *ptr }
 length: list.length
 allocate: list.ptr = malloc(sizeof(TYPE) * length); ans.length = length;
 access: list.ptr[i]
bit puzzles
// leastBitPos  return a mask that marks the position of the least significant 1 bit int leastBitPos(int x) { return x & (~x+1); } int bitMask(int highbit, int lowbit) { int zeros = ~1 << highbit; /* 1100 0000 */ int ones = ~0 << lowbit; /* 1111 1000 */ return ~zeros & ones; /* 0011 1000 */ } /* satAdd  adds two numbers but when positive overflow occurs, returns maximum possible value, and when negative overflow occurs, it returns minimum positive value. */ // soln  overflow when operands have same sign and sum and operands have different sign int satAdd(int x, int y) { int x_is_neg = x >> 31; int y_is_neg = y >> 31; int sum = x + y; int same_sign = (x_is_neg & y_is_neg  ~x_is_neg & ~y_is_neg); int overflow = same_sign & (x_is_neg ^ (sum >> 31)); int pos_overflow = overflow & ~x_is_neg; int neg = 0x1 << 31; int ans = ~overflow&sum  overflow & (pos_overflow&~neg  ~pos_overflow&neg); return ans; }
reading
ch 1 (1.7, 1.9)
 files are stored as bytes, most in ascii
 all files are either text files or binary files
 i/o devices are connected to the bus by a controller or adapter
 processor holds PC, main memory holds program
 oslayer between hardware and applications  protects hardware and unites different types of hardware
 concurrent  instructions of one process are interleaved with another
 does a context switch to switch between processes
 concurrency  general concept of multiple simultaneous activities
 parallelism  use of concurrency to make a system faster
 virtual memoryabstraction that provides each process with illusion of full main memory
 memory  codedataheapshared librariesstack
 threads allow us to have multiple control flows at the same time  switching
 multicore processor: either has multicore or is hyperthreaded (one CPU, repeated parts)
 processors can do several instructions per clock cycle
 SingleInstruction, MultipleData (SIMD)  ex. add four floats
ch 2 (2.1, 2.4.2, 2.4.4)
 floating points (float, double)
 sign bit (1)
 exponentfield (8, 11)
 bias = 2^(k1)1 ex. 127
 normalized exponent = expBias, mantissa = 1.mantissa
 denormalized: exp  all 0s
 exponent = 1Bias, mantissa without 1  0 and very small values
 exp: all 1s
 infinity (if mantissa 0)
 NaN otherwise
 mantissa
 rounding
 roundtoeven  if halfway go to closest even number  avoides statistical bias
 roundtowardzero
 rounddown
 roundup
 leading 0 specifies octal
 leading 0x specifies hex
 leading 0b specifies binary
ch 3 (3.6, 3.7)
 computers execute machine code
 intel processors are all backcompatible
 ISA  instruction set architecture
 control  condition codes are set after every instruction (1bit registers)
 Zero Flag  recent operation yielded 0
 Carry Flag  yielded carry
 Sign Flag  yielded negative
 Overflow Flag  had overflow (pos or neg)
 guarded do can check if a loop is infinite
 instruction src, destination
 parentheses dereference a point
 there is a different add command for 16bit operands than for 64bit operands
 all instructions change the program counter
 call instruction only changes the stack pointer
4.1,4.2
 eight registers
 esp is stack pointer
 CC and PC
 4byte values are littleendian
 status code State
 1 AOK
 2 HLT
 3 ADR  seg fault
 4 INS  invalid instruction code
 lines starting with “.” are assembler directives
 assembly code is assembled resulting in just addresses and instruction codes
 pushl %esp  this doesn’t change esp
 pop %esp  pops the value in esp
 high voltage = 1
 digital system components
 logic
 memory elements
 clock signals
 mux  picks a value and lets it through
 int Out = [
s: A;
1: B;
];
 B is the default
 combinatorial circuit  many bits as input simultaneously
 ALU  three inputs, A, B, func
 clocked registers store individual bits or words
 RAM stores several words and uses address to retrieve them
 stored in register file
4.3.14
 SEQ  sequential processor
 stages
 fetch
 read icode,ifun < byte 1
 maybe read rA, rB < byte 2
 maybe read valC < 8 bytes
 decode
 read operands usually from rA, rB  sometimes from %esp
 call these valA, valB
 execute
 adds something, called valE
 for jmp tests condition codes
 memory
 reads something from memory called valM or writes to memory
 write back
 writes up to two results to regfile
 PC update
 fetch
 popl reads two copies so that it can increment before updating the stack pointer
 components: combinational logic, clocked registers (the program counter and condition code register), and randomaccess memories
 reading from RAM is fast
 only have to consider PC, CC, writing to data memory, regfile
 processor never needs to read back the state updated by an instruction in order to complete the processing of this instruction.
 based on icode, we can compute three 1bit signals :
 instr_valid: Does this byte correspond to a legal Y86 instruction? This signal is used to detect an illegal instruction.
 need_regids: Does this instruction include a register specifier byte?
 need_valC: Does this instruction include a constant word?
4.4 pipelining
 the task to be performed is divided into a series of discrete stages
 increases the throughput  # customers served per unit time
 might increase latency  time required to service an individual customer.
 when pipelining, have to add time for each stage to write to register
 time is limited by slowest stage
 more stages has diminishing returns for throughput because there is constant time for saving into registers
 latency increases with stages
 throughput approaches 1/(register time)
 we need to deal with dependencies between the stages
4.5.3, 4.5.8
 several copies of values such as valC, srcA
 registers dD, eD, mM, wW  lowercase letter is input, uppercase is output
 we try to keep all the info of one instruction within a stage
 merge signals for valP in call and valP in jmp as valA
 load/use hazard  (try using before loaded) one instruction reads a value from memory while the next instruction needs this value as a source operand
 we can stop this by stalling and forwarding (the use of a stall here is called a load interlock)
5  optimization
 eliminate unnecessary calls, tests, memory references
 instructionlevel parallelism
 profilers  tools that measure the performance of different parts of the program
 critical paths  chains of data dependencies that form during repeated executions of a loop
 compilers can only apply safe operations
 watch out for memory aliasing  two pointers desginating same memory location
 functions can have side effects  calling them multiple times can have different results
 small boost from replacing function call by body of function (although this can be optimized in compiler sometimes)
 measure performance with CPE  cycles per element
 reduce procedure calls (ex. length in for loop check)
 loop unrolling  increase number of elements computed on each iteration
 enhance parallelism
 multiple accumulators
 limiting factors
 register spilling  when we run out of registers, values stored on stack
 branch prediction  has misprediction penalties, but these are uncommon
 trinary operator could make things faster
 understand memory performance
 using macros lets compiler optimizem more, lessens bookkeeping
6.1.1, 6.2, 6.3
 SRAM is bistable as long as power is on  will fall into one of 2 positions
 DRAM loses its value ~10100 ms
 memory controller sends row,col (i,j) to DRAM and DRAM sends back contents
 matrix organization reduces number of inputs, but slower because must use 2 steps to load row then column
 memory modules
 enhanced DRAMS
 nonvolatile memory
 ROM  readonly memories  firmwared
 accessing main memory
 buses  collection of parallel wires that carry address, data, control
 accessing main memory
 locality
 locality of references to program data
 visiting things sequentially (like looping through array)  stride1 reference pattern or sequential reference pattern
 locality of instruction fetches
 like in a loop, the same instructions are repeated
 locality of references to program data
 memory hierarchy
 blocksizes for caching can differ between different levels
 when accessing memory from cache, we either get cache hit or cache miss
 if we miss we replace or evict a block
 can use random replacement or leastrecently used
 cold cache  cold misses / compulsory misses  when cache is empty
 need a placement policy for level k+1 > k (could be something like put block i into i mod 4)
 conflict miss  miss because placement policy gets rid of block you need  ex. block 0 then 8 then 0 with above placement policy
 capacity misses  the cache just can’t hold enough
6.4, 6.5  cache memories & writing cachefriendly code
 Miss rate. The fraction of memory references during the execution of a program, or a part of a program, that miss. It is computed as #misses/#references.
 Hit rate. The fraction of memory references that hit. It is computed as 1 − miss rate.
 Hit time. The time to deliver a word in the cache to the CPU, including the time for set selection, line identification, and word selection. Hit time is on the order of several clock cycles for L1 caches.
 Miss penalty. Any additional time required because of a miss. The penalty for L1 misses served from L2 is on the order of 10 cycles; from L3, 40 cycles; and from main memory, 100 cycles.
 Traditionally, highperformance systems that pushed the clock rates would opt for smaller associativity for L1 caches (where the miss penalty is only a few cycles) and a higher degree of associativity for the lower levels, where the miss penalty is higher
 In general, caches further down the hierarchy are more likely to use writeback than writethrough
8.1 Exceptions
 exceptions  partly hardware, partly OS
 when an event occurs, indirect procedure call (the exception) through a jump table called exception table to OS subroutine (exception handler).
 three possibilities
 returns to I_curr
 returns to I_next
 program aborts
 exception table  entry k contains address for handler code for exception k
 processor pushes address, some additional state
 four classes
 interrupts (the faulting instruction)
 signal from I/O device, Async, return next instruction
 traps
 intentional exception (interface for making system calls), Sync, return next
 faults
 potentially recoverable error (ex. page fault exception), Sync, might return curr
 aborts
 nonrecoverable error, Sync, never returns
 interrupts (the faulting instruction)
 examples
 general protection fault  seg fault
 machine check  fatal hardware error
8.2 Processes
 process  instance of program in execution
 every program runs in the context of some process (context has code, data stack, pc, etc.)
 logic control flow  like we have exclusive use of processor
 processes execute partially and then are preempted (temporarily suspended)
 concurrency/multitasking/time slicing  if things trade off
 parallel  concurrent and on separate things
 kernel uses context switches
 private address space  like we have exclusive use of memory
 each process has stack, shared libraries, heap, executable
 every program runs in the context of some process (context has code, data stack, pc, etc.)
8.3 System Call Error Handling
 system level calls return 1, set the global integer variable errno
 this should be checked for
99.5 Virtual Memory
 address translation  converts virtual to physical address
 translated by the MMU
 VM partitions virtual memory into fixedsize blocks called virtual pages partitioned into three sets
 unallocated
 cached
 uncached
 virtual pages tend to be large because cache misses are large
 DRAM will be fully associative, writeback
 each process has a page table  maps virtual pages to physical pages
 managed by OS
 has PTEs
 PTE  valid bit, nbit address field
 valid bit  whether its currently cached in DRAm
 if yes, address is the start of corresponding physical page
 if valid bit not set && null address  has not been allocated
 if valid bit not set && real address  points to start of virtual page on disk
 PTE  3 permission bits
 SUP  does it need to be in kernel (supervisor) mode?
 READ  read access
 WRITE  write access
 page fault  DRAM cache miss
 read valid bit is not set  triggers handler in kernel
 demand paging  waiting until a miss occurs to swap in a page
 malloc creates room on disk
 thrashing  not good locality  pages are swapped in and out continuoously
 virtual address space is typically larger
 multiple virtual pages can be mapped to the same shared physical page (ex. everything points to printf)
 VM simplifies many things
 linking
 each process follows same basic format for its memory image
 loading
 loading executables / shared object files
 sharing
 easier to communicate with OS
 memory allocation
 physical pages don’t have to be contiguous
 linking
 memory protection
 private memories are easily isolated
9.6 Address Translation
 low order 4 bits serve 2,3  fault 8c: 1000 1100 b6

operating systems
 1  introduction
 1.1 what operating systems do
 1.2 computersystem organization
 1.3 computersystem architecture
 1.4 operatingsystem structure
 1.5 operatingsystem operations
 1.6 process management
 1.7 memory management
 1.8 storage management
 1.9 protection & security
 1.10 basic data structures
 1.11 computing environments
 2  OS Structures
 3  processes
 4  threads
 5  process synchronization
 7  main memory
 6  cpu scheduling
 8  virtual memory
 9  massstorage structure
 10  filesystem interface
 11  filesystem implementation
 12  i/o systems
1  introduction
1.1 what operating systems do
 computer system  hierarchical approach = layered approach
 hardware
 operating system
 application programs
 users
 views
 user view  OS maximizes work user is performing
 system view
 os allocates resources  CPU time, memory, filestorage, I/O
 os is a control program  manages other programs to prevent errors
 program types
 os is the kernel  one program always running on the computer
 only kernel can access resources provided by hardware
 system programs  associated with OS but not in kernel
 application programs
 os is the kernel  one program always running on the computer
 middleware  set of software frameworks that provide additional services to application developers
1.2 computersystem organization
 when computer is booted, needs bootstrap program
 initializes things then loads OS
 also launches system processes
 ex. Unix launches “init”
 events
 hardware signals with interrupt
 software signals with system call
 interrupt vector holds addresses for all types of interrupts
 have to save address of interrupted instruction
 memory
 von Neumman architecture  uses instruction register
 main memory is RAM
 volatile  lost when power off
 secondary storage is nonvolatile (ex. hard disk)
 ROM is unwriteable so static programs like bootstrap are ROM
 access
 uniform memory access (UMA)
 nonuniform memory access (NUMA)
 I/O
 device driver for I/O devices
 direct memory access (DMA)  transfers entire blocks of data w/out CPU intervention
 otherwise device controller must move data to its local buffer and return pointer to that
 multiprocessor systems
 increased throughput
 economies of scale (costwise)
 increased reliability (fault tolerant)
1.3 computersystem architecture
 singleprocessor system  one main cpu
 usually have specialpurpose processors (e.g. keyboard controller)
 multiprocessor system / multicore system
 multicore means multiprocessor on same chip
 multicore is generally faster
 multiple processors in close communication
 advantages
 increased throughput
 economy of scale
 increased reliability = graceful degradation = fault tolerant
 types
 asymmetric multiprocessing  boss processor controls the system
 symmetric multiproccesing (SMP)  each processor performs all tasks
 more common
 blade server  multiple independence multiprocessor systems in same chassis
 multicore means multiprocessor on same chip
 clustered system  multiple loosely coupled cpus
 types
 asymmetric clustering  one machine runs while other monitors it (hotstandby mode)
 symmetric clustering  both run something
 parallel clusters
 require disributed lock manager to stop conflicting parallel operations
 can share same data via storageareanetworks
 beowulf cluster  use ordinary PCs to make cluster
 types
1.4 operatingsystem structure
 multiprogramming  increases CPU utilization so CPU is always doing something
 keeps job pool ready on disk
 time sharing / multitasking  multiple jobs switch so fast that both can be interacted with
 requires an interactive computer system
 process  program loaded into memory
 scheduling
 job scheduling  picking jobs from job pool (disk > memory)
 CPU scheduling  what to run first (memory > cpu)
 memory
 processes are swapped from main memory to disk
 virtual memory allows for execution of process not in memory
1.5 operatingsystem operations
 trap / exception  softwaregenerated interrupt
 usermode and kernel mode (also called system mode)
 when in kernel mode, mode bit is 0
 separate mode for virtual machine manager (VMM)
 this is built into hardware
 kernel can use a timer to getting stuck in user mode
1.6 process management
 program is passive, process is active
 process needs resources
 process is unit of work
 singlethreaded process has one program counter
1.7 memory management
 cpu can only directly read from main memory
 computers must keep several programs in memory
 hardware design is impmortant
1.8 storage management
 defines file as logical storage unit
 most programs stored on disk until loaded
 in addition to secondary storage, there is tertiary storage (like DVDs)
 caching  save frequent items on faster things
 cache coherency  make sure cache coherency is properly updated with parallel processes
1.9 protection & security
 process can execute only within its address space
 protection  controlling access to resources
 security  defends a system from attacks
 maintain list of user IDs and group IDs
 can temporarily escalate priveleges to an effective UID  setuid command
1.10 basic data structures
 bitmap  string of n binary digits
1.11 computing environments
 network computers  are essentially terminals that understand webbased computing
 distributed system  shares resources among separate computer systems
 network  communication path between two or more computers
 TCP/IP is most common network protocol
 networks
 PAN  personalarea network (like bluetooth)
 LAN  localarea network connects computers within a room, building, or campus
 WAN  widearea network
 MAN  metropolitanarea network
 network OS provides features like file sharing across the network
 distributed OS provides less autonomy  makes it feel like one OS controls entire network
 clientserver computing
 computeserver  performs actions for user
 fileserver  stores files
 peertopeer computing
 all clients w/ central lookup service, ex. Napster
 no centralized lookup service
 uses discovery protocol  puts out request and other peer must respond
 virtualization  allows OS to run within another OS
 interpretation  run programs as nonnative code (ex. java runs on JVM)
 BASIC can be compiled or interpreted
 cloudcomputing  computing, storage, and applications as a service accross a network
 public cloud
 private cloud
 hybrid cloud
 software as a service (SAAS)
 platform as a service (PAAS)
 infrastructure as a service (IAAS)
 cloud is behind a firewall, can only make requests to it
 embedded systems  like microwaves / robots
 specific tasks
 have realtime OS  fixed time constraints
2  OS Structures
2.1 os services
 for the user
 user interface  commandline interface and graphical user interface
 program execution  load a program and run it
 I/O operations  file or device
 Filesystem manipulation
 communications  between processes / computer systems
 error detection
 for system operation
 resource allocation
 accounting  keeping stats on users / processes
 protection / security
2.2 user and os interface
 command interpreter = shell  gets and executes next userspecified command
 could contain the code to execute the command
 command interpreter could have code to execute commands
 more often, executes system programs, such as “rm”, that are executed
 GUI
2.3 system calls
 system calls  provide an interface to os services
 API usually wraps system calls (ex. java)
 libc  provided by Linux/Mac OS for C
 systemcall interface links API calls to system calls
 passing parameters
 pass parameters in registers
 parameters stored in block of memory and address passed in register
 parameters pushed onto stack
2.4 system call types
 process control  halting, ending
 lock shared data  no other process can access until released
 file manipulation
 device manipulation
 similar to file manipulation
 information maintenance  time, date, dump()
 single step is CPU mode which throws trap for CPU after every instruction for a debugger
 communications
 messagepassing model
 each computer has host name and network identifier (IP address)
 each process has process name
 daemons  system programs for receiving connections (like servers waiting for a client)
 sharedmemory model
 messagepassing model
 protection
2.5 system programs
 system programs = system utilities
 some provide interfaces for system calls
 other uses
 file management
 status info
 file modification
 programminglanguage support
 program loading and execution
 communications
 background services
2.6 os design and implementation
 mechanism  how to do something
 want this to be general so only certain parameters change
 policy  what will be done
 os mostly in C, lowlevel kernel in assembly
 highlevel is easier to port but slower
2.7 os structure
 want modules but current models aren’t very modularized
 monolithic system has performance advantages  very little overhead
 in practice everything is a hybrid
 system can be modularized with a layered approach
 layers: hardware, …, user interface
 easy to construct and debug
 hard to define layers, less efficient
 microkernel approach  used in os Mach
 move nonessential kernel components to system / userlevel
 smaller kernel, everything communicates with message passing
 makes extending os easier, but slower functions due to system overhead
 loadable kernel modules
 more flexible  kernel modules can change
 examples (see pics)
2.8 os debugging
 errors are written to log file and core dump (memory snapshot) is written to file
 if kernel crashes, must save its dump to s special area
 performance tuning  removing bottlenecks
 monitor trace listings  log if interesting events with times / parameters
 SolarisDTrace is a tool to debug and tune the os
 profiling  periodically samples instruction pointer to determine which code is being executed
2.9 generation
 system generation  configuring os on a computer
 usually on a CDROM
 lots of things must be determined (like what CPU to use)
2.10 system boot
 bootstrap program
3  processes
3.1 process concept
 process  program in execution
 batch system executes jobs = processes
 timeshared system has user programs or tasks
 program is passive while process is active
 parts
 program code  text section
 program counter
 registers
 stack
 data section
 heap
 same program can have many processes
 process can be execution environment for other code (ex. JVM)
 process state
 new
 running
 waiting
 ready
 terminated
 process control block (PCB) = task control block  repository for any info that varies process to process
 process state
 program counter
 CPU registers
 CPUscheduling information
 memorymanagement information
 accounting information
 I/O status information
 could include information for each thread
 parent  process that created another process
3.2 process scheduling
 process scheduler  selects available process for multitasking
 processes begin in job queue
 processes that are ready and waiting are in the ready queue until they are dispatched  usually stored as a linked list
 lots of things can happen here (fig 3_6)
 ex. make I/O request and go to I/O queue
 I/Obound process  spends more time doing I/O
 CPUbound process  spends more time doing computations
 each device has a list of process waiting in its device queue
 scheduler  selects processes from queues
 longterm scheduler  selects from processes on disk to load into memory
 controls the degree of multiprogramming = number of processes in memory
 has much more time than shortterm scheduler
 want good mix of I/Obound and CPUbound processes
 sometimes this doesn’t exist
 shortterm / CPU scheduler  selects from processes ready to execute and allocates CPU to one of them
 sometimes mediumterm scheduler
 does swapping  remove a process from memory and later reintroduce it
 longterm scheduler  selects from processes on disk to load into memory
 context switch  occurs when switching processes
 when interrupt occurs, kernel saves context of old process and loads saved context of new process
 context is in the PCB
 might be more or less work depending on hardware
3.3 operations on processes
 usually each process has unique process identifier (pid)
 linux everything starts with init process (pid=1)
 restricting a child process to a subset of the parent’s resources prevents system overload
 parent may pass along initialization data
 after creating new process
 parent continues to execute concurrently with children
 parent waits until some or all of its children terminate
 two addressspace possibilities for the new process:
 child is duplicate of parent (it has the same program and data as the parent).
 child loads new program
 forking
 when call fork() continue operation but returns 0 for parent process and nonzero for child
 child is a copy of the parent
 after fork, usually one process calls exec() to load binary file into memory
 overrides program, doesn’t return unless error occurs
 parent can call wait() until child finishes (moves itself off ready queue until child finishes)
 when call fork() continue operation but returns 0 for parent process and nonzero for child
 on Windows, uses CreateProcess() which requires loading a new program rather than sharing address space
 STARTUPINFO 
 PROCESSINFORMATION 
 process termination
 exit() kills process (return in main calls exit)
 process can return status value
 parent can terminate child if it knows its pid
 cascading termination  if parent dies, its children die
 zombie process  terminated but parent hasn’t called wait() yet
 remains because parent wants to know what exit status was
 if parent terminates without wait(), orphan child is assigned init as new parent (init periodically invokes wait())
3.4 interprocess communication
 process cooperation
 information sharing
 computation speedup
 modularity
 convenience
 interprocess communication (IPC)  allows exchange of data and info
 shared memory  shared region of memory is established
 one process establishes region
 other process must attach to it (OS must allow this)
 less overhead (no system calls)
 suffers from cache coherency
 ex. producer consumer
 producer fills buffer and consumer empties it
 unbounded buffer  producer can keep producing indefinitely
 bounded buffer  consumer waits if empty, producer waits if full
 in points to next free position
 out points to first full position
 message passing  messages between coordinating processes
 useful for smaller data
 easier in a distributed system
 direct or indirect communication
 direct requires knowing the id of process to send / receive
 can be asymmetrical  need to know id of process to send to, but not receive from
 results in hardcoding
 indirect  messages are sent / received from mailboxes
 more flexible, can send message to whoever shares mailbox
 mailbox owned by process  owner receives those messages
 mailbox owned by os  unclear
 synchronous or asynchronous communication
 synchronous = blocking
 when both send and recieve are blocking = rendezvous 3. automatic or explicit buffering
 queue for messages can have 3 implementations
 zero capacity (must be blocking)
 bounded capacity
 unbounded capacity
 shared memory  shared region of memory is established
3.5 examples of IPC systems
 POSIX  shared memory
 Mach  message passing
 Windows  shared memory for message passing
3.6 communication in clientserver systems
 sockets  endpoint for communication
 IP address + port number
 connecting
 server listens on a port
 client creates socket and requests connection to server’s port
 server accepts connection (then usually writes data to socket)
 all ports below 1024 are well known
 connectionoriented=TCP
 connectionless = UDP
 special IP address 127.0.0.1  loopback  refers to itself
 sockets are lowlevel  can only send unstructured bytes
 remote procedure calls (RPCs)  remote messagebased communication
 like IPC, but between different computers
 message addressed to an RPC daemon listening to a port
 messages are wellstructured
 specifies a port  a number included at the start of a message packet
 system has many ports to differentiate different services
 uses stubs to hide details
 they marshal the parameters
 might have to convert data into external data representation (XDR) (to avoid issues like bigendian vs. littleendian)
 must make sure each message is acted on exactly once
 client must know port
 binding info (port numbers) may be predetermined and unchangeable
 binding can be dynamic with rendezvous deaemon (matchmaker) on a fixed RPC port
 pipes  conduit allowing 2 processes to communicate
 four issues
 bidirectional?
 full duplex (data can travel in both directions at same time?) or half duplex (only one way)?
 parentchild relationship?
 communicate over a network?
 ordinary pipe  write at one end, read at the other
 unix function
pipe(int fd[])
 fd[0] is readend and fd[1] is writeend
 only exists while a child and parent process are communicating
 therefore only on same machine
 parent and child should both close unused ends of the pipe
 on windows, called anonymous pipes
 requires security attributes
 unix function
 named pipe  can be bidirectional
 called FIFOs in Unix
 only halfduplex, requires same machine
 Windows  fulllduplex and can be different machines
 many processes can use them
 called FIFOs in Unix
 four issues
4  threads
 thread  basic unit of CPU utilization
 program counter
 register set
 stack
 making a thread is quicker and less resourceintensive than making a process
 used in RPC and kernels
 benefits
 responsiveness
 resource sharing
 economy
 scalability
4.2  multicore programming (skipped)
 amdahl’s law: $speedup \leq \frac{1}{S+(1S)/N_{cores}}$
 S is serial portion
 parallelism
 data parallelism  distributing subsets of data across cores and performing same operation on each core
 task parallelism  distribution tasks across cores
4.3  multithreading models
 need relationship between user threads and kernel threads
 manytoone model  maps userlevel threads to one kernel thread
 can’t be parallel on multicore systems
 ex. used by Green threads
 onetoone model
 small overhead for creating each thread
 used by Linux and Windows
 manytomany model
 less than or equal number of kernel threads
 twolevel model mixes a onetoone model and a manytomany model
 manytoone model  maps userlevel threads to one kernel thread
4.4  thread libraries
 thread library  provides programmer with an API for creating/managing threads
 asynchronous v. synchronous threading
1  POSIX Pthreads
/* get the default attributes */ pthread attr init(&attr); /* create the thread */ pthread create(&tid,&attr,runner,argv[1]); // runner is a func to call /* wait for the thread to exit */ pthread join(tid,NULL);
 shared data is declared globally
2  Windows
3  Java  uses Runnable interface
4.5  implicit threading (skipped)
 implicit threading  handle threading in runtime libraries and compilers
 thread pool  number of threads at startup that sit in a pool and wait for work
 OpenMP  set of compiler directives / API for parallel programming
 identifies parallel regions
 uses #pragma
 Grand central dispatch  extends C
 uses dispatch queue
4.6  threading issues
 fork/exec need to know if should fork all threads / when to replace program
 signal notifies a process that a particular event has occurred
 has a default signal handler
 userdefined signal handler
 delivering a signal to a process:
kill(pid_t pid, int signal)
 delivering a signal to a thread:
pthread_kill(pthread_t tid, int signal)
 delivering a signal to a process:
 thread cancellation  terminating target thread before it has completed
 asynchronous cancellation  one thread immediately terminates target thread
 deferred cancellation  target thread periodically checks whether it should terminate
 pthread_cancel(tid)
 uses deferred cancellation
 cancellation occurs only when thread reaches cancellation point
 threadlocal storage  when threads need separate copies of data
 lightweight process = LWP  between user thread and kernel thread
 scheduler activation  kernel provides application with LWPs
 upcall  kernel informs application about certain events
4.7  linux (skipped)
 linux process / thread are same = task
 uses clone() system call
5  process synchronization
 cooperating process can effect or be affected by other executing processes
 ex. consumer/producer
 if counter++ and counter– execute concurrently, don’t know what will happen
 this is a race condition
5.2  criticalsection problem
 each process has critical section where it updates common variables
 <img src=”pics/5_1.png”/ width=40%>
 3 requirements
 mutual exclusion  2 processes can’t concurrently do critical section
 progress  things should be in critical selection
 bounded waiting  every process should eventually get to critical selection
 kernels
 preemptive kernels
 more responsive
 nonpreemptive kernels
 no race conditions
 preemptive kernels
5.3  peterson’s solution
 peterson’s solution
 <img src=”pics/5_2.png” width=40%/>
 here i is one task and j is the other
 not guaranteed to work
 <img src=”pics/5_2.png” width=40%/>
5.4  synchronization hardware
 locking  protecting critical regions using locks
 singleprocessor solution
 prevent interrupts while shared variable is being modified
 ex.
test_and_set()
 instructions do things like swapping atomically  as one uninterruptable unit
 these are basically locked instructions
 ex.
compare_and_swap()
5.5  mutex locks
 mutex: <img src=”pics/5_8.png” width=40%/>
 simplest synchronization tool
 this type of mutex lock is called spinlock because requires busy waiting  processes not in critical section are continuously looping
 good when locks are short
5.6  semaphores
 semaphore S  integer variable accessed through wait() (like trying to execute) and signal() (like releasing)
 counting semaphore  unrestricted domain
 binary sempahore  0 and 1
wait(S) { while(S<=0) // busy wait S; } signal(S) { S++; }
 to improve performace, replace busy wait by process blocking itself
 places itself into a waiting queue
 restarted when other process executes a signal() operation
typedef struct{ int value; struct process *list; } semaphore; wait(semaphore *S) { S>value; if (S>value < 0) add this process to S>list; } signal(semaphore *S) { S>value++; if (S>value <= 0){ remove a process P from S>list; wakeup(P); // resumes execution } }
 deadlocked  2 processes are in waiting queues, can’t wakeup unless other process signals them
 indefinite blocking=starving  could happen if we remove processes from waiting queue in LIFO order
 bottom never gets out
 priority inversion
 only occurs when processes have > 2 priorities
 usually solved with a priorityinheritance protocol
 when a process accesses resources needed by a higherpriority process, it inherits the higher priority until they are finished with the resources in question
5.7  classic synchronization problems
 boundedbuffer problem
 readerswriters problem
 writers must have exclusive access
 readers can read concurrently
 diningphilosophers problem
5.8  monitors
 monitor  highllevel synchronization construct
 only 1 process can run at a time
 abstract data type which includes a set of programmerdefined operations with mutual exlusion
 has condition variables
 these can only call wait() or signal()
 when a signal is encountered, 2 options
 signal and wait
 signal and continue
 can implement with a semaphore
 1st semaphore:
mutex
 process must wait before entering and signal after leaving the monitor  2nd semaphore:
next
 signaling processes use next to suspend themselves  3rd semaphore:
next_count
= number of suspended processes
 1st semaphore:
wait(mutex); // body of F if (next count > 0) signal(next); else signal(mutex);
 conditionalwait construct can help with resuming
x.wait(c);
 priority number c stored with name of process that is suspended
 when
x.signal()
is executed, process with smallest priority number is resumed next
5.9.4  pthreads synchronization
#include <pthread.h> pthread mutex t mutex; /* create the mutex lock */ pthread mutex init(&mutex,NULL) // null specifies default attributes pthread mutex lock(&mutex); // acquire the mutex lock /* critical section */ pthread mutex unlock(&mutex); // release the mutex lock
 these functions return 0 w/ correct operation otherwise error code
 POSIX specifies named and unnamed semaphores
 name has name and can be shared by different processes
#include <semaphore.h> sem t sem; /* Create the semaphore and initialize it to 1 */ sem init(&sem, 0, 1); /* acquire the semaphore */ sem wait(&sem); /* critical section */ /* release the semaphore */ sem post(&sem);
5.10  alternative approaches (skip)
5.11  deadlocks
 resource utilization
 request
 use
 release
 deadlock requires 4 simultaneous conditions
 mutual exclusion
 hold and wait
 no preemption
 circular wait
 deadlocks can be described by system resourceallocation graph
 request edge  directed edge from process P to resource R means P has requested instance of resource type R
 assignment edge  R> P
 if the graph has no cycles, not deadlocked
 if cycle, possible deadlock
 three ways to handle
 use protocol to never enter deadlock
 enter, detect, recover
 ignore the problem
 developers must write code that avoids deadlocks
7  main memory
7.1  background
 CPU can only directly access main memory and registers
 accessing memory is slower than registers
 processor must stall or use cache
 processes need separate memory spaces
 base register  holds smallest usable address
 limit register  specifies size of range
 os / hardware check these, throw a trap if there was error
 input queue holds processes waiting to be be brought into memory
 compiler binds symbolic addresses to relocatable addresses
 linkage editor binds relocatable addresses to absolute addresses
 CPU uses virtual address=logical address
 memorymanagement unit (MMU) maps from virtual to physical address
 simple ex. add virtual address to a process’s base register = relocation register
 memorymanagement unit (MMU) maps from virtual to physical address
 dynamic loading  don’t load whole process, only load things when called
 dynamically linked libraries  system libraries linked to user programs when the programs are run
 stub  tells how to load / locate library routine
 shared libraries  all use same library
7.2 (skipped)
7.3  contiguous memory allocation
 contiguous memory allocation  each process has a section
 put OS in low memory and process memory in higher
 transient OS code  not often used
 ex. drivers
 can remove this and change OS memory usage by decreasing val in OS limit register
 split mem into partitions
 each partition can only have 1 process
 multiplepartition method  free partitions take a new process
 variablepartition scheme  OS keeps table of free mem
 all available mem = hole
 holes are divided between processes
 firstfit  allocate first hole big enough
 bestfit  allocate smallest hole that is big enough
 worstfit  allocate largest hole (largest leftover hole)
 worst
 external fragmentation  there is enough free mem, but it isn’t contiguous
 50percent rule  1/3 of mem is unusable
 solved with compaction  shuffle mem to put free mem together
 can be expensive to move mem around
 internal fragmentation  extra mem a proc is allocated but not using (because given in block sizes)
 2 types of noncontiguous solutions
 segmentation
 paging
7.4  segmentation (skip)
 segments make up logical address space
 name (or number)
 length
 logical address is a tuple
 (segmentnumber, offset)
 segment table
 each entry has segment base and segment limit
 doesn’t avoid external fragmentation
7.5  paging (skip)
 break physical mem into fixedsize frames and logical mem into corresponding pages

CPU address = [page number page offset]  page table contains base address of each page in physical mem
 usually, each process gets a page table
 <img src=”pics/7_10.png” width=40%/>
 frame table keeps track of which frames are available / who owns them
 paging is prevalent
 avoids external fragmentation, but has internal fragmentation
 small page tables can be stored in registers
 usually pagetable base register points to page table in mem
 has translation lookaside buffer  stores some pagetable entries
 some entries are wired down  cannot be removed from TLB
 some TLBS store addressspace identifiers (ADIDs)
 identify a process
 otherwise hard to contain entries for several processes
 want high hit ratio
 pagetable often stores a bit for readwrite or readonly
 validinvalid bit sets whether page is in a process’s logical address space
 OR pagetable length register  says how long page table is
 can share reentrant code = pure code
 nonselfmodifying code
7.6  page table structure (skip)
 page tables can get quite large (total mem / page size)
 hierarchical paging  ex. twolevel page table
 <img src=”pics/7_18.png” width=40%/>
 also called forwardmapped page table
 unused things aren’t filled in
 for 64bit, would generally require too many levels
 hashed page tables
 virtual page number is hash key > physical page number
 clustered page tables  each entry stores everal pages, can be faster
 inverted page tables
 only one page table in system
 one entry for each page/frame of memory
 <img src=”pics/7_20.png” width=40%/>
 takes more time to lookup
 hash table can speed this up
 difficulty with shared memory
7.79 (skipped)
6  cpu scheduling
 preemptive  can stop and switch a process that is currently running
6.3  algorithms
 firstcome, firstserved
 shortestjobfirst
 can be preemptive or non preemptive
 priorityscheduling
 indefinite blocking / starvation
 roundrobin
 every process gets some time
 multilevel queue scheduling
 ex. foreground and background
 multilevel feedback queues
 allows processes to move between queues
6.4  thread scheduling
 process contention scope  competition for CPU takes place among threads belonging to same process
 PTHREAD_SCOPE_PROCESS  userlevel threads onto available LWPs
 PTHREAD_SCOPE_SYSTEM  binds LWP for each userlevel thread
6.5  multipleprocessor scheduling
 asymmetric vs. symmetric
 almost everything is symmetric (SMP)
 processor affinity  try to not switch too much
 load balancing  try to make sure all processes share work
 multithreading
 coarsegrained  thread executes until longlatency event, such as memory stall
 finegrained  switches between instruction cycle
6.6  realtime systems
 event latency  amount of time that elapses from when an event occurs to when it is serviced
 interrupt latency  period of time from the arrival of an interrupt at the CPU to the start of the routine that services the interrupt
 dispatch latency
 Preemption of any process running in the kernel
 Release by lowpriority processes of resources needed by a highpriority process
 ratemonotonic scheduling  schedules periodic tasks using a static priority policy with preemption
6.7  SKIP
8  virtual memory
8.1  background
 lots of code is seldom used
 virtual mem allows the execution of processes that are not completely in
 benefits
 programs can be larger than physical mem
 more processes in mem at same time
 less swapping programs into mem
 sparse address space  virtual address spaces with hole (betwen heap and stack)
 <img src=”pics/8_3.png” width=40%/>
8.2  demand paging
 demand paging  load pages only when they are needed
 lazy pager  only swaps a page into memory when it is needed
 can use validindvalid bit in page table to signal whether a page is in memory
 memory resident  residing in memory
 accessing page marked invalid causes page fault
 <img src=”pics/8_6.png” width=40%/>
 must restart after fetching page
 don’t let anything change while fetching
 use registers to store state before fetching
 pure demand paging  never bring a page into memory until it is required
 programs tend to have locality of reference, so we bring in chunks at a time
 extra time when there is a page fault
 service the pagefault interrupt
 read in the page
 restart the process
 effective access time is directly proportional to pagefault rate
 anonymous memory  pages not associated with a file
8.3  copyonwrite
 copyonwrite  allows parent and child processes intially to share the same pages
 if either process writes, copy of shared page is created
 new pages can come from a set pool
 zerofillondemand  zeroed out before being allocated
 virtual memory fork  not copyonwrite
 child uses adress space of parent
 parent suspended
 meant for when child calls exec() immediately
8.4  page replacement  select which frames to replace
 multiprogramming might overallocate memory
 all programs might need all their mem at once
 buffers for I/O also use a bunch of mem
 when overallocated, 3 options
 terminate user process
 swap out a process
 page replacement
 want lowest pagefault rate
 test with reference string, which is just a list of memory references
 if no frame is free, find one not being used and free it
 write its contents to swap space
 <img src=”pics/8_10.png” width=40%/>
 modify bit=dirty bit reduces overhead
 if hasn’t been modified then don’t have to rewrite it to disk
 page replacement examples
 FIFO
 Belady’s anomaly  for some algorithms, pagefault rate may increase as number of allocate frames increases
 optimal (OPT / MIN)
 replace the page that will not be used for the longest period of time
 LRU  least recently used (last used)
 implement with counters since each use
 stack of page numbers (whenever something is used, put it on top)
 stack algorithms  set of pages in memory for n frames is always a subset of the set of pages that would be in memory with n + 1 frames
 don’t suffer from Belady’s anomaly
 LRUapproximation
 reference bit  set whenever a page is used
 can keep additional reference bits by recording reference bits at regular intervals1
 secondchance algorithm  FIFO, but if ref bit is 1, set ref bit to 0 and move on to next FIFO page
 can have clock algorithm
 <img src=”pics/8_17.png” width=40%/>
 enhanced secondchance  uses reference bit and modify bit
 give preference to pages that have been modified
 countingbased  count and implement LFU (least frequently used) or MFU (most frequently used)
 FIFO
 pagebuffering algorithms
 pool of free frames  makes things faster
 list of modified pages  written to disk whenever paging device is idle
 som algorithms, like databases perform better when they get their own memory capability called raw disk instead of being managed by OS
8.5 frameallocation algorithms  how many frames to allocate to teach process in memory (skipped)
8.6  thrashing
 if lowpriority process gets too few frames, swap it out
 thrashing  process spends more time paging than executing
 CPU utilization stops increasing
 thrashing  process spends more time paging than executing
 local replacement algorithm = priority replacement algorithm  if one process starts thrashing, cannot steal frames from another
 locality model  each locality is a set of pages actively used together
 give process enough for its current locality
 workingset model  still based on locality
 defines workingset window $\delta$
 defines working set as pages in most recent $\delta$ refs
 OS adds / suspends processes according to working set sizes
 approximate with fixedinterval timer
 pagefault frequency  add / decrease pages based on targe pagefault rate
8.7  (skipped)
8.8.1  buddy system
 memory allocated with powerof2 allocator  requests are given powers of 2
 each page is split into 2 buddies and each of those splits again recursively
 coalescing  buddies can be combined quickly
9  massstorage structure
9.1
9.2
9.4  disk scheduling
 bandwidth  total number of bytes transferred, divided by time
 firstcome firstserved
 shortestseektimefrist
 SCAN algorithm  disk swings side to side servicing requests on the way
 also called elevator algorithm
 also has circularscan
9.5  disk management
 lowlevel formatting  dividing disk into sectors that controller can read/write
 blocks have header / trailer with errorcorrecting codes
 bad blocks are corrupted  need to replace them with others = sector sparing = forwarding
 sector slipping  just renumbers to not index bad blocks
10  filesystem interface
10.1
 os maintains openfile table
 might require file locking
 must support different file types
10.2  access methods
 simplest  sequential
 direct access = relative access
 uses relative block numbers
10.3
 disk can be partitioned
 twolevel directory
 users are first level
 directory is 2nd level
 extend this into a tree
 acyclic makes it faster to search
 cycles require very slow garbage collection
 link  pointer to another thing
10.4  file system mounting
11  filesystem implementation
11.1
 filecontrol block (FCB) contains info about file ownership, etc.
11.2
11.3 (SKIP)
11.4
 contiguous allocation
 linked allocation
 FAT
 indexed allocation  all the pointers in 1 block
11.5
 keep track of freespace list
 implemented as bit map
 keep track of linked list of free space
 grouping  block stores n1 free blocks and 1 pointer to next block
 counting  keep track of ptr to next block and the number of free blocks after that
12  i/o systems
 bus  shared set of wires
 registers
 datain  read by the host
 dataout
 status
 control
 interrupt chaining  each element in the interrupt vector points to the had of a list of interrupt handlers
 system calls use software interrupt
 direct memory access  read large chunks instead of one byte at a time
 devicestatus table
 spool  buffer for device (ex. printer) that can’t hold interleaved data
 1  introduction

Information Retrieval
 introduction
 related fields
 web crawler
 inverted index
 query processing
 user
 ranking model
 latent semantic analysis  removes noise
 probabalistic ranking principle  different approach, ML
 retrieval evaluation
 reading
introduction
 building blocks of search engines
 search (user initiates)
 reccomendations  proactive search engine (program initiates e.g. pandora, netflix)
 information retrieval  activity of obtaining info relevant to an information need from a collection of resources
 information overload  too much information to process
 memex  device which stores records so it can be consulted with exceeding speed and flexibility (search engine)
 IR pieces
 Indexed corpus (static)
 crawler and indexer  gathers the info constantly, takes the whole internet as input and outputs some representation of the document
 web crawler  automatic program that systematically browses web
 document analyzer  knows which section has what takes in the metadata and outputs the index (condensed), manage content to provide efficient access of web documents
 crawler and indexer  gathers the info constantly, takes the whole internet as input and outputs some representation of the document
 User
 query parser  parses the search terms into managed system representation
 Ranking
 ranking model takes in the query representation and the indices, sorts according to relevance, outputs the results
 also need nice display
 query logs  record user’s search history
 user modeling  assess user’s satisfaction
 Indexed corpus (static)
 steps
 repository > document representation
 query > query representation
 ranking is performed between the 2 representations and given to the user
 evaluation  by users
 information retrieval:
 reccomendation
 question answering
 text mining
 online advertisement
related fields
they are all getting closer, database approximate search and information extraction converts unstructed data to structured:
database systems information retrieval structured data unstructured data semantics are welldefined semantics are subjective structured query languages (ex. SQL) simple keyword queries exact retrieval relevancedrive retrieval emphasis on efficiency emphasis on effectiveness  natural language processing  currently the bottleneck
 deep understainding of language
 cognitive approaches vs. statistical
 small scale problems vs. large
 developing areas
 currently mobile search is big  needs to use less data, everything needs to be more summarized
 interactive retrieval  like a human being, should collaborate
 core concepts
 information need  desire to locate and obtain info to satisfy a need
 query  a designed representation of user’s need
 document  representation of info that could satisfy need
 relevance  relatedness between documents and need, this is vague
 multiple perspectives: topical, semantic, temporal, spatial (ex. gas stations shouldn’t be behind you)
 Yahoo used to have system where you browsed based on structure (browsing), but didn’t have queries (querying)
 better when user doesn’t know keywords, just wants to explore
 push mode  systems push relevant info to users without a query
 pull mode  users pull out info using keywords
web crawler
 web crawler determines upper bound for search engine
 loop over all URL’s (difficult to set its order)
 make sure it’s not visited
 read it and save it as indexed
 setItVisited
 visiting strategy
 breadth first  has to memorize all nodes on previous level
 depth first  explore the web by branch
 focused crawlings  prioritize the new links by predefined strategies
 not all documents are equally important
 prioritize by indegree
 prioritize by PageRank  breadthfirst in early state then approximate periodically
 prioritize by topical relevance
 estimate the similarity by anchortext or text near anchor
 some websites provide site map for google, disallows certain pages (ex. cnn.com/robots.txt)
 some websites push info to google so it doesn’t need to be crawled (ex. news websites)
 need to revisit to get changed info
 uniform revisiting (what google does)
 proportional revisiting (visiting frequency is proportional to page’s update frequency)
 html parsing
 shallow parsing  only keep text between title and p tags
 automatic wrapper generation  regular expression for HTML tags’ combination
 representation
 long string has no semantic meaning
 list of sentences  sentence is just short document (recursive definition)
 list of words
 tokenization  break a stream of text into meaningful units
 several statistical methods
 bagofwords representation
 we get frequencies, but lose grammar and order
 Ngrams (improved)
 continguous sequence of n items from a given sequence of text
 for example, keep pairs of words
 google uses n = 7
 increase vocabulary to V^n
 continguous sequence of n items from a given sequence of text
 full text indexing
 pros: preserves all information, full automatic
 cons: vocab gap: car vs. cars, very large storage
 Zipf’s law  frequency of any word is inversely proportional to its rank in the frequency table
 frequencies decrease linearly
 discrete version of power law
 stopwords  we ignore these and get meaningful part
 head words take large portion but are meaningless e.g. the, a, an
 tail words  major portion of dictionary, but rare e.g. dextrosinistral
 risk: we lost structure ex. this is not a good option > option
 normalization
 convert different forms of a word to normalized form
 USA St. Louis > Saint Louis
 rule based: delete period, all lower case
 dictionary based: equivalence classes ex. cell phone > mobile phone
 stemming: ladies > lady, referring > refer
 risks: lay > lie
 solutions
 Porter stemmer  pattern of vowelconsonant sequence
 Krovertz stemmer  morphological rules
 empirically, stemming still hurts performance
 convert different forms of a word to normalized form
 modern search engines don’t do stemming or stopword removal
 more advanced NLP techniques are applied  ex. did you search for a person? location?
inverted index
 simple attempt
 documents have been craweld from web, tokenized/normalized, represented as bagofwords
 try to match keywords to the documents
 space complexity O(d*v) where d = # docs, v = vocab size
 Zipf’s law: most of space is wasted so we only store the occurred words
 instead of an array, we store a linked list for each doc
 time complexity O(qd_lengthd_num) where q=length of query
 solution
 lookup table for each word, key is word, value is list of documents that contain it
 timecomplexity O(q*l) where l is average length of list of documents containing word
 by Zipf’s law, d_length « l
 data structures
 hashtable  modest size (length of dictionary)
 postings  very large  sequential access, contain docId, term freq, term position…
 compression is needed
 sortingbased inverted index construction  mapreduce
 from each doc extract tuples of (termId (key in hashtable), docId, count)
 sort by termId within each doc
 merge sort to get one list sorted by key in hashtable
 compress terms with same termId and put into hashtable
 features
 needs to support approximate search, proximity search, dynamic index update
 dynamic index update
 periodically rebuild the index  acceptable if change is small over time and missing new documents is fine
 auxiliary index
 keep index for new docs in memory
 merge to index when size exceeds threshold
 soln: multiple auxiliary indices on disk, logarithmic merging
 index compression
 save space
 increase cache efficiency
 improve diskmemory transfer rate
 coding theory: E[L] = ∑p(x_l) * l
 instead of storing docId, we store gap between docIDs since they are ordered
 biased distr. gives great compression: frequent words have smaller gaps, infrequent words have large gaps, so the large numbers don’t matter (Zipf’s law)
 variablelength coding: less bits for small (high frequency) integers
 more things put into index
 document structure
 title, abstract, body, bullets, anchor
 entity annotation
 these things are fed to the query
 document structure
query processing
 parse syntax ex. Barack AND Obama, orange OR apple
 same processing as on documents: tokenization > normalization > stemming > stopword removal
 speedup: start from lowest frequency to highest ones (easy to toss out documents)
 phrase matching “computer science”
 Ngrams doesn’t work, could be very long phrase
 soln: generalized postings match
 equality condition check with requirement of position patter between two query terms
 ex. t2.post1.pos (t1 must be immediately before t2 in any matched document)

proximity query: t2.post1.pos <= k
 spelling correction
 pick nearest alternative or pick most common alternative
 proximity between query terms
 edit distance = minimum number of edit operations to transform one string to another
 insert, replace, delete
 speedup
 fix prefix length
 build characterlevel inverted index
 consider layout of a keyboard
 phonetic similarity ex. “herman” > “Hermann”
 solve with phonetic hashing  similarsounding terms hash to same value
user
 result display
 relevance
 diversity
 navigation  query suggestion, search by example
 list of links has always been there
 search engine reccomendations largely bias the user
 direct answers (advanced I’m feeling lucky)
 ex. 100 cm to inches
 google was using user’s search result feedback
 spammers were abusing this
 social things have privacy concerns
 instant search (refreshes search while you’re typing)
 slightly slows things down
 carrot2  browsing not querying
 foam tree display, has circles with sizes representing popularity
 has a learning curve
 pubmed  knows something about users, has keyword search and more
 result display
 relevance
 most users only look at top left
 this can be changed with multimedia content
 HCI is attracting more attention now
 mobile search
 multitouch
 less screen space
ranking model
 naive boolean query “obama” AND “healthcare” NOT “news”
 unions, intersects, lists
 often overconstrained or underconstrained
 also doesn’t give you relevance of returned documents
 you can’t actually return all the documents
 instead we have rank docs for the users (topk retrieval) with different kinds of relevance
 vector space model (uses similarity between query and document)
 how to define similarity measure
 both doc and query represented by concept vectors
 k concepts define highdimensional space
 element of vector corresponds to concept weight
 concepts should be orthogonal (nonoverlapping in meaning)
 could use terms, ngrams, topics, usually bagofwords
 weights: not all terms are equally important
 TF  term frequency weighting  a frequent term is more important
 normalization: tf(t,d) = 1+log(f), if f(t,d) > 0
 or proportionally: = a+(1a)*f/max(f)
 normalization: tf(t,d) = 1+log(f), if f(t,d) > 0
 IDF weighting  a term is more discriminative if it occurs only in fewer docs
 IDF(t) = 1+log(N/(d_num(t))) where N = total # docs, d_num(t) = # docs containt t
 total term frequency doesn’t work because words can frequently occur in a subset
 combining TF and IDF  most widely used
 w(t,d) = TF(t,d) * IDF(t)
 TF  term frequency weighting  a frequent term is more important
 similarity measure
 Euclidean distance  penalizes longer docs too much
 cosine similarity  dot product and then normalize
 drawbacks
 assumes term independence
 assume query and doc to be the same
 lack of predictive adequacy
 lots of parameter tuning
 (uses probablity of relevance)
 vocabulary  set of words user can query with
latent semantic analysis  removes noise
 terms aren’t necessarily orthogonal in vectors space model
 synonmys: car vs. automobile
 polysems: fly (action vs. insect)
 independent concept space is preferred (axes could be sports, economics, etc.)
 constructing concept space
 automatic term expansion  cluster words based on thesaurus (WordNet does this)
 word sense disambiguation  use dictionary, wordusage context
 latent semantic analysis
 assumption  there is some underlying structure that is obscurred by randomness of word choice
 random noise contaminates termdocument data
 assumption  there is some underlying structure that is obscurred by randomness of word choice
 linear algebra  singular value decomposition
 m x n matrix C with rank r
 decompose into U * D * V^T, where D is an r x r diagonal matrix (like eigenvalues^2)
 U and V are orthogonal matrices
 we put the eigenvalues in D into descending order and only take the first k values to be nonzero
 this is low rank decomposition
 multiply the D’s of different docs get similarity
 eigenvector is new representation of each doc
 principle component analysis  separate things based on direction that maximizes variance
 put query into lowrank space
 LSA can also be used beyond text
 𝑂(𝑀𝑁2)
probabalistic ranking principle  different approach, ML
 total probablility  use bayes’s rule over a partition
 Hypothesis space H={H_1,…,H_n}, training data E

$P(H_i E) = P(E H_i)P(H_i)/P(E)$  prior = P(H_i)

posterior = P(H_i E)  to pick the most likely hypothesis H*, we drop P(E)

P(H_i E) = P(E H_i)P(H_i)

 losses  rank by descending loss

a1 = loss(retrieved nonrelevant) 
a2 = loss(not retrieved relevant)

 we need to make a relevance measure function
 assume independent relevance, sequential browsing
 most existing ir research has fallen into this line of thinking

conditional models for P(R=1 Q,D)  basic idea  relevance depends on how well a query matches a document

P(R=1 Q,D) = g(Rep(Q,D),t)  linear regression

MLE: prediction = argmax(P(X 0)) 
Bayesian: prediction = argmax(P(X 0)) P(0)
ml
 features/attributes for ranking  many things
 use logistic regression to find relevance
 little guidance on feature selection
 this model has completely taken over
generative models for P(R=1Q,D)

compute Odd(R=1 Q,D) using Bayes’ rule
language models
 a model specifying probabilty distributions for different word sequences (generative model)
 too much memory for ngram, so we use unigrams
 generate text by sampling from discrete distribution
 maximum likelihood estimation (MLE)
 sampling with replacement (like picking marbles from bag)  gives you probability distributions
 when you get a query see which document is more likely to generate the query
 MLE can’t represent unseen words (ex. ipad)
 smoothing
 we want to avoid log zero for these words, but we can’t arbitrarily add to the zero
 instead we add to the zero probabilities and subtract from the probabilities of observed words
 additive smoothing  add a constant delta to the counts of each word
 skews the counts in favor of infrequent terms  all words are treated equally
 absolute discounting  subtract from each nonzero word, distribute among zeros
 reference smoothing  use reference language model to choose what to add
 linear interpolation  subtract a percentage of your probability, distribute among zeros
 dirichlet prior/bayesian  not affected by document length
 effect of smoothing is to get rid of log(0) and to devalue very common words and add weight to infrequent words
 longer documents should borrow less because they see the more uncommon words
 additive smoothing  add a constant delta to the counts of each word
retrieval evaluation
 evaluation criteria
 small things  speed, # docs returned, spelling correction, suggestions
 most important this is satisfying user’s information need
 Cranfield experiments  retreived documents’ relevance is a good proxy of a system’s utility in satisfying user’s information need
 standard benchmark  TREC, hosted by NIST
 elements of evaluation
 document collection
 set of information needs expressible as queries
 relevance judgements  binary relevant, nonrelevant for each querydocument pair
 stats
 type 1: false positive  wrongly returneda
 precision  fraction of retrieved documents that are relevant = p(relevantretrieved) = tp/(tp+fp)
 recall  fraction of relevant documents that are retrieved = p(retrievedrelevant) = tp/(tp+fn)
 they generally trade off
 evaluation is in terms of one query
 unordered evaluation  consider the documents unordered
 calculate the precision P and recall P
 combine them with harmonic mean: F = 1 / (a(1/P)+(1a)1/R) where a assigns weights, usually pick a=1
 F = 2/(1/P+1/R)
 we do this instead of normal mean because values very close to 0 results in very large denominators
 ranked evaluation w/ binary relevance  consider the ranked results
 precision vs recall has sawtooth shape curve
 recall never decreases
 precision increases if we find relevant doc, decreases if irrelevant 1. elevenpoint interpolated (use recall levels 0,.1,.2,…,1.0)
 shouldn’t really use 1.0  not very meaningful 2. precision@k
 ignore all docs ranked lower than k
 only use relevant docs
 recall@k is problematic because it is hard to know how many docs are relevant 3. MAP  mean average precision  usually best
 considers rank position of each relevant doc
 compute p@k for each relevant doc
 average precision = average of those p@k
 mean average precision = mean over all the queries
 weakness  assumes users are interested in finding many relevant docs, requires many relevance judgements 4. MRR  mean reciprocal rank  only want one relevant doc
 uses: looking for fact, knownitem search, navigational queries, query auto completion
 reciprocal rank = 1/k where k is ranking position of 1st relevant document
 mean reciprocal rank = mean over all the queries
 ranked evaluation w/ numerical relevance
 binary relevance is insufficient  highly relevant documents are more useful
 gain is accumulated starting at the top and discounted at lower ranks
 typical discount is 1/log(rank)
 DCG (discounted cumulative gain)  total gain accumulated at a particular rank position p
 DCG_p = rel_1 + sum(i=1 to p) rel_i/log_2(i)
 DCG_p = sum_{i=1 to p}(2^rel_i  1)/(log_2(1+i)) where rel_i is usually 0 to 4
 this is what is actually used
 emphasize on retrieving highly relevant documents
 different queries have different numbers of relevant docs  have to normalized DCG
 normalized DCG  normalize by the DCG of the ideal ranking
 statistical significance tests  difference could just be because of p values you chose
 pvalue  prob of data using null hypothesis, if p < alpha we reject null hypothesis
 sign test
 hypothesis  difference median is zero 2. wilcoxon signed rank test
 hypothesis  data are paired and come from the same population 3. paired ttest
 difference has zero mean value 4. onetail v. two tail
 lol use twotail
 kappa statistic  measures accuracy of assesor  P(judges agree)P(judges agree randomly) / (1P(judges agree randomly))
 = 0 if they agree by chance
 otherwise 1 or < 0
 P(judges agree randomly) = marginals for yesyes and nono
 pooling  hard to annotate all docs  relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems
 pvalue  prob of data using null hypothesis, if p < alpha we reject null hypothesis
feedback as model interpolation
 important that we take distance from Q to D not D to Q
 this is because the measure is asymmetric
mp3
 2^rel  rel can be 0 or 1
 whenever you change stopword removal/stemming, have to rebuild index
 otherwise, you will think they are all important
reading
as we may think
 there are too many published things, hard to keep track
19 web search basics
 client server design
 server communicates with client via a protocal such as http in a markup language such as html
 client  generally a brower  can ignore what it doesn’t understand
 we need to include autoritativeness when thinking about a document’s relevance
 we can view html pages as nodes and hyperlinks as directed edges
 power law: number of web pages w/ indegree i ~ 1/(i^a)
 bowtie structure: three types of webpages IN > SCC > OUT
 spam  would repeat keywords to be included in searches
 there is paid inclusion
 cloaking  different page is shown to crawler than to user
 doorway page  text and metadata to rank highly  then redirects
 SEO (search engine optimizers)  consulting for helping people rank highly
 search engine marketing  how to budget different keywords
 some search engines started out without advertising
 advertising  per click, per view
 competitors can click spam the ads of opponents
 types of queries
 informational  general info
 navigational  specific website
 transactional  buying or downloading
 difficult to get size of index
 shingling  count repeating consecutive sequences
2.2, 20.1, 20.2
 hard to tokenize
2.3, 2.4, 4, 5.2, 5.3
 compression and vocaublary
1.3, 1.4 boolean retrieval
 find lists for each term, then intersect or union or complement
 lists need to be sorted by docId so we can just increment the pointers
 we start with shortest lists and do operations to make things faster
 at any point we only want to look at the smallest possible list
6.2, 6.3, 6.4 vector space model
 tf(t,d) = term frequency of term t in doc d
 uses bag of words  order doesn’t matter, just frequency
 often replaced by wf(t,d) = 1+log(tf(t,d)) else 0 because have way more terms doesn’t make it way more relevant
 also could normalize ntf(t,d) = a + (1a)*tf(t,d)/tf_max(d) where a is a smoothing term
 idf(t) = inverse document frequency of term t
 collection frequency = total number of occurrences of a term in the collection.
 document frequency df(t) = #docs in that contain term t.
 idf(t) = log(N/df(t)) where N = #docs
 combination weighting scheme: tf_idf(t,d) = tf(t,d)*idf(t)  (tf is actually log)
 document score = sum over terms tf_idf(t,d)

cosine similarity = doc1*doc2 / ( doc1 * doc2 ) (this is the dot product)  we want the highest possible similarity
 euclidean distance penalizes long documents too much
 similarity = cosine similarity of (query,doc)
 only return top k results
 pivoted normalized document length?  generally penalizes long document, but avoids overpenalizing
13 text classification
 machine learning approach
 naive bayes text classification
12

Real Analysis
 ch 1  the real numbers
 ch 2  sequences and series
 ch 3  basic topology of R
 ch 4  functional limits and continuity
 ch 5  the derivative
 ch 6  sequences and series of function
 ch 7  the Riemann Integral
 overview
ch 1  the real numbers
 there is no rational number whose square is 2 <div class="collapse" id="111"> proof by contradiction </div>
 contrapositive:  logically equivalent

triangle inequality: $ a+b \leq a + b $<div class="collapse" id="121"> often use ab = (ac)+(cb) </div>  axiom of completeness  every nonempty set $A \subseteq \mathbb{R}$ that is bounded above has a least upper bound
 doesn’t work for $\mathbb{Q}$
 supremum = supA = least upper bound (similarly, infimum)
 supA is an upper bound of A
 if $s \in \mathbb{R}$ is another u.b. then $s \geq supA$
 can be restated as $\forall \epsilon > 0, \exists a \in A$ $s\epsilon < a$
 nested interval property  for each $n \in N$, assume we are given a closed interval $I_n = [a_n,b_n]={ x \in \mathbb{R} : a_n \leq x \leq b_n }$ Assume also that each $I_n$ contains $I_{n+1}$. Then, the resulting nested sequence of nonempty closed intervals $I_1 \supseteq I_2 \supseteq …$ has a nonempty intersection <div class="collapse" id="141"> use AoC with x = sup{$a_n: n \in \mathbb{N}$} in the intersection of all sets</div>
 archimedean property
 $\mathbb{N}$ is unbounded above (sup $\mathbb{N}=\infty$)
 $\forall x \in \mathbb{R}, x>0, \exists n \in \mathbb{N}, 0 < \frac{1}{n} < x$
contradiction with AoC $\mathbb{Q}$ is dense in $\mathbb{R}$  for every $a,b \in \mathbb{R}, a<b$, $\exists r \in \mathbb{Q}$ s.t. $a<r<b$
 pf: want $a < \frac{m}{n} < b$
 by Archimedean property, want $\frac{1}{n} < ba$
 corollary: the irrationals are dense in $\mathbb{R}$
 pf: want $a < \frac{m}{n} < b$
 there exists a real number $r \in \mathbb{R}$ satisfying $r^2 = 2$
 pf: let r = $sup { t \in \mathbb{R} : t^2 < 2 }$. disprove $r^2<2, r^2>2$ by considering $r+\frac{1}{n},r\frac{1}{n}$
 A ~ B if there exists f:A>B that is 11 and onto
 A is finite  there exists n $\in \mathbb{N}$ s.t. $\mathbb{N}_n$~A
 countable = $\mathbb{N}$~A.
 uncountable  inifinite set that isn’t countable
 Q is countable
 pf: Let $A_n = { \pm \frac{p}{q}:$ where p,q $\in \mathbb{N}$ are in lowest terms with p+q=n}
 R is uncountable
 pf: Assume we can enumerate $\mathbb{R}$ Use NIP to exclude one point from $\mathbb{R}$ each time. The intersection is still nonempty, so we didn’t succesfully enumerate $\mathbb{R}$
 $\frac{x}{x^21}$ maps (0,1) $\to \mathbb{R}$
 countable union of countable sets is countable
 if $A \subseteq B$ and B countable, then A is either countable or finite
 if $A_n$ is a countable set for each $n \in \mathbb{N}$, then their union is countable
 the open interval (0,1) = ${ x \in \mathbb{R} : 0 < x < 1 }$ is uncountable
 pf: diagonalization  assume there exists a function from (0,1) to $\mathbb{R}$. List the decimal expansions of these as rows of a matrix. Complement of diagonal does not exist.
 cantor’s thm  Given any set A, there does not exist a function f:$A \to P(A)$ that is onto
 P(A) is the set of all subsets of A
ch 2  sequences and series

a sequence $(a_n)$ converges to a real number if $\forall \epsilon > 0, \exists N \in \mathbb{N}$ such that $\forall n\geq N, a_na < \epsilon$  otherwise it diverges
 if a limit exists, it is unique

a sequence $(x_n)$ is bounded if there exists a number M > 0 such that $ x_n \leq M \forall n \in \mathbb{N}$  every convergent sequence is bounded
 algebraic limit thm  let lim $a_n = a$ and lim $b_n$ = b. Then
 lim($ca_n$) = ca
 lim($a_n+b_n$) = a+b
 lim($a_n b_n$) = ab
 lim($a_n/b_n$) = a/b, provided b $\neq$ 0

pf 3: use triangle inequality, $ a_nb_nab = a_nb_nab_n+ab_nab =…= b_n a_na + a b_nb $  pf 4: show $(b_n) \to b$ implies $(\frac{1}{b_n}) \to \frac{1}{b}$

 order limit thm  Assume lim $a_n = a$ and lim $b_n$ = b.
 If $a_n \geq 0$ $\forall n \in \mathbb{N}$, then $a \geq 0$
 If $a_n \leq b_n$ $\forall n \in \mathbb{N}$, then $a \leq b$
 If $\exists c \in \mathbb{R}$ for which $c \leq b_n$ $\forall n \in \mathbb{N}$, then $c \leq b$
 pf 1: by contradiction
 monotone  increasing or decreasing (not strictly)
 monotone convergence thm  if a sequence is monotone and bounded, then it converges
 convergence of a series
 define $s_m=a_1+a_2+…+a_m$
 $\sum_{n=1}^\infty a_n$ converges to A $\iff (s_m)$ converges to A
 cauchy condensation test  suppose $a_n$ is decreasing and satisfies $a_n \geq 0$ for all $n \in \mathbb{N}$. Then, the series $\sum_{n=1}^\infty a_n$ converges iff the series $\sum_{n=1}^\infty 2^na_{2^n}$ converges
 pseries $\sum_{n=1}^\infty 1/n^p$ converges iff p > 1
2.5
 let $(a_n)$ be a sequence and $n_1<n_2<…$ be an increasing sequence of natural numbers. Then $(a_{n_1},a_{n_2},…)$ is a subsequence of $(a_n)$
 subsequences of a convergent sequence converge to the same limit as the original sequence
 can be used as divergence criterion
 bolzanoweierstrass thm  every bounded sequence contains a convergent subsequence
 pf: use NIP, keep splitting interval into two
2.6

$(a_n)$ is a cauchy sequence if $\forall \epsilon > 0, \exists N \in \mathbb{N}$ such that $\forall m,n\geq N, a_na_m < \epsilon$  cauchy criterion  a sequence converges $\iff$ it is a cauchy sequence
 cauchy sequences are bounded
 overview: AoC $\iff$ NIP $\iff$ MCT $\iff$ BW $\iff$ CC
2.7
 algebraic limit thm  let $\sum_{n=1}^\infty a_n$ = A, $\sum_{n=1}^\infty b_n$ = B
 $\sum_{n=1}^\infty ca_n$ = cA
 $\sum_{n=1}^\infty a_n+b_n$ = A+B
 cauchy criterion for series  series converges $\iff$ $(s_m)$ is a cauchy sequence
 if the series $\sum_{n=1}^\infty a_n$ converges then lim $a_n=0$
 comparison test
 geometric series  $\sum_{n=0}^\infty a r^n = \frac{a}{1r}$
 $s_m = a+ar+…+ar^{m1} = \frac{a(1r^m)}{1r}$
 absolute convergence test
 alternating series test 1. decreasing 2. lim $a_n$ = 0
 then, $\sum_{n=1}^\infty (1)^{n+1} a_n$ converges
 rearrangements: there exists onetoone correspondence
 if a series converges absolutely, any rearrangement converges to same limit
ch 3  basic topology of R
3.1 cantor set
 C has small length, but its cardinality is uncountable
 discussion of dimensions, doubling sizes leads to 2^dimension sizes
 Cantor set is about dimension .631
3.2 open/closed sets
 A set O $\subseteq \mathbb{R}$ is open if for all points a $\in$ O there exists an $\epsilon$neighborhood $V_{\epsilon}(a) \subseteq O$

$V_{\epsilon}(a)={ x \in R : xa < \epsilon$}  the union of an arbitrary collection of open sets is open
 the intersection of a finite collection of open sets is open

 a point x is a limit point of a set A if every $\epsilon$neighborhood $V_{\epsilon}(x)$ of x intersects the set A at some point other than x
 a point x is a limit point of a set A if and only if x = lim $a_n$ for some sequence ($a_n$) contained in A satisfying $a_n \neq x$ for all n $\in$ N
 isolated point  not a limit point
 set $F \subseteq \mathbb{R}$ closed  contains all limit points
 closed iff every Cauchy sequence contained in F has a limit that is also an element of F
 density of Q in R  for every $y \in \mathbb{R}$, there exists a sequence of rational numbers that converges to y
 closure  set with its limit points
 closure $\bar{A}$ is smallest closed set containing A
 iff set open, complement is closed
 R and $\emptyset$ are both open and closed
 the union of a finite collection of closed sets is closed
 the intersection of an arbitrary collection of closed sets is closed
 R and $\emptyset$ are both open and closed
3.3
 a set K $\subseteq \mathbb{R}$ is compact if every sequence in K has a subsequence that converges to a limit that is also in K
 Nested Compact Set Property  intersection of nested sequence of nonempty compact sets is not empty
 let A $\subseteq \mathbb{R}$. open cover for A is a (possibly infinite) collection of open sets whose union contains the set A.
 given an open cover for A, a finite subcover is a finite subcollection of open sets from the original open cover whose union still manages to completely contain A
 HeineBorel thm  let K $\subseteq \mathbb{R}$. All of the following are equivalent
 K is compact
 K is closed and bounded
 every open cover for K has a finite subcover
ch 4  functional limits and continuity
4.1
 dirichlet function: 1 if r $\in \mathbb{Q}$ 0 otherwise
4.2 functional limits

def 1. Let f:$A \to R$, and let c be a limit point of the domain A. We say that $lim_{x \to c} f(x) = L$ provided that for all $\epsilon$ > 0, there exists a $\delta$ > 0 s.t. whenever 0 < xc < $\delta$ (and x $\in$ A) it follows that f(x)L < $\epsilon$  def 2. Let f:$A \to R$, and let c be a limit point of the domain A. We say that $lim_{x \to c} f(x) = L$ provided that for every $\epsilon$neighborhood $V_{\epsilon}(L)$ of L, there exists a $\delta$neighborhood $V_{\delta}($c) around c with the property that for all x $\in V_{\delta}($c) different from c (with x $\in$ A) it follows that f(x) $\in V_{\epsilon}(L)$.
 sequential criterion for functional limits  Given function f:$A \to R$ and a limit point c of A, the following 2 statements are equivalent:
 $lim_{x \to c} f(x) = L$
 for all sequences $(x_n) \subseteq$ A satisfying $x_n \neq$ c and $(x_n) \to c$, it follows that $f(x_n) \to L$.
 algebraic limit thm for functional limits
 divergence criterion for functional limits
4.3 continuous functions

a function f:$A \to R$ is continuous at a point c $\in$ A if, for all $\epsilon$>0, there exists a $\delta$>0 such that whenever xc <$\delta$ (and x$\in$ A) it follows that $ f(x)f( c) <\epsilon$. F is continous if it is continuous at every point in the domain A  characterizations of continuouty
 criterion for discontinuity
 algebraic continuity theorem
 if f is continuous at c and g is continous at f( c) then g $\circ$ f is continuous at c
4.4 continuous functions on compact sets
 preservation of compact sets  if f continuous and K compact, then f(K) is compact as well
 extreme value theorem  if f if continuous on a compact set K, then f attains a maximum and minimum value. In other words, there exist $x_0,x_1 \in K$ such that $f(x_0) \leq f(x) \leq f(x_1)$ for all x $\in$ K

f is uniformly continuous on A if for every $\epsilon$>0, there exists a $\delta$>0 such that for all x,y $\in$ A, $ xy < \delta \implies f(x)f(y) < \epsilon$ 
a function f fails to be uniformly continuous on A iff there exists a particular $\epsilon_o$ > 0 and two sequences $(x_n),(y_n)$ in A sastisfying $ x_n  y_n \to 0$ but $ f(x_n)f(y_n) \geq \epsilon_o$

 a function that is continuous on a compact set K is uniformly continuous on K
4.5 intermediate value theorem
 intermediate value theorem  Let f:[a,b]$ \to R$ be continuous. If L is a real number satisfying f(a) < L < f(b) or f(a) > L > f(b), then there exists a point c $\in (a,b)$ where f( c) = L
 a function f has the intermediate value property on an inverval [a,b] if for all x < y in [a,n] and all L between f(x) and f(y), it is always possible to find a point c $\in (x,y)$ where f( c)=L.
ch 5  the derivative
5.2 derivatives and the intermediate value property
 let g: A > R be a function defined on an interval A. Given c $\in$ A, the derivative of g at c is defined by g’( c) = $\lim_{x \to c} \frac{g(x)  g( c)}{xc}$, provided this limit exists. Then g is differentiable at c. If g’ exists for all points in A, we say g is differentiable on A
 identity: $x^nc^n = (xc)(x^{n1}+cx^{n2}+c^2x^{n3}+…+c^{n1}$)
 differentiable $\implies$ continuous
 algebraic differentiability theorem
 adding
 scalar multiplying
 product rule
 quotient rule
 chain rule: let f:A> R and g:B>R satisfy f(A)$\subseteq$ B so that the composition g $\circ$ f is defined. If f is differentiable at c in A and g differentiable at f( c) in B, then g $\circ$ f is differnetiable at c with (g$\circ$f)’( c)=g’(f( c))*f’( c)
 interior extremum thm  let f be differentiable on an open interval (a,b). If f attains a maximum or minimum value at some point c $\in$ (a,b), then f’( c) = 0.
 Darboux’s thm  if f is differentiable on an interval [a,b], and a satisfies f’(a) < $\alpha$ < f’(b) (or f’(a) > $\alpha$ > f’(b)), then there exists a point c $\in (a,b)$ where f’( c) = $\alpha$
 derivative satisfies intermediate value property
5.3 mean value theorems
 mean value theorem  if f:[a,b] > R is continuous on [a,b] and differentiable on (a,b), then there exists a point c $\in$ (a,b) where $f’( c) = \frac{f(b)f(a)}{ba}$
 Rolle’s thm  f(a)=f(b) > f’( c)=0
 if f’(x) = 0 for all x in A, then f(x) = k for some constant k
 if f and g are differentiable functions on an interval A and satisfy f’(x) = g’(x) for all x $\in$ A, then f(x) = g(x) + k for some constant k

generalized mean value theorem  if f and g are continuous on the closed interval [a,b] and differentiable on the open interval (a,b), then there exists a point c $\in (a,b)$ where f(b)f(a) g’( c) = g(b)g(a) f’( c). If g’ is never 0 on (a,b), then can be restated $\frac{f’( c)}{g’( c)} = \frac{f(b)f(a)}{g(b)g(a)}$ 
given g: A > R and a limit point c of A, we say that $lim_{x \to c} g(x) = \infty$ if, for every M > 0, there exists a $\delta$> 0 such that whenever 0 < xc < $\delta$ it follows that g(x) ≥ M  L’Hospital’s Rule: 0/0  let f and g be continuous on an interval containing a, and assume f and g are differentiable on this interval with the possible exception of the point a. If f(a) = g(a) = 0 and g’(x) ≠ 0 for all x ≠ a, then $lim_{x \to a} \frac{f’(x)}{g’(x)} = L \implies lim_{x \to a} \frac{f’(x)}{g’(x)} = L$
 L’Hospital’s Rule: $\infty / \infty$  assume f and g are differentiable on (a,b) and g’(x) ≠ 0 for all x in (a,b). If $lim_{x \to a} g(x) = \infty $, then $lim_{x \to a} \frac{f’(x)}{g’(x)} = L \implies lim_{x \to a} \frac{f’(x)}{g’(x)} = L$
ch 6  sequences and series of function
6.2 uniform convergence of a sequence of functions
 for each n $\in \mathbb{N}$ let $f_n$ be a function defined on a set A$\subseteq R$. The sequence ($f_n$) of functions converges pointwise on A to a function f if, for all x in A, the sequence of real numbers $f_n(x)$ converges to f(x)

let ($f_n$) be a sequence of functions defined on a set A$\subseteq$R. Then ($f_n$) converges unformly on A to a limit function f defined on A if, for every $\epsilon$>0, there exists an N in $\mathbb{N}$ such that $\forall n ≥N, x \in A , f_n(x)f(x) <\epsilon$ 
Cauchy Criterion for uniform convergence  a sequence of functions $(f_n)$ defined on a set A $\subseteq$ R converges uniformly on A iff $\forall \epsilon > 0 \exists N \in \mathbb{N}$ s.t. whenever m,n ≥N and x in A, $ f_n(x)f_m(x) <\epsilon$

 continuous limit thm  Let ($f_n$) be a sequence of functions defined on A that converges uniformly on A to a function f. If each $f_n$ is continuous at c in A, then f is continuous at c
6.3 uniform convergence and differentiation
 differentiable limit theorem  let $f_n \to f$ pointwise on the closed interval [a,b], and assume that each $f_n$ is differentiable. If $(f’_n)$ converges uniformly on [a,b] to a function g, then the function f is differentiable and f’=g
 let ($f_n$) be a sequence of differentiable functions defined on the closed interval [a,b], and assume $(f’_n)$ converges uniformly to a function g on [a,b]. If there exists a point $x_0 \in [a,b]$ for which $f_n(x_0)$ is convergent, then ($f_n$) converges uniformly. Moreover, the limit function f = lim $f_n$ is differentiable and satisfies f’ = g
6.4 series of functions
 termbyterm continuity thm  let $f_n$ be continuous functions defined on a set A $\subseteq$ R and assume $\sum f_n$ converges uniformly on A to a function f. Then, f is continuous on A.
 termbyterm differentiability thm  let $f_n$ be differentiable functions defined on an interval A, and assume $\sum f’_n(x)$ converges uniformly to a limit g(x) on A. If there exists a point $x_0 \in [a,b]$ where $\sum f_n(x_0)$ converges, then the series $\sum f_n(x)$ converges uniformly to a differentiable function f(x) satisfying f’(x) = g(x) on A. In other words, $f(x) = \sum f_n(x)$ and $f’(x) = \sum f’_n(x)$

Cauchy Criterion for uniform convergence of series  A series $\sum f_n$ converges uniformly on A iff $\forall \epsilon > 0 \exists N \in N$ s.t. whenever n>m≥N, x in A $ f_{m+1}(x) + f_{m+2}(x) + f_{m+3}(x) + …+f_n(x) < \epsilon$ 
Wierstrass MTest  For each n in N, let $f_n$ be a function defined on a set A $\subseteq$ R, and let $M_n > 0$ be a real number satisfying $ f_n(x) ≤ M_n$ for all x in A. If $\sum M_n$ converges, then $\sum f_n$ converges uniformly on A

6.5 power series
 power series f(x) = $\sum_{n=0}^\infty a_n x^n = a_0 + a_1 x_1 + a_2 x^2 + a_3 x^3 + …$

if a power series converges at some point $x_0 \in \mathbb{R}$, then it converges absolutely for any x satisfying x < $x_0$  if a power series converges pointwise on the set A, then it converges uniformly on any compact set K $\subseteq$ A

if a power series converges absolutely at a point $x_0$, then it converges uniformly on the closed interval [c,c], where c = $x_0$  Abel’s thm  if a power series converges at the point x = R > 0, the the series converges uniformly on the interval [0,R]. A similar result holds if the series converges at x = R

 if $\sum_{n=0}^\infty a_n x^n$ converges for all x in (R,R), then the differentiated series $\sum_{n=0}^\infty n a_n x^{n1}$ converges at each x in (R,R) as well. Consequently the convergence is uniform on compact sets contained in (R,R).
 can take infinite derivatives
6.6 taylor series
 Taylor’s Formula $\sum_{n=0}^\infty a_n x^n = a_0 + a_1 x_1 + a_2 x^2 + a_3 x^3 + …$
 centered at 0: $a_n = \frac{f^{(n)}(0)}{n!}$
 Lagrange’s Remainder thm  Let f be differentiable N+1 times on (R,R), define $a_n = \frac{f^{(n)}(0)}{n!}…..$
 not every infinitely differentiable function can be represented by its Taylor series (radius of convergence zero)
ch 7  the Riemann Integral
7.2 def of Riemann integral
 partition of [a,b] is a finite set of points from [a,b] that includes both a and b
 lower sum  sum all the possible smallest rectangles
 a partition Q is a refinement of a partition P if $P \subseteq Q$
 if $P \subseteq Q$, then L(f,P)≤L(f,Q) and U(f,P)≥U(f,Q)
 a bounded function f on the interval [a,b] is Riemannintegrable if U(f) = L(f) = $\int_a^b f$
 iff $\forall \epsilon >0$, there exists a partition P of [a,b] such that $U(f,P)L(f,P)<\epsilon$
 U(f) = inf{U(f,P)} for all possible partitions P
 if f is continuous on [a,b] then it is integrable
7.3 integrating functions with discontinuities
 if f:[a,b]>R is bounded and f is integrable on [c,b] for all c in (a,b), then f is integrable on [a,b]
7.4 properties of Integral
 assume f: [a,b]>R is bounded and let c in (a,b). Then, f is integrable on [a,b] iff f is integrable on [a,c] and [c,b]. In this case we have $\int_a^b f = \int_a^c f + \int_c^b f.$F
 integrable limit thm  Assume that $f_n \to f$ uniformly on [a,b] and that each $f_n$ is integarble. Then, f is integrable and $lim_{n \to \infty} \int_a^b f_n = \int_a^b f$.
7.5 fundamental theorem of calculus
 If f:[a,b] > R is integrable, and F:[a,b]>R satisfies F’(x) = f(x) for all x $\in$ [a,b], then $\int_a^b f = F(b)  F(a)$
 Let f: [a,b]> R be integrable and for x $\in$ [a,b] define G(x) = $\int_a^x g$. Then G is continuous on [a,b]. If g is continuous at some point $c \in [a,b]$ then G is differentiable at c and G’(c) = g(c).
overview
 convergence
 sequences
 series
 functional limits
 normal, uniform
 sequence of funcs
 pointwise, uniform
 series of funcs
 pointwise, uniform
 integrability
 sequential criterion  usually good for proving discontinuous
 limit points
 functional limits
 continuity
 absence of uniform continuity
 algebraic limit theorem ~ scalar multiplication, addition, multiplication, division
 limit thm
 sequences
 series  can’t multiply / divide these
 functional limits
 continuity
 differentiability
 ~integrability~
 limit thms
 continuous limit thm  Let ($f_n$) be a sequence of functions defined on A that converges uniformly on A to a function f. If each $f_n$ is continuous at c in A, then f is continuous at c
 differentiable limit theorem  let $f_n \to f$ pointwise on the closed interval [a,b], and assume that each $f_n$ is differentiable. If $(f’_n)$ converges uniformly on [a,b] to a function g, then the function f is differentiable and f’=g
 convergent derivatives almost proves that $f_n \to f$
 let ($f_n$) be a sequence of differentiable functions defined on the closed interval [a,b], and assume $(f’_n)$ converges uniformly to a function g on [a,b]. If there exists a point $x_0 \in [a,b]$ for which $f_n(x_0) \to f(x_0)$ is convergent, then ($f_n$) converges uniformly
 integrable limit thm  Assume that $f_n \to f$ uniformly on [a,b] and that each $f_n$ is integarble. Then, f is integrable and $lim_{n \to \infty} \int_a^b f_n = \int_a^b f$.
 functions are continuous at isolated points, but limits don’t exist there

uniform continuity: minimize $ f(x)f(y) $  derivative doesn’t have to be continuous
 integrable if finite amount of discontinuities and bounded

Calculus
Singlevariable calculus
Derivatives:
$\frac{d}{dx}x^n = nx^{n1}$
$\frac{d}{dx}a^x = a^{x}ln(a)$
$\frac{d}{dx}ln(x) = 1/x$
$\frac{d}{dx}tan(x)= sec^2(x)$
$\frac{d}{dx}cot(x)= csc^2(x)$
$\frac{d}{dx}sec(x)= sec(x)tan(x)$
$\frac{d}{dx}csc(x)= csc(x)cot(x)$
$\int tan = ln sec $ $\int cot = ln sin $ $\int sec = ln sec+tan $ $\int csc = ln csccot $ $\int \frac{du}{\sqrt{a^2u^2}} = sin^{1}(\frac{u}{a})$
$\int \frac{du}{u\sqrt{u^2a^2}} = \frac{1}{a}sec^{1}(\frac{u}{a})$
$\int \frac{du}{a^2+u^2} = \frac{1}{a} tan^{1}(\frac{u}{a})$
Continuous: left limit = right limit = value
Differentiable: continuous and no sharp points / asymptotes
L’Hospital’s  for indeterminate forms: $(\frac{f(x)}{g(x)})’ = \frac{f’(x)}{g’(x)}$
Integration by parts: $\int{udv}=uv\int{duv}$, LIATE
Expansions:
$e^x = \sum{\frac{x^n}{n!}}$
$sin(x) = \sum_0^\infty{\frac{(1)^n x^{2n+1}}{(2n+1)!}}$
$cos(x) = \sum_0^\infty{\frac{(1)^n x^{2n}}{(2n)!}}$
Geometric Sum: $a_{1st}\frac{1r^{n+1}}{1r}$
Multivariable calculus
Polar: r,$\theta$,z
Spherical: $\rho,\theta,\phi$
Clairut’s Thm: Conservative function $f_{xy}=f_{yx}$

Chaos
 Normal forms of Hopf bifurcations
 important figs
 Systems of Linear ODEs
 Discrete Nonlinear Dynamical Systems
 Conservative Systems
 Ref
Normal forms of Hopf bifurcations
 pitchfork:
 subcritical pitchfork: $\dot{x} = \lambda x + x^3$
 saddle node (turning point): $\dot{x} = \lambda  x^2$
 transcritical: $\dot{x} = \lambda x  x^2$
important figs
 perioddoubling (flip bifurcation) $f = \mu x (1x) (f = \mu sin (\pi x) $ is similiar)
 inverse tangent bifurcation  unstable and stable P3 orbits coalesce, move slightly off bisector and becomes chaotic
 pendulum
 energy surface  trajectories run around the surface, not down it
 Conservative systems: 6.5
 study Hamiltonian p. 187188
 Pendulum: 6.7
 dynamics  study of things that evolve with time
 chaos  deterministic, aperiodic, sensitive, longterm prediction impossible
 phase space  has coordinates $x_1,…,x_n$
 phase portrait  variable xaxis, derivative yaxis
 bifurcation diagram  parameter xaxis, steady state yaxis
 draw separate graphs for these
 first check  look for fixed points
 for 1D, if f’ $<$ 0 then stable
 stable f.p. = all possible ICs in a.s.b.f.n. result in trajectories that remain in a.s.b.f.n. for all time
 asymptotically stable f.p.  stable and approaches f.p. as $t\ra\infty$
 hyperbolic f.p.  eigenvals aren’t strictly imaginary
 bifurcation point of f.p.  point where num solutions change or phase portraits change significantly
 globally stable  stable from any ICs
 autonomous = f is a function of x, not t
 we can always make a system autonomous by having $x_n$ = t, so $\dot{x_n}$ = 1
 dimension = number of 1st order ODEs, dimension of phasespace
 existence and uniqueness thm: if $\dot{x}$ and $\dot{x}’$ are continuous, then there is some unique solution
 linearization  used to find stability of f.p.s
 solving Hopf: use polar to get $\dot{\rho}, \dot{\theta}$

multiply one thing by cos, one by sin, then add
 $ \rho = \sqrt{x_1^2 + x_2^2}
\theta = tan^{1}(\frac{x_2}{x_1})$  Hysterisis curve  Sshaped curve of fixed branches  ruler getting larger  snap bifurcation  both axes are parameters
Systems of Linear ODEs
 solutions are of the form $\underbar{x}(t) = \underbar{C}_1e^{\alpha_1 t} + \underbar{C}_2e^{\alpha_2 t}$
 Eigenspaces: $E^S$ (stable), $E^U$ (unstable), $E^C$ (center  real part)  plot eigenvectors
 how to solve these systems?
 solve eigenvectors
 positive real part  goes out
 negative real part  goes in
 bifurcation requires 0 as eigenvalue
 has imaginary component: spiral / focus
 purely imaginary  center = stable, but not a.s.
 finite velocity = $\frac{dRe(\alpha)}{d\lambda}$
 change coordinates to polar
 for $\lambda \geq 0$, solution is a stable L.C. (from either direction spirals into a circular orbit)
 attracting  any trajectory that starts within $\delta$ of $\bar{\underbar{x}}$ evolves to $\bar{\underbar{x}}$ as t $\to \infty$ (it doesn’t have to remain within $\delta$ at all times
 stable (Lyapanov stable)  any trajectory that starts within $\delta$ remains within $\varepsilon$ for all time ($\varepsilon$ is chosen first)
 asymptotically stable  attracting and stable
 hyperbolic f.p.  iff all eigenvals of the linearization of the nds about the f.p. have nonzero real parts \
Discrete Nonlinear Dynamical Systems
 functional iteration: $x_{n+m} = f^m(x_n)$ (apply f m times)
 fixed point: $f(x^)=x^$

f.p. stable if $ \frac{df}{dx}(x^*) <1$, unstable if $>$ 1  check norbit by checking nth derivative: $\frac{df^n}{dx}(x_i^) = \prod_{i=1}^{n1} \frac{df}{dx}(x_i^)$
 perioddoubling bifurcations
 selfstability  orbit for which the stabilitydetermining derivative is zero. This means that the max of the map and the point at which the max occurs are in the orbit.
 type I intermittency  exhibited by inverse tangent bifurcation
 Feigenbaum sequence  perioddoubling path to chaos, keep increasing parameter until period is chaotic
\begin{center} \begin{tabular}{  m{4cm}  m{4cm}  } \hline \multicolumn{2}{c}{3D Attractors}
\hline Type of Attractor & Sign of Exponents \ \hline Fixed Point & (, , )\ Limit Cycle & (0, , ) \ Torus & (0, 0, )
Strange Attractor & (+, 0, )
\hline \end{tabular} \end{center} homoclinic orbit  connects unstable manifold of saddle point to its own stable manifold
 e.g. trajectory that starts and ends at the same fixed point
 manifolds are denoted by a W (ex. $W^S$ is the stable manifold)
 heteroclinic orbit  connects unstable manifold of fp to stable manifold of another fp \
Conservative Systems
 $F(x) = \frac{dV}{dx}$ (by defn.)
 $m\ddot{x}+\frac{dV}{dx}=0$, multiply by $\dot{x} \to \frac{d}{dt}[\frac{1}{2}m\dot{x}^2+V(x)]=0$
 so total energy $E=\frac{1}{2}m\dot{x}^2+V(x)$
 motion of pendulum: $\frac{d^2\theta}{dt^2}+\frac{g}{L}sin\theta=0$
 nondimensionalize with $\omega=\sqrt{g/L}, \tau=\omega t \to \ddot{\theta}+sin\theta =0$
 can multiply this by $\dot{\theta}$
 $\omega$limit $t \to \infty$
 $\alpha$limit $t \to \infty$
 libration  small orbit surrounding center
 system: $\dot{\theta}=\nu$, $\dot{\nu} = sin\theta$
Hamiltonian Dynamical System
 $\dot{\underbar{x}}=\frac{\partial H}{\partial y}(\underbar{x},\underbar{y})$ , $\dot{\underbar{y}}=\frac{\partial H}{\partial x}(\underbar{x},\underbar{y})$ for some function H called the Hamiltonian
 we can only have centers (minima in the potential) and saddle points (maxima)
 separatrix  orbit that separates trapped and passing orbits
 Poincare Benderson Thm  can’t have chaos in a 2D system
Ref
 $\frac{\partial}{\partial x}(f_1 * f_2 * f_3) = \frac{\partial f_1}{\partial x} f_2 f_3 + \frac{\partial f_2}{\partial x} f_1 f_3 + \frac{\partial f_3}{\partial x} f_1 f_2$
 $e^{\mu it} = cos(\mu t)+ isin(\mu t)$
 $x = A e^{(\lambda + i)t} + B e^{(\lambda  i)t} \implies x = (A’ sin(t) + B’ cos(t)) e^{\lambda t} $ If we have $\dot{x_1},\dot{x_2}$ then we can get $x_2(x_1) with \frac{dx_1}{dx_2} = \frac{\dot{x_1}}{\dot{x_2}}$

Differential Equations
Differential Equations
Separable: Separate and Integrate FOLDE: y’ + p(x)y = g(x) IF: $e^{\int{p(x)}dx}$
Exact: Mdx+Ndy = 0 $M_y=N_x$ Integrate Mdx or Ndy, make sure all terms are present
Constant Coefficients: Plug in $e^{rt}$, solve characteristic polynomial repeated root solutions: $e^{rt},re^{rt}$ complex root solutions: $r=a\pm bi, y=c_1e^{at} cos(bt)+c_2e^{at} sin(bt)$
SOLDE (nonconstant): py’‘+qy’+ry=0
Reduction of Order: Know one solution, can find other
Undetermined Coefficients (doesn’t have to be homogenous): solve homogenous first, then plug in form of solution with variable coefficients, solve polynomial to get the coefficients
Variation of Parameters: start with homogenous solutions $y_1,y_2$ $Y_p=y_1\int \frac{y_2g}{W(y_1,y_2)}dt+y_2\int \frac{y_1g}{W(y_1,y_2)}dt$
Laplace Transforms  for anything, best when g is noncontinuous
$\mathcal{L}(f(t))=F(t)=\int_0^\infty e^{st}f(t)dt$
Series Solutions: More difficult
Wronskian: $W(y_1 ,y_2)=y_1y _2’ y_2 y_1’$ W = 0 $\implies$ solns linearly dependent
Abel’s Thm: y’‘+py’+q=0 $\implies W=ce^{\int pdt}$

linear algebra
SVD + eigenvectors
strang 5.1  intro
 elimination changes eigenvalues
 eigenvector application to diff eqs $\frac{du}{dt}=Au$
 soln is exponential: $u(t) = c_1 e^{\lambda_1 t} x_1 + c_2 e^{\lambda_2 t} x_2$
 eigenvalue eqn: $Ax = \lambda x \implies (A\lambda I)x=0$
 set $det(A\lambda I) = 0$ to get characteristic polynomial
 eigenvalue properties
 0 eigenvalue signals that A is singular
 eigenvalues are on the main diagonal when the matrix is triangular
 checks
 sum of eigenvalues = trace(A)
 prod eigenvalues = det(A)
 defective matrices  lack a full set of eigenvalues
strang 5.2  diagonalization
 assume A (nxn) has n eigenvectors
 S := eigenvectors as columns
 $S^{1} A S = \Lambda$ where corresponding eigenvalues are on diagonal of $\Lambda$
 if matrix A has no repeated eigenvalues, eigenvectors are independent
 other S matrices won’t produce diagonal
 only diagonalizable if n independent eigenvectors
 not related to invertibility
 eigenvectors corresponding to different eigenvalues are lin. independent
 there are always n complex eigenvalues
 eigenvalues of $A^2$ are squared, eigenvectors remain same
 eigenvalues of $A^{1}$ are inverse eigenvalues
 eigenvalue of rotation matrix is $i$
 eigenvalues for $AB$ only multiply when A and B share eigenvectors
 diagonalizable matrices share the same eigenvector matrix S iff $AB = BA$
strang 5.3  difference eqs and power $A^k$
 compound interest
 solving for fibonacci numbers
 Markov matrices
 steadystate Ax = x
 corresponds to $\lambda = 1$
 stability of $u_{k+1} = A u_k$

stable if all eigenvalues satisfy $ \lambda_i $ <1 
neutrally stable if some $ \lambda_i =1$ 
unstable if at least one $ \lambda_i $ > 1

 Leontief’s inputoutput matrix
 PerronFrobenius thm  if A is a positive matrix (positive values), so is its largest eigenvalue. Every component of the corresponding eigenvector is also positive.
strang 6.3  singular value decomposition
 SVD for any m x n matrix: $A=U\Sigma V^T$
 U (mxm) are eigenvectors of $AA^T$
 columns of V (nxn) are eigenvectors of $A^TA$
 r singular values on diagonal of $\Sigma$ (m x n)  square roots of nonzero eigenvalues of both $AA^T$ and $A^TA$
 properties
 for PD matrices, $\Sigma=\Lambda$, $U\Sigma V^T = Q \Lambda Q^T$
 for other symmetric matrices, any negative eigenvalues in $\Lambda$ become positive in $\Sigma$
 for PD matrices, $\Sigma=\Lambda$, $U\Sigma V^T = Q \Lambda Q^T$
 applications
 very numerically stable because U and V are orthogonal matrices
 condition number of invertible nxn matrix = $\sigma_{max} / \sigma_{min}$
 $A=U\Sigma V^T = u_1 \sigma_1 v_1^T + … + u_r \sigma_r v_r^T$
 we can throw away columns corresponding to small $\sigma_i$
 pseudoinverse $A^+ = V \Sigma^+ U^T$
Linear Basics
 Linear
 Superposition f(x+y) = f(x)+f(y)
 Proportionality $f(kx) = kf(x)$
 Vector Space
 Closed under addition
 Contains Identity
 Inner Product  returns a scalar
 Linear
 Symmetric
 Something Tricky
 Determinant  sum of products including one element from each row / column with correct sign
 Eigenvalues: $det(A\lambda I)=0$
 Eigenvectors: Find null space of A$\lambda$I
 Linearly Independent: $c_1x_1+c_2x_2=0 \implies c_1=c_2=0$
 Mapping $f: a \mapsto b$
 Onto (surjective): $ \forall b\in B \exists a\in A \, f(a)=b$
 11 (injective): $f(a_1)=f(a_2) \implies a_1=a_2 $

norms $ x p = (\sum{i=1}^n x_i ^p)^{1/p}$  $L_0$ norm  number of nonzero elements

$ x _1 = \sum x_i $ 
$ x _\infty = max_i x_i $

matrix norm  given a vector norm X , matrix norm is given by: A = $max_{x ≠ 0} Ax / x $  represents the maximum stretching that A does to a vector x
 psuedoinverse $A^+ = (A^T A)^{1} A^T$
 inverse
 if orthogonal, inverse is transpose
 if diagonal, inverse is invert all elements
 inverting 3x3  transpose, find all mini dets, multiply by signs, divide by det
 to find eigenvectors, values
 $det(A\lambda I)=0$ and solve for lambdas
 $A = QDQ^T$ where Q columns are eigenvectors
Singularity
 nonsingular = invertible = nonzero determinant = null space of zero
 only square matrices
 rank of mxn matrix max number of linearly independent columns / rows  rank==m==n, then nonsingular
 illconditioned matrix  matrix is close to being singular  very small determinant
 positive semidefinite  $A \in R^{nxn}$
 intuitively is like having upwards curve
 if $\forall x \in R^n, x^TAx \geq 0$ then A is positive semi definite (PSD)
 if $\forall x \in R^n, x^TAx > 0$ then A is positive definite (PD)
 PD $\to$ full rank, invertible
 Gram matrix  G = $X^T X \implies $PSD
 if X full rank, then G is PD
Matrix Calculus
 gradient $\Delta_A f(\mathbf{A})$ partial derivatives with respect to each element of matrix
 f has to be a function that takes matrix, returns a scalar
 output will be the same sizes as the variable you take the gradient of (in this case A)
 $\nabla^2$ is not gradient of the gradient
 examples
 $\nabla_x a^T x = a$
 $\nabla_x x^TAx = 2Ax$ (if A symmetric)
 $\nabla_x^2 x^TAx = 2A$ (if A symmetric)
 function f(x,y)
 1st derivative is vector of derivatives
 2nd derivative is Hessian matrix

Math Basics
Differential Equations
Separable: Separate and Integrate FOLDE: y’ + p(x)y = g(x) IF: $e^{\int{p(x)}dx}$
Exact: Mdx+Ndy = 0 $M_y=N_x$ Integrate Mdx or Ndy, make sure all terms are present
Constant Coefficients: Plug in $e^{rt}$, solve characteristic polynomial repeated root solutions: $e^{rt},re^{rt}$ complex root solutions: $r=a\pm bi, y=c_1e^{at} cos(bt)+c_2e^{at} sin(bt)$
SOLDE (nonconstant): py’‘+qy’+ry=0
Reduction of Order: Know one solution, can find other
Undetermined Coefficients (doesn’t have to be homogenous): solve homogenous first, then plug in form of solution with variable coefficients, solve polynomial to get the coefficients
Variation of Parameters: start with homogenous solutions $y_1,y_2$ $Y_p=y_1\int \frac{y_2g}{W(y_1,y_2)}dt+y_2\int \frac{y_1g}{W(y_1,y_2)}dt$
Laplace Transforms  for anything, best when g is noncontinuous
$\mathcal{L}(f(t))=F(t)=\int_0^\infty e^{st}f(t)dt$
Series Solutions: More difficult
Wronskian: $W(y_1 ,y_2)=y_1y _2’ y_2 y_1’$ W = 0 $\implies$ solns linearly dependent
Abel’s Thm: y’‘+py’+q=0 $\implies W=ce^{\int pdt}$
Misc
$\Gamma(n)=(n1)!=\int_0^\infty x^{n1}e^{x}dx$
$\zeta(x) = \sum_1^\infty \frac{1}{x^2} $ \end{multicols}
stochastic processes
Stochastic  random process evolving with time
Markov: $P(X_t=x X_{t1})=P(X_t=x X_{t1}…X_1)$ Martingale: $E[X_t]=X_{t1}$
abstract algebra
Group: set of elements endowed with operation satisfying 4 properties:
 closed 2. identity 3. associative 4. inverses
Equivalence Relation;
 reflexive 2. transitive 3. symmetric
discrete math
Goldbach’s strong conjecture: Every even integer greater than 2 can be expressed as the sum of two primes (He considered one a prime).
Goldbach’s weak conjecture: All odd numbers greater than 7 are the sum of three primes.
Set  An unordered collection of items without replication
Proper subset  subset with cardinality less than the set
A U A = A Idempotent law
Disjoint: A and B = empty set
Partition: mutually disjoint, union fills space
powerset $\mathcal{P}$(A) = set of all subsets
Converse: $q\ra p$ (same as inverse: $p \ra q$)
$p_1 \ra p_2 \iff  p_1 \lor p_2 $
The greatest common divisor of two integers a and b is the largest integer d such that d $ $ a and d $ $ b Proof Techniques

Proof by Induction

Direct Proof

Proof by Contradiction  assume p $\land$ q, show contradiction

Proof by Contrapositive  show q $\ra$ p
identities
$e^{2lnx}= \frac{1}{e^{2lnx}} = \frac{1}{e^{lnx}e^{lnx}} = \frac{1}{x^2}$
$ln(xy) = ln(x)+ln(y)$
$lnx * lny = ln(x^{lny})$
$e^{\mu it} = cos(\mu t)+ isin(\mu t)$
Partial Fractions $\frac{3x+11}{(x3)(x+2)} = \frac{A}{x3} + \frac{B}{x+2}$
$(ax+b)^k = \frac{A_1}{ax+b}+\frac{A_2}{(ax+b)^2}+…$
$(ax^2+bx+c)^k = \frac{A_1x+B_1}{ax^2+bx+c}+…$
$cos(a\pm b) = cos(a)cos(b)\mp sin(a)sin(b)$
$sin(a \pm b) = sin(a)cos(b) \pm sin(b)cos(a)$

Proofs
proofs
 induction
 must already know formula
 doesn’t give intuition
 there are uncomputable functions e.g. Halting Problem, 3x+1 problem
 nonexistence proofs
 must cover all possible scenarios, harder than existence
 pigeonhole principle
 induction

Real Analysis
 ch 1  the real numbers
 ch 2  sequences and series
 ch 3  basic topology of R
 ch 4  functional limits and continuity
 ch 5  the derivative
 ch 6  sequences and series of function
 ch 7  the Riemann Integral
 overview
ch 1  the real numbers
 there is no rational number whose square is 2 <div class="collapse" id="111"> proof by contradiction </div>
 contrapositive:  logically equivalent

triangle inequality: $ a+b \leq a + b $<div class="collapse" id="121"> often use ab = (ac)+(cb) </div>  axiom of completeness  every nonempty set $A \subseteq \mathbb{R}$ that is bounded above has a least upper bound
 doesn’t work for $\mathbb{Q}$
 supremum = supA = least upper bound (similarly, infimum)
 supA is an upper bound of A
 if $s \in \mathbb{R}$ is another u.b. then $s \geq supA$
 can be restated as $\forall \epsilon > 0, \exists a \in A$ $s\epsilon < a$
 nested interval property  for each $n \in N$, assume we are given a closed interval $I_n = [a_n,b_n]={ x \in \mathbb{R} : a_n \leq x \leq b_n }$ Assume also that each $I_n$ contains $I_{n+1}$. Then, the resulting nested sequence of nonempty closed intervals $I_1 \supseteq I_2 \supseteq …$ has a nonempty intersection <div class="collapse" id="141"> use AoC with x = sup{$a_n: n \in \mathbb{N}$} in the intersection of all sets</div>
 archimedean property
 $\mathbb{N}$ is unbounded above (sup $\mathbb{N}=\infty$)
 $\forall x \in \mathbb{R}, x>0, \exists n \in \mathbb{N}, 0 < \frac{1}{n} < x$
contradiction with AoC $\mathbb{Q}$ is dense in $\mathbb{R}$  for every $a,b \in \mathbb{R}, a<b$, $\exists r \in \mathbb{Q}$ s.t. $a<r<b$
 pf: want $a < \frac{m}{n} < b$
 by Archimedean property, want $\frac{1}{n} < ba$
 corollary: the irrationals are dense in $\mathbb{R}$
 pf: want $a < \frac{m}{n} < b$
 there exists a real number $r \in \mathbb{R}$ satisfying $r^2 = 2$
 pf: let r = $sup { t \in \mathbb{R} : t^2 < 2 }$. disprove $r^2<2, r^2>2$ by considering $r+\frac{1}{n},r\frac{1}{n}$
 A ~ B if there exists f:A>B that is 11 and onto
 A is finite  there exists n $\in \mathbb{N}$ s.t. $\mathbb{N}_n$~A
 countable = $\mathbb{N}$~A.
 uncountable  inifinite set that isn’t countable
 Q is countable
 pf: Let $A_n = { \pm \frac{p}{q}:$ where p,q $\in \mathbb{N}$ are in lowest terms with p+q=n}
 R is uncountable
 pf: Assume we can enumerate $\mathbb{R}$ Use NIP to exclude one point from $\mathbb{R}$ each time. The intersection is still nonempty, so we didn’t succesfully enumerate $\mathbb{R}$
 $\frac{x}{x^21}$ maps (0,1) $\to \mathbb{R}$
 countable union of countable sets is countable
 if $A \subseteq B$ and B countable, then A is either countable or finite
 if $A_n$ is a countable set for each $n \in \mathbb{N}$, then their union is countable
 the open interval (0,1) = ${ x \in \mathbb{R} : 0 < x < 1 }$ is uncountable
 pf: diagonalization  assume there exists a function from (0,1) to $\mathbb{R}$. List the decimal expansions of these as rows of a matrix. Complement of diagonal does not exist.
 cantor’s thm  Given any set A, there does not exist a function f:$A \to P(A)$ that is onto
 P(A) is the set of all subsets of A
ch 2  sequences and series

a sequence $(a_n)$ converges to a real number if $\forall \epsilon > 0, \exists N \in \mathbb{N}$ such that $\forall n\geq N, a_na < \epsilon$  otherwise it diverges
 if a limit exists, it is unique

a sequence $(x_n)$ is bounded if there exists a number M > 0 such that $ x_n \leq M \forall n \in \mathbb{N}$  every convergent sequence is bounded
 algebraic limit thm  let lim $a_n = a$ and lim $b_n$ = b. Then
 lim($ca_n$) = ca
 lim($a_n+b_n$) = a+b
 lim($a_n b_n$) = ab
 lim($a_n/b_n$) = a/b, provided b $\neq$ 0

pf 3: use triangle inequality, $ a_nb_nab = a_nb_nab_n+ab_nab =…= b_n a_na + a b_nb $  pf 4: show $(b_n) \to b$ implies $(\frac{1}{b_n}) \to \frac{1}{b}$

 order limit thm  Assume lim $a_n = a$ and lim $b_n$ = b.
 If $a_n \geq 0$ $\forall n \in \mathbb{N}$, then $a \geq 0$
 If $a_n \leq b_n$ $\forall n \in \mathbb{N}$, then $a \leq b$
 If $\exists c \in \mathbb{R}$ for which $c \leq b_n$ $\forall n \in \mathbb{N}$, then $c \leq b$
 pf 1: by contradiction
 monotone  increasing or decreasing (not strictly)
 monotone convergence thm  if a sequence is monotone and bounded, then it converges
 convergence of a series
 define $s_m=a_1+a_2+…+a_m$
 $\sum_{n=1}^\infty a_n$ converges to A $\iff (s_m)$ converges to A
 cauchy condensation test  suppose $a_n$ is decreasing and satisfies $a_n \geq 0$ for all $n \in \mathbb{N}$. Then, the series $\sum_{n=1}^\infty a_n$ converges iff the series $\sum_{n=1}^\infty 2^na_{2^n}$ converges
 pseries $\sum_{n=1}^\infty 1/n^p$ converges iff p > 1
2.5
 let $(a_n)$ be a sequence and $n_1<n_2<…$ be an increasing sequence of natural numbers. Then $(a_{n_1},a_{n_2},…)$ is a subsequence of $(a_n)$
 subsequences of a convergent sequence converge to the same limit as the original sequence
 can be used as divergence criterion
 bolzanoweierstrass thm  every bounded sequence contains a convergent subsequence
 pf: use NIP, keep splitting interval into two
2.6

$(a_n)$ is a cauchy sequence if $\forall \epsilon > 0, \exists N \in \mathbb{N}$ such that $\forall m,n\geq N, a_na_m < \epsilon$  cauchy criterion  a sequence converges $\iff$ it is a cauchy sequence
 cauchy sequences are bounded
 overview: AoC $\iff$ NIP $\iff$ MCT $\iff$ BW $\iff$ CC
2.7
 algebraic limit thm  let $\sum_{n=1}^\infty a_n$ = A, $\sum_{n=1}^\infty b_n$ = B
 $\sum_{n=1}^\infty ca_n$ = cA
 $\sum_{n=1}^\infty a_n+b_n$ = A+B
 cauchy criterion for series  series converges $\iff$ $(s_m)$ is a cauchy sequence
 if the series $\sum_{n=1}^\infty a_n$ converges then lim $a_n=0$
 comparison test
 geometric series  $\sum_{n=0}^\infty a r^n = \frac{a}{1r}$
 $s_m = a+ar+…+ar^{m1} = \frac{a(1r^m)}{1r}$
 absolute convergence test
 alternating series test 1. decreasing 2. lim $a_n$ = 0
 then, $\sum_{n=1}^\infty (1)^{n+1} a_n$ converges
 rearrangements: there exists onetoone correspondence
 if a series converges absolutely, any rearrangement converges to same limit
ch 3  basic topology of R
3.1 cantor set
 C has small length, but its cardinality is uncountable
 discussion of dimensions, doubling sizes leads to 2^dimension sizes
 Cantor set is about dimension .631
3.2 open/closed sets
 A set O $\subseteq \mathbb{R}$ is open if for all points a $\in$ O there exists an $\epsilon$neighborhood $V_{\epsilon}(a) \subseteq O$

$V_{\epsilon}(a)={ x \in R : xa < \epsilon$}  the union of an arbitrary collection of open sets is open
 the intersection of a finite collection of open sets is open

 a point x is a limit point of a set A if every $\epsilon$neighborhood $V_{\epsilon}(x)$ of x intersects the set A at some point other than x
 a point x is a limit point of a set A if and only if x = lim $a_n$ for some sequence ($a_n$) contained in A satisfying $a_n \neq x$ for all n $\in$ N
 isolated point  not a limit point
 set $F \subseteq \mathbb{R}$ closed  contains all limit points
 closed iff every Cauchy sequence contained in F has a limit that is also an element of F
 density of Q in R  for every $y \in \mathbb{R}$, there exists a sequence of rational numbers that converges to y
 closure  set with its limit points
 closure $\bar{A}$ is smallest closed set containing A
 iff set open, complement is closed
 R and $\emptyset$ are both open and closed
 the union of a finite collection of closed sets is closed
 the intersection of an arbitrary collection of closed sets is closed
 R and $\emptyset$ are both open and closed
3.3
 a set K $\subseteq \mathbb{R}$ is compact if every sequence in K has a subsequence that converges to a limit that is also in K
 Nested Compact Set Property  intersection of nested sequence of nonempty compact sets is not empty
 let A $\subseteq \mathbb{R}$. open cover for A is a (possibly infinite) collection of open sets whose union contains the set A.
 given an open cover for A, a finite subcover is a finite subcollection of open sets from the original open cover whose union still manages to completely contain A
 HeineBorel thm  let K $\subseteq \mathbb{R}$. All of the following are equivalent
 K is compact
 K is closed and bounded
 every open cover for K has a finite subcover
ch 4  functional limits and continuity
4.1
 dirichlet function: 1 if r $\in \mathbb{Q}$ 0 otherwise
4.2 functional limits

def 1. Let f:$A \to R$, and let c be a limit point of the domain A. We say that $lim_{x \to c} f(x) = L$ provided that for all $\epsilon$ > 0, there exists a $\delta$ > 0 s.t. whenever 0 < xc < $\delta$ (and x $\in$ A) it follows that f(x)L < $\epsilon$  def 2. Let f:$A \to R$, and let c be a limit point of the domain A. We say that $lim_{x \to c} f(x) = L$ provided that for every $\epsilon$neighborhood $V_{\epsilon}(L)$ of L, there exists a $\delta$neighborhood $V_{\delta}($c) around c with the property that for all x $\in V_{\delta}($c) different from c (with x $\in$ A) it follows that f(x) $\in V_{\epsilon}(L)$.
 sequential criterion for functional limits  Given function f:$A \to R$ and a limit point c of A, the following 2 statements are equivalent:
 $lim_{x \to c} f(x) = L$
 for all sequences $(x_n) \subseteq$ A satisfying $x_n \neq$ c and $(x_n) \to c$, it follows that $f(x_n) \to L$.
 algebraic limit thm for functional limits
 divergence criterion for functional limits
4.3 continuous functions

a function f:$A \to R$ is continuous at a point c $\in$ A if, for all $\epsilon$>0, there exists a $\delta$>0 such that whenever xc <$\delta$ (and x$\in$ A) it follows that $ f(x)f( c) <\epsilon$. F is continous if it is continuous at every point in the domain A  characterizations of continuouty
 criterion for discontinuity
 algebraic continuity theorem
 if f is continuous at c and g is continous at f( c) then g $\circ$ f is continuous at c
4.4 continuous functions on compact sets
 preservation of compact sets  if f continuous and K compact, then f(K) is compact as well
 extreme value theorem  if f if continuous on a compact set K, then f attains a maximum and minimum value. In other words, there exist $x_0,x_1 \in K$ such that $f(x_0) \leq f(x) \leq f(x_1)$ for all x $\in$ K

f is uniformly continuous on A if for every $\epsilon$>0, there exists a $\delta$>0 such that for all x,y $\in$ A, $ xy < \delta \implies f(x)f(y) < \epsilon$ 
a function f fails to be uniformly continuous on A iff there exists a particular $\epsilon_o$ > 0 and two sequences $(x_n),(y_n)$ in A sastisfying $ x_n  y_n \to 0$ but $ f(x_n)f(y_n) \geq \epsilon_o$

 a function that is continuous on a compact set K is uniformly continuous on K
4.5 intermediate value theorem
 intermediate value theorem  Let f:[a,b]$ \to R$ be continuous. If L is a real number satisfying f(a) < L < f(b) or f(a) > L > f(b), then there exists a point c $\in (a,b)$ where f( c) = L
 a function f has the intermediate value property on an inverval [a,b] if for all x < y in [a,n] and all L between f(x) and f(y), it is always possible to find a point c $\in (x,y)$ where f( c)=L.
ch 5  the derivative
5.2 derivatives and the intermediate value property
 let g: A > R be a function defined on an interval A. Given c $\in$ A, the derivative of g at c is defined by g’( c) = $\lim_{x \to c} \frac{g(x)  g( c)}{xc}$, provided this limit exists. Then g is differentiable at c. If g’ exists for all points in A, we say g is differentiable on A
 identity: $x^nc^n = (xc)(x^{n1}+cx^{n2}+c^2x^{n3}+…+c^{n1}$)
 differentiable $\implies$ continuous
 algebraic differentiability theorem
 adding
 scalar multiplying
 product rule
 quotient rule
 chain rule: let f:A> R and g:B>R satisfy f(A)$\subseteq$ B so that the composition g $\circ$ f is defined. If f is differentiable at c in A and g differentiable at f( c) in B, then g $\circ$ f is differnetiable at c with (g$\circ$f)’( c)=g’(f( c))*f’( c)
 interior extremum thm  let f be differentiable on an open interval (a,b). If f attains a maximum or minimum value at some point c $\in$ (a,b), then f’( c) = 0.
 Darboux’s thm  if f is differentiable on an interval [a,b], and a satisfies f’(a) < $\alpha$ < f’(b) (or f’(a) > $\alpha$ > f’(b)), then there exists a point c $\in (a,b)$ where f’( c) = $\alpha$
 derivative satisfies intermediate value property
5.3 mean value theorems
 mean value theorem  if f:[a,b] > R is continuous on [a,b] and differentiable on (a,b), then there exists a point c $\in$ (a,b) where $f’( c) = \frac{f(b)f(a)}{ba}$
 Rolle’s thm  f(a)=f(b) > f’( c)=0
 if f’(x) = 0 for all x in A, then f(x) = k for some constant k
 if f and g are differentiable functions on an interval A and satisfy f’(x) = g’(x) for all x $\in$ A, then f(x) = g(x) + k for some constant k

generalized mean value theorem  if f and g are continuous on the closed interval [a,b] and differentiable on the open interval (a,b), then there exists a point c $\in (a,b)$ where f(b)f(a) g’( c) = g(b)g(a) f’( c). If g’ is never 0 on (a,b), then can be restated $\frac{f’( c)}{g’( c)} = \frac{f(b)f(a)}{g(b)g(a)}$ 
given g: A > R and a limit point c of A, we say that $lim_{x \to c} g(x) = \infty$ if, for every M > 0, there exists a $\delta$> 0 such that whenever 0 < xc < $\delta$ it follows that g(x) ≥ M  L’Hospital’s Rule: 0/0  let f and g be continuous on an interval containing a, and assume f and g are differentiable on this interval with the possible exception of the point a. If f(a) = g(a) = 0 and g’(x) ≠ 0 for all x ≠ a, then $lim_{x \to a} \frac{f’(x)}{g’(x)} = L \implies lim_{x \to a} \frac{f’(x)}{g’(x)} = L$
 L’Hospital’s Rule: $\infty / \infty$  assume f and g are differentiable on (a,b) and g’(x) ≠ 0 for all x in (a,b). If $lim_{x \to a} g(x) = \infty $, then $lim_{x \to a} \frac{f’(x)}{g’(x)} = L \implies lim_{x \to a} \frac{f’(x)}{g’(x)} = L$
ch 6  sequences and series of function
6.2 uniform convergence of a sequence of functions
 for each n $\in \mathbb{N}$ let $f_n$ be a function defined on a set A$\subseteq R$. The sequence ($f_n$) of functions converges pointwise on A to a function f if, for all x in A, the sequence of real numbers $f_n(x)$ converges to f(x)

let ($f_n$) be a sequence of functions defined on a set A$\subseteq$R. Then ($f_n$) converges unformly on A to a limit function f defined on A if, for every $\epsilon$>0, there exists an N in $\mathbb{N}$ such that $\forall n ≥N, x \in A , f_n(x)f(x) <\epsilon$ 
Cauchy Criterion for uniform convergence  a sequence of functions $(f_n)$ defined on a set A $\subseteq$ R converges uniformly on A iff $\forall \epsilon > 0 \exists N \in \mathbb{N}$ s.t. whenever m,n ≥N and x in A, $ f_n(x)f_m(x) <\epsilon$

 continuous limit thm  Let ($f_n$) be a sequence of functions defined on A that converges uniformly on A to a function f. If each $f_n$ is continuous at c in A, then f is continuous at c
6.3 uniform convergence and differentiation
 differentiable limit theorem  let $f_n \to f$ pointwise on the closed interval [a,b], and assume that each $f_n$ is differentiable. If $(f’_n)$ converges uniformly on [a,b] to a function g, then the function f is differentiable and f’=g
 let ($f_n$) be a sequence of differentiable functions defined on the closed interval [a,b], and assume $(f’_n)$ converges uniformly to a function g on [a,b]. If there exists a point $x_0 \in [a,b]$ for which $f_n(x_0)$ is convergent, then ($f_n$) converges uniformly. Moreover, the limit function f = lim $f_n$ is differentiable and satisfies f’ = g
6.4 series of functions
 termbyterm continuity thm  let $f_n$ be continuous functions defined on a set A $\subseteq$ R and assume $\sum f_n$ converges uniformly on A to a function f. Then, f is continuous on A.
 termbyterm differentiability thm  let $f_n$ be differentiable functions defined on an interval A, and assume $\sum f’_n(x)$ converges uniformly to a limit g(x) on A. If there exists a point $x_0 \in [a,b]$ where $\sum f_n(x_0)$ converges, then the series $\sum f_n(x)$ converges uniformly to a differentiable function f(x) satisfying f’(x) = g(x) on A. In other words, $f(x) = \sum f_n(x)$ and $f’(x) = \sum f’_n(x)$

Cauchy Criterion for uniform convergence of series  A series $\sum f_n$ converges uniformly on A iff $\forall \epsilon > 0 \exists N \in N$ s.t. whenever n>m≥N, x in A $ f_{m+1}(x) + f_{m+2}(x) + f_{m+3}(x) + …+f_n(x) < \epsilon$ 
Wierstrass MTest  For each n in N, let $f_n$ be a function defined on a set A $\subseteq$ R, and let $M_n > 0$ be a real number satisfying $ f_n(x) ≤ M_n$ for all x in A. If $\sum M_n$ converges, then $\sum f_n$ converges uniformly on A

6.5 power series
 power series f(x) = $\sum_{n=0}^\infty a_n x^n = a_0 + a_1 x_1 + a_2 x^2 + a_3 x^3 + …$

if a power series converges at some point $x_0 \in \mathbb{R}$, then it converges absolutely for any x satisfying x < $x_0$  if a power series converges pointwise on the set A, then it converges uniformly on any compact set K $\subseteq$ A

if a power series converges absolutely at a point $x_0$, then it converges uniformly on the closed interval [c,c], where c = $x_0$  Abel’s thm  if a power series converges at the point x = R > 0, the the series converges uniformly on the interval [0,R]. A similar result holds if the series converges at x = R

 if $\sum_{n=0}^\infty a_n x^n$ converges for all x in (R,R), then the differentiated series $\sum_{n=0}^\infty n a_n x^{n1}$ converges at each x in (R,R) as well. Consequently the convergence is uniform on compact sets contained in (R,R).
 can take infinite derivatives
6.6 taylor series
 Taylor’s Formula $\sum_{n=0}^\infty a_n x^n = a_0 + a_1 x_1 + a_2 x^2 + a_3 x^3 + …$
 centered at 0: $a_n = \frac{f^{(n)}(0)}{n!}$
 Lagrange’s Remainder thm  Let f be differentiable N+1 times on (R,R), define $a_n = \frac{f^{(n)}(0)}{n!}…..$
 not every infinitely differentiable function can be represented by its Taylor series (radius of convergence zero)
ch 7  the Riemann Integral
7.2 def of Riemann integral
 partition of [a,b] is a finite set of points from [a,b] that includes both a and b
 lower sum  sum all the possible smallest rectangles
 a partition Q is a refinement of a partition P if $P \subseteq Q$
 if $P \subseteq Q$, then L(f,P)≤L(f,Q) and U(f,P)≥U(f,Q)
 a bounded function f on the interval [a,b] is Riemannintegrable if U(f) = L(f) = $\int_a^b f$
 iff $\forall \epsilon >0$, there exists a partition P of [a,b] such that $U(f,P)L(f,P)<\epsilon$
 U(f) = inf{U(f,P)} for all possible partitions P
 if f is continuous on [a,b] then it is integrable
7.3 integrating functions with discontinuities
 if f:[a,b]>R is bounded and f is integrable on [c,b] for all c in (a,b), then f is integrable on [a,b]
7.4 properties of Integral
 assume f: [a,b]>R is bounded and let c in (a,b). Then, f is integrable on [a,b] iff f is integrable on [a,c] and [c,b]. In this case we have $\int_a^b f = \int_a^c f + \int_c^b f.$F
 integrable limit thm  Assume that $f_n \to f$ uniformly on [a,b] and that each $f_n$ is integarble. Then, f is integrable and $lim_{n \to \infty} \int_a^b f_n = \int_a^b f$.
7.5 fundamental theorem of calculus
 If f:[a,b] > R is integrable, and F:[a,b]>R satisfies F’(x) = f(x) for all x $\in$ [a,b], then $\int_a^b f = F(b)  F(a)$
 Let f: [a,b]> R be integrable and for x $\in$ [a,b] define G(x) = $\int_a^x g$. Then G is continuous on [a,b]. If g is continuous at some point $c \in [a,b]$ then G is differentiable at c and G’(c) = g(c).
overview
 convergence
 sequences
 series
 functional limits
 normal, uniform
 sequence of funcs
 pointwise, uniform
 series of funcs
 pointwise, uniform
 integrability
 sequential criterion  usually good for proving discontinuous
 limit points
 functional limits
 continuity
 absence of uniform continuity
 algebraic limit theorem ~ scalar multiplication, addition, multiplication, division
 limit thm
 sequences
 series  can’t multiply / divide these
 functional limits
 continuity
 differentiability
 ~integrability~
 limit thms
 continuous limit thm  Let ($f_n$) be a sequence of functions defined on A that converges uniformly on A to a function f. If each $f_n$ is continuous at c in A, then f is continuous at c
 differentiable limit theorem  let $f_n \to f$ pointwise on the closed interval [a,b], and assume that each $f_n$ is differentiable. If $(f’_n)$ converges uniformly on [a,b] to a function g, then the function f is differentiable and f’=g
 convergent derivatives almost proves that $f_n \to f$
 let ($f_n$) be a sequence of differentiable functions defined on the closed interval [a,b], and assume $(f’_n)$ converges uniformly to a function g on [a,b]. If there exists a point $x_0 \in [a,b]$ for which $f_n(x_0) \to f(x_0)$ is convergent, then ($f_n$) converges uniformly
 integrable limit thm  Assume that $f_n \to f$ uniformly on [a,b] and that each $f_n$ is integarble. Then, f is integrable and $lim_{n \to \infty} \int_a^b f_n = \int_a^b f$.
 functions are continuous at isolated points, but limits don’t exist there

uniform continuity: minimize $ f(x)f(y) $  derivative doesn’t have to be continuous
 integrable if finite amount of discontinuities and bounded

brain basics
 Cerebrum  The cerebrum is the largest portion of the brain, and contains tools which are responsible for most of the brain’s function. It is divided into four sections: the temporal lobe, the occipital lobe, parietal lobe and frontal lobe. The cerebrum is divided into a right and left hemisphere which are connected by axons that relay messages from one to the other. This matter is made of nerve cells which carry signals between the organ and the nerve cells which run through the body.
 Frontal Lobe  The frontal lobe is one of four lobes in the cerebral hemisphere. This lobe controls a several elements including creative thought, problem solving, intellect, judgment, behavior, attention, abstract thinking, physical reactions, muscle movements, coordinated movements, smell and personality.
 Parietal Lobe Located in the cerebral hemisphere, this lobe focuses on comprehension. Visual functions, language, reading, internal stimuli, tactile sensation and sensory comprehension will be monitored here.
 Sensory Cortex  The sensory cortex, located in the front portion of the parietal lobe, receives information relayed from the spinal cord regarding the position of various body parts and how they are moving. This middle area of the brain can also be used to relay information from the sense of touch, including pain or pressure which is affecting different portions of the body.
 Motor Cortex  This helps the brain monitor and control movement throughout the body. It is located in the top, middle portion of the brain.
 Temporal Lobe: The temporal lobe controls visual and auditory memories. It includes areas that help manage some speech and hearing capabilities, behavioral elements, and language. It is located in the cerebral hemisphere.
 Wernicke’s Area This portion of the temporal lobe is formed around the auditory cortex. While scientists have a limited understanding of the function of this area, it is known that it helps the body formulate or understand speech.
 Occipital Lobe: The optical lobe is located in the cerebral hemisphere in the back of the head. It helps to control vision.
 Broca’s Area This area of the brain controls the facial neurons as well as the understanding of speech and language. It is located in the triangular and opercular section of the inferior frontal gyrus.
 Cerebellum
 This is commonly referred to as “the little brain,” and is considered to be older than the cerebrum on the evolutionary scale. The cerebellum controls essential body functions such as balance, posture and coordination, allowing humans to move properly and maintain their structure.
Limbic System
 The limbic system contains glands which help relay emotions. Many hormonal responses that the body generates are initiated in this area. The limbic system includes the amygdala, hippocampus, hypothalamus and thalamus.
 Amygdala:The amygdala helps the body responds to emotions, memories and fear. It is a large portion of the telencephalon, located within the temporal lobe which can be seen from the surface of the brain. This visible bulge is known as the uncus.
 Hippocampus: This portion of the brain is used for learning memory, specifically converting temporary memories into permanent memories which can be stored within the brain. The hippocampus also helps people analyze and remember spatial relationships, allowing for accurate movements. This portion of the brain is located in the cerebral hemisphere.
 Hypothalamus:The hypothalamus region of the brain controls mood, thirst, hunger and temperature. It also contains glands which control the hormonal processes throughout the body.
 Thalamus:The Thalamus is located in the center of the brain. It helps to control the attention span, sensing pain and monitors input that moves in and out of the brain to keep track of the sensations the body is feeling.
Brain Stem
 All basic life functions originate in the brain stem, including heartbeat, blood pressure and breathing. In humans, this area contains the medulla, midbrain and pons. This is commonly referred to as the simplest part of the brain, as most creatures on the evolutionary scale have some form of brain creation that resembles the brain stem. The brain stem consists of midbrain, pons and medulla.
 Midbrain:The midbrain, also known as the mesencephalon is made up of the tegmentum and tectum. These parts of the brain help regulate body movement, vision and hearing. The anterior portion of the midbrain contains the cerebral peduncle which contains the axons that transfer messages from the cerebral cortex down the brain stem, which allows voluntary motor function to take place.
 Pons: This portion of the metencephalon is located in the hindbrain, and links to the cerebellum to help with posture and movement. It interprets information that is used in sensory analysis or motor control. The pons also creates the level of consciousness necessary for sleep.
 Medulla: The medulla or medulla oblongata is an essential portion of the brain stem which maintains vital body functions such as the heart rate and breathing.

Computational Neuroscience
[toc]
 1 introduction
 2  neural encoding
 3 neural decoding
 4  information theory
 5  computing in carbon
 6  computing with networks
 7  networks that learn: plasticity in the brain & learning
 8 
 ml analogies
1 introduction
1.1  overview
 three types
 descriptive brain model  encode / decode external stimuli
 mechanistic brian cell / network model  simulate the behavior of a single neuron / network
 interpretive (or normative) brain model  why do brain circuits operate how they do
1.2  descriptive
 receptive field  the things that make a neuron fire
1.3  mechanistic and interpretive
 retina has oncenter / offsurround cells  stimulated by points
 then, V1 has differently shaped receptive fields
 efficient coding hypothesis  learns different combinations (e.g. lines) that can efficiently represent images
 sparse coding (Olshausen and Field 1996)
 ICA (Bell and Sejnowski 1997)
 Predictive Coding (Rao and Ballard 1999)
 brain is trying to learn faithful and efficient representations of an animal’s natural environment
 same goes for auditory cortex
2  neural encoding
2.1  defining neural code
 extracellular
 fMRI
 averaged over space
 slow, requires seconds
 EEG
 noisy
 averaged, but faster
 multielectrode array
 record from several individual neurons at once
 calcium imaging
 cells have calcium indicator that fluoresce when calcium enters a cell
 fMRI
 intracellular  can use patch electrodes
 raster plot
 replay a movie many times and record from retinal ganglion cells during movie

encoding: P(response stimulus)  tuning curve  neuron’s response (ex. firing rate) as a function of stimulus
 orientation / color selective cells are distributed in organized fashion
 some neurons fire to a concept, like “Pamela Anderson”
 retina (simple) > V1 (orientations) > V4 (combinations) > ?
 also massive feedback

decoding: P(stimulus response)
2.2  simple encoding

want P(response stimulus)  response := firing rate r(t)
 stimulus := s
 simple linear model
 r(t) = c * s(t)
 weighted linear model  takes into account previous states weighted by f
 temporal filtering
 r(t) = $f_0 \cdot s_0 + … + f_t \cdot s_t = \sum s_{tk} f_k$ where f weights stimulus over time
 could also make this an integral, yielding a convolution:
 r(t) = $\int_{\infty}^t d\tau : s(t\tau) f(\tau)$
 a linear system can be thought of as a system that searches for portions of the signal that resemble its filter f
 leaky integrator  sums its inputs with f decaying exponentially into the past
 flaws
 no negative firing rates
 no extremely high firing rates
 can add a nonlinear function g of the linear sum can fix this
 r(t) = $g(\int_{\infty}^t d\tau : s(t\tau) f(\tau))$
 spatial filtering
 r(x,y) = $\sum_{x’,y’} s_{xx’,yy’} f_{x’,y’}$ where f again is spatial weights that represent the spatial field
 could also write this as a convolution
 for a retinal center surround cell, f is positive for small $\Delta x$ and then negative for large $\Delta x$
 can be calculated as a narrow, large positive Gaussian + spread out negative Gaussian  can combine above to make spatiotemporal filtering
 filtering = convolution = projection
 temporal filtering
2.3  feature selection

P(response stimulus) is very hard to get  stimulus can be highdimensional (e.g. video)
 stimulus can take on many values
 need to keep track of stimulus over time

solution: sample P(response s) to many stimuli to characterize what in input triggers responses
 find vector f that captures features that lead to spike
 dimensionality reduction  ex. discretize
 value at each time $t_i$ is new dimension
 commonly use Gaussian white noise
 time step sets cutoff of highest frequency present
 prior distribution  distribution of stimulus
 multivariate Gaussian  Gaussian in any dimension, or any linear combination of dimensions
 look at where spiketriggering points are and calculate spiketriggered average f of features that led to spike
 use this f as filter
 determining the nonlinear input/output function g

replace stimulus in P(spike stimulus) with P(spike $s_1$), where s1 is our filtered stimulus 
use bayes rule $g=P(spike s_1)=\frac{P(s_1 spike)P(spike)}{P(s_1)}$ 
if $P(s_1 spike) \approx P(s_1)$ then response doesn’t seem to have to do with stimulus


 incorporating many features $f_1,…,f_n$
 here, each $f_i$ is a vector of weights
 $r(t) = g(f_1\cdot s,f_2 \cdot s,…,f_n \cdot s)$
 could use PCA  discovers lowdimensional structure in highdimensional data
 each f represents a feature (maybe a curve over time) that fires the neuron
2.4  variability
 hidden assumptions about timevarying firing rate and single spikes
 smooth function RFT can miss some stimuli

statistics of stimulus can effect P(spike stimulus)  Gaussian white noise is nice because no way to filter it to get structure
 identifying good filter

want $P(s_f spike)$ to differ from $P(s_f)$ where $s_f$ is calculated via the filter  instead of PCA, could look for f that directly maximizes this difference (Sharpee & Bialek, 2004)
 KullbackLeibler divergence  calculates difference between 2 distributions
 $D_{KL}(P(s),Q(s)) = \int ds P(s) log_2 P(s) / Q(s)$
 maximizing KL divergence is equivalent to maximizing mutual info between spike and stimulus
 this is because we are looking for most informative feature
 this technique doesn’t require that our stimulus is white noise, so can use natural stimuli
 maximization isn’t guaranteed to uniquely converge

 modeling the noise
 need to go from r(t) > spike times
 divide time T into n bins with p = probability of firing per bin
 over some chunk T, number of spikes follows binomial distribution (n, p)
 mean = np
 var = np(1p)
 if n gets very large, binomial approximates Poisson
 $\lambda$ = spikes in some set time
 mean = $\lambda$
 var = $\lambda$ 1. can test if distr is Poisson with Fano factor=mean/var=1 2. interspike intervals have exponential distribution  if fires a lot, this can be bad assumption (due to refractory period)
 $\lambda$ = spikes in some set time
 generalized linear model adds explicit spikegeneration / postspike filter (Pillow et al. 2008)
 postspike filter models refractory period
 Paninski showed that using exponential nonlinearity allows this to be optimized
 could add in firing of other neurons
 timerescaling theorem  tests how well we have captured influences on spiking (Brown et al 2001)
 scaled ISIs ($t_{i1}t_i$) r(t) should be exponential
3 neural decoding
3.1  neural decoding and signal detection

decoding: P(stimulus response)  ex. you hear noise and want to tell what it is  here r = response = firing rate
 monkey is trained to move eyes in same direction as dot pattern (Britten et al. 92)
 when dots all move in same direction (100% coherence), easy
 neuron recorded in MT  tracks dots
 count firing rate when monkey tracks in right direction
 count firing rate when monkey tracks in wrong direction
 as coherence decreases, these firing rates blur

need to get P(+ or  r)  can set a threshold on r by maximizing likelihood

P(r +) and P(r ) are likelihoods

 NeymanPearson lemma  likelihood ratio test is the most efficient statistic, in that is has the most power for a given size

$\frac{p(r +)}{p(r )} > 1?$

 can set a threshold on r by maximizing likelihood
 accumulated evidence  we can accumulate evidence over time by multiplying these probabilities
 instead we take sum the logs, and compare to 0

$\sum_i ln \frac{p(r_i +)}{p(r_i )} > 0?$  once we hit some threshold for this sum, we can make a decision + or 
 experimental evidence (Kiani, Hanks, & Shadlen, Nat. Neurosci 2006)
 monkey is making decision about whether dots are moving left/right
 neuron firing rates increase over time, representing integrated evidence
 neuron always seems to stop at same firing rate
 priors  ex. tiger is much less likely then breeze

scale P(+ r) by prior P(+) 
neuroscience ex. photoreceptor cells P(noise r) is much larger than P(signal r)  therefore threshold on r is high to minimize total mistakes

 cost of acting/not acting

loss for predicting + when it is : $L_ \cdot P[+ r]$ 
loss for predicting  when it is +: $L_+ \cdot P[ r]$  cut your losses: answer + when average Loss$+$ < Loss$$

i.e. $L_+ \cdot P[ r]$ < $L_ \cdot P[+ r]$

 rewriting with Baye’s rule yields new test:

$\frac{p(r +)}{p(r )}> L_+ \cdot P[] / L_ \cdot P[+]$  here the loss term replaces the 1 in the NeymanPearson lemma


3.2  population coding and bayesian estimation
 population vector  sums vectors for cells that point in different directions weighted by their firing rates
 ex. cricket cercal cells sense wind in different directions
 since neuron can’t have negative firing rate, need overcomplete basis so that can record wind in both directions along an axis
 can do the same thing for direction of arm movement in a neural prosthesis
 not general  some neurons aren’t tuned, are noisier
 not optimal  making use of all information in the stimulus/response distributions
 bayesian inference

$p(s r) = \frac{p(r s)p(s)}{p( r)}$ 
maximum likelihood: s* which maximizes p(r s) 
MAP = maximum $a:posteriori$: s* which mazimizes p(s r)

 simple continuous stimulus example
 setup
 s  orientation of an edge
 each neuron’s average firing rate=tuning curve $f_a(s)$ is Gaussian (in s)
 let $r_a$ be number of spikes for neuron a
 assume receptive fields of neurons span s: $\sum r_a (s)$ is const
 solving
 maximizing loglikelihood with respect to s  take derivative and set to 0
 soln $s^* = \frac{\sum r_a s_a / \sigma_a^2}{\sum r_a / \sigma_a^2}$
 if all the $\sigma$ are same, $s^* = \frac{\sum r_a s_a}{\sum r_a}$
 this is the population vector
 maximum a posteriori

$ln : p(s r) = ln : P(r s) + ln : p(s) = ln : P(r )$  $s^* = \frac{T \sum r_a s_a / \sigma^2a + s{prior} / \sigma^2{prior}}{T \sum r_a / \sigma^2_a + 1/\sigma^2{prior}}$
 this takes into account the prior
 narrow prior makes it matter more

 doesn’t incorporate correlations in the population
 maximizing loglikelihood with respect to s  take derivative and set to 0
 setup
3.3  stimulus reconstruction (not sure about this)
 decoding s > $s^*$
 want an estimator $s_{Bayes}=s_B$ given some response r
 error function $L(s,s_{B})=(ss_{B})^2$

minimize $\int ds : L(s,s_{B}) : p(s r)$ by taking derivative with respect to $s_B$ 
$s_B = \int ds : p(s r) : s$  the conditional mean (spiketriggered average)
 add in spiketriggered average at each spike
 if spiketriggered average looks exponential, can never have smooth downwards stimulus
 could use 2 neurons (like in H1) and replay the second with negative sign
 LGN neurons can reconstruct a video, but with noise
 recreated 1 sec long movies  (Jack Gallant  Nishimoto et al. 2011, Current Biology)
 voxelbased encoding model samples ton of prior clips and predicts signal

get p(r s) 
pick best p(r s) by comparing predicted signal to actual signal  input is filtered to extract certain features
 filtered again to account for slow timescale of BOLD signal

 decoding

maximize p(s r) by maximizing p(r s) p(s), and assume p(s) uniform  30 signals that have highest match to predicted signal are averaged
 yields pretty good pictures

 voxelbased encoding model samples ton of prior clips and predicts signal
4  information theory
4.1  information and entropy
 surprise for seeing a spike h(p) = $log_2 (p)$
 entropy = average information
 code might not align spikes with what we are encoding
 how much of the variability in r is encoding s
 define q as en error

$P(r_+ s=+)=1q$ 
$P(r_ s=+)=q$  similar for when s=

 total entropy: $H(R ) =  P(r_+) log P(r_+)  P(r_)log P(r_)$

noise entropy: $H(R S=+) = q log q  (1q) log (1q)$ 
mutual info I(S;R) = $H(R )  H(R S) $ = total entropy  average noise entropy  = $D_{KL} (P(R,S), P(R )P(S))$
 define q as en error
 grandma’s famous mutual info recipe
 for each s

P(R s)  take one stimulus and repeat many times (or run for a long time) 
H(R s)  noise entropy


$H(R S)=\sum_s P(s) H(R s)$ 
$H(R ) $ calculated using $P(R ) = \sum_s P(s) P(R s)$
 for each s
4.2 information in spike trains
 information in spike patterns
 divide pattern into time bins of 0 (no spike) and 1 (spike)
 binary words w with letter size $\Delta t$, length T (Reinagel & Reid 2000)
 can create histogram of each word
 can calculate entropy of word
 look at distribution of words for just one stimulus
 distribution should be narrower
 calculate $H_{noise}$  average over time with random stimuli and calculate entropy
 varied parameters of word: length of bin (dt) and length of word (T)
 there’s some limit to dt at which information stops increasing
 this represents temporal resolution at which jitter doesn’t stop response from identifying info about the stimulus
 corrections for finite sample size (Panzeri, Nemenman,…)
 information in single spikes  how much info does single spike tell us about stimulus
 don’t have to know encoding, mutual info doesn’t care
 calculate entropy for random stimulus
 $p=\bar{r} \Delta t$ where $\bar{r}$ is the mean firing rate 2. calculate entropy for specific stimulus

let $P(r=1 s) = r(t) \Delta t$ 
let $P(r=0 s) = 1  r(t) \Delta t$  get r(t) by having simulus on for long time
 ergodicity  a time average is equivalent to averging over the s ensemble
 info per spike $I(r,s) = \frac{1}{T} \int_0^T dt \frac{r(t)}{\bar{r}} log \frac{r(t)}{\bar{r}}$
 timing precision reduces r(t)
 low mean spike rate > high info per spike
 ex. rat runs through place field and only fires when it’s in place field
 spikes can be sharper, more / less frequent
 don’t have to know encoding, mutual info doesn’t care
4.3 coding principles
 natural stimuli
 huge dynamic range  variations over many orders of magnitude (ex. brightness)
 power law scaling  structure at many scales (ex. far away things)
 efficient coding  in order to have maximum entropy output, a good encoder should match its outputs to the distribution of its inputs
 want to use each of our “symbols” (ex. different firing rates) equally often
 should assign equal areas of input stimulus PDF to each symbol
 adaptataion to stimulus statistics
 feature adaptation (Atick and Redlich)
 spatial filtering properties in retina / LGN change with varying light levels
 at low light levels surround becomes weaker
 feature adaptation (Atick and Redlich)
 coding sechemes
 redundancy reduction
 population code $P(R_1,R_2)$
 entropy $H(R_1,R_2) \leq H(R_1) + H(R_2)$  being independent would maximize entropy
 correlations can be good
 error correction and robust coding
 correlations can help discrimination
 retina neurons are redundant (Berry, Chichilnisky)
 more recently, sparse coding
 penalize weights of basis functions
 instead, we get localized features
 redundancy reduction
 we ignored the behavioral feedback loop
5  computing in carbon
5.1  modeling neurons
 nernst battery
 osmosis (for each ion)
 electrostatic forces (for each ion)
 together these yield Nernst potential $E = \frac{k_B T}{zq} ln \frac{[in]}{[out]}$
 T is temp
 q is ionic charge
 z is num charges  part of voltage is accounted for by nernst battery $V_{rest}$  yields $\tau \frac{dV}{dt} = V + V_\infty$ where $\tau=R_mC_m=r_mc_m$  equivalently, $\tau_m \frac{dV}{dt} = ((VE_L)  g_s(t)(VE_s) r_m) + I_e R_m $
5.2  spikes
5.3  simplified model neurons
 integrateandfire neuron
 passive membrane (neuron charges)
 when V = V$_{thresh}$, a spike is fired
 then V = V$_{reset}$
 doesn’t have good modeling near threshold
 can include threshold by saying
 when V = V$_{max}$, a spike is fired
 then V = V$_{reset}$
 modeling multiple variables
 also model a K current
 can capture things like resonance
 theta neuron (Ermentrout and Kopell)
 often used for periodically firing neurons (it fires spontaneously)
5.4  a forest of dendrites
 cable theory  Kelvin
 voltage V is a function of both x and t
 separate into sections that don’t depend on x
 coupling conductances link the sections (based on area of compartments / branching)
 Rall model for dendrites
 if branches obey a certain branching ratio, can replace each pair of branches with a single cable segment with equivalent surface area and electrotonic length
 $d_1^{3/2} = d_{11}^{3/2} + d_{12}^{3/2}$
 if branches obey a certain branching ratio, can replace each pair of branches with a single cable segment with equivalent surface area and electrotonic length
 dendritic computation (London and Hausser 2005)
 hippocampus  when inputs arrive at soma, similiar shape no matter where they come in = synaptic scaling
 where inputs enter influences how they sum
 dendrites can generate spikes (usually calcium) / backpropagating spikes
 ex. Jeffress model  sound localized based on timing difference between ears
 ex. direction selectivity in retinal ganglion cells  if events arive at dendrite far > close, all get to soma at same time and add
6  computing with networks
6.1  modeling connections between neurons
 model effects of synapse by using synaptic conductance $g_s$ with reversal potential $E_s$
 $g_s = g_{s,max} \cdot P_{rel} \cdot P_s$
 $P_{rel}$  probability of release given an input spike
 $P_s$  probability of postsynaptic channel opening = fraction of channels opened
 $g_s = g_{s,max} \cdot P_{rel} \cdot P_s$
 basic synapse model
 assume $P_{rel}=1$
 model $P_s$ with kinetic model
 open based on $\alpha_s$
 close based on $\beta_s$
 yields $\frac{dP_s}{dt} = \alpha_s (1P_s)  \beta_s P_s$
 3 synapse types
 AMPA  wellfit by exponential
 GAMA  fit by “alpha” function  has some delay
 NMDA  fit by “alpha” function  has some delay
 linear filter model of a synapse
 pick filter (ex. K(t) ~ exponential)
 $g_s = g_{s,max} \sum K(tt_i)$
 network of integrateandfire neurons
 if 2 neurons inhibit each other, get synchrony (fire at the same time
6.2  intro to network models
 comparing spiking models to firingrate models
 advantages
 spike timing
 spike correlations / synchrony between neurons
 disadvantages
 computationally expensive
 uses linear filter model of a synapse
 advantages
 developing a firingrate model
 replace spike train $\rho_1(t) \to u_1(t)$
 can’t make this replacement when there are correlations / synchrony?
 input current $I_s$: $\tau_s \frac{dI_s}{dt}=I_s + \mathbf{w} \cdot \mathbf{u}$
 works only if we let K be exponential
 output firing rate: $\tau_r \frac{d\nu}{dt} = \nu + F(I_s(t))$
 if synapses are fast ($\tau_s « \tau_r$)
 $\tau_r \frac{d\nu}{dt} = \nu + F(\mathbf{w} \cdot \mathbf{u}))$
 if synapses are slow ($\tau_r « \tau_s$)
 $\nu = F(I_s(t))$
 if static inputs (input doesn’t change)  this is like artificial neural network, where F is sigmoid
 $\nu_{\infty} = F(\mathbf{w} \cdot \mathbf{u})$
 could make these all vectors to extend to multiple output neurons
 replace spike train $\rho_1(t) \to u_1(t)$
 recurrent networks
 $\tau \frac{d\mathbf{v}}{dt} = \mathbf{v} + F(W\mathbf{u} + M \mathbf{v})$
 $\mathbf{v}$ is decay
 $W\mathbf{u}$ is input
 $M \mathbf{v}$ is feedback
 with constant input, $v_{\infty} = W \mathbf{u}$
 ex. edge detectors
 V1 neurons are basically computing derivatives
 $\tau \frac{d\mathbf{v}}{dt} = \mathbf{v} + F(W\mathbf{u} + M \mathbf{v})$
6.3  recurrent networks
 linear recurrent network: $\tau \frac{d\mathbf{v}}{dt} = \mathbf{v} + W\mathbf{u} + M \mathbf{v}$
 let $\mathbf{h} = W\mathbf{u}$
 want to investiage different M
 can solve eq for $\mathbf{v}$ using eigenvectors
 suppose M (NxN) is symmetric (connections are equal in both directions)
 $\to$ M has N orthogonal eigenvectors / eigenvalues
 let $e_i$ be the orthonormal eigenvectors
 output vector $\mathbf{v}(t) = \sum c_i (t) \mathbf{e_i}$
 allows us to get a closedform solution for $c_i(t)$
 eigenvalues determine network stability
 if any $\lambda_i > 1, \mathbf{v}(t)$ explodes $\implies$ network is unstable
 otherwise stable and converges to steadystate value
 $\mathbf{v}_\infty = \sum \frac{h\cdot e_i}{1\lambda_i} e_i$
 amplification of input projection by a factor of $\frac{1}{1\lambda_i}$
 if any $\lambda_i > 1, \mathbf{v}(t)$ explodes $\implies$ network is unstable
 suppose M (NxN) is symmetric (connections are equal in both directions)
 ex. each output neuron codes for an angle between 180 to 180
 define M as cosine function of relative angle
 excitation nearby, inhibition further away
 memory in linear recurrent networks
 suppose $\lambda_1=1$ and all other $\lambda_i < 1$
 then $\tau \frac{dc_1}{dt} = h \cdot e_1$  keeps memory of input
 ex. memory of eye position in medial vestibular nucleus (Seung et al. 2000)
 integrator neuron maintains persistent activity
 nonlinear recurrent networks: $\tau \frac{d\mathbf{v}}{dt} = \mathbf{v} + F(\mathbf{h}+ M \mathbf{v})$
 ex. rectification linearity F(x) = max(0,x)
 ensures that firing rates never go below
 can have eigenvalues > 1 but stable due to rectification
 can perform selective “attention”
 network performs “winnertakesall” input selection
 gain modulation  adding constant amount to input h multiplies the output
 also maintains memory
 ex. rectification linearity F(x) = max(0,x)
 nonsymmetric recurrent networks
 ex. excitatory and inhibitory neurons
 linear stability analysis  find fixed points and take partial derivatives
 use eigenvalues to determine dynamics of the nonlinear network near a fixed point
7  networks that learn: plasticity in the brain & learning
7.1  synaptic plasticity, hebb’s rule, and statistical learning
 if 2 spikes keep firing at same time, get LTP  longterm potentiation
 if input fires, but not B then could get LTD  longterm depression
 Hebb rule $\tau_w \frac{d\mathbf{w}}{dt} = \mathbf{x}v$
 $\mathbf{x}$  input
 $v$  output
 translates to $\mathbf{w}_{i+1}=\mathbf{w}_i + \epsilon \cdot \mathbf{x}v$
 average effect of the rule is to change based on correlation matrix $\mathbf{x}^T\mathbf{x}$
 covariance rule: $\tau_w \frac{d\mathbf{w}}{dt} = \mathbf{x}(vE[v])$
 includes LTD as well as LTP
 Oja’s rule: $\tau_w \frac{d\mathbf{w}}{dt} = \mathbf{x}v \alpha v^2 \mathbf{w}$ where $\alpha>0$
 stability
 Hebb rule  derivative of w is always positive $\implies$ w grows without bound
 covariance rule  derivative of w is still always positive $\implies$ w grows without bound

could add constraint that $ w =1$ and normalize w after every step


Oja’s rule  $ w = 1/\sqrt{alpha}$, so stable
 solving Hebb rule $\tau_w \frac{d\mathbf{w}}{dt} = Q w$ where Q represents correlation matrix
 write w(t) in terms of eigenvectors of Q
 lets us solve for $\mathbf{w}(t)=\sum_i c_i(0)exp(\lambda_i t / \tau_w) \mathbf{e}_i$
 when t is large, largest eigenvalue dominates
 hebbian learning implements PCA
 hebbian learning learns w aligned with principal eigenvector of input correlation matrix
 this is same as PCA
7.2  intro to unsupervised learning

 most active neuron is the one whose w is closest to x
 competitive learning
 updating weights given a new input
 pick a cluster (corresponds to most active neuron)
 set weight vector for that cluster to running average of all inputs in that cluster
 $\Delta w = \epsilon \cdot (\mathbf{x}  \mathbf{w})$
 related to selforganizing maps = kohonen maps
 in selforganizing maps also update other neurons in the neighborhood of the winner
 update winner closer
 update neighbors to also be closer
 ex. V1 has orientation preference maps that do this
 updating weights given a new input
 generative model
 prior P(C )

likelihoood P(X C) 
posterior P(C X) 
mixture of Gaussians model  Gaussian assumption P(X C=c) is Gaussian  EM = expectationmaximization

estimate P(C X)  pick what cluster point belongs to  for Gaussian model, each cluster gets a probability of changing
 this probability weights the change  *“soft”
 learn parameters of generative model  change parameters of Gaussian (mean and variance) for clusters


 assumes you have all the points at once
7.3  sparse coding and predictive coding
 eigenface  Turk and Pentland 1991
 eigenvectors of the input covariance matrix are good features
 can represent images using sum of eigenvectors (orthonormal basis)
 suppose you use only first M principal eigenvectors
 then there is some noise
 can use this for compression
 not good for local components of an image (e.g. parts of face, local edges)
 if you assume Gausian noise, maximizing likelihood = minimizing squared error
 generative model
 images X
 causes

likelihood P(X=x C=c)  Gaussian
 proportional to $exp(xGc)$

want posterior P(C X)  prior p(C )
 assume priors causes are independent
 want sparse distribution
 has heavy tail (superGaussian distribution)
 then P(C ) = $k \cdot \prod exp(g(C_i))$
 can implement sparse coding in a recurrent neural network
 Olshausen & Field, 1996  learns receptive fields in V1
 sparse coding is a special case of predicive coding
 there is usually a feedback connection for every feedforward connection (Rao & Ballard, 1999)
8 
8.2  reinforcement learning  predicting rewards
 dopamine serves as brain’s reward
ml analogies
Brain theories
 Computational Theory of Mind
 Classical associationism
 Connectionism
 Situated cognition
 Memoryprediction framework
 Fractal Theory: https://www.youtube.com/watch?v=axaH4HFzA24
 Brain sheets are made of cortical columns (about .3mm diameter, 1000 neurons / column)
 Have ~6 layers
brain as a computer
 Brain as a Computer – Analog VLSI and Neural Systems by Mead (VLSI – very large scale integration)
 Brain Computer Analogy
 Process info
 Signals represented by potential
 Signals are amplified = gain
 Power supply
 Knowledge is not stored in knowledge of the parts, but in their connections
 Based on electrically charged entities interacting with energy barriers
 http://en.wikipedia.org/wiki/Computational_theory_of_mind
 http://scienceblogs.com/developingintelligence/2007/03/27/whythebrainisnotlikeaco/
 Brain’ storage capacity is about 2.5 petabytes (Scientific American, 2005)
 Electronics
 Voltage can be thought of as water in a reservoir at a height
 It can flow down, but the water will never reach above the initial voltage
 A capacitor is like a tank that collects the water under the reservoir
 The capacitance is the crosssectional area of the tank
 Capacitance – electrical charge required to raise the potential by 1 volt
 Conductance = 1/ resistance = mho, siemens
 We could also say the word is a computer with individuals being the processors – with all the wasted thoughts we have – the solution is probably to identify global problems and channel people’s focus towards working on them
 Brain chip: http://www.research.ibm.com/articles/brainchip.shtml
 Differences: What Can AI Get from Neuroscience?
 Brains are not digital
 Brains don’t have a CPU
 Memories are not separable from processing
 Asynchronous and continuous
 Details of brain substrate matter
 Feedback and Circular Causality
 Asking questions
 Brains has lots of sensors
 Lots of cellular diversity
 NI uses lots of parallelism
 Delays are part of the computation
Brain v. Deep Learning
 http://timdettmers.com/
 problems with brain simulations:
 Not possible to test specific scientific hypotheses (compare this to the large hadron collider project with its perfectly defined hypotheses)
 Does not simulate real brain processing (no firing connections, no biological interactions)
 Does not give any insight into the functionality of brain processing (the meaning of the simulated activity is not assessed)
 Neuron information processing parts
 Dendritic spikes are like first layer of conv net
 Neurons will typically have a genome that is different from the original genome that you were assigned to at birth. Neurons may have additional or fewer chromosomes and have sequences of information removed or added from certain chromosomes.
 http://timdettmers.com/2015/03/26/convolutiondeeplearning/
 The adult brain has 86 billion neurons, about 10 trillion synapse, and about 300 billion dendrites (treelike structures with synapses on them

Development
 22 early development
 23 circuit formation
 24 plasticity in systems
 25 repair and regeneration
 26 diseases
22 early development
 ways to study
 topdown: rosy retrospection
 bottomup: e.g. LTP/LTD
 human disease: strokebystroke
 development=ontogeny
 timeframe
 month 1  gastrulation
 most sensitive time for mom
 month 25  cells being born
 up to year 2  axon guidance / synapse formation
 gastrulation  process by which early embryo undergoes folds = shapes of NS
 diseases
 spina bifida  neural tube fails to seal
 vitamin B12 can fix this
 anencephaly  neural tube fails to close higher up
 spina bifida  neural tube fails to seal
 parts
 roofplate at top (back)
 floorplate on bottom (stomach)
 neural crest  pinches off top of roofplate
 neuroblasts = classic stem cells
 assymetric division  cells generate themselves and differentiated progeny
 ultimate stem cell  fertilized eggs
 differentiation
 cells made by neuroblasts decide what they are going to become
 morphogens
 BMP  roofplate
 cyclopia  fatal defect in BMP
 Hedge hogs  at floor plate
 Retinoids  axial, affect skin
 affected by thalidomide  helps morning sickness but causes missing limb segments
 also affected by accutane
 FGFs  axial symmetry
 Wnts  skin, gut, hair
 loss of wnts is loss of hair
 BMP  roofplate
 floor plate loses function after embryogenesis except glioblastoma
 measure BMP and HH gradient to figure out where you are
 treat ALS by adding HH to make more alpha motor neurons 1. dorsal direction
 roofplate makes BMP
 low HH  interneurons, sensory neurons (ex. nociceptors)
 even BMP/HH  sympathetic
 high HH  more motor neurons
 floorplate makes HH (hedge hog) 2. axial specification (anterior/posterior)
 tube swells into bulbs that become cerebellum, superior colliculus, cortex
 homeotic genes = hox genes  set of genes (transcription factors) in order on chromosome
 order corresponds to order of your body parts
 rhombomeres  segments in brainstem made by hox gene patterns
 lineages
 when neuroblast is born, starts producing progeny (family tree of neuron types)
 very often, cells are produced in certain order
 timing: cellcell interations and tyrosine kinases determine order
 first alpha neurons, then GABAergic to control those, last is glia
 neural crest function
 migratory  moves out and divides:
 neuroblastoma  developed early  severe problem because missing parts of NS
 makes DRG and associated glial cells (schwann cells)
 makes sympathetic NS and target ganglia, enteric NS, parasympathetic NS targets
 makes melanocytes  know how to migrate and divide but can make melanoma (cancer)
 cortex is made inside out (6>1)
 starts with stem cells called radial glia
 cortical dysplasia  missing a layer / duplicating a layer
 small part with 2 layer 3s  severe epilepsy 5. cell death
 1/2 of cells die in development
 axon guidance (ch 23)
 each cell born and axon grows and are guided to a target
 dendrite basically follows same rules
 synapse formation (ch 23, 24)
 pruning and plasticity
 NMDA receptor type
 form synapses and if they don’t look right  get rid of them
 K1/K1 synapses breaking and forming
 after age 21, K1 starts increasing and net loss of synapses
 month 1  gastrulation
23 circuit formation
 growth cone  motile tip of axon
 actin tip
 lamellipodium  sheet (hand)
 filopodium  huge curves (fingers)
 chemo attraction (actin assembly) and chemo repulsion (actin disassembly)
 microtubule shaft  tubulin is much more cemented in
 mauthner cell of tadpole  first recorded growth cone
 can’t regrow (that’s why we can’t regrow spinal cord)
 actin tip
 signals in growing axons
 pioneer axons (Betz cells) are first  often die
 follower axons (other Betz cells) can jump onto these and connect before pioneer dies
 trophic support  neuron survives on contact
 frog tectum (has superior colliculus) with map of retina:
 ephrin (EPH) repulses axon
 retinal NT > tectum AP
 axons have different amount of EPH receptors (in retina temporal has more than nasal)
 gradient of EPH (in tectum anterior has less than posterior)
 if we flip eye upside down (on nasaltemporal axis), image will be upside down
 ephrin (EPH) repulses axon
 3 classes of axon guidance molecules:
 ECM/integrins
 NCAM (homophilic—binds to another neuron that is NCAM),
 follower neurons bind to pioneer through NCAMNCAM interactions
 Cadherin (homophilic)
 involved in recognition of being some place
 4 important ligands/receptors
 ephrins/eph
 gradient of eph receptor
 netrin/dcc = guidance moleculereceptor = DCC
 attracts axons to floorplate (midline)
 cells without DCC don’t cross midline
 slit/robo  receptor is slit
 chases axons off (away from midline)
 axons not destined to cross midline are born expressing robo
 axons destined to cross the midline only express robo after crossing
 if DCC () and robo () will continue wandering around
 robo 4 is associated with Tourette’s
 semaphorins/plexins
 combinatorial code  use combinations of these to guide axons
 these are the same genes that move cancer around
 ephrins/eph
 synaptic formation
 neuroexins  further recognition
 turn up in autism and schizophrenia
 DSCAM
 associated with Down’s syndrome
 doesn’t use gradients
 makes different kinds of proteins by differential slicing
 neuroexins  further recognition
 competition
 neurotrophins are secreted by muscle
 in early development, a muscle fiber has many alpha motor neurons innervating it
 all innervating neurons suck up neurotrophin and whichever sucks up most, kills all the others
 eventually, each muscle fiber is innervated by one alpha motor neuron
 only enough neurotrophin in target cells for a certain number of synapses
 happens everywhere
 ex. sympathetic ganglia
 ex. sensory neurons in skin get axons to correct cell types based on neurotrophin
 merkel  BDNF
 proprioceptor  NT3
 nociceptor  NGF
 ex. muscles  produce NGF
 treating ALS with NGF hyperactivates sensory neurons with trkA > causes chicken pox
 signals/receptors
 NGF  trk a (Trk receptor  survival signaling pathways)
 BDNF  trk b
 NT3  trk b and c
 NT4/5  trk b
 all bind p75 (death receptor)
 want to keep neurotrophins local, because there aren’t that many of them
 neurotrophins are secreted by muscle
24 plasticity in systems
 experiencedependent plasticity 
 ex. ducks imprinting is nonreversable
 learning is crystallized during critical period
 CREB and protein synthesis
 NMDA receptors
 epigenetics  histones control transcription and other things
 follow Hebb’s postulate  fire together, wire together
 different eyes firing together will sync up (NMDA receptors to strengthen synapses)
 systems
 ocular dominance
 left/right neurons terminate in adjacent zones
 LGN in cortex uses efferents just like superior colliculus
 label injected into retina can make it into cortex
 cat experiments
 some cells see only one eye, some see both
 cats need to form visual map in short critical period (<6 days)
 this is why you need cochlear implant early
 both eyes open  equal OD columns
 one eye closed  unequal OD columns
 branches coming out of LGN neurons grow more branches based on relative light exposure (they compete for eye’s real estate)
 strabismus = lazy eye  poor coordination with one of the muscles
 one eye is not quite seeing
 treat with patch on good eye > allows bad eye to catch up since eyes compete for ocular dominance columns
 more stimulus = more branches
 dye from retina goes through thalamus into cortex
 rabies virus does same thing: cell>ganglion>brain
 tonotopic map
 connection between MSO and inferior/superior colliculus
 playing one tone increases representation
 playing white noise disorganizes map
 birdsong
 hear song 1020 times when young  crystallized
 afterwards can’t learn new skills
 stress
 early stress sets stress points later in life
 uses serotonin
 ocular dominance
 shifts
 superior colliculus  integrate visual, auditory, motor to get X,Y coordinate
 auditory map  plastic (but only when young)
 visual map  not plastic
 if you shift visual map (with a prism), auditory map can shift over to meet the visual
 optic neuritis  ms optic nerve disease that shifts map
 only young animals can shift unless they were shifted before and are now unadapting
25 repair and regeneration
 full repair  human PNS  skin, muscles
 12 mm/day growth  speed of slow axonal transport
 thinnest axons first (thermal receptors and nociceptors)
 proprioceptors last
 process
 perinerium / schwann cells surrounds axons  helps regeneration
 growth cones that are cut form stumps > distal axons degenerate = walerian degeneration
 macrophages come in and eat up the damaged stuff
 neurotrophins are involved
 miswiring is common  regrow and may not find right target
 bell’s palsy  loss of facial nerve  recovers with miswiring (salivary / tear)
 neuromuscular junctions (NMJ)
 damaged cells leave synaptic ghost = glia and protein matrix for nerve to regrow into
 repairs easily after heavy training
 no repair / glial scar  human CNS
 no ghost because so spread out
 glia cover wound (scar) but can’t develop further
 has astrocytes and oligodendrocytes (types of glial cells)
 don’t support regrowth
 involved in scarring
 microglia  from immune system
 control inflammation
 release cytokines
 nogo  protein that blocks regrowth (but there are other proteins as well)
 we try repairing with shunts  piece of sciatic nerve from other part of body with schwann cells from PNS to try to repair a connection in the CNS
 stem cell regeneration  put new neurons being formed, 2 places in humans
 nonhuman examples
 floor plate of lizards can make new tail
 fish retina always making new cells
 canary brain part has stem cells that learn new song every year
 small C14 incorporation after early development  suggests we don’t regenerate neurons  C14 was from nuclear testing
 human areas that do regenerate
 hippocampus
 memories you store temporarily
 subventricular zone makes glomeruli in olfactory bulb cells
 turnover daily
 sensory neurons and their targets constantly die and regenerate
 niche  places where stem cells stay alive
 ex. places in CNS with WINT molecular signals
 damage control  remove these signals for apoptosis = cell death
 ex. places in CNS with WINT molecular signals
 hippocampus
 glutamate increase  excitotoxicity
 can stop with NMDA blockers
 induce a coma by cooling them down or GABA drugs
 cytokines increase  immune system (like neurotrophins), inflammation
 hypoxia/stress
 neurotrophin withdrawal
 in stress times neurotrophin goes down
 nonhuman examples
26 diseases
alzheimer’s
 overview
 ageassociated  tons of people get it
 doesn’t kill you, secondary complications like pneumonia will kill you
 rate is going up
 very expensive to treat
 declarative memories are affected by Alzheimer’s
 these are memories that you know
 first 2 areas to go in Alzheimer’s
 hippocampus
 patient HM had no hippocampus
 no anterograde memory  learning new things
 hippocampus stores 1 day of info
 offloading occurs during sleep (REM sleep) to prefrontal cortex, temporal lobe, V4
 dreaming  might see images as you are offloading
 patient HM had no hippocampus
 basal forebrain  spread synapses all over cortex
 uses Ach
 ignition key for entire cortex
 hippocampus
 alzheimer’s characteristics only found in autopsy
 amyloid plaques
 maybe Abeta causes it
 Abeta comes from APP
 Abeta42 binds to itself
 prion (starts making more of itself)
 this cycle could be exacerbated by injury
 clumps and attracts immune system which kills local important cells
 this could cause Alzheimer’s
 rare genetic mutations in Abeta increase probability you get Alzheimer’s
 antiinflammation may be too late
 can take drugs that increase Ach functions  ex. cholinergic agonists, cholinesterase inhibitors
 tangles
 tangles made of protein called Tau
 most people think these are just dead cells resulting from Alzheimer’s but some think they cause it
 amyloid plaques
parkinson’s
 loss of substantia nigra pars compacta dopaminergic neurons
 when you get down to 20% what you were born with
 dopaminergic neurons form melanin = dark color
 hits to head can give inflammation
 know what they need to do  don’t have enough dopamine to act
 treat with L Dopa > something like dopamine > take out globus pallidus
 Lewy bodies are clumps of alpha synuclein  appear at dopaminergic synapses
 clumps like Abeta42
 associated with earlyonset Parkinson’s (rare) associated with genetic mutations
 bradykinesia  slowness of movement
 age can give parksinson’s
 no evidence that toxins can induce parkinsons
 PTP/ pesticides can induce Parkinson’s in test animals
 1/500 people

memory
 The Neuron Doctrine – the neuron is the fundamental building block and elementary signaling unit of the brain
 Golgi – develops silver staining method which allows Cajal to see entire neuron
 Santiago Ramon y Cajal – Spanish anatomist who simplifies neuron forest and looks at individual neurons, develops model with dendrites, cell body, and axon
 4 parts: neuron, synapse, connection specificity (specific neurons connect to specific others), dynamic polarization (signals travel in one direction)
 Freud looks into Cajal’s theories, but doesn’t incorporate them
 1906 – they share Nobel despite Golgi hating Cajal’s theories
 1955 – Cajal’s intuitions borne out conclusively
 Next Generation
 Sherrington – builds on Cajal’s work – finds that neurons integrate signals and some signals are inhibitory
 Shares Nobel with Adrian in 1932 – Adrian is younger and grateful
 Phases of Nerve Signaling
 Galvani discovers electrical activity in animals, Helmholtz measures speed of electrical signals in neurons
 Adrian measures action potentials and sees that they all have the same size and that intensity correlates with their frequency
 Bernstein (student of Helmholtz) finds that ions carry the electrical current – he investigates only the potassium ion
 Ionic hypothesis – Hodgkin, Huxley (and Katz) – find sodium, potassium in squid axon using voltage clamp, discover voltagegated channels, win Nobel in 1963
 Interneuronal signaling
 1920s – Dale and Loewi find that synaptic transmission is chemical  use acetylcholine in frogs
 synaptic potential – has different sizes, slower – only over synapses (can be excitatory or inhibitory, can generate action potential)
 longterm potentiation (LTP) is a persistent strengthening of synapses based on recent patterns of activity
 Eccles – believed in spark theory (synaptic transmission was electrical), but after talking to Popper disproves it with Katz and believes in soup theory (synaptic transmission is chemical)
 Glutamate – major excitatory neurotransmitter
 GABA – main inhibitory transmitter
 Katz’s lab later showed there are a few synapses that are electrical
 Katz – found that neurotransmitters were released by voltagegated gates letting in Ca ions
 Packets of neurotransmitters called quanta are released in synaptic vesicles
 Confirmed in in 1955
 Modern Generation
 Four Lobes
 Frontal – working memory and lots of stuff
 Temporal – auditory processing, language, and memory
 Parietal – sensory information
 Occipital – vision
 Brain maps  Marshall showed that, even though the different sensory systems carry different types of information and end up in different regions of the cerebral cortex, they share a common logic in their organization: all sensory information is organized topographically in the brain in the form of precise maps of the body’s sensory receptors
 Broca and Wernicke find that specific brain regions are in charge of specific functions
 Broca’s area expression of language
 Werknicke’s area – perception of language
 Patient H.M. – research by Brenda Milner
 He couldn’t store new memories although he could learn new skills
 Memory is a distinct mental function, clearly separate from other perceptual, motor, and cognitive abilities.
 Shortterm memory and longterm memory can be stored separately. Loss of medial temporal lobe structures, particularly loss of the hippocampus, destroys the ability to convert new shortterm memory to new longterm memory.
 There is explicit and implicit memory (implicit is a collection of processes)
 Milner showed that at least one type of memory can be traced to specific places in the brain
 Early Eric Kandel
 Gets lucky start recording in hippocampus
 Decides to start recording in Aplysia – large and has only 20,000 neurons separated into nine ganglia (human brain ~ 100 billion)
 Hypothesizes that persistent changes in the strength of synaptic connections results in memory storage
 just as neurons and their synaptic connections are exact and invariant, so, too, the function of those connections is invariant.
 First, we found that the changes in synaptic strength that underlie the learning of a behavior may be great enough to reconfigure a neural network and its informationprocessing ability
 a given set of synaptic connections between two neurons can be modified in opposite ways— strengthened or weakened—by different forms of learning.
 Third, the duration of shortterm memory storage depends on the length of time a synapse is weakened or strengthened.
 Fourth, we were beginning to understand that the strength of a given chemical synapse can be modified in two ways, depending on which of two neural circuits is activated by learning—a mediating circuit or a modulatory circuit
 Learning may be a matter of combining various elementary forms of synaptic plasticity into new and more complex forms, much as we use an alphabet to form words.
 forgetting had at least two phases: a rapid initial decline that was sharpest in the first hour after learning and then a much more gradual decline that continued for about a month.
 Homosynaptic  the depression occurred in the same neural pathway that was stimulated
 Strengthening synapses = greater responses
 Shortterm to Longterm
 Memory consolidation – shortterm is subject to disruption
 Head injuries or seizures can lead to retrograde amnesia – you forget what was in you shortterm memory
 Electric shocks were able to get rid of shortterm memory
 A shortterm memory lasting minutes is converted—by a process of consolidation that requires the synthesis of new protein—into stable, longterm memory lasting days, weeks, or even longer
 Longterm memory results in growing or shedding synapses
 As the memory fades, the number of synapses goes almost back to normal, with the difference accounting for relearning a task easier
 Recall
 Based on cues, in the case of Aplysia gillwithdrawal, external stimulus
 Shortterm
 Shortterm memory changes are presynaptic  during shortterm habituation lasting minutes, the sensory neuron releases less neurotransmitter, and during shortterm sensitization it releases more neurotransmitter
 Relies on interneurons:
 Mediating circuits – produce behavior directly, sensory neurons that innervate the siphon, the interneurons, and the motor neurons that control the gillwithdrawal reflex
 Modulatory circuits  not directly involved in producing a behavior but instead finetunes the behavior in response to learning by modulating—heterosynaptically—the strength of synaptic connections between the sensory and motor neurons
 For example, in gillwithdrawal sensitization, interneurons release serotonin into the presynaptic terminals of the sensory neurons
 This generates a long, slow synaptic potential in the motor neurons
 Ionotropic receptors – neurotransmittergated, open ion channels
 Metabotropic receptors – gated, activate enzymes; these enzymes can make cyclic AMP, act much longer
 Serotonin binds to metabotropic receptors in the presynaptic terminal of sensory neurons increasing the amount of cAMP and in turn the amount of glutamate
 This was verified in studies of Drosophila
 Sensory regions map to specific places in brain, keeping proximity
 Pavlov
 Habituation – animal repeatedly presented with neutral sensory stimulus learns to ignore it
 Sensitization – animal learns strong stimulus is dangerous and enhances its defensive reflexes
 Classical Conditioning – pair neutral stimulus with potentially dangerous stimulus, animal learns to respond to neutral stimulus
 Longterm memory
 Proteins must be made  DNA makes RNA, and RNA makes protein
 The serotonin itself was able to make new synapses grow via synthesis of proteins in the nucleus
 Genes
 There are effector genes which mediate cellular functions and regulatory genes which switch them on and off
 Genes have regions called promoters and repressors and regulatory proteins must bind to these in order to express them
 With repeated pulses of serotonin, kinase A move into the nucleus – turn on regulatory protein called CREB which turns on some genes
 Also requires turning off other genes
 MAP kinase also moves into nucleus and depresses CREB2
 Together, activating CREB and deactivating CREB2 transfers shortterm memories to longterm
 Synaptic marking – the proteins produced in the nucleus know which synapses to go to because of their shortterm changes
 Proteins must be synthesized locally at the synapses
 Dormant mRNA is sent to all the synapses
 There is a protein called CPEB that is activated by serotonin and is required by the synapses to maintain protein synthesis
 Resembles a prion – special protein with dominant and recessive conformation, dominant can be harmful
 Dominant is selfperpetuating – turns recessive into dominant
 Explicit memory  depends on the elaborate neural circuitry of the hippocampus and the medial temporal lobe, and it has many more possible storage sites.
 Longterm potentiation  synaptic strengthening mechanism in the hippocampus
 longterm potentiation describes a family of slightly different mechanisms, each of which increases the strength of the synapse in response to different rates and patterns of stimulation
 glutamate acts on two different types of ionotropic receptors in the hippocampus, the AMPA receptor and the NMDA receptor. The AMPA receptor mediates normal synaptic transmission and responds to an individual action potential in the presynaptic neuron. The NMDA receptor, on the other hand, responds only to extraordinarily rapid trains of stimuli and is required for longterm potentiation flow of calcium ions into the postsynaptic cell acts as a second messenger (much as cyclic AMP does), triggering longterm potentiation  allows calcium ions to flow through its channel if and only if it detects the coincidence of two neural events, one presynaptic and the other postsynaptic: The presynaptic neuron must be active and release glutamate, and the AMPA receptor in the postsynaptic cell must bind glutamate and depolarize the cell.
 explicit memory in the mammalian brain, unlike implicit memory in Aplysia or Drosophila, requires several gene regulators in addition to CREB
 Cognitive science  Kantian notion that the brain is born with a priori knowledge, “knowledge that is . . . independent of experience.”
 Visual system
 Different layers respond to different things  each cell in the primary visual cortex responds only to a specific orientation of such lightdark contours
 The brain does not simply take the raw data that it receives through the senses and reproduce it faithfully. Instead, each sensory system first analyzes and deconstructs, then restructures the raw, incoming information according to its own builtin connections and rules
 What and where are different neural pathways
 there is no single cortical area to which all other cortical areas report exclusively, either in the visual or in any other system. In sum, the cortex must be using a different strategy for generating the integrated visual image.
 Spatial Map
 the hippocampus of rats contains a representation—a map—of external space and that the units of that map are the pyramidal cells of the hippocampus, referred to as “place cells.”
 The brain sometimes codes with coordinates centered on the receiver and sometimes relative to the outside world
 Attention acts as a filter, selecting some objects for further processing
 A modulating circuit involving dopamine in the hippocampus forms spatial maps
 The dopamine comes from the cerebral cortex (a higher level part of the brain)
 Eric Kandel tried to translate his research into a cure for agerelated memory loss
 Alzheimer’s: This degeneration of tissue is caused in large part by the accumulation of an abnormal material known as ßamyloid in the form of insoluble plaques in the spaces between brain cells.
 We found that drugs which activate these dopamine receptors, and thereby increase cyclic AMP, overcome the deficit in the late phase of longterm potentiation. They also reverse the hippocampusdependent memory deficit.
 Various disorders are being solved slowly through research
 Consciousness
 consciousness in people, is an awareness of self, an awareness of being aware.
 The brain does reconstruct our perception of an object, but the object perceived—the color blue or middle on the pia no—appears to correspond to the physical properties of the wavelength of the reflected light or the frequency of the emitted sound
 Claustrum is connected to a bunch of different brain parts – could regulate attention
 viewing frightening stimuli activates two different brain systems, one that involves conscious, presumably topdown attention and one that involves unconscious, bottomup attention, or vigilance
 readiness potential can measure what a person is going to do before they know they want to do it

Motor system
 16 lower
 17 upper
 18 basal ganglia (choose what you want to do)
 19 cerebellum (fine tuning all your motion)
 20 eye movements/integration
 21 visceral (how you control organs, stress levels, etc.)
16 lower
 sensory in dorsal spinal cord, motor in ventral
 farther out neurons control farther out body parts (medial=trunk, lateral=arms,legs)
 one motor neuron (MN) innervates multiple fibers
 the more fibers/neuron, the less precise
 MN pool  group of MNs=motor units
 muscle tone = all your muscles are a little on, kind of like turning on the car engine and when you want to, you can move forward
 more firing = more contraction
 MN types
 fast fatiguable  white muscle
 fast fatigueresistant
 slow  red muscles, make atp
 muscles are innervated by a proportion of these MNs
 reflex
 whenever you get positive signal on one side, also get negative on other
 flexor  curl in (bicep)
 extensor  extend (tricep)
 proprioceptors (+)  measure length  more you stretch, more firing of alpha MN to contract
 intrafusal muscle=spindle  stretches the proprioceptor so that it can measure even when muscle is already stretched
 $\gamma$ motor neuron  adjusts intrafusal muscles until they are just right
 keeps muscles tight so you know how much muscle is streteched
 if alpha fires a lot, gamma will increase as well
 high gamma allows for fast responsiveness  brainstem modulators (serotonin) also do this
 opposes muscle stretch to keep it fixed
 spindle > activates muscles > contracts > turns off
 sensory neurons / gamma MNs innervate muscle spindle
 $\gamma$ motor neuron  adjusts intrafusal muscles until they are just right
 homonymous MNs go into same muscle, antagonistic muscle pushes other way
 golgi tendon () measures pressure not stretch
 safety switch
 inhibits homonymous neuron so you don’t rip muscle off
 ALS = Lou Gehrig’s disease
 MNs are degenerating  reflexes don’t work
 progressive loss of $\alpha$ MNs
 last neuron to go is superior rectus muscle > people use eyes to talk with tracker
 CPG = central pattern generator
 ex. step on pin, lift up leg
 walking works even if you cut cat’s spinal cord
 collection of interneurons
17 upper
 cAMP is used by GPCR
 lift and hold circuit
 ctx>lateral white matter>lateral ventral horn>limb muscles
 lateral white matter  most sensitive to injury
 brainstem>medial white matter>medial horn>trunk
 medial white matter > goes into trunk
 ctx>lateral white matter>lateral ventral horn>limb muscles
 bulbarspinal tracts
 lateral and medial vestibulospinal tracts  feedback
 automated system  not much thinking
 posture  reflex
 too slow for learning surfing
 reticular  feedforward = anticipate things before they happen
 command / control system for trunk muscles (posture)
 feedforward  not a reflex, lean back before opening drawer
 caudal pontine  feeds into spinal cord
 colliculospinal tract
 has superior colliculus  eye muscles, necklooking
 see ch. 20  reflex
 lateral and medial vestibulospinal tracts  feedback
 corticular bulbar tract (premotor>primary motor>brainstem)
 motor cortexes  this info is descending
 can override reticular reflexes in reticular formation
 premotor cortex (P2)  contains all actions you can do
 has mirror neurons that fire ahead of primary neurons
 fire if you think about it or if you do it
 has mirror neurons that fire ahead of primary neurons
 primary motor cortex (P1)
 layer 1 ascending
 layer 4 input
 layer 5  Betz cells  behave like 6 (output)
 layer 6  descending output
 has map like S1 does
 Jacksonian march get seizure that goes from feet to face (usually one side)
 epileptic seizure  neurons fire too much and fire neurons near them
 insular  flashes of moods
 pyriform  flashes of smells
 epileptic seizure  neurons fire too much and fire neurons near them
 Jacksonian march get seizure that goes from feet to face (usually one side)
 Betz cells  if they fire, you will do something
 dictate a goal, not single neuron to fire
 axons to ventral horn of spinal cord
 lesions
 upper
 spasticity  unorganized leg motions
 increased tone  tight muscles
 hyperactive deep reflexes
 ex. babinski’s sign
 curl foot down a lot because you don’t know how much to curl
 curling foot down = normal plantar
 more serotonin can cause this
 lower
 hypoactive deep reflexes
 decreased tone
 severe muscle atrophy
 upper
 pathways
 Betz cell
 90% cross midline in brainstem  control limbs
 10% don’t cross  trunk muscles
 Betz cell
18 basal ganglia (choose what you want to do)
 “who you are”
 outputs
 brainstem
 motor cortex
 4 loops (last 2 aren’t really covered)
 motor loops
 body movement loop
 SnC > S (CP) > () Gp > () VA/VL > motor cortex
 oculomotor loop
 cortex > caudate > substantia nigra pars reticulata > superior colliculus
 body movement loop
 nonmotor loops
 prefrontal loop  daydreaming (higherorder function)
 spiny neurons corresponding to a silly idea (alien coming after you) filtered out because not fired enough
 schizophrenia  can’t filter that out
 limbic loop  mood
 has nucleus accumbens
 can make mood better with dopamine
 prefrontal loop  daydreaming (higherorder function)
 motor loops
 substantia nigra
 pars compacta  dopaminergic neurons (input to striatum)
 more dopamine = more strength between cortical pyramidal neurons and spiny neurons (turns up the gain)
 dopamine helps activate a spiny neuron
 may be the ones that learn (positive outcome is saved, will result in more dopamine later)
 Parkinson’s  specific loss of dopaminergic neurons
 dopaminergic neurons form melanin = dark color
 when you get down to 20% what you were born with
 know what they need to do  don’t have enough dopamine to act
 treat with L Dopa > something like dopamine > take out globus pallidus
 cocaine, amphetamine  too much dopamine
 Huntington’s  death of specific class of spiny neurons
 have uncontrolled actions
 Tourette’s  too much dopamine
 also alcohol
 MPPP (synthetic heroin)
 MPTP looks like dopamine but turns into MPP and kills dopaminergic neurons
 treated with L Dopa to reactivate spiny neurons
 pars reticulata
 doesn’t have dopamine (output from striatum)
 striatum contains spiny neurons
 doesn’t have dopamine (output from striatum)
 caudate (for vision)  output to globus pallidus and substantia nigra (pars reticulata)
 putamen  output only to globus pallidus
 each spiny neuron gets input from ~1000 cortical pyramidal cells
 globus pallidus  each spiny neuron connects to one globus pallidus neuron  deja vu  spiny neuron you haven’t fired in a while
 VA/VL (thalamus)  all motor actions must go through here before cortex  has series of commands of all actions you can do  has parallel set of betz cells that will illicit those actions  VA/VL is always firing, globus pallidus inhibits it (tonic connection)
 pars compacta  dopaminergic neurons (input to striatum)
19 cerebellum (fine tuning all your motion)
 redundant system  cortex could do all of this, but would be slow
 repeated circuit  interesting for neuroscientists
 all info comes in, gets processed and goes back out
 cerebellum gets motor efferant copy
 all structures on your brain that do processing send out efferent
 cerebellum sends efferant copy back to itself with time delay (through inferior olive)
 cerebrocerebellum
 deals with premotor cortex (mostly motor cortex)
 spinocerebellum = clarke’s nucleus, knows stretch of every muscle, many proprioceptors go straight into here
 motor cortex
 has a map of muscles
 vestibular cerebellum  vestibular>cerebellum>vestibular
 vestibular system leans you back but if wind blows, have to adjust to that
 input
 pontine nuclei (from cortex)
 vestibular nuclei (balance)
 cuneate nucleus (somatosensory from spinal upper body)
 clarke (proprio from spinal lower body)
 processing
 cerebellar deep nuclei
 output
 deep cerebellar nuclei
 go to superior colliculus, reticular formation
 VA/VL (thalamus)  back to cortex
 red nucleus
 deep cerebellar nuclei
 circuit 1  finetuning
 circuit 2  detects differences, adjusts
 cerebellum > red nucleus (is an efferant copy) > inferior olive > cerebellum
 compare new copy to old copy
 cells
 purkinje cells  huge number of dendrite branches  dead planar allows good imaging
 GABAergic
 (input) mossy fibers (+)> granule cells (send parallel fibers) (+)> purkinje cell ()> deep cerebellar nuclei (output)
 mossy>granule>parallel fibers connect to ~100,000 parallel fibers
 climbing fiber  comes from inferior olive and goes back to purkinje cell (this is the efferent copy) = training signal
 loops
 deep excitatory loop (climbing/mossy) (+)> deep cerebellar nuclei
 cortical inhibitory loop (climbing/granule) (+)> purkinje
 the negative is from purkinje to deep cerebellar nuclei
 purkinje cells  huge number of dendrite branches  dead planar allows good imaging
 alcohol
 can create gaps = folia
 longterm use causes degeneration = ataxia (lack of coordination)
20 eye movements/integration
 Broca’s view  look at people with problems
 Ramon y Cajal  look at circuits
 5 kinds of eye movements
 saccades
 use cortex, superior colliculus (visual info > LGN > cortex, 10% goes to brainstem)
 constantly moving eyes around (fovea)
 ~scan at 30 Hz
 5 Hz=200 ms for cortex to process so pause eyes (get 56 images)
 there is a little bit of drift
 can’t control this
 humans are better than other animals at seeing things that aren’t moving
 VOR  vestibular ocular reflex  keeps eyes still
 use vestibular system, occurs in comatose
 fast
 works better if you move your head fast
 optokinetic system  tracks with eyes
 ex. stick head out window of car and track objects as they go by
 slower than VOR (takes 200 ms)
 works better if slower
 reflex
 in cortex (textbooks) but probs brainstem (new)
 smooth pursuit  can track things moving very fast
 suppress saccades and track smoothly
 only in higher apes
 area MT is highest area of motion coding and goes up and comes down multiple ways
 high speed processing isn’t understood
 could be retina processing
 vergence  crossing your eyes
 suppresses conjugate eye movements
 we can control this
 only humans  bring objects up very close
 reading uses this
 saccades
 eye muscles
 rectus
 vertical
 superior
 inferior
 use complicated vertical gaze center
 last to degenerate in ALS
 lockedin syndrome  can only move eyes vertically
 controls oculomotor nucleus
 lateral
 medial
 lateral (controlled by abducens)
 use horizontal gaze center=PPRF which talk to abducens MLF connects abducents to opposite medial lateral rectus muscle
 oblique  more circular motions
 superior (controlled by trochlear nucleus)
 inferior
 vertical
 everything else controlled by oculomotor nucleus
 rectus
 superior colliculus has visual map
 controls saccades, connects to gaze centers
 takes input from basal ganglia (oculomotor loop)
 also gets audio input from inferior colliculus (hear someone behind you and turn)
 gets strokes
 redundant with frontal eye field in secondary motor cortex
 connects to superior colliculus, gaze center, and comes back
 if you lose one of these, the other will replace it
 if you lose both, can’t saccade to that side
21 visceral (how you control organs, stress levels, etc.)
 parasympathetic works against sympathetic
 divisions
 sympathetic  fightorflight (adrenaline)
 functions
 neurons to smooth muscle
 pupils dilate
 increases heart rate
 turn off digestive system
 2 things with no parasympathetic counterpart
 increase BP
 sweat glands
 location
 neurons in spinal cord lateral horn
 send out neurons to sympathetic trunk (along the spinal cord)
 all outgoing connections are adrenergic
 neurons in spinal cord lateral horn
 betaadrenergic drugs block adrenaline
 beta agonist  activates adrenaline receptors (do this before EKG)
 functions
 parasympathetic  relaxing (ACh)
 location
 brainstem
 edinger westphal nucleus  pupilconstriction
 salivatory nucleus
 vagus nucleus  digestive system, sexual function
 nucleus ambiguous  heart
 nucleus of the solitary tract
 all input/output goes through this
 rostral part (front)  taste neurons
 caudal part (back) contains all sensory information of viscera (ex. BP, heart rate, sexual
 all input/output goes through this
 sacral spinal cord (bottom)  gut/bladder/genitals
 not parallel to sympathetic – poor design  may cause stressassociated diseases
 brainstem
 hard to make drugs with ACh
 location
 enteric nervous system  in your gut
 takes input through vagus nerve from vagus nucleus
 also has sensory neurons and sends afferents back to brainstem
 sympathetic  fightorflight (adrenaline)
 pathway
 insular cortex  what you care about
 amygdala  contains emotional memories
 hypothalamus  controls a lot
 mostly peptinergin neurons
 aging, digestion, mood, straight to bloodstream & CNS
 releases hormones
 ex. leptin  stops you eating when you eat calories
 reticular formation  feedforward, prepares digestion before we eat
 three examples
 heart rate
 starts at nucleus ambiguous
 also takes input from chemoreceptors (ex. pH)
 SA node at heart generates heartbeat  balances Ach and adrenaline
 sympathetic sends info from thoracic spinal cord
 heart sends back baroreceptor afferents
 bladder function
 parasympathetic in sacral lateral horn make you pee (contracts bladder)
 turn off sympathetic NS
 open sphincter muscle (voluntary)
 can also control this via skeletal nervous system
 circuit
 amygdala (can’t pee when nervous)
 pontine micturation center > parasympathetic preganglionic neurons > parasympathetic ganglionic neurons
 inhibitory local circuit neurons > somatic MNs
 sexual function
 Viagra turns on parasympathetic NS
 also gives temporal color blindness
 sympathetic involved in ejaculation
 temporal correlation (“Point and Shoot”)
 Viagra turns on parasympathetic NS
 heart rate

Neural Signaling
 1  introduction
 2  electrical signals of nerve cells
 3  voltagedependent membranes
 4  ion channel transporters
 5  synapses
 6  neurotransmitters
 7  molecular signaling
 8  synaptic plasticity
1  introduction

genomics
 male Drosophila uses body position and environment to add rhythmic notes to song
 female uses this to gauge male’s brain
 neural circuits make up neural systems
 neural systems serve 3 general functions
 sensory systems
 motor systems
 associational systems  link the other two, higher order functions
 gene has coding DNA (exons) and regulatory DNA (introns)
 model organisms
 cat  visual cortex
 squid and sea slug had really large neurons
 4 species: worm C. elegans, Drosophila, zebrafish Danio rerio, mouse Mus musculus
 complete genome is available for them
 can try homologous recombination  splicing in new genes
 human genome has ~20k genes, ~14k expressed in brain, ~6k expressed only in brain
 singlegene mutations can cause diseases like microcephaly
 simulate brain as a computer
 passive cabling equation
 theoretical neuroscience
 blue brain project
 human brain project
cellular components
 neuron doctrine by Ramon y Cajal / Sherrington replaces Golgi’s reticular theory
 Cajal uses Golgi’s saltstaining method to show neurons are distinct
 Sherrington finds synapses
 there are rare gap junctions between neurons (where there are electrical synapses)
 neurons
 number of inputs reflects convergence
 number of targets reflects divergence
 local circuit neurons = interneurons  short axons
 projection neurons  long axons
 glial cells  support and regeneration
 outnumber neurons 3:1
 they are stem cells  can generate new glia
 maintaining ion gradients, modulating nerve signals, modulating synaptic action, scaffolding, aiding recovery
 astrocytes  in CNS, maintain chemical environment, retain stem cell properties
 oligodendrocytes  in CNS, lay down myelin  in PNS, Schwann cells do this
 microglial cells  remove debris
 glial stem cells  make more glia and sometimes neurons
cellular diversity
 ~10^11 neurons, more glia
 histology  microscopic analysis of cells
 Nissl method  stains the nucleolus
neural circuits
 neuropil  bundle of dendrites/axons/glia  where synaptic connectivity occurs
 afferent neuron  carries info toward the brain
 efferent neuron  carries info away
 myotatic reflex example  kneejerk
organization of the human nervous system
 sensory systems

motor systems
 CNS
 brain
 spinal cord
 PNS
 sensory neurons
 somatic motor division  connect CNS to skeletal muscles
 autonomic or visceral motor division  innervate muscles / glands
 autonomic ganglia  peripheral motor neurons that take inputs from brainstem / spinal cord
 enteric system  small ganglia / neurons in gut that influence gastric motility and secretion
 sympathetic division  ganglia lie near the vertebral column and sent their axons to a variety of targets
 parasympathetic division  ganglia are found near organs they innervate
 groupings
 ganglia  accumulations of cell bodies / supporting cells
 nerve  bundles of axons in PNS
 tract  bundles of CNS axons
 if they cross the brain midline called commisures
 nuclei  local accumulations of similar neurons
 cortex  sheetlike arrays of neurons
 gray matter  has more cell bodies
 white matter  has more axons
neural systems
 unity of function  divide things into different systems  ex. visual
 representation of information  ex. vision can be topographic map, taste can be computational map
 subdivision into subsystems  ex. vision has color, form, motion, all in parallel
structural analysis
 oftenused lesion studies
 anterograde  source to termination
 vs retrograde  terminus to source
functional analysis of neural systems
 electrophysiological recording  uses electrodes, neuronlevel
 can determine receptive field  region in sensory space where a specific stimulus elicits a spike
 functional brain imaging  noninvasive, records local activity
 computerized tomography (CT), magnetic resonance imaging (MRI), diffusion tensor imaging (DTI), PET, SPECT, fMRI, MEG, MSI
analyzing complex behavior
 cognitive neuroscience  understanding higherorder functions
 neuroethology  complex behaviors of animals
2  electrical signals of nerve cells
 microelectrode  fine glass tubing filled with good conductor
 all cells have a voltage difference across them
 assume resting potential  we’ll use 58 mV
 depolarized  less negative  we’ll use +58 mV
 potentials
 receptor potential  (small) due to the activation of sensory neurons by external stimuli
 at terminal of dendrite
 graded  depends on how strong the input is
 synaptic potential  (small) caused by activation of synapse
 at middle of dendrite
 action potential  cause by the other 2
 at the axon
 receptor potential  (small) due to the activation of sensory neurons by external stimuli
 active transporters create differences in concentrations of specific ions  battery
 cells can be depolarized by adding too much K+ outside
 ion channels  make membranes selectively permeable  wires
 ions
 outside: high Na, Cl; low K
 generally 10:1 ratio between inside, outside
 inside of cell has a bunch of negative proteins to balance chlorine
 ions want to spread out because of entropy (then they factor in charge difference)
 $V_{ion} = 58/z * log(X_{out} / X_{in}) $
 calculate for each ion, if able to move
 z is charge on ion
 for potassium: 58/1 * log(.1) = 58mV
 Cl Nernst potential is actually 70 mV (even though we assumed 58 before)
 Cl works as an inhibitor  ex. alcohol lets chloride in
 whichever ion leaks, this determines the membrane potential
 hodgkinhuxley
 large squid escape neurons
 adding K+ outside depolarizes the cell
 adding Na+ outside raises height of spike
3  voltagedependent membranes
 voltage clamp  one electrode outside, one inside
 measured with reference to outside (usually more negative inside)
 keeps voltage constant
 current clamp  keeps current constant
 current clamp  just measures the voltage without interfering
 patchclamp  suction part of cell into pipette
 passive properties
 current injection: $V_t = V_{\infty} (1e^{t/ \tau})$
 voltage decay: $V_t = V_{\infty} e^{t / \tau}$
 block Na+ current with tetrodotoxin
 block K+ current with tetraethylammonium
 refractory period is because Na needs to stop being inactivated
 Na+ is transient, K+ is not
 myelin insulates sections  less ion loss
 called saltatory conduction
 faster and more efficient
 concentrates action potential to nodes
 without myelin, 10 m/s with myelin, 100 m/s
 deals with JAM receptor system
 umyelinated can be ok
 might want to regenerate
 don’t care about speed
 PNS
 Schwann cells
 loss  Guillanbarre syndrome
 CNS
 oligodendrocytes
 loss  multiple sclerosis (MS)
4  ion channel transporters
 patch electrode  pull a piece of membrane out
 cellattached  don’t break membrane
 wholecell  break membrane
 insideout  inside of membrane is outside electrode
 outsideout  outside of membrane is outside electrode  this method is always preferred
 tetrodotoxin binds to outside of cell membrane
 some channels are delayed
 selfinactivating = transient  channels turn off by themselves
 take 1020 ms
 voltagegated channels
 Na+, K+, Ca, Cl
 frog Xenopus Ooctyes ion channels are studied
 TRP channels gated by mechanical / heat
 4 K+ channels
 delayed rectifier
 fast acting  shapes AP, used for hearing
 late phase  slow ending, makes AP fire again
 inward rectifier  open at resting potential  establishes membrane potential
 mitten model  protein rotates around
 human genes: 10 Na, 10 Ca, 100K, ~5 Cl
 there are more types of potassium channels
 channelopathies  diseases can be caused by altered ion channels
 ion transporters
 ATPase Pumps
 Na+/K+ pump
 Ca pump
 ion exchangers
 Na+/K+ pump exchanges 3Na for 2K ions
 1/3 of body’s energy
 2/3 of neuron’s energy
 brains use 20% of body’s energy
 Ouabain blocks this
 SERT  Na transporter
 cotransporter
 ATPase Pumps
 ligandgated channels
 respond to a chemical
 usually allow Na, K, Cl to flow in and out
5  synapses
synapse types
 electrical synapses
 gap junction channels
 ions flow through gap junction channels
 presynaptic and postsynaptic cell are almost the same
 delay is fast (.1 ms)
 gap junction proteins have been showing up in diseases
 simple organisms have these
 chemical synapses
 bouton  end of presynaptic dendrite
 spines  start of postsynaptic dendrite
 voltagegated Ca comes in and causes vesicles to fuse with presynaptic membrane
 neurotransmitter released
 bind to ligandgated channels which let ions flow through
 if Cl flows into postsynaptic cell  inhibition
 pumps get rid of neurotransmitters quickly
neurotransmitter types
 released when Ca comes in due to depolarization
 peptides
 ex. oxytocin
 require long Ca exposure
 loaded into vesicles up by the cell body  can take days to get to bouton
 neurotransmitter diffuses away  doesn’t always have specific target
 can spread to all neurons in the area (ex. substance P)
 small & fast
 glutamate, Ach, GABA
 loaded into vesicles in bouton
 presynaptic cell takes these back up
discovery
 neurotransmitter lifecycle
 synthesis > receptors > function > removal
 important that they are removed
 60 s to recycle
 Loewi’s experiment showed that neurotransmitters can flow through solution to synchronize heart
synaptic transmission
 minis = MEPP  not big enough to fire the neuron
 you can treat a muscle as a postsynaptic junction
 chatter from single vesicle release
 quantal basis of neurotransmitter release  1,2,3,etc because vesicles release as allornone
 synapses / vesicles are all about the same size across different species
 receptors receive these neurotransmitters differently
 each vesicle is covered with proteins
 SNAPs on plasma membrane
 SNAREs on vesicle
 ex. synaptobrevin
 botulinum toxin, tetanus toxin  block synaptobrevin  clip other proteins
 render a vesicle inactive
 they recognize each other and lock for priming  ready to release when Ca+ enters
 need to endocytose membrane to make more vesicles
 endocytosis includes clathrin which curves the membrane
 can measure single ligandgated channel by clamping it alone
6  neurotransmitters
receptors
 ionotropic
 Name  AMPA GluRx  NMDA NRx  Kainate  GABAA  glycine  Nicotinic Ach  5HT3  P2x purinergic  ——– ———— ———– ——— ——– ——— ————— ———– —————   Abbrev  AMPA  NMDA  Kainate  GABA  Glycine  nACh  Serotonin  Purines   Ion  Na  Na/Ca  Na  Cl  Cl  Na  Na  Na 
 metabotropic
 Name  mGlux  GABAB  Dx  Alpha, beta, adrenergic  Hx  5HT17  Purinergic A or P  Ach Muscarinicx  ——— ———– ——– ——————— ———————— ——————— —————— —————— ——————   Abbrev  Glutamate  GABA_B  Dopamine  NE,Epi  Histamine  Serotonin  Purines  Muscarinic   Function    cocaine, ADHD drugs  antianxiety  unkown, probs sleep  3 is ionotropic!   mushroom drugs 

excitatory
 Acetylcholine  excites muscle cells
 receptors: nAch
 Acetylcholinesterase breaks down Ach after it is released
 neurotoxins (ex. Serin) break down Acetylcholinesterase so Ach stays and muscles stay on (nerve gases)
 Myasthinia is when you develop antibodies against your own nAch receptors
 you have trouble controlling your eyes
 Glutamate  excites pyramidal cells
 VGLUT pumps Glutamate into vesicles
 EATT transports extracellular Glutamate into presynaptic terminal / nearby Glial cells
 Glial cells convert glutamate to glutamine (inactivates) and glutamine is taken up by the presynaptic terminal
 glutamate overload if you overload the inactivating pumps in the glial cells
 glutamate receptors
 AMPA  fast Na only
 NMDA  slow, Na and Ca and also requires depolarization
inhibitory
 ,4. GABA/glycine
 have simple transporters that move released GABA into presynaptic terminal / Glia
 Ionotropic GABA receptors  depressants; shut down nervous system
 can bind steroids like estrogen  different effects in men / women
 alcohol binds to this, shuts things down
 immature neurons have high intracellular Cl  people don’t know why
neuromodulators
 lifecycle molecules
 synthesis: LDopa, trytophan
 reuptake: DAT, NET, SerT
 breakdown: MAO
 vesicular transport: VMAT
 catecholamines
 pathway: DOPA > Dopamine > Norepinephrine > Epinephrine
 dopamine  forming of memories
 produced by substantia nigra
 loss of these neurons > Parkinson’s
 norepinephrine
 produced by locus coeruleus
 serotonin = 5HT  happiness
 produced by locus coeruleus
 Tryptophan > serotonin
 serotonin produced by Raphe nuclei
 affected by LSD
 histamine  not wellknown
 antihistamines can make you hallucinate
 atp  sensitivity to pain
 neuropeptides  slow
 substance P  pain
 alphaendorphin  analgesia (block pain)
 vasopressin  blood pressure
 thyrotropin releasing hormone (TRH)  metabolism
 neuropeptide Y  mood/aggression
 enndocannabinoids  weed
 ex. anandamide  binds to CB1 (cannabanoid 1 receptor)
 retrograde signal  post back to pre  inhibit the inhibitor
 this increases the signal
 nitric oxide  gas
 binds to guanylyl cyclase
7  molecular signaling
localization
 chemical signaling mechanisms
 synaptic  local
 ex. Ach
 paracrine  medium distance, neurotransmitter sprinkled and nearby targets take it up
 ex. serotonin
 endocrine  get into your blood stream  bodywide signaling
 ex. vasopressin
 synaptic  local
 amplification = enzyme
 signal that activates enzyme amplifies signal
 cellsignaling molecules
 cellimpermeant molecules
 need transmembrane receptors
 ex. glutamate
 cellpermeant molecules (steroids)
 can have intracellular receptors
 signaling molecules
 adhesion molecules  like a lock and key  binds neurons together
 cellimpermeant molecules
 spine has small neck  hard for proteins to go through it
 keeps information local
 large raft of signaling molecules keeps info local
celllular receptors
 ionotropic  channellinked receptors
 neurotransmitter binds to a channel that opens
 ex. Glu ionotropic receptor
 signal is sodium coming in
 enzymelinked receptors
 ex. TrkA NGF receptor  Tyrosine kinase
 once it binds, it becomes an enzyme
 metabotropic  Gproteincoupled receptors
 ex. Glue metabotropic receptor
 activate Gprotein that then does something
 these require energy for Gproteins
 Heterotrimeric Gproteins
 Gprotein has 3 subunits: 
 Gs  binds norepinephrine

 cAMP 2. Gq  binds glutamate

 DAG (diaglycerol) & IP3 3. Gi  binds dopamine
 inhibits cAMP
 Monomeric Gproteins
 Gprotein has just one subunit
 ex. Ras
 intracellular receptors
 ex. estrogen receptors
 activates intracellular transcription factors
second messengers
 Ca must be pumped out of neuron or
 Ca can be stored into internal stores in ER
 Adenylyl cyclase turns ATP into cAMP > PKA
 Guanylyl cyclase turns GTP into cGMP > PKG
 would use insideout patch to study these
 Phospholipase C > PKC > IP3 lets Ca out of ER (usually kills cell)
protein control
 kinases (on switch) add phosphate to protein and makes them active
 PKA  cAMP binds then catalytic domain can bind
 noncovalent so can diffuse into nucleus
 CAM kinase
 covalent
 protein kinase C
 covalent  very local to membrane
 PKA  cAMP binds then catalytic domain can bind
 phosphatases (off switch) remove phosphates
longterm alteration
 longterm alteration requires epigenetic changes (change transcription factors)
 transcription factor CREB requires three things at once
 Ca comes in and binds Cam kinase
 activate Protein kinase A  this can stay in nucleus for a while
 how much to make
 MAP kinase, ras
 on/off switch  when all of these things come in at once  convergence signaling  CREB will make actin, AMPA receptors 
 TrkA binds NGF (Nerve growth factor)  peptide and has 3 pathways:
 PI 3 kinase
 ras
 Phospholipase C
 this displays divergence signaling
 cerebellar synapses
 mGlu inhibits AMPA with negative feedback
 lets out Ca which depresses AMPA receptor
 signal scaling  tyrosine hydroxylase makes dopamine
 more Ca in bouton activates more hydroxylase > makes more dopamine
 Ca comes in whenever fires
 use it or lose it
8  synaptic plasticity
shortterm
 measured by firing neuron before muscle and recording response
 synaptic facilitation  frequencydependent plasticity  fire faster, get bigger EPSPS
 Ca comes in and persists during next pulse
 ms time scale
 synaptic depression  transmitters are depleted
 s time scale
 synaptic potentiation/augmentation  changes in presynaptic proteins
 min time scale
 synaptic facilitation  frequencydependent plasticity  fire faster, get bigger EPSPS
 experiments
 habituation  decrease vesicles on sensory neuron after too much stimuli
 sensitization  associate two stimuli together:
 mechanism
 sensory neuron > serotinergic neuromodulatary interneuon > motor neuron
 interneuron releases serotonin
 cAMP produced in sensory neuron
 long term  CREB in nucleus  synapse growth
 central signal for LTM
 activates PKA
 short term  decreases K+ current
 sensory neuron doesn’t learn, neuromuscular junction gets stronger
 mutant genes associated with cAMP identified
 phosphodiesterase  if you remove this, too much cAMP
 adenylate cyclase
longterm
 HM lost his memory w/out hippocampus
 site of LTP
 at night memories are moved from hippocampus (flash drive) to cortex (longterm hard drive)
 Schaffer collateral pathway is one pathway in hippocampus (also perforant pathway, mossy fiber)
 prestimulate with tetanus
 later when stimulated EPSP is bigger
 usually 20 ms between firing  LTP when multiple firing in less time
 Schaffer mechanism  NMDA receptor key to this
 both AMPA and NMDA exist
 NMDA > Ca > CAM kinase > LTP
 Mg blocks NMDA unless already depolarized
 requires good timing!
 silent synapses
 shortterm  more AMPA receptors, longterm new synapses
 all synapses born with only NMDA
 protein synthesis needed for LTP (mostly making more AMPA)
 unclear how synapse decides whether to strengthen / make more synapses
 LTD  longterm depression is opposite of LTP  longterm potentiation
 low levels of Ca lead to AMPA being endocytosed
 mGlu > PKC > starts LTD
 epilepsy  neurons fire together and wire together

bmi
 companies
 Neuralink
 Kernel
 data types
 Neural dust
 calcium imaging
 companies

Sensory Input
 9 somatosensory
 10  nociception
 11  vision (eye)
 12  central visual system
 13  auditory system
 14  vestibular system
 15  chemical senses
9 somatosensory
cheat sheet
 vocab
 nerve  bundle of axons
 tract  bundle of axons in CNS
 nucleus  bundle of neurons related to some function
 midline  center of nervous system
 brain tends to be lateralized  one side is given control
 ex. speak almost exclusively from left side of brain
 information processing
 feedback (gain)
 almost always with glutamatergic / GABA
 feedforward  anticipation
 estimate things before they happen
 adjust your behavior in advance of the world (ex. lean before you hit a table)
 centersurround inhibition (spatial gain)
 if you touch yourself, brain enhances sensitivity of one point by suppressing information from around it
 feedback (gain)
sensory system overview
 we have dorsal root ganglia (DRG) on spinal cord
 axon goes to CNS
 dendrites go everywhere
 pseudounipolar  born polar but become unipolar
 dendrite goes straight into axon with cell body off to the side
 do very little processing
 dorsal horn  top layer that controls sensory information
 in the brain stem, these are called cranial ganglia
 special one is trigeminal ganglia (sensory receptors for face)
 oxytocin important clinically
 Trp channels  connected mechanically into membrane
 dermatomes
 map of sensory parts to brain
 segments of spinal cord correspond to stripes across your body
 brain to feet: cervical, thoracic, lumbar, sacral
 shingles  virus where you get stripes of sores  single DRG
 pops out the skin on the dendrite of one DRG
 peripheral damage won’t give you stripes of pain
 feeling resolution  depends on density of neurons innervating skin
 more neurons  small receptive fields
 twopoint discrimination test  poke you at different points and see if you can tell if the points are different
 higher discrimination is better
 discrimination is different that sensitivity (like how it hurts when wounded)
4 neuron classes
 they have certain structures that tune them into certain kinds of vibrations
 Proprioception
 muscle spindles  on every neuron  fastest
 measures stretch on every muscles
 lets you know where your arm is
 Golgi tendon organ
 measures tension on tendon
 safety switches  numb your body if you’re overstressing something (make you let go of hanging on cliff)
 muscle spindles  on every neuron  fastest
 Ia II  touch neurons
 superficial  most sensitive
 Merkel: hires, slow adapt
 Meissner: hires, fast adapt
 deeper  sense vibrations, pressure
 Ruffini: lowres, slow adapt
 Pacinian: lowres, fast adapt
 these are in order of depth
 diabetes  tissue loss and pain / numbness are lost
 superficial  most sensitive
 Adelta  fast pain
 C fibers  pain, temperature, itch
 very slow, stay on
 no myelination  Pruritus  newly discovered set of sensory neurons
 between pain/touch  itch neurons
 new in mice: massage neurons
 can only fire by stimulating in certain pattern
 goes to emotion center not knowledge  pleasure
 Proprioception
 speed proportional to diameter, myelination
 adaptation
 some adapt slowly (you keep feeling something)
 some adapt quickly (stop feeling)
 if you move finger slightly, start firing again when changed
 better if you feel cockroach that starts moving
pathways
 upperbody
 S1 cortex  primary somatic sensory cortex  this is the knowledge of where was touched
 VPL  everything accumulates here in the thalamus then goes to
 Cuneate nucleus  everything goes into this
 lowerbody (trunk down)
 everything in the lower body goes to Gracile nucleus  in brain stem
 special case  sensory for face
 trigeminal ganglion connects into vpm (thalamus) then goes into S1 cortex
 proprioceptive pathways
 starts in lower body
 axons split  half go up to Clark’s nucleus
 half go back into muscles
 Clark’s nucleus goes straight into cerebellum
 axons split  half go up to Clark’s nucleus
 starts in upper body  goes straight into cerebellum
 thus cerebellum have map of where / how tense muscles are
 starts in lower body
representation
 cortex  this is where understanding is
 dedicates area based on how many neurons coming in
 lips / hands have more area
 S1  primary somatosensory cortex
 most body parts
 neurons from functionally distinct columns
 cortex assigns space based on how much info comes in
 after amputation and time, map grows into lost space
 map is different when different stimuli are given to fingers
 S2  secondary somatosensory cortex
 processes and codes information from S1
 throat, tongue, teeth, jaw, gum
 dedicates area based on how many neurons coming in
pathway
 mechanosensory
 DRG
 Cuneate, Gracile
 VPL
 S1
 face mechanosensory
 trigeminal ganglion
 principal nucleus of trigeminal complex
 vpm
 S1
 proprioception
 lower body
 muscle spindles split
 half go to motor neurons
 other half go to Clark’s nucleus
 clark’s nucleus > cerebellum
 upper body  straight to the cerebellum
 lower body
10  nociception
review
 chronic pain is very import clinically
 cortex  lets you know if you are sensing something
 lossoffunction lesion  piece of cortex is lost  lose awareness
 can come from stroke, migraineaura
 gainoffunction lesion = excitatory lesion  like epilepsy
 cortex comes on when it shouldn’t
 increased awareness
 can come from stroke / migraine
 lossoffunction lesion  piece of cortex is lost  lose awareness
 “sixth sense”  measuring stretch of all your muscles in cerebellum
 nociception = pain
 has nociceptors  neurons that do nociception
 thermoceptors  neurons that sense temperature
 two classes of linking receptors
 Adelta fibers  fast pain
 C fibers  slow and chronic
 Trp channels  mechanically or thermally gated
 let Na+ in
 trpV heat  binds capsaicin
 in the class of vanilloids
 birds not capsaicin sensitive
 trpM cold  binds menthol
 adapts in minutes  stop feeling cold after a while
 synapses of nociceptors go to dorsal horn of drg
 nociceptor goes contralateral (must cross midline)  if you cut left side of spinal chord, lose  mechanoception (ipsilateral) from left and nociception (contralateral) from right
 mechanoreceptors, by contrast, send axon up the spinal cord
 dorsal horn has laminal structure (has layers)
 know where pain is
 somatosensory cortex
 care about pain
 insular cortex  emotional part of brain
 whether or not you care about pain
 pairs up with other senses
 can have both lossoffunction and gainoffunction lesions in both places
 referred pain map  map that refers to a specific problem (ex. esophagus)
 visceral pain  don’t know where the pain is
 hyperalgesia  increased pain sensitivity
 pain sensing neurons are hyperactive because of inflammation
 pain sensing neuron releases substance P into Mast cell or neutrophil which releases histamine which strengthens receptor
 prostaglandins activate nococeptors
 allodynia  when mechanosensation hurts  not understood
 turning off pain  add serotonin
 exercise
 lack of serotonin ~ mood disorders
 central sensitization: allodynia
 these mechanisms work through introception
 senses chemical imbalances
 phantom limbs and phantom pain  if you lose a limb and still feel pain
 mechanoreceptors inhibit nociceptors
pathway
 nociception
 same as mechanosensory except goes all the way to thalamus
 doesn’t stop in brainstem
 crosses the midline after first synapse
 same as mechanosensory except goes all the way to thalamus
 visceral pain
 axons mainline straight up, go through vpl, go straight to insular cortex
11  vision (eye)
 most of visual system is to read faces
 eye
 aqueous humor
 posterior chamber
 lens
 ciliary muscles
 retina
 fovea
 optic disk
 optic nerve and retinal vessels
 to see far, stretch lens = accomodation
 retina  rods and cones are at back
 cones  color
 retinal ganglion cells sends down signal
 12 days to turnover whole photoreceptor disks into PE (pigment epithelium)
 PE is what the rods / cones are in
 PE contains optic disks containing rhodopsin protein that is sensitive to light that break off of rods / cones
 light leads to inhibition
 melanopsin  receptor for blue light
circuits
 accomodation  stretching lens uncrosses lines
 function photoreceptor
 usually cGMP is letting in Na/Ca
 Ca provides negative feedback here
 when light hits, retinal inside rohodopsin activates phosphodiesterase  breaks down cGMP so channel closes and they aren’t let in
 usually cGMP is letting in Na/Ca
 light on middle
 depolarizes cone
 excites oncenter
 inhibits offcenter
 these adjust quickly
 horizontal cells  takes positive input from photoreceptor and inhibits it back
 inhibits horizontal cells else around it  creates contrast
 have these for each color
pathway

 rods / cones (2). horizontal cells  regulate gain control, how fast adapts, contrast adaptation
 bipolar cells (4). amacrine cells  processing of movements
 retinal ganglion cells
 go into thalamus then to cortex (6). small amount go into brain stem and control mood / circadian rhythms
12  central visual system
 cortex is a pizza box
 has columns
 autophagy  process by which cells eat parts of themselves
 nobel 2016
 cones  color
 12 day cycle for processing optic disks
 photoreceptors have cyclic Gactivated channel
 light shuts down photoreceptors
 cell decreases in activity
 very roughly  each cone connects to cone bipolar cell
 this gets represented by one column in the cortex
 1530 rods connect to 1 rod bipolar cells
 cortex has 6 layers
 each has tons of neurons, mostly pyramidal neurons
 column is a section through the 6 layers  all does about the same thing
 orientation columns responds to specific x,y
 has subregions that respond to specific orientations
 ocular dominance column  both eyes for same coordinate go to same spot
 dominated by one eye
 distance
 far cells
 tuned cells
 near cells
 V4 in temporal lobe  object recognition
pathways
 overall
 V1
 V2
 V4 or MT
 central projections
 retinal ganglions
 all go through optic stuff
13  auditory system
 ear parts
 outer
 middle
 tympanic membrane
 inner
 cochlea  senses the sound
 oval window
 round window  not understood
 conductive hearing loss  in the outer/middle ear
 sensorineural hearing loss  in the cochlea
 can’t be fixed with hearing aids
 humans
 25kHz ~= human speech (can sometimes hear more)
 30100x boost for tympanic membrane
 this differs between people
 200x focus onto oval window
 cochlea
 4 layers
 inner hair cells  what you hear with
 outer hair cells  generate sound
 generates noise at every frequency except one you want to hear
 otoacoustical emmision  low buzz that is produced
 tenitis  ringing in the ears
 can be internal
 can be peripheral  generated by otoacoustical emmision
 high frequencies right next to cochlea
 low frequencies on distal tip
 human high frequency cells die with age
 4 layers
 hair cells
 bundle of cilia
 have an orientation
 kinocilium is the tallest
 tall ones are in the back
 dying hair cells  can’t be replaced
 loud sounds
 certain antibiotics
 auditory pathwayz
 MSO  medial superior olive  decides where the sounds is coming from
 takes input from right / left ear, decides which came in first
 medial geniculate complex of the thalamus
 MSO  medial superior olive  decides where the sounds is coming from
 brain shape
 folds are pretty random
 phrenology  shape of skull was based on brain
 thought it could determine personality
 false
 Hsechl’s Gyrus folding pattern is not random
 argument that if you have one, you are more musical
 any sounds is made up of a bunch of frequencies
circuits
 K depolarizes hair cells, lets in Ca, releases vesicles
14  vestibular system
 very related to cochlea
 same hair cells
 differences

 vestibular system doesn’t use cortex (you don’t think about it)
 goes right into spinal chord 2. controls eye movements
 one of the fastest circuits in the brain
 clinically important
 you have to be able to have your balance
 each column is computational unit of the cortex
 ocular dominance column
 you have to be able to have your balance
 one for each eye

 labyrinth and its innervation
 semicircular canals
 can only measure one axis of rotation
 remember horizontal canal  measures turning head left to right
 this measures acceleration
 like a hoola hoop filled with glitter
 has ampulla at one place in the hoop
 cupula  sits over the ampulla’s hair cells
 if the “glitter” hits the cupula, it will bend the hair cells
 if you keep spinning, fluid starts moving and you stop detecting movement
 this means the canals adapt mechanically
 if you stop spinning, fluid keeps moving and system thinks you’re spinning the other way
 right horizontal canal activated by turn to the right
 same for left
 scarpa’s ganglion  has hair cells inside
 sends axons into vestibular nuclei
 lots of fluid (high in K+)
 macula  place where all the hair cells are
 Ampullae  at base of canals
 hair cells all in the same direction
 utricle and saccule  measure head tilt
 hair cells in multiple orientations
 these contain otoconia
 these are little crystals that move with gravity
 measure acceleration and tilt
 Ampullae  at base of canals
 semicircular canals
 tilts do not adapt  they keep firing while you’re leaned back
 they basically report tilt / position at all times
 tiplink  connect cilia together for hair cells
 when they move, tiplink move, pull on ion channels
 motor on connected hair cell moves up and down to generate correct amount of tension
 motor uses myosin and actin
 harming these proteins can cause deafness
 both eyes must always be looking in the same direction
 also must be sitting over image for a while
 ipsilateral  same side
 contralateral  different side
 vestibular ocular reflex VOR  turn your head to the right, eyes move left
 doesn’t require cortex
 only have to learn excitatory
15  chemical senses
 cAMP is used by GPCR

Research future
 keeping up to date: https://sanjayankur31.github.io/planetneuroscience/
questions
 problems that are solved, or soon will be
 how do single neurons compute?
 what is the connectome of a small nervous system, like that of C. elegans (300 neurons)?
 how can we image a live brain of 100,000 neurons at cellular and millisecond resolution?
 hydra was completed
 how does sensory transduction work?
 problems that we should be able to solve in the next 50 years
 can we add senses to the brain?
 like cochlear implant
 like vibrations
 how do circuits of neurons compute?
 what is the complete connectome of the mouse brain (70e6 neurons)?
 how can we image a live mouse brain at cellular and millisecond resolution?
 what causes psychiatric and neurological illness?
 how do learning and memory work?
 shortterm vs. longterm
 declarative vs. nondeclarative
 encodes relationships between things not things themselves
 memory retrieval
 why do we sleep and dream?
 sleep is restorative (but then why high neural activity?)
 allows the brain to run simulations
 consolidating memories and forgetting
 where is consciousness?
 at this point, sounds and vision should line up (delayed appropriately)
 how do we make decisions?
 how does the brain represent abstract ideas?
 what does neural baseline activity represent?
 how does the brain solve timing?
 moving eyes
 blinking
 hearing and vision time differences
 how does sensorimotor learning build a model of the world?
 can we add senses to the brain?
 problems that we should be able to solve, but who knows when
 how do brains simulate the future?
 how does the mouse brain compute?
 what is the complete connectome of the human brain (8e10 neurons)?
 how can we image a live human brain at cellular and millisecond resolution?
 how could we cure psychiatric and neurological diseases?
 how could we make everybody’s brain function best?
 brain and quantum?
 some work in quantum neural nets
 how is info coded in neural activity?
 like measuring tansistors and guessing what computer is doing
 neuron gets lots of inputs
 do glial cells and other signaling molecules compute?
 what is intelligence?
 what is iq?
 how do specialized systems integrate?
 problems we may never solve
 what are emotions?
 brain states that quickly assign values
 in the amygdala
 how does the human brain compute?
 how can cognition be so flexible and generative?
 how and why does conscious experience arise?
 thing that flickers on when you wake up that was not there
 evolutionary to manage all the different systems
 what are emotions?
 metaquestions
 what counts as an explanation of how the brain works? (and which disciplines would be needed to provide it?)
 how could we build a brain? (how do evolution and development do it?)
 what are the different ways of understanding the brain? (what is function, algorithm, implementation?)
 ref David Eaglemen article: http://discovermagazine.com/2007/aug/unsolvedbrainmysteries
 ref Adolphs 2015, “The unsolved problems of neuroscience”
brainmachine interfaces
 neuralink
 surgeons won’t want to put chips into people’s brains
 only for people with serious medical conditions right now
 http://waitbutwhy.com/2017/04/neuralink.html
RNA barcoding
 allows for tagging different neurons
 can then optically get differences
 also can sequence and get differences (http://www.cell.com/neuron/pdf/S08966273%2816%29304214.pdf)
 future of electrophysiology: https://www.technologynetworks.com/neuroscience/articles/shiningalightonthefutureofelectrophysiology286992
brain transplant
 computational hypothesis of the mind
tms
 temporary cure for autism
 can change people’s minds

Research main
 Problems: Alzheimers, PTSD, autism, addiction, MS, depression, schizophrenia
1  neural decoding
fMRI decoding
 reconstructing visual experiences from brain activity evoked by movies (Gallant, 2011)
 try doing this with music?
 typing
 like fb or neuralink
 new data (e.g. BMI?)
spike sorting
 can be based on electrodes or calciumimaging data
 can get gt with intracellular recordings
 spike sorting with GANs
 simulated datasets also work well
 http://spikefinder.codeneuro.org
neural encoding
 cochlear implant turns sound into neural signal
2  brain mapping
structural connectomics
 random forests / CNNs for neuronal image segmentation
 uses gradient boosting with MALIS
functional connectivity
 computational fMRI (Cohen et al. 2017)
 using graphical models with weightedl1 regularization
3  computational learning models
neural priors
 cox train cnn w/ fMRI
comparison to cnn
 look at features found in layers
biophysically plausible network learning
 PCA
 antihebbian learning (Foldiak)
 sparse coding (Olshausen & Field)
 ICA (Sejnowski)
 adaptive synaptogenesis with inhibition
4  theoretical neural coding
action potentials
 velocity vs. energy
linearization
 linearization PLOS
 linearization JCNeuro
 interspike interval
5  cnns
 autoencoder with sparsity rules
 rotation project
 train without flips
 neural network compression
 extracting memory with deep learning
 learning how to find the right segments of memory
 learning to decode another neural network?

Research ref
datasets
 senseLab: https://senselab.med.yale.edu/
 modelDB  has NEURON code
 model databases: http://www.cnsorg.org/modeldatabase
 comp neuro databases: http://home.earthlink.net/~perlewitz/database.html
 crns data: http://crcns.org/
 hippocampus spike train data: http://crcns.org/datasets/hc
 visual cortex data (gallant)
 allen brain atlas
 wikipedia page: https://en.wikipedia.org/wiki/List_of_neuroscience_databases
 human fMRI datasets: https://docs.google.com/document/d/1bRqfcJOV7U4faa3h8yPBjYQoLXYLLgeY6_af_N2CTM/edit
 Kay et al 2008 has data on responses to images
 calcium imaging data: http://spikefinder.codeneuro.org/
 spikes: http://www2.le.ac.uk/departments/engineering/research/bioengineering/neuroengineeringlab/software
data types
  fMRI  EEG  ECoG  Local field potential (together forms microelectrode array)  singleunit  calcium imaging  ————–———————————————————————————————————— —  scale  high  high  high  low  tiny   spatial res  midlow  very low  low  midlow  x   temporal res  very low  midhigh  high  high  super high   invasiveness  non  non  yes (under skull)  very  very 
 neural dust
ongoing projects
 govsponsored
 human brain project
 blue brain project  largescale brain simulation
 european brain project
 companies
 Neuralink
 Kernel
 Facebook neural typing interface
 google brain
 IBM: project joshua blue
#conferences:
 Annual Computational Neuroscience Meeting
 Statistical Analysis of Neuronal Data
 2017
 SFN (11/1111/15)  DC
 NIPS (12/412/9)  Long Beach
 2018
 ICCV (March)
 VSS (5/185/23)  Florida (Always)
 ICML (7/107/15)  Stockholm
 CVPR (6/186/23)  Salt Lake City
 SFN (11/311/7)  San Diego
 NIPS (12/312/8)  Montreal
 2019
 ICCV (March)  Korea?
 ICML (7/107/14)  Long Beach
 CVPR (Unknown)
 SFN (10/1910/23)  Chicago
 NIPS (Unknown)
areas
 Basic approaches:
 The problem of neural coding
 Spike trains, point processes, and firing rate
 Statistical thinking in neuroscience
 Overview of stimulusresponse function models
 Theory of model fitting / regularization / hypothesis testing
 Bayesian methods
 Estimation of stimulusresponse functionals: regression methods, spiketriggered covariance
 Variance analysis of neural response
 Estimation of SNR. Coherence
 Generalized Linear Models
 Information theoretic approaches:
 Information transmission rates and maximally informative dimensions
 Scene statistics approaches and neural modeling
 Techniques for analyzing multipleunit recordings:
 Event sorting in electrophysiology and optical imaging
 Optophysiology cell detection
 Sparse coding/ICA methods, vanilla and methods including statistical models of nonlinear dependencies
 Methods for assessing functional connectivity
 Statistical issues in network identification
 Lowdimensional latent dynamical structure in network activity–Gaussian process factor analysis/newer methods
 Models of memory, motor control and decision making:
 Neural integrators
 Attractor networks
 senseLab: https://senselab.med.yale.edu/

Information Theory
[toc]
Infotheory basics
entropy
 $H(X) =  \sum p(x) log p(x) = E[h(p)]$
 $h(p)=  log(p)$
 $H(p)$ implies p is binary
 intuition
 higher entropy $\implies$ more uniform
 lower entropy $\implies$ more pure
 expectation of variable $W=W(X)$, which assumes the value $log(p_i)$ with probability $p_i$
 minimum, average number of binary questions (like is X=1?) required to determine value is between H(X) and H(X)+1
 related to asymptotic behavior of sequence of i.i.d. random variables

$H(Y X)=\sum_j p(x_j) H(Y X=x_j)$ 
$H(X,Y)=H(X)+H(Y X) =H(Y)+H(X Y)$

relative entropy / mutual info
 relative entropy = KL divergence  measures distance between 2 distributions
 if we knew the true distribution p of the random variable, we could construct a code with average description length H(p).

If, instead, we used the code for a distribution q, we would need H(p) + D(p q) bits on the average to describe the random variable. 
$D(p q) \neq D(q p)$
 mutual info I(X; Y)

$I(X; Y) = \sum_X \sum_y p(x,y) log \frac{p(x,y)}{p(x) p(y)} = D(p(x,y) p(x)\cdot p(y))$ 
$I(X; Y) = H(X)  H(X Y)$  $I(X; X) = H(X)$ so entropy sometimes called selfinformation

chain rules

entropy  $H(X_1, …, X_n) = \sum_i H(X_i X_{i1}, …, X_1)$ 
conditional mutual info $I(X; Y Z) = H(X Z)  H(X Y,Z)$ 
$I(X_1, …, X_n; Y) = \sum_i I(X_i; Y X_{i1}, … , X_1)$


conditional relative entropy $D(p(y x) q(y x)) = \sum_x p(x) \sum_y p(y x) log \frac{p(y x)}{q(y x)}$ 
$D(p(x, y) q(x, y)) = D(p(x) q(x)) + D(p(y x) q(y x))$

Jensen’s inequality
 convex  lies below any chord
 has positive 2nd deriv
 linear functions are both convex and concave
 Jensen’s inequality  if f is a convex function and X is an R.V., $E[f(X)] \geq f(E[X])$
 if f strictly convex, equality $\implies X=E[X]$
 implications

information inequality $D(p q) \geq 0$ with equality iff p(x)=q(x) for all x 
$H(X) \leq log X $ where X denotes the number of elements in the range of X, with equality if and only X has a uniform distr 
$H(X Y) \leq H(X)$  information can’t hurt  $H(X_1, …, X_n) \leq \sum_i H(X_i)$

axiomatic approach
 fundamental theorem of information theory  it is possible to transmit information through a noisy channel at any rate less than channel capacity with an arbitrarily small probability of error
 to achieve arbitrarily high reliability, it is necessary to reduce the transmission rate to the channel capacity
 uncertainty measure axioms
 H(1/M,…,1/M)=f(M) is a montonically increasing function of M
 f(ML) = f(M)+f(L) where M,L $\in \mathbb{Z}^+$
 grouping axiom
 H(p,1p) is continuous function of p
 $H(p_1,…,p_M) =  \sum p_i log p_i = E[h(p_i)]$
 $h(p_i)=  log(p_i)$
 only solution satisfying above axioms
 H(p,1p) has max at 1/2
 lemma  Let $p_1,…,p_M$ and $q_1,…,q_M$ be arbitrary positive numbers with $\sum p_i = \sum q_i = 1$. Then $\sum p_i log p_i \leq  \sum p_i log q_i$. Only equal if $p_i = q_i : \forall i$
 intuitively, $\sum p_i log q_i$ is maximized when $p_i=q_i$, like a dot product
 $H(p_1,…,p_M) \leq log M$ with equality iff all $p_i = 1/M$
 $H(X,Y) \leq H(X) + H(Y)$ with equality iff X and Y are independent

$I(X,Y)=H(Y)H(Y X)$  sometimes allow p=0 by saying 0log0 = 0
 information $I(x)=log_2 \frac{1}{p(x)}=log_2p(x)$
 reduction in uncertainty (amount of surprise in the outcome)
 if the probability of event happening is small and it happens the info is large
 entropy $H(X)=E[I(X)]=\sum_i p(x_i)I(x_i)=\sum_i p(x_i)log_2 p(x_i)$

information gain $IG(X,Y)=H(Y)H(Y X)$ 
$=\sum_j p(x_j) \sum_i p(y_i x_j) log_2 p(y_i x_j)$

 parts
 random variable X taking on $x_1,…,x_M$ with probabilities $p_1,…,p_M$
 code alphabet = set $a_1,…,a_D$ . Each symbol $x_i$ is assigned to finite sequence of code characters called code word associated with $x_i$
 objective  minimize the average word length $\sum p_i n_i$ where $n_i$ is average word length of $x_i$
 code is uniquely decipherable if every finite sequence of code characters corresponds to at most one message
 instantaneous code  no code word is a prefix of another code word
 $H(X) =  \sum p(x) log p(x) = E[h(p)]$

Linear Models
 ch 1 introduction
 ch 2 simple linear regression
 ch 3 multiple linear regression
 ch 4 multicollinearity
 ch 5 categorical predictors
 ch 6 polynomial regression
 ch 7 model comparison and selection
 ch 8 automated search procedures
 ch 9 model building process overview
ch 1 introduction
 regression analysis studies relationships among variables
 $Y = f(X_1,…X_i) + \epsilon$
 We can use $X_i^2$ as a term in a linear regression, but the function must be a linear combination of terms (no coefficients in the exponent, in a sine, etc.)
 regression analysis
 statement of the probelm
 select potential relevant variables
 data collection
 model specification
 choice of fitting method
 model fitting
 model validation  important
 iterative process!
 regressions
 simple linear regression  univariate Y, univariate X
 multiple linear regression  univariate Y, multivariate X
 multivariate linear regression  multivariate Y
 generalized linear regression  Y isn’t normally distributed
 ANOVA  all X are categorical
 Analysis of covariance  part of X are categorical
ch 2 simple linear regression
basics
 $Y = \beta_0 + \beta_1X + \epsilon$
 Take samples $x_i,y_i$
 assume error $\epsilon \sim N(0,\sigma^2)$
 further assume error $\epsilon_i,\epsilon_j,…\epsilon_n$ are i.i.d
 this isn’t always the case, for example if some the data points were correlated to each other
 given $x_i$
 $Var[Y_i] = Var[\epsilon_i] = \sigma^2$
 $Y_i \sim N(\beta_0 + \beta_1X , \sigma^2)$  $Cov(Y_i,Y_j) = 0$, uncorrelated
 pvalue  probability we reject $H_o$, but it is true
 want this to be low to reject
parameter estimation (least squares)
 $\epsilon_i = y_i  \beta_0  \beta_1x_i$
 minimize sum of squared errors
 Sums
 SSE = $\sum_1^n\epsilon_i^2 = \sum_1^n (y_i  \beta_0  \beta_1x_i)^2$
 $S_{xx}=\sum (x_i\bar{x})^2$
 $S_{yy}=\sum (y_i\bar{y})^2$
 $S_{xy}=\sum (x_i\bar{x})(y_i\bar{y})$
 estimators
 $\hat{\beta_1}=\frac{\sum (x_i\bar{x})(y_i\bar{y})}{\sum (x_i\bar{x})^2}$
 $\hat{\beta_0}=\bar{y}\hat{\beta_1}\bar{x}$
 GaussMarkov Theorem  the least squares estimators $\hat{\beta_0}$ and $\hat{\beta_1}$ are unbiased estimators and have minimum variance among all unbiased linear estimators = best linear unbiased estimators.
 unbiased means $E[\hat{x}] = E[x]$
 $Var(\hat{\beta_1})=\frac{\sigma^2}{\sum(x_i\bar{x})^2}$
 $Var(\hat{\beta_0})=\frac{\sigma^2}{n}+\frac{(\bar{x}\sigma)^2}{\sum(x_i\bar{x})^2}$
 $\hat{\sigma}^2 = MSE = \frac{SSE}{n2}$  n2 since there are 2 parameters in the linear model
 sometimes we have to enforce $\beta_0=0$, there are different statistics for this
evaluate model fitting
 SST  total sum of squares  measure of total variation in response variable
 $\sum(y_i\bar{y})^2$
 SSR  regression sum of squares  measure of variation explained by predictors
 $\sum(\hat{y_i}\bar{y})^2$
 SSE  measure of variation not explained by predictors
 $\sum(y_i\hat{y_i})^2$
 SST = SSR + SSE
 $R^2 = \frac{SSR}{SST}$  coefficient of determination
 measures the proportion of variation in Y that is explained by the predictor
 Cor(X,Y) = $\rho$ = $\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}$
 only measures linear relationship
 measure of strength and direction pof the linear association between two variables
 better for simple linear regression, doesn’t work later
 $\rho^2$ = $R^2$
inference for simple linear regression
 confidence interval construction
 confidence interval (CI) is range of values likely to include true value of a parameter of interest
 confidence level (CL)  probability that the procedure used to determine CI will provide an interval that covers the value of the parameter
 $\hat{\beta_0} \pm t_{n2,\alpha /2} * s.e.(\hat{\beta_0}) $
 for $\beta_1$
 with known $\sigma$
 $\frac{\hat{\beta_1}\beta_1}{\sigma(\hat{\beta_1})} \sim N(0,1)$
 derive CI
 with unknown $\sigma$
 $\frac{\hat{\beta_1}\beta_1}{s(\hat{\beta_1})} \sim t_{n2}$
 derive CI
 with known $\sigma$
 for $\beta_1$
 hypothesis testing
 ttest
 $H_0:\beta_1=b $
 $t_1 = \frac{\hat{\beta_1}b}{s.e.(\hat{\beta_1})}$, n2 degrees of freedom
 ftest: $H_0:\beta_1=0 $
 F=MSR/MSE
 reject if F > $F_{1\alpha;1,n2}$
 ttest
 two kinds of prediction
 the prediction of the value of the respone variable Y which corresponds to any chose value, $x_o$ of the predictor variable
 the estimation of the mean response $\mu_o$ when X = $x_o$
assumptions
 There exists a linear relation between the response and predictor variable(s).
 otherwise predicted values will be biased
 The error terms have the constant variance, usually denoted as $\sigma^2$.
 otherwise prediction / confidence intervals for Y will be affected
 The error terms are independent, have mean 0.
 otherwise a predictor like time might have been omitted from the model
 Model fits all observations well (no outliers).
 otherwise misleading fit
 The errors follow a Normal distribution.
 otherwise usually ok
 assessing regression assumptions
 look at scatterplot
 look at residual plot
 should fall randomly near 0 with similar vertical variation, magnitudes
 QQ plot / normal probability plot
 standardized residulas vs. normal scores
 values should fall near line y = x, which represents normal distribution
 could do histogram of residuals
 look for normal curve  only works with a lot of data points  lack of fit test  based on repeated Y values at same X values
variable transformations
 if assumptions don’t work, sometimes we can transform data so they work
 transform x  if residuals generally normal and have constant variance
 corrects nonlinearity
 transform y  if relationship generally linear, but nonconstant error variance
 stabilizes variance
 if both problems, try y first
 BoxCox: Y’ = $Y^l$ if l ≠ 0, else log(Y)
ch 3 multiple linear regression
 $Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + …+ \beta_p X_p + \epsilon$
 each coefficient is the contribution of x_i after both variables have been linearly adjusted for the other predictor variables
 least squares solved to estimate regression coefficients
 unbiased $\hat{\sigma}^2 = \frac{SSE}{np1}$
matrix form
 write Y = X*$\beta+\epsilon$
 each row is one $X_0,X_1,…,X_p$
 $\hat{\underline{\beta}} = (X’X)^{1}X’Y$
 multicollinearity  sometimes no unique soln  parameter estimates have large variability
 $\hat{\sigma}^2 = \frac{SSE}{np1}$ where there are p predictors
 hat matrix  $\hat{Y}=HY$
 H = $X(X’X)^{1}X’$
Ftests
 F statistic tests $H_0$: $\beta_1 = … = \beta_p = 0$
 reject when p ≤ .05
 $R^2$ only gets larger
 adjusted $R^2$  divide each sum of squares by its degrees of freedom
 partial ftest: $H_0$: $\beta_2 = \beta_3 = 0$ ~ any subset of betas is 0
 tries to eliminate these from the model
 can only eliminate 1 coefficient at a time
 extra sum of squares
 first variables given go into model first
 basically partial ftest, but calculate f in different way
$H_0$: $\beta_2 = \beta_3 = 0$ ~ any subset of betas is 0
 tries to eliminate these from the model
anova table

last column is P(> t )  test is whether the statistic = 0
 default Fvalue is for all coefficients=0
extra sums of squares
 regression happens in order you specify
ch 4 multicollinearity
 multicollinearity  when predictors are highly correlated with each other
 roundoff errors
 X’X has determintant close to zero
 X’X elements differ substantially in magnitude
 correlation transformation  normalizes variables
 when linearly dependent, clearly can’t determine coefficients uniquely
 cannot interpret one set of regression coefficients as reflecting effects of the different predictors
 cannot extrapolate
 predicting is fine
 variance inflation factor (VIF)  measure how much the variances of the estimated regression coefficients are inflated as compared to when the predictors are not linearly related
ch 5 categorical predictors
 quantitive variable  gets a number
 qualitative variable  ex. color
 A matrix is singular if and only if its determinant is zero
 bonferroni procedure  we are doing 3 tests with 5% confidence, so we actually do 5/3% for each test in order to restrict everything to 5% total
indicator variables with 2 classes
 ancova  at least one categorical and one quantitative predictor
 indicator variables take on the value 0 or 1
 dummy coding  matrix is singular so we drop the last indicator variable  called reference class / baseline class
 additive effects assume that each predictor’s effect on the response does not depend on the value of the other predictor (as long as the other one was fixed
 assume they have the same slope
 interaction effects allow the effect of one predictor on the response to depend on the values of other predictors.
 $y_i = β_0 + β_1x_{i1} + β_2xi2 + β_3xi1xi2 + ε_i$
 We can use the levene.test() function, from the lawstat package. The null hypothesis for this test is that the variances are equal across all classes.
more than 2 classes
 $β_0$ is the mean response for the reference class when all the other predictors X1, X2, · · · are zero.
 $β_1$ is the mean response for the first class of C minus the mean response for the reference class when X1, X2, · · · are held constant.
 The F statistic reported with the summary() function for a linear model tests if β1 = ···β7 = 0.
other coding
 effect coding
 one vector is all 1s
 B_0 should be weighted average of the class averages
 orthogonal coding (not on test)
ch 6 polynomial regression
 have to get all lower order terms
 beware of overfitting
 must center all the variables to reduce multicollinearity
 hierarchical appraoch  fit higher order model to check whether lower order model is adequate or not
 if a given order term is retained, all related terms of lower order must be retianed
 otherwise it isn’t invariant to transformations of the columns
 interaction terms are similar to before
ch 7 model comparison and selection
 Ockham’s razor  principle of parsimony  given two theories that describe a phenomenon equally well, we should prefer the theory that is simpler
 several different criteria
 don’t penalize many predictors
 $R^2_p$  doesn’t pen
 penalize many predictors
 adjusted $R^2_p$  penalty
 Mallow’s $C_p$
 $AIC_p$
 $BIC_p$
 PRESS
 don’t penalize many predictors
ch 8 automated search procedures
ch 9 model building process overview

Probability
Properties
 Mutually Exclusive: P(AB)=0
 Independent: P(AB) = P(A)P(B)

Conditional: P(A B) = $\frac{P(AB)}{P(B)}$
Measures
 $E[X] = \int P(x)x dx$
 $V[X] = E[(x\mu)^2] = E[x^2]E[x]^2$
 for unbiased estimate, divide by n1
 $Cov[X,Y] = E[(X\mu_X)(Y\mu_Y)] = E[XY]E[X]E[Y]$
 $Cor(Y,X) = \rho = \frac{Cov(Y,X)}{s_xs_y}$
 $(Cor(Y,X))^2 = R^2$
 Cov is a measure of how much 2 variables change together
 linearity
 $Cov(aX+bY,Z) = aCov(X,Z)+bCov(Y,Z)$
 $V(a_1X_…+a_nX_n) = \sum_{i=1}^{n}\sum_{j=1}^{n}a_ia_jcov(X_i,X_j)$
 if $X_1,X_2$ independent, $V(X_1X_2) = V(X_1) + V(X_2)$

$f(X), X=v(Y), g(Y) = f(v(Y))$ $\frac{d}{dy}g^{1}(y)$ 
$g(y_1,y_2) = f(v_1,v_2) det(M) $ where M in rowmajor is $\frac{\partial v1}{y1}, \frac{\partial v1}{y2} …$  $Corr(aX+b,cY+d) = Corr(X,Y)$ if a and c have same sign
 $E[h(X)] \approx h(E[X])$
 $V[h(X)] \approx h’(E[X])^2 V[X]$
 skewness = $E[(\frac{X\mu}{\sigma})^3]$
Momentgenerating function
 $M_X(t) = E(e^{tX})$
 $E(X^r) = M_X ^ {(r )} (0)$
 sometimes you can use $ln(M_x(t))$ to find $\mu$ and $V(X)$
 Y = aX+b $> M_y(t) = e^{bt}M_x(at)$
 Y = $a_1X_1+a_2X_2 \to M_Y(t) = M_{X_1}(a_1t)M_{X_2}(a_2t)$ if $X_i$ independent
 probability plot  straight line is better  plot ([100(i.5)/n]the percentile, ith ordered observation)
 ordered statistics  variables $Y_i$ such that $Y_i$ is the ith smallest
 If X has pdf f(x) and cdf F(x), $G_n(y) = (F(y))^n$, $g_n(y) = n(F(y))^{n1}f(y)$
 If joint, $g(y_1,…y_n) = n!f(y_1)…f(y_n)$
 $g(y_i) = \frac{n!}{(i1)!(ni)!}(F(y_i))^{i1}(1F(y_i))^{n1}f(y_i)$
Distributions

Bernoulli: $f(x)= \begin{cases} 1,& \text{if } 0\leq x\leq p
0, & \text{otherwise} \end{cases}$ 
Binomial: $f(n,p)= \begin{cases} {n \choose p} p^x (1p)^{nx},& \text{if } 0\leq x\leq p
0, & \text{otherwise} \end{cases}$ 
Gaussian: $f(\lambda)= \begin{cases} \frac{1}{\lambda}e^{\lambda t},& \text{if } 0\leq x\leq p
0, & \text{otherwise} \end{cases}$

Statistics
 Statistics and Sampling Distributions
 Point Estimation
 statistical intervals
 Tests on Hypotheses
 Inferences Based on 2 Samples
 ANOVA
Statistics and Sampling Distributions
 can calculated expected values of sample mean and sample $\sigma$ 2 ways: prob. rules and simulation (for simulation fix n and repeat k times)
 CLT  random samples have a normal distr. if n is large
 CLT: $lim_{n>\infty}P(\frac{\bar{X}\mu}{\sigma/\sqrt{n}}\leq z)=P(Z\leq z) = \Phi(z)$
 CLT $\to Y = X_1..X_n$ has approximately lognormal distribution if all $P(X_i>0)$
Law of Large Numbers
 $ E(\bar{X}\mu)^2 \to 0$ as $n \to \infty,$

$ P( \bar{X}\mu \geq \epsilon) \to 0$ as $n \to \infty$  $T_o = X_1+…+X_n, E(T_o) = n\mu , V(T_o) = n\mu ^2$
 $E(\bar{X}) = \mu$
 $V(\bar{X}) = \frac{\sigma_x^2}{n}$
 chisquared  finding the distribution for sums of squares of normal variables.
 if Z1,…, Zn are i.i.d. standard normal,then $Z_1^2+…+Z_n^2 = \chi_n^2$
 $(n1)S^2/\sigma^2 \text{ proportional to } \chi_{n1}^2$
 t  to use the sample standard deviation to measure precision for the mean X, we combine the square root of a chisquared variable with a normal variable
 f  compare two independent sample variances in terms of the ratio of two independent chisquared variables.
Point Estimation
 point estimate  single number prediction
 point estimator  statistic that predicts a parameter
 MSE  mean squared error  $E[(\hat{\theta}\theta)^2]$ = $V(\hat{\theta})+[E(\hat{\theta})\theta]^2$
 bias = $E(\hat{\theta})=\theta$
 after unbiased we want MVUE (minimum variance unbiased estimator)
 Estimators: $ \tilde{X} $ = Median, $X_e$ = Midrange((max+min)/2), $X_{tr(10)}=$ 10 percent trimmed mean (discard smallest and largest 10 percent)
 standard error: $\sigma_{\hat{\theta}} = s_{\hat{\sigma}} = \sqrt{Var(\hat{\theta)}}$  determines consistency
 boostrap  computationally compute standard error
 $S^2 (Unbiased)= \sum{\frac{(X_i\bar{X})^2}{n1}}$
 $\hat{\sigma^2} (MLE) = \sum{\frac{(X_i\mu)^2}{n}}$
 Can calculate estimators for a distr. by calculating moments
 A statistic T = t(X1, . . ., Xn) is said to be sufficient for making inferences about a parameter y if the joint distribution of X1, X2, . . ., Xn given that T = t does not depend upon y for every possible value t of the statistic T.
 Neyman Factorization Thm  $t(X_1,…,X_n)$ is sufficient $\leftrightarrow f = g(t,\theta)*h(x_1,…,x_n)$

Estimating h($\theta$), if U is unbiased, T is sufficient for $\theta$, then use $U^* = E(U T)$  Fisher Information $I(\theta)=V[\frac{\partial}{\partial\theta}ln(f(x;\theta))]$ (for n samples, multiply by n)
 If T is unbiased estimator for $\theta$ then $V(T) \geq \frac{1}{nI(\theta)}$
 Efficiency of T is ratio of lower bound to variance of T
 hypergeometric  number of success in n draws of (without replacement) of sample with m successes and Nm failures
 negative binomial  fix number of successes, X = number of trials before rth success
 normal  standardized: $\frac{X\mu}{\sigma}$ (mean 0 and std.dev.=1)
 gamma: $ \Gamma (a) = \int_{0}^{\infty} x^{a1}e^{x}dx$, $\Gamma(1/2) = \sqrt{\pi}$
MLE
 MLE  maximize $f(x_1,…,x_n;\theta_1,…\theta_m)$  agreement with chosen distribution  often take ln(f) and then take derivative $\approx$ MVUE, but can be biased
 $\hat{\theta} = $argmax $ L(\theta)$

Likelihood $L(\theta)=P(X_1…X_n \theta)=\prod_{i=1}^n P(X_i \theta)$ 
$logL(\theta)=\sum log P(X_i \theta)$  to maximize, set $\frac{\partial LL(\theta)}{\partial \theta} = 0$


Use $\hat{\theta} = $argmax $ P(\text{Train Model}(\theta))$
statistical intervals
 interval estimates come with confidence levels
 $Z=\frac{\bar{X}\mu}{\sigma / \sqrt{n}}$
 For p not close to 0.5, use Wilson score confidence interval (has extra terms)
 confidence interval  If multiple samples of trained typists were selected and an interval constructed for each sample mean, 95 percent of these intervals contain the true preferred keyboard height
Tests on Hypotheses
 Var($\bar{X}\bar{Y})=\frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n}$
 tail refers to the side we reject (e.g. uppertailed=$H_a:\theta>\theta_0$
 $\alpha$  type 1  reject $H_0$ but $H_0$ true
 $\beta$  type 2  fail to reject $H_0$ but $H_0$ false
 we try to make the null hypothesis a statement of equality
 uppertailed  reject large values
 $\alpha$ is computed using the probability distribution of the test statistic when $H_0$ is true, whereas determination of b requires knowing the test statistic distribution when $H_0$ is false
 type 1 error usually more serious, pick $\alpha$ level, then constrain $\beta$
 can standardize values and test these instead
 Pvalue is the probability, calculated assuming that the null hypothesis is true, of obtaining a value of the test statistic at least as contradictory to $H_0$ as the value calculated from the available sample. (observed significance level)
 reject $H_0$ if p $\leq \alpha$
Inferences Based on 2 Samples
 $\sigma_{\bar{X}\bar{Y}} = \sqrt{\frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n}}$
 there are formulas for type 1,2 errors
 If both normal, $Z = \frac{\bar{X}\bar{Y}(\mu_1\mu_2)}{\sqrt{\frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n}}}$
 If both have same variance, do a weighted average (pooled) $S_p^2 = \frac{m1}{m+n2}S_1^2+\frac{n1}{m+n2}S_2^2$
 If we have a large sample size, these expressions are basically true, we just use the sample standard deviation
 randomized controlled experiment  investigators assign subjects to the two treatments in a random fashion
 small sample sizes  twosample t test
 $T = \frac{\bar{X}\bar{Y}(\mu_1\mu_2)}{\sqrt{\frac{S_1^2}{m}+\frac{S_2^2}{n}}}$
 $\nu= \frac{(se_1^2 + se_2^2)^2}{\frac{se_1^4}{m1}+\frac{se_2^4}{n1}}$ (round down)
 $se_1 = \frac{s_1}{\sqrt{m}}$
 $se_2 = \frac{s_2}{\sqrt{n}}$
 twosample t confidence interval for $\mu_1\mu_2$ with confidence 100(1a) percent:
 $\bar{x}\bar{y} \pm t_{\alpha/2,v} \sqrt{\frac{s_1^2}{m}+\frac{s_2^2}{n}}$
 very hard to calculate type II errors here
 paired data  not independent
 we do a onesample t test on the differences
 do pairing when large correlation within experimental units
 do independentsamples when correlation within pairs is not large
 proportions when m and n both large:
 $Z=\frac{\hat{p_1}\hat{p_2}}{\sqrt{\hat{p}\hat{q}(\frac{1}{m}+\frac{1}{n})}}$ where $\hat{p}=\frac{m}{m+n}\hat{p_1}+\frac{n}{m+n}\hat{p_2}$, $\hat{q}=1\hat{p}$
 bootstrap  computationally compute by taking samples, can use percentile intervals (sort and then pick nth from bottom/top)
 permutation tests  permute the labels on the data  pvalue is the fraction of arrangements that are at least as extreme as the value computed for the original data
 for testing if two variances are equal, use $F_{\alpha,m1,n1}$
ANOVA
 ANOVA  analysis of variance
Regression and Correlations
 y  called dependent, response variable
 x  independent, explanatory, predictor variable

notation: $E(Y x^) = \mu_{Y\cdot x^} = $ mean value of Y when x = $x^*$  Y = f(x) + $\epsilon$
 linear: $Y=\beta_0+\beta_1 x+\epsilon$
 logistic: $odds = \frac{p(x)}{1p(x)}=e^{\beta_0+\beta_1 x+\epsilon}$
 we minimize least squares: $SSE = \sum_{i=1}^n (y_i(b_0+b_1x_i))^2$
 $b_1=\hat{\beta_1}=\frac{\sum (x_i\bar{x})(y_i\bar{y})}{\sum (x_i\bar{x})^2} = \frac{S_{xy}}{S_{xx}}$
 $b_0=\bar{y}\hat{\beta_1}\bar{x}$
 $S_{xy}=\sum x_iy_i\frac{(\sum x_i)(\sum y_i)}{n}$
 $S_{xx}=\sum x_i^2  \frac{(\sum x_i)^2}{n}$
 residuals: $y_i\hat{y_i}$
 SSE = $\sum y_i^2  \hat{\beta}_0 \sum y_i  \hat{\beta}_1 \sum x_iy_i$
 SST = total sum of squares = $S_{yy} = \sum (y_i\bar{y})^2 = \sum y_i^2  (\sum y_i)^2/n$
 $r^2 = 1\frac{SSE}{SST}=\frac{SSR}{SST}$  proportion of observed variation that can be explained by regression
 $\hat{\sigma}^2 = \frac{SSE}{n2}$
 $T=\frac{\hat{\beta}1\beta_1}{S / \sqrt{S{xx}}}$ has a t distr. with n2 df
 $s_{\hat{\beta_1}}=\frac{s}{\sqrt{S_{xx}}}$
 $s_{\hat{\beta_0}+\hat{\beta_1}x^} = s\sqrt{\frac{1}{n}+\frac{(x^\bar{x})^2}{S_{xx}}}$
 sample correlation coefficient $r = \frac{S_{xy}}{\sqrt{S_xx}\sqrt{S_{yy}}}$
 this is a point estimate for population correlation coefficient = $\frac{Cov(X,Y)}{\sigma_X\sigma_Y}$
 make fisher transformation  this test statistic also tests correlation
 degrees of freedom
 onesample T = n1
 T procedures with paired data  n1
 T procedures for 2 independent populations  use formula ~= smaller of n11 and n21
 variance  n2
 use ztest if you know the standard deviation—