## Notes

### ai

• Artificial Intelligence

[toc]

# symbol search

• computer science - empirical inquiry

## symbols and physical symbol systems

• intelligence requires the ability to store and manipulate symbols
• laws of qualitative structure
• cell doctrine in biology
• plate tectonics in geology
• germ theory of disease
• doctrine of atomism
• “physical”
1. obey laws of physics
2. not restricted to human systems
• designation - then given the expression, the system can affect the object
• interpretation - expression designates a process
• physical symbol system hypothesis - a physical symbol system has the necessary and sufficient means for general intelligent action
• from { cite newell1980physical }
• identify a task domain calling for intelligence; then construct a program for a digital computer that can handle tasks in that domain
• no boundaries have come up yet
• wanted general problem solver - leads to generalized schemes of representation
• goes along with information processing psychology
• observe human actions requiring intelligence
• program systems to model human actions

## heuristic searching

• symbol systems solve problems with heuristic search
• Heuristic Search Hypothesis - solutions are represented as symbol structures. A physical symbol system exercises its intelligence in problem solving by search–that is, by generating and progressively modifying symbol structures until it produces a solution structure
• from { cite newell1976computer }
• there are practical limitations on how fast computers can search
• To state a problem is to designate
1. a test for a class of symbol structures (solutions of the problem)
2. a generator of symbol structures (potential solutions).
• To solve a problem is to generate a structure, using (2), that satisfies the test of (1).
• searching is generally in a tree-form

# intro

• AI - field of study which studies the goal of creating intelligence
• intelligent agent - system that perceives its environment and takes actions that maximize its chances of success
• expert task examples - medical diagnosis, equipment repair, computer configuration, financial planning
1. formal systems - use axioms and formal logic
2. ontologies - structuring knowledge in graph form
3. statistical methods
• turing test - is human mind deterministic { turing1950computing }
• chinese room argument - rebuts turing test { cite searle1980minds }
• china brain - what if different people hit buttons to fire individual neurons

# knowledge representation

• physical symbol system hypothesis - a physical symbol system has the necessary and sufficient means for general intelligent action
• computers and minds are both physical symbol systems
• symbol - meaningful pattern that can be manipulated
• symbol system - creates, modifies, destroys symbols
• want to represent
1. meta-knowledge - knowledge about what we know
2. objects - facts
3. performance - knowledge about how to do things
4. events - actions
• two levels
1. knowledge level - where facts are described
2. symbol level - lower
• properties
1. representational adequacy - ability to represent
3. inferential efficiency
4. acquisitional efficiency - acquire new information
• two views of knowledge
1. logic
• a logic is a language with concrete rules
• syntax - rules for constructing legal logic
• semantics - how we interpret / read
• assigns a meaning
• multi-valued logic - not just booleans
• higher-order logic - functions / predicates are also objects
• multi-valued logics - more than 2 truth values
• fuzzy logic - uses probabilities rather than booleans
• match-resolve-act cycle
2. associationist
• knowledge based on observation
• semantic networks - objects and relationships between them - like is a, can, has
• graphical representation
• equivalent to logical statements
• ex. nlp - conceptual dependency theory - sentences with same meaning have same graphs
• frame representations - semantic networks where nodes have structure
• ex. each frame has age, height, weight, …
• when agent faces new situation - slots can be filled in, may trigger actions / retrieval of other frames
• inheritance of properties between frames
• frames can contain relationships and procedures to carry out after various slots filled

# expert systems

• expert system - program that contains some of the subject-specific knowledge of one or more human experts.
• problems
1. planning
2. monitoring
3. instruction
4. control
• need lots of knowledge to be intelligent
• rule-based architecture - condition-action rules & database of facts
• acquire new facts
• from human operator
• interacting with environment directly
• forward chaining
• until special HALT symbol in DB, keep following logical rule, add result to DB
• conflict resolution - which rule to apply when many choices available
• pattern matching - logic in the if statements
• backward chaining - check if something is true
• check database
• check if on the right side of any facts
• CLIPS - expert system shell
• define rules and functions…
• explanation subsystem - provide explanation of reasoning that led to conclusion
• people
1. knowledge engineer - computer scientist who designs / implements ai
2. domain expert - has domain knowledge
1. user interface
2. knowledge engineering - art of designing and building expert systems - determine characteristics of problem - automatic knowledge-acquisition - set of techniques for gaining new knowledge
• ex. parse Wikipedia
• crowdsourcing
• creating an expert system can be very hard
• only useful when expert isn’t available, problem uses symbolic reasoning, problem is well-structured
• MYCIN - one of first successful expert systems { cite shortliffe2012computer }
• Stanford in 1970s
• used backward chaining but would ask patient questions - sometimes too many questions
• can explain reasoning
• can free up human experts to deal with rare problems

# decisions

## game trees – R&N 5.2-5.5

• minimax algorithm
• ply - half a move in a tree
• for multiplayer, the backed-up value of a node n is the vector of the successor state with the highest value for the player choosing at n
• time complexity - $O(b^m)$
• space complexity - O(bm) or even O(m)
• alpha-beta pruning cut in half the exponential depth
• once we have found out enough about n, we can prune it
• depends on move-ordering
• might want to explore best moves = killer moves first
• transposition table can hash different movesets that are just transpositions of each other
• imperfect real-time decisions
• can evaluate nodes with a heuristic and cutoff before reaching goal
• heuristic uses features
• want quiescent search - consider if something dramatic will happen in the next ply
• horizon effect - a position is bad but isn’t apparent for a few moves
• singular extension - allow searching for certain specific moves that are always good at deeper depths
• forward pruning - ignore some branches
• beam search - consider only n best moves
• PROBCUT prunes some more
• search vs lookup
• often just use lookup in the beginning
• program can solve and just lookup endgames
• stochastic games
• include chance nodes
• change minimax to expectiminimax
• $O(b^m numRolls^m)$
• cutoff evaluation function is sensitive to scaling - evaluation function must be a postive linear transformation of the probability of winning from a position
• can do alpha-beta pruning analog if we assume evaluation function is bounded in some range
• alternatively, could simulate games with Monte Carlo simulation

## utilities / decision theory – R&N 16.1-16.3

•  $P(RESULT(a)=s’ a,e)$
• s - state, observations e, action a
• utility function U(s)
• rational agent should choose action with maximum expected utility
•  expected utility $EU(a e) = \sum_{s’} P(RESULT(a)=s’ a,e) U(s’)$
• notation
• A>B - agent prefers A over B
• A~B - agenet is indifferent between A and B
• preference relation has 6 axioms of utility theory
1. orderability - A>B, A~B, or A<B
2. transitivity
3. continuity
4. substitutability - can do algebra with preference eqns
5. monotonicity - if A>B then must prefer higher probability of A than B
6. decomposability - 2 consecutive lotteries can be compressed into single equivalent lottery
• these axioms yield a utility function
• isn’t unique (ex. affine transformation yields new utility function)
• sometimes ranking, not numbers needed - value function = ordinal utility function
• agent might not be explicitly maximizing the utility function

### utility functions

• preference elicitation - finds utility function
• normalized utility to have min and max value
• assess utility of s by asking agent to choose between s and (p:min, (1-p):max)
• micromort - one in a million chance of death
• QALY - quality-adjusted life year
• money
• agenets exhibits monotonic preference for more money
• gambling has expected monetary value = EMV
• when utility of money is sublinear - risk averse
• value agent will accept in lieu of lottery = certainty equivalent
• EMV - certainty equivalent = insurance premium
• when supralinear - risk-seeking or linear - risk-neutral
• optimizer’s curse - tendency for E[utility] to be too high
• descriptive theory - how actual agents work
• decision theory - normative theory
• certainty effect - people are drawn to things that are certain
• ambiguity aversion
• framing effect - wording can influence people’s judgements
• evolutionary psychology
• anchoring effect - buy middle-tier wine because expensive is there

## decision theory / VPI – R&N 16.5 & 16.6

• decision network
1. chance nodes - represent RVS (like BN)
2. decision nodes - points where decision maker has a choice of actions
3. utility nodes - represent agent’s utility function
• can ignore chance nodes
• then action-utility function = Q-function maps directly from actions to utility
• evaluation
1. set evidence
2. for each possible value of decision node
• set decision node to that value
• calculate probabilities of parents of utility node
• calculate resulting utility
3. return action with highest utility

### the value of information

• information value theory - enables agent to choose what info to acquire
• observations only effect agents belief state
• value of info = expected value between best actions before and after info is obtained
• value of perfect information VPI - assume we can obtain exact evidence on some variable $e_j$
•  $VPI_e(E_j) = \left(\sum_k P(E_j = e_{jk} e) : EU(\alpha_{ejk} e, E_j = e_{jk})\right) - EU(\alpha e)$
• info is more valuable when it is likely to cause a change of plan
• info is more valuable when the new plan will be much better than the old plan
• VPI not linearly additive, but is order-independent
• information-gathering agent
• myopic - greedily obtain evidence which yields highest VPI until some threshold
• conditional plan - considers more things

## mdps and rl – R&N 17.1-17.4, 21.1-21.6

• fully observable - agent knows its state
• markov decision process
• set of states
• set of actions
•  transition model $P(s’ s,a)$
• reward function R(s)
• solution is policy $\pi^* (s)$ - what action to do in state s
• optimal policy yields highest expected utlity
• optimizing MDP - multiattribute utility theory
• could sum rewards, but results are infinite
• instead define objective function (maps infinite sequences of rewards to single real numbers)
• ex. set a finite horizon and sum rewards
• optimal action in a given state could change over time = nonstationary
• ex. discounting to prefer earlier rewards (most common)
• could discount reward n steps away by $\gamma^n$, 0<r<1
• ex. average reward rate per time step
• ex. agent is guaranteed to get to terminal state eventually - proper policy
• expected utility executing $\pi$: $U^\pi (s) = E[\sum_t \gamma^t R(S_t)]$
• when we use discounted utilities, $\pi$ is independent of starting state
•  $\pi^*(s) = \underset{\pi}{argmax : U^\pi (s)} = \underset{a}{argmax} \sum_{s’} P(s’ s,a) U’(s)$ - utility of state is immediate reward for that state plus the expected discounted utility of the next state, assuming agent chooses optimal action

### value iteration

• value iteration - calculates utility of each state and uses utilities to find optimal policy
•  bellman eqn - $U(s) = R(s) + \gamma : \underset{a}{max} \sum_{s’} P(s’ s, a) U(s’)$
• recalculate several times with Bellman update to approximate solns to bellman eqn = $U_{i+1}(s) = R(s) + \gamma : \underset{a}{max} \sum_{s’} P(s’|s, a) U_i(s’)$
• value iteration eventually converges
• contraction - function of one variable that when applied to 2 different inputs in turn produces 2 output values that are closer together than the original inputs
• contraction only has 1 fixed point
• Bellman update is a contraction on the space of utility vectors and therefore converges
• error is reduced by factor of $\gamma$ each iteration
•  also, terminating condition - if $U_{i+1}-U_i < \epsilon (1-\gamma) / \gamma$ then $U_{i+1}-U <\epsilon$
•  what actually matters is policy loss $U^{\pi_i}-U$ - the most the agent can lose by executing $\pi_i$ instead of the optimal policy $\pi^*$
•  if $U_i -U < \epsilon$ then $U^{\pi_i} - U < 2\epsilon \gamma / (1-\gamma)$

### policy iteration

• another way to find optimal policies
1. policy evaluation - given a policy $\pi_i$, calculate $U_i=U^{\pi_i}$, the utility of each state if $\pi_i$ were to be executed
• like value iteration, but with a set policy so there’s no max
• can solve exactly for small spaces, or approximate
2. policy improvement - calculate a new MEU policy $\pi_{i+1}$ using one-step look-ahead based on $U_i$
•  same as above, just $\pi^*(s) = \underset{\pi}{argmax : U^\pi (s)} = \underset{a}{argmax} \sum_{s’} P(s’ s,a) U’(s)$
• asynchronous policy iteration - don’t have to update all states at once

### partially observable markov decision processes (POMDP)

• agent is not sure what state it’s in
•  same elements but add sensor model P(e s)
• have prob. distr b(s) for belief states
• updates like the HMM
•  $b’(s’) = \alpha P(e s’) \sum_s P(s’ s, a) b(s)$
• changes based on observations
• optimal action depends only on the agent’s current belief state - use belief states as the states of an MDP and solve as before
• changes because state space is now continuous
• value iteration
1. expected utility of executing p in belif state is just $b \cdot \alpha_p$ - dot product
2. $U(b) = U^{\pi^*}(b)=\underset{p}{max} : b \cdot \alpha_p$
• belief space is continuous [0,1] so we represent it as piecewise linear, and store these discrete lines in memory
• do this by iterating and keeping any values that are optimal at some point
• remove dominated plans - generally this is far too inefficient
• dynamic decision network - online agent
• still don’t really understand this

## reinforcement learning

• reinforcement learning - use observed rewards to learn optimal policy for the environment
• 3 agent designs
1. utility-based agent - learns utility function on states
• requires model of the environment
2. Q-learning agent
• learns action-utility function = Q-function maps directly from actions to utility
3. reflex agent - learns policy that maps directly from states to actions

### passive reinforcement learning

• given policy $\pi$, learn $U^\pi (s)$
• like policy evaluation, but transition model / reward function are unknown
• direct utility estimation - run a bunch of trials to sample utility = expected total reward from each state
• adaptive dynamic programming (ADP) - learn transition model and rewards, and then plug into Bellman eqn
• prioritized sweeping - prefers to make adjustements to states whose likely succesors have just undergone a large adjustment in their own utility estimates
• two ways to add prior
1. Bayesian reinforcement learning - assume a prior P(h) on the transition model
•  use prior to calculate $P(h e)$
• let $u_h^\pi$ be expected utility avareaged over all possible start states, obtained by executing policy $\pi$ in model h
•  $\pi^* = \underset{\pi}{argmax} \sum_h P(h e) u_h^\pi$
2. give best outcome in the worst case over H (from robust control theory)
• $\pi^* = \underset{\pi}{argmax} \underset{h}{min} u_h^\pi$
• temporal-difference learning - adjust utility estimates towards the ideal equilibrium that holds locally when the utility estimates are correct
• $U^\pi = U^\pi (s) + \alpha (R(s) + \gamma U^\pi (s’) - U^\pi (s))$
• like a crude approximation of ADP

### active reinforcement learning

• explore states to find their utilities and exploit model to get highest reward
• bandit problems - determining exploration policy
• should be GLIE - greedy in the limit of infinite exploration - visits all states infinitely, but eventually become greedy
• ex. choose random action 1/t of the time
• better ex. give optomistic prior utility to unexplored states
• uses exploration function f(u,numTimesVisited) in utility update rule
• n-armed bandit - pulling n levelers on a slot machine, each with different distr.
• Gittins index - function of number of pulls / payoff

### learning action-utility function

• U(s) = $\underset{a}{max} Q(s,a)$
•  does require $P(s’ s,a)$ if we use ADP
•  doesn’t require knowing $P(s’ s,a)$ if we use TD: $Q(s,a) = Q(s,a) + \alpha (R(s) + \gamma \underset{a’}{max} Q(s’, a’) - Q(s,a))$
• SARSA is related: $Q(s,a) = Q(s,a) + \alpha (R(s) + \gamma Q(s’, a’) - Q(s,a))$
• here, a’ is action actually taken
• SARSA is on-policy while Q-learning is off-policy

### generalization

• approximate Q-function
• ex. linear function of parameters
• can learn params online with delta rule = wildrow-hoff rule: $\theta_i = \theta - \alpha : \frac{\partial Loss}{\partial \theta_i}$
• keep twiddling the policy as long as it improves, then stop
• store one Q-function (parameterized by $\theta$) for each action
• $\pi(s) = \underset{a}{max} \hat{Q}_\theta (s,a)$
• this is discontinunous, instead often use stochastic policy representation (ex. softmax for $\pi_theta (s,a)$)
• learns $\theta$ that results in good performance
• Q-learning learns actual Q* function - coulde be different (scaling factor etc.)
• to find $\pi$ maximize policy value $p(\theta)$
• could do this with gradient ascient / empirical gradient hill climbing
• when environment/policy is stochastic, more difficult
1. could sample mutiple times to compute gradient
2. REINFORCE algorithm - could approximate gradient at $\theta$ by just sampling at $\theta$: $\nabla_\theta p(\theta) \approx \frac{1}{N} \sum_{j=1}^N \frac{(\nabla_\theta \pi_\theta (s,a_j)) R_j (s)}{\pi_\theta (s,a_j)}$
3. PEGASUS - correlated sampling - ex. 2 blackjack programs would both be dealt same hands

### applications

• game playing
• robot control

# logic and planning

• knowledge-based agents - intelligence is based on reasoning that operates on internal representations of knowledge

## logical agents - 7.1-7.7 (omitting 7.5.2)

• 3 steps’- given a percept, the agent
1. adds the percept to its knowledge base
2. asks the knowledge base for the best action
3. tells the knowledge base that it has taken that action
• declarative approach - tell sentences until agent knows how to opearte
• procedural approach - encodes desired behaviors as program code
• ex. Wumpus World
• logical entailment between senteces
• B follows logically from A (A implies B)
• $A \vDash B$
• model checking - try everything to see if A $\implies$ B
• this is sound=truth-preserving
• complete - can derive any sentence that is entailed
• grounding - connection between logic and real environment (usually sensors)
• inference
• TT-ENTAILS - recursively enumerate all sentences - check if a query is in the table
• theorem properties
• satisfiable - true under some model
• validity - tautalogy - true under all models
• monotonicity - set of impliciations can only increase as info is added to the knowledge base
• if $KB \implies A$ then $KB \land B \implies A$

## theorem proving

• resolution rule - resolves different rules with each other - leads to complete inference procedure
• CNF - conjunctive normal form - conjunction of clauses
• anything can be expressed as this
• skip this - resolution algorithm: check if $KB \implies A$ so check if $KB \land -A$
• keep adding clauses until
1. nothing can be added
2. get empty clause so $KB \implies A$
• ground resolution thm - if a set of clauses is unsatisfiable, then the resolution closure of those clauses contains the empty clause
• resolution closure - set of all clauses derivable by repeated application of resolution rule
• restricted knowledge bases
• horn clause - at most one positive
• definite clause - disjunction of literals with exactly one positive
• goal clause - no positive
• benefits
• easy to understand
• forward-chaining / backward-chaining are applicable
• deciding entailment is linear in size(KB)
• forward/backward chaining
• checks if q is entailed by KB of definite clauses
• keep adding until query is added or nothing else can be added
• backward chaining works backwards from the query
• checking satisfiability
• complete backtracking
• davis-putnam algorithm = DPLL - like TT-entails with 3 improvements
1. early termination
2. pure symbol heuristic - pure symbol appears with same sign in all clauses
3. unit clause heuristic - clause with just on eliteral or one literal not already assigned false
• other improvements (similar to search)
1. component analysis
2. variable and value ordering
3. intelligent backtracking
4. random restarts
5. clever indexing
• local search
• evaluation function can just count number of unsatisfied clauses (MIN-CONFLICTS algorithm for CSPs)
• WALKSAT - randomly chooses between flipping based on MIN-CONFLICTS and randomly
• runs forever if no soln
• underconstrained problems are easy to find solns too
• satisfiability threshold conjecture - for random clauses, probability of satisfiability goes to 0 or 1 based on ratio of clauses to symbols
• hardest problems are at the threshold
• state variables that change over time also called fluents
• can index these by time
• effect axioms - specify the effect of an action at the next time step
• frame axioms - assert that all propositions remain the same
• succesor-state axiom: $F^{t+1} \iff ActionCausesF^t \lor (F^t \land -ActionCausesNotF^t )$
• keeping track of belief state
• can just use 1-CNF
• 1-CNF includes all states that are in fact possible given the full percept history
• conservative approximation
• SATPLAN - how to make plans for future actions that solve the goal by propositional inference
• must add precondition axioms - states that action occurrence requires preconditions to be satisfied
• action exclusion axioms - one action at a time

## first-order logic - 8.1-8.3.3

• declarative language - semantics based on a truth relation between sentences and possible worlds
• has compositionality - meaning decomposes
• Sapir-Whorf hypothesis - understanding of the world is influenced by the language we speak
• 3 elements
1. objects
2. relations
• functions - only one value for given input
• first-order logic assumes more about the world than propositional logic
• epistemological commitments - the possible states of knowledge that a logic allows with respect to each fact
• higher-order logic - views relations and functions as objects in themselves
• first-order consists of symbols
1. constant symbols - stand for objects
2. predicate symbols - stand for relations
3. function symbols - stand for functions
• arity - fixes number of args
• term - logical expresision tha refers to an object’
• atomic sentence - formed from a predicate symbol optionally followed by a parenthesized list of terms
• true if relation holds among objects referred to by the args - $\forall, \exists$, etc. - interpretation - specifies exactly which objects, relations and functions are referred to by the symbols

## inference in first-order logic - 9.1-9.4

• propositionalization - can convert first-order logic to propositional logic and do propositional inference
• universal instantiation - we can infer any sentence obtained by substituting a ground term for the variable
• replace “forall x” with a specific x
• existential instantiation - variable is replaced by a new constant symbol
• replace “there exists x” with a specific x
• Skolem constant - new name of constant
• only need finite subset of propositionalized KB - can stop nested functions at some depth
• semidecidable - algorithms exist that say yes to every entailed sentence, but no algorithm exists that also says no to every nonentailed sentence
• generalized modus ponens
• unification - finding substitutions that make different logical expressions look identical
• UNIFY(Knows(John,x), Knows(x,Elizabeth)) = fail .
• use different x’s - standardizing apart
• want most general uniier
• need occur check - S(x) can’t unify with S(S(x))
• storage and retrieval
• STORE(s) - stores a sentence s into the KB
• FETCH(q) - returns all unifiers such that the query q unifies with some sentence in the KB
• only try to unfity reasonable facts using indexing
• query such as Employs(IBM, Richard)
• all possible unifying queries form subsumption lattice
• forward chaining
• first-order definite clauses - disjunctions of literals of which exactly one is positive (could also be implication whose consequent is a single positive literal)
• Datalog - language restricted to first-order definite clauses with no function symbols
• simple forward-chaining: FOL-FC-ASK
1. pattern matching is expensive
2. rechecks every rule
3. generates irrelevant facts
• efficient forward chaining (solns to above problems)
1. conjuct odering problem - find an ordering to solve the conjuncts of the rule premise so the total cost is minimized
• requires heuristics (ex. minimum-remaining-values)
2. incremental forward chaining - ignore redundant rules
• every new fact inferred on iteration t must be derived from at least one new fact inferred on iteration t-1
• rete algorithm was first to do this
3. irrelevant facts can be ignored by backward chaining
• could also use deductive database to keep track of relevant variables
• backward-chaining
• simple backward-chaining: FOL-BC-ASK
• is a generator - returns multiple times, each giving one possible result
• logic programming: algorithm = logic + control
• ex. prolog
• a lot more here
• can have parallelism
• redudant inference / infinite loops because of repeated states and infinite paths
• can use memoization (similar to the dynamic programming that forward-chaining does)
• generally easier than converting it into FOLD
• constraint logic programming - allows variables to be constrained rather than bound
• allows for things with infinite solns
• can use metarules to determine which conjuncts are tried first

## classical planning 10.1-10.2

• planning - devising a plan of action to achieve one’s goals
• Planning Domain Definition Language (PDDL) - uses factored representation of world
• closed-world assumption - fluents that aren’t present are false
• set of ground (variable-free) actions can be represented by a single action schema
• like a method

## knowledge representation 12.1 - 12.3

• ontological engineering - representing objects and their relationships
• upper ontology - tree more general at the top more specific at bottom
• must represent categories
• subcategories make a taxonomy
• can also define functions
• mass noun - function that includes only intrinsic properties
• count noun - function that includes any extrinsic properties
• physical symbol system hypothesis - a physical symbol system has the necessary and sufficient means for general intelligent action
• computers and minds are both physical symbol systems
• symbol - meaningful pattern that can be manipulated
• symbol system - creates, modifies, destroys symbols
• want to represent
1. meta-knowledge - knowledge about what we know
2. objects - facts
3. performance - knowledge about how to do things
4. events - actions
• two levels
1. knowledge level - where facts are described
2. symbol level - lower
• properties
1. representational adequacy - ability to represent
3. inferential efficiency
4. acquisitional efficiency - acquire new information
• two views of knowledge
1. logic
• a logic is a language with concrete rules
• syntax - rules for constructing legal logic
• semantics - how we interpret / read
• assigns a meaning
• multi-valued logic - not just booleans
• higher-order logic - functions / predicates are also objects
• multi-valued logics - more than 2 truth values
• fuzzy logic - uses probabilities rather than booleans
• match-resolve-act cycle
2. associationist
• knowledge based on observation
• semantic networks - objects and relationships between them - like is a, can, has
• graphical representation
• equivalent to logical statements
• ex. nlp - conceptual dependency theory - sentences with same meaning have same graphs
• frame representations - semantic networks where nodes have structure
• ex. each frame has age, height, weight, …
• when agent faces new situation - slots can be filled in, may trigger actions / retrieval of other frames
• inheritance of properties between frames
• frames can contain relationships and procedures to carry out after various slots filled
• Deep Learning

[toc]

# neural networks

• basic perceptron update rule
• if output is 0 but should be 1: raise weights on active connections by d
• if output is 1 but should be 0: lower weights on active connections by d
• transfer / activation functions
• sigmoid(z) = $\frac{1}{1+e^{-z}}$
• Binary step
• TanH
• Rectifier = ReLU
• deep - more than 1 hidden layer
• regression loss = $\frac{1}{2}(y-\hat{y})^2$
• classification loss = $-y log (\hat{y}) - (1-y) log(1-\hat{y})$
• can’t use SSE because not convex here
• multiclass classification loss $=-\sum_j y_j ln \hat{y}_j$
• backpropagation - application of reverse mode automatic differentiation to neural networks’s loss
• apply the chain rule from the end of the program back towards the beginning
• $\frac{dL}{dx_i} = \frac{dL}{dz} \frac{\partial z}{\partial x_i}$
• sum $\frac{dL}{dz}$ if neuron has multiple outputs z
• L is output
• $\frac{\partial z}{\partial x_i}$ is actually a Jacobian (deriv each $z_i$ wrt each $x_i$ - these are vectors)
• each gate usually has some sparsity structure so you don’t compute whole Jacobian
• pipeline
• initialize weights, and final derivative ($\frac{dL}{dL}=1$)
• for each batch
• run network forward to compute outputs at each step
• compute gradients at each gate with backprop
• update weights with SGD

# training

• vanishing gradients problem - neurons in earlier layers learn more slowly than in later layers
• happens with sigmoids
• exploding gradients problem - gradients are significantly larger in earlier layers than later layers
• RNNs
• batch normalization - whiten inputs to all neurons (zero mean, variance of 1)
• do this for each input to the next layer
• dropout - randomly zero outputs of p fraction of the neurons during training
• like learning large ensemble of models that share weights
• 2 ways to compensate (pick one)
1. at test time multiply all neurons’ outputs by p
2. during training divide all neurons’ outputs by p
• softmax - takes vector z and returns vector of the same length
• makes it so output sums to 1 (like probabilities of classes)

# CNNs

• kernel here means filter
• convolution G- takes a windowed average of an image F with a filter H where the filter is flipped horizontally and vertically before being applied
• G = H $\ast$ F
• if we do a filter with just a 1 in the middle, we get the exact same image
• you can basically always pad with zeros as long as you keep 1 in middle
• can use these to detect edges with small convolutions
• can do Guassian filters
• convolutions typically sum over all color channels
• weight matrices have special structure (Toeplitz or block Toeplitz)
• input layer is usually centered (subtract mean over training set)
• usually crop to fixed size (square input)
• receptive field - input region
• stride m - compute only every mth pixel
• downsampling
• max pooling - backprop error back to neuron w/ max value
• average pooling - backprop splits error equally among input neurons
• data augmentation - random rotations, flips, shifts, recolorings

## 1 - AlexNet (2012)

• landmark (5 conv layers, some pooling/dropout)

## 2 - ZFNet (2013)

• fine tuning and deconvnet

## 3 - VGGNet (2014)

• 19 layers, all 3x3 conv layers and 2x2 maxpooling

## 4 - GoogLeNet (2015)

• lots of parallel elements (called Inception module)

## 5 - Msft ResNet (2015)

• very deep - 152 layers
• connections straight from initial layers to end
• only learn “residual” from top to bottom

## 6 - Region Based CNNs (R-CNN - 2013, Fast R-CNN - 2015, Faster R-CNN - 2015)

• object detection

## 7- GAN (2014)

• might not converge
• generative adversarial network
• goal: want G to generate distribution that follows data
• ex. generate good images
• two models
• G - generative
• D - discriminative
• G generates adversarial sample x for D
• G has prior z
• D gives probability p that x comes from data, not G
• like a binary classifier: 1 if from data, 0 from G
• adversarial sample - from G, but tricks D to predicting 1
• training goals
• G wants D(G(z)) = 1
• D wants D(G(z)) = 0
• D(x) = 1
• converge when D(G(z)) = 1/2
• G loss function: $G = argmin_G log(1-D(G(Z))$
• overall $min_g max_D$ log(1-D(G(Z))
• training algorithm
• in the beginning, since G is bad, only train my minimizing G loss function
• later
  for
for
max D by SGD
min G by SGD


• RNN+CNN

## 9 - Spatial transformer networks (2015)

• transformations within the network

## 10 - Segnet (2015)

• encoder-decoder network

## 11 - Unet (2015)

• Ronneberger - applies to biomedical segmentation

## 12 - Pixelnet (2017)

• predicts pixel-level for different tasks with the same architecture
• convolutional layers then 3 FC layers which use outputs from all convolutional layrs together

# recent papers

• deepmind’s learning to learn
• optimal brain damage - starts with fully connected and weeds out connections (Lecun)
• tiling - train networks on the error of previous networks

# RNNs

• feedforward NNs have no memory so we introduce recurrent NNs
• able to have memory
• could theoretically unfold the network and train with backprop
• truncated - limit number of times you unfold
• $state_{new} = f(state_{old},input_t)$
• ex. $h_t = tanh(W h_{t-1}+W_2 x_t)$
• train with backpropagation through time (unfold through time)
• truncated backprop through time - only run every k time steps
• error gradients vanish exponentially quickly with time lag

## LSTMs

• have gates for forgetting, input, output
• easy to let hidden state flow through time, unchanged
• gate $\sigma$ - pointwise multiplication
• multiply by 0 - let nothing through
• multiply by 1 - let everything through
• forget gate - conditionally discard previously remembered info
• input gate - conditionally remember new info
• output gate - conditionally output a relevant part of memory
• GRUs - similar, merge input / forget units into a single update unit
• Dimensionality Reduction

# PCA

• have p random variables
• want new set of K axes (linear combinations of the original p axes) in the direction of greatest variability
• this is best for visualization, reduction, classification, noise reduction
• to find axis - minimize sum of squares of projections onto line =($v^TX^TXv$ subject to $v^T v=1$ )
• $\implies v^T(X^TXv-\lambda v)=0$
• SVD: let $X = U D V^T$
• $V_q$ (pxq) is first q columns of V
• $H = V_q V_q^T$ is the projection matrix
• to transform $x = Hx$
• columns of $UD$ (Nxp) are called the principal components of X
• eigenvectors of covariance matrix -> principal components
• most important corresponds to largest eigenvalue (eigenvalue corresponds to variance)
• finding eigenvectors can be hard to solve, so 3 other methods
1. singular value decomposition (SVD)
2. multidimensional scaling (MDS)
• based on eigenvalue decomposition
• extract components sequentially, starting with highest variance so you don’t have to extract them all

### nonlinear PCA

• usually uses an auto-associative neural network

# ICA

• like PCA, but instead of the dot product between components being 0, the mutual info between components is 0
• goals
• minimizes statistical dependence between its components
• maximize information transferred in a network of non-linear units
• uses information theoretic unsupervised learning rules for neural networks
• problem - doesn’t rank features for us
• Learning Theory

[toc]

# books

1. Machine Learning - Tom Mitchell
2. An Introduction to Computational Learning Theory - Kearns & Vazirani

# bounds

• 2 major inequalities
1. Markov’s inequality
• $P(X \geq a) \leq \frac{E[X]}{a}$
• X is typically running time of the algorithm
• if we don’t have E[X], can use upper bound for E[X]
2. Chebyshev’s inequality
•  $P( X-\mu \geq a) \leq \frac{Var[X]}{a^2}$
• utilizes the variance to get a better bound - CLT - law of large numbers - Chernoff bounds - Hoeffding bounds

# approximations

• $\left( \frac{n}{k} \right) < \left( \frac{ne}{k} \right)^k$
• $\left( \frac{n}{e} \right)^n < n!$
• $(1-x)^N \leq e^{-Nx}$
• Poisson pmf approximates binomial when N large, p small

# evolution

• performance is correlation $Perf_D (h,c) = \sum h(x) \cdot c(x) \cdot P(x)$
• want $P(Perf_D(h,c) < Perf_D(c,c)-\epsilon) < \delta$

# sample problems

• ex: N marbles in a bag. How many draws with replacement needed before we draw all N marbles?
• write $P_i = \frac{N-(i-1)}{N}$ where i is number of distinct drawn marbles
• transition from i to i+1 is geometrically distributed with probability $P_i$
• mean times is sum of mean of each geometric
• in order to get probabilities of seeing all the marbles instead of just mean[# draws], want to use Markov’s inequailty
• box full of 1e6 marbles
• if we have 10 evenly distributed classes of marbles, what is probability we identify all 10 classes of marbles after 100 draws?

# computational learning theory

• frameworks
1. PAC
2. mistake-bound - split into b processes which each fail with probability at most $\delta / b$
• questions
1. sample complexity - how many training examples needed to converge
2. computational complexity - how much computational effort needed to converge
3. mistake bound - how many training examples will learner misclassify before converging
• must define convergence based on some probability

## PAC - probably learning an approximately correct hypothesis - Mitchell

• want to learn C
• data X is sampled with Distribution D
• learner L considers set H of possible hypotheses
• true error $err_d (h)$ of hypothesis h with respect to target concept c and distribution D is the probability that h will misclassify an instance drawn at random according to D.
• $err_D(h) = \underset{x\in D}{Pr}[c(x) \neq h(x)]$
• getting $err_D(h)=0$ is infeasible
• PAC learnable - consider concept class C defined over set of instances X of length n and a learner L using hypothesis space H
• C is PAC-learnable by L using H if for all $c \in C$, distributions D over X, $\epsilon$ s.t. 0 < $\epsilon$ < 1/2 $\delta$ s.t. $0<\delta<1/2$, learner L will with probability at least $(1-\delta)$ output a hypothesis $h \in H$ s.t $err_D(h) \leq \epsilon$
• efficiently PAC learnable - time that is polynomial in $1/\epsilon, 1/\delta, n, size(c )$
• probably - probability of failure bounded by some constant $\delta$
• approximately correct - err bounded by some constant $\epsilon$
• assumes H contains hypothesis with artbitraily small error for every target concept in C

## sample complexity for finite hypothesis space - Mitchell

• sample complexity - growth in the number of training examples required
• consistent learner - outputs hypotheses that perfectly fit training data whenever possible
• outputs a hypothesis belonging to the version space
• consider hypothesis space H, target concept c, instance distribution $\mathcal{D}$, training examples D of c. The versions space $VS_{H,D}$ is $\epsilon$-exhaused with respect to c and $\mathcal{D}$ if every hypothesis h in $VS_{H,D}$ has error less than $\epsilon$ with respect to c and $\mathcal{D}$: $(\forall h \in VS_{H,D}) err_\mathcal{D} (h) < \epsilon$

## rectangle learning game - Kearns

• data X is sampled with Distribution D
• simple soln: tightest-fit rectangle
• define region T so prob a draw misses T is $1-\epsilon /4$
• then, m draws miss with $(1-\epsilon /4)^m$
• choose m to satisfy $4(1-\epsilon/4)^m \leq \delta$

## VC dimension

• VC dimension measures capacity of a space of functions that can be learend by a statistical classification algorithm
• let H be set of sets and C be a set
•  $H \cap C := { h \cap C : h \in H }$
• a set C is shattered by H if $H \cap C$ contains all subsets of C
• The VC dimension of $H$ is the largest integer $D$ such that there exists a set $C$ with cardinality $D$ that is shattered by $H$
• VC dimension 0 -> hypothesis either always returns false or always returns true
• Sauer’s lemma - let $d \geq 0, m \geq 1$, $H$ hypothesis space, VC-dim(H) = d. Then, $\Pi_H(m) \leq \phi (d,m)$
• fundamental theorem of learning theory provides bound of m that guarantees learning: $m \geq [\frac{4}{\epsilon} \cdot (d \cdot ln(\frac{12}{\epsilon}) + ln(\frac{2}{\delta}))]$

# concept learning and the general-to-specific ordering

• definitions
• concept learning - acquiring the definition of a general category given a sample of positive and negative training examples of the category
• concept is boolean function that returns true for specific things
• can represent function as vector acceptable features, ?, or null (if any null, then entire vector is null)
• general hypothesis - more generally true
• general defines a partial ordering
• a hypothesis is consistent with the training examples if it correctly classifies them
• an example x satisfies a hypothesis h if h(x) = 1
• find-S - finding a maximally specific hypothesis
• generalize each time it fails to cover an observed positive training example
• flaws - ignores negative examples - if training data is perfect, then will get answer 1. no errors 2. there exists a hypothesis in H that describes target concept c
• version space - set of all hypotheses consistent with the training examples
• list-then-eliminate - list all hypotheses and eliminate any that are inconsistent (slow)
• candidate-elimination - represent most general (G) and specific (S) members of version space
• version space representation theorem - version space can be found from most general / specific version space members
• for positive examples
• make S more general
• fix G
• for negative examples
• fix S
• make G more specific
• in general, optimal query strategy is to generate instances that satisfy exactly half the hypotheses in the current version space
• testing?
• classify as positive if satisfies S
• classify as negative if doesn’t satisfy G
• bias
• unbiased learner
• might have to learn a union of rules - then target concept is expressible
• however, this doesn’t generalize at all
• thus need inductive inference property: a learner that makes no a prior assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances
• define inductive bias of a learner as the set of additional assumptions B sufficient to justify its inductive inferences as deductive inferences
• required to generalize
• inductive bias of candidate-elimination - target concept c is contained in H
• Machine Learning

[toc]

# Overview

• 3 types
• supervised
• unsupervised
• reinforcement

## Evaluation

• accuracy = number of correct classifications / total number of test cases
• balanced accuracy = 1/2 (TP/P + TN/N)
• recall - TP/(TP+FN)
• precision - TP/(TP+FP)
• you train by minimizing SSE on training data
• report MSE for test samples
• cross validation - don’t have enough data for a test set
• k-fold - split data into N pieces, test on only 1
• LOOCV - train on all but one

## Error

• Define a loss function $\mathcal{L}$
•  0-1 loss: C-f(X)
• $L_2$ loss: $(C-f(X))^2$
• Expected Prediction Error EPE(f) = $E_{X,C} [\mathcal{L}(C,f(X))]$
•  =$E_{X}\left[ \sum_i \mathcal{L}(C_i,f(X)) Pr(C_i X) \right]$
• Minimize EPE
•  Bayes Classifier minimizes 0-1 loss: $\hat{f}(X)=C_i$ if $P(C_i X)=max_f P(f X)$
•  KNN minimizes L2 loss: $\hat{f}(X)=E(Y X)$
• EPE(f(X)) = $noise^2+bias^2+variance$
• noise - unavoidable
• bias=$E[(\bar{\theta}-\theta_{true})^2]$ - error due to incorrect assumptions
• simple linear regression has 0 bias
• variance=$E[(\bar{\theta}-\theta_{estimated})^2]$ - error due to variance of training samples
• more complex models (more nonzero parameters) have lower bias, higher variance
• if high bias, train and test error will be very close (model isn’t complex enough)

# Classification

• asymptotic classifier - assume you get infinite training / testing points
•  discriminative - model P(C X) directly
• smaller asymptotic error
• slow convergence ~ O(p)
•  generative - model P(X C) directly
• generally has higher bias -> can handle missing data
• fast convergence ~ O(log(p))

## Discriminative

### SVMs

• svm benefits
1. maximum margin separator generalizes well
2. kernel trick makes it very nonlinear
3. nonparametric - can retain training examples, although often get rid of many
• notation
• $y \in {-1,1}$
• $h(x) = g(w^tx +b)$
• g(z) = 1 if $z \geq 0$ and -1 otherwise
• define functional margin $\gamma^{(i)} = y^{(i)} (w^t x +b)$
• want to limit the size of (w,b) so we can’t arbitrarily increase functional margin
• function margin $\hat{\gamma}$ is smallest functional margin in a training set
•  geometric margin = functional margin / w
•  if w =1, then same as functional margin
• invariant to scaling of w
• optimal margin classifier
•  want $$max : \gamma : s.t. : y^{(i)} (w^T x^{(i)} + b) \geq \gamma, i=1,..,m; w =1$$
•  difficult to solve, especially because of w =1 constraint
• assume $\hat{\gamma}=1$ - just a scaling factor
•  now we are maximizing $1/ w$
•  equivalent to this formulation: $$min : 1/2 w ^2 : s.t. : y^{(i)}(w^Tx^{(i)}+b)\geq1, i = 1,…,m$$
• Lagrange duality
• dual representation is found by solving $\underset{a}{argmax} \sum_j \alpha_j - 1/2 \sum_{j,k} \alpha_j \alpha_k y_j y_k (x_j \cdot x_k)$ subject to $\alpha_j \geq 0$ and $\sum_j \alpha_j y_j = 0$
• convex
• data only enter in form of dot products, even when predicting $h(x) = sgn(\sum_j \alpha_j y_j (x \cdot x_j) - b)$
• weights $\alpha_j$ are zero except for support vectors
• replace dot product $x_j \cdot x_k$ with kernel function $K(x_j, x_k)$
• faster than just transforming x
• allows to find optimal linear separators efficiently
• soft margin classifier - lets examples fall on wrong side of decision boundary
• assigns them penalty proportional to distance required to move them back to correct side
• want to maximize margin $M = \frac{2}{\sqrt{w^T w}}$
•  we get this from $M= x^+ - x^- = \lambda w = \lambda \sqrt{w^Tw}$
• separable case: argmin($w^Tw$) subject to
• $w^Tx+b\geq 1$ for all x in +1 class
• $w^Tx+b\leq 1$ for all x in -1 class
• solve with quadratic programming
• non-separable case: argmin($w^T w/2 + C \sum_i^n \epsilon_i$) subject to
• $w^Tx_i +b \geq 1-\epsilon_i$ for all x in +1 class
• $w^Tx_i +b \leq -1+\epsilon_i$ for all x in -1 class
• $\forall i, \epsilon_i \geq 0$
• large C can lead to overfitting
• benefits
• number of parameters remains the same (and most are set to 0)
• we only care about support vectors
• maximizing margin is like regularization: reduces overfitting
• these can be solved with quadratic programming QP
• solve a dual formulation (Lagrangian) instead of QPs directly so we can use kernel trick
• primal: $min_w max_\alpha L(w,\alpha)$
• dual: $max_\alpha min_w L(w,\alpha)$
• KKT condition for strong duality
• complementary slackness: $\lambda_i f_i(x) = 0, i=1,…,m$
• VC (Vapnic-Chervonenkis) dimension - if data is mapped into sufficiently high dimension, then samples will be linearly separable (N points, N-1 dims)
• kernel functions - new ways to compute dot product (similarity function)
• original testing function: $\hat{y}=sign(\Sigma_{i\in train} \alpha_i y_i x_i^Tx_{test}+b)$
• with kernel function: $\hat{y}=sign(\Sigma_{i\in train} \alpha_i y_i K(x_i,x_{test})+b)$
• linear $K(x,z) = x^Tz$
• polynomial $K (x, z) = (1+x^Tz)^d$
•  radial basis kernel $K (x, z) = exp(-r x-z ^2)$
• computing these is O($m^2$), but dot-product is just O(m)
• function that corresponds to an inner product in some expanded feature space
• practical guide
• use m numbers to represent categorical features
• scale before applying
• fill in missing values

### Logistic Regression

•  $p = P(Y=1 X)=\frac{exp(\theta^T x)}{1+exp(\theta ^Tx)}$
• logit (log-odds) of $p:ln\left[ \frac{p}{1-p} \right] = \theta^T x$
• predict using Bernoulli distribution with this parameter p
• can be extended to multiple classes - multinomial distribution

## Generative

### Naive Bayes Classifier

• let $C_1,…,C_L$ be the classes of Y
•  want Posterior $P(C X) = \frac{P(X C)(P(C)}{P(X)}$
• MAP rule - maximum A Posterior rule \begin{itemize}
• use Prior P(C)
• using x, predict $C^*=\text{argmax}_C P(C|X_1,…,X_p)=\text{argmax}_C P(X_1,…,X_p|C) P(C)$ - generally ignore denominator \end{itemize}
• naive assumption - assume that all input attributes are conditionally independent given C \begin{itemize}
• $P(X_1,…,X_p|C) = P(X_1|C)\cdot…\cdot P(X_p|C) = \prod_i P(X_i|C)$ \end{itemize}
• learning \begin{enumerate}
• learn L distributions $P(C_1),P(C_2),…,P(C_L)$
• learn $P(X_j=x_{jk}|C_i)$ \begin{itemize}
• for j in 1:p
•  i in 1:$C$
•  k in 1:$X_j$
• for discrete case we store $P(X_j|c_i)$, otherwise we assume a prob. distr. form \end{itemize} \begin{itemize}
•  naive: $C \cdot ( X_1 + X_2 + … + X_p )$ distributions
• otherwise: $|C|\cdot (|X_1| \cdot |X_2| \cdot … \cdot |X_p|)$ \end{itemize} \end{enumerate}
• testing \begin{itemize}
• P(X|c) - look up for each feature $X_i|C$ and try to maximize \end{itemize}
• smoothing - used to fill in 0s \begin{itemize}
•  $P(x_i c_j) = \frac{N(x_i, c_j) +1}{N(c_j)+ X_i }$
• then, $\sum_i P(x_i|c_j) = 1$ \end{itemize}

### Gaussian classifiers

• distributions \begin{itemize}
•  Normal $P(X_j C_i) = \frac{1}{\sigma_{ij} \sqrt{2\pi}} exp\left( -\frac{(X_j-\mu_{ij})^2}{2\sigma_{ij}^2}\right)$- requires storing $C \cdot p$ distributions
• Multivariate Normal $\frac{1}{(2\pi)^{D/2}} \frac{1}{|\Sigma|^{1/2}} exp\left(-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu)\right)$where $\Sigma$ is covariance matrix \end{itemize} \end{itemize} \begin{itemize}
•  decision boundary are points satisfying $P(C_i X) = P(C_j X)$
• LDA - linear discriminant analysis - assume covariance matrix is the same across classes \begin{itemize}
• Gaussian distributions are shifted versions of each other
• decision boundary is linear \end{itemize}
• QDA - different covariance matrices \begin{itemize}
• estimate the covariance matrix separately for each class C
• decision boundaries are quadratic
• fits data better but has more parameters to estimate \end{itemize}
• Regularized discriminant analysis - shrink the separate covariance matrices towards a common matrix \begin{itemize}
• $\Sigma_k = \alpha \Sigma_k + (1-\alpha) \Sigma$ \end{itemize}
• treat each feature attribute and class label as random variables \begin{itemize}
• we assume distributions for these
• for 1D Gaussian, just set mean and var to sample mean and sample var \end{itemize}

### Text classification

• bag of words - represent text as a vector of word frequencies X \begin{itemize}
• remove stopwords, stemming, collapsing multiple - NLTK package in python
• assumes word order isn’t important
• can store n-grams \end{itemize}
•  multivariate Bernoulli: $P(X C)=P(w_1=true,w_2=false,… C)$
• multivariate Binomial: $P(X|C)=P(w_1=n_1,w_2=n_2,…|C)$ \begin{itemize}
• this is inherently naive \end{itemize}
• time complexity \begin{itemize}
•  training O(n*average_doc_length_train+ c dict )
• testing O(|C| average_doc_length_test) \end{itemize}
• implementation \begin{itemize}
• have symbol for unknown words
• underflow prevention - take logs of all probabilities so we don’t get 0
• c = argmax log$P(c)$ + $\sum_i log P(X_i|c)$ \end{itemize}

## Instance-Based (ex. K nearest neighbors)

• also called lazy learners
• makes Voronoi diagrams
• can take majority vote of neighbors or weight them by distance
• distance can be Euclidean, cosine, or other
• should scale attributes so large-valued features don’t dominate
• Mahalanobois distance metric takes into account covariance between neighbors
• in higher dimensions, distances tend to be much farther, worse extrapolation
• sometimes need to use invariant metrics
• ex. rotate digits to find the most similar angle before computing pixel difference
• could just augment data, but can be infeasible
• computationally costly so we can approximate the curve these rotations make in pixel space with the invariant tangent line
• stores this line for each point and then find distance as the distance between these lines
• finding NN with k-d (k-dimensional) tree
• balanced binary tree over data with arbitrary dimensions
• each level splits in one dimension
• might have to search both branches of tree if close to split
• finding NN with locality-sensitive hashing
• approximate
• make multiple hash tables
• each uses random subset of bit-string dimensions to project onto a line
• union candidate points from all hash tables and actually check their distances
• comparisons
• error rate of 1 NN is never more than twice that of Bayes error

# Feature Selection

## Filtering

• ranks features or feature subsets independently of the predictor
• univariate methods (consider one variable at a time) \begin{itemize}
• ex. T-test of y for each variable
• ex. Pearson correlation coefficient - this can only capture linear dependencies
• mutual information - covers all dependencies \end{itemize}
• multivariate methods \begin{itemize}
• features subset selection
• need a scoring function
• need a strategy to search the space
• sometimes used as preprocessing for other methods \end{itemize}

## Wrapper

• uses a predictor to assess features of feature subsets
• learner is considered a black-box - use train, validate, test set
• forward selection - start with nothing and keep adding
• backward elimination - start with all and keep removing
• others: Beam search - keep k best path at teach step, GSFS, PTA(l,r), floating search - SFS then SBS

## Embedding

• uses a predictor to build a model with a subset of features that are internally selected
• ex. lasso, ridge regression

# Unsupervised Learning

• labels are not given
• intra-cluster distances are minimized, inter-cluster distances are maximized
• Distance measures \begin{itemize}
• symmetric D(A,B)=D(B,A)
• self-similarity D(A,A)=0
• positivity separation D(A,B)=0 iff A=B
• triangular inequality D(A,B) <= D(A,C)+D(B,C)
• ex. Minkowski Metrics $d(x,y)=\sqrt[r]{\sum |x_i-y_i|^r}$ \begin{itemize}
• r=1 Manhattan distance
• r=1 when y is binary -> Hamming distance
• r=2 Euclidean
• r=$\infty$ “sup” distance \end{itemize}
• correlation coefficient - unit independent
• edit distance

## Hierarchical

• Two approaches:
1. Bottom-up agglomerative clustering - starts with each object in separate cluster then joins
2. Top-down divisive - starts with 1 cluster then separates
• ex. starting with each item in its own cluster, find best pair to merge into a new cluster
• repeatedly do this to make a tree (dendrogram)
• distances between clusters \begin{itemize}
• single-link=nearest neighbor=their closest members (long, skinny clusters)
• complete-link=furthest neighbor=their furthest members (tight clusters)
• average=average of all cross-cluster pairs - most widely used \end{itemize}
• Complexity: $O(n^2p)$ for first iteration and then can only get worse

## Partitional

• partition n objects into a set of K clusters (must be specified)
• globally optimal: exhaustively enumerate all partitions
• minimize sum of squared distances from cluster centroid
• Evaluation w/ labels - purity - ratio between dominant class in cluster and size of cluster

### Expectation Maximization (EM)

• general procedure that includes K-means
• E-step
• calculate how strongly to which mode each data point “belongs” (maximize likelihood)
• M-step - calculate what each mode’s mean and covariance should be given the various responsibilities (maximization step)
• known to converge
• can be suboptimal
• monotonically decreases goodness measure
• can also partition around medoids
• mixture-based clustering
• K-Means
•  assign everything to nearest center: O( clusters *np)
• recompute centers O(np) and repeat until nothing changes
• partition amounts to Voronoi diagram

### Gaussian Mixture Model (GMM)

• continue deriving new mean and variance at each step
• “soft” version of K-means - update means as weighted sums of data instead of just normal mean

# Derivations

## normal equation

• $L(\theta) = \frac{1}{2} \sum_{i=1}^n (\hat{y}_i-y_i)^2$
• $L(\theta) = 1/2 (X \theta - y)^T (X \theta -y)$
• $L(\theta) = 1/2 (\theta^T X^T - y^T) (X \theta -y)$
• $L(\theta) = 1/2 (\theta^T X^T X \theta - 2 \theta^T X^T y +y^T y)$
• $0=\frac{\partial L}{\partial \theta} = 2X^TX\theta - 2X^T y$
• $\theta = (X^TX)^{-1} X^Ty$

## ridge regression

•  $L(\theta) = \sum_{i=1}^n (\hat{y}_i-y_i)^2+ \lambda \theta _2^2$
• $L(\theta) = (X \theta - y)^T (X \theta -y)+ \lambda \theta^T \theta$
• $L(\theta) = \theta^T X^T X \theta - 2 \theta^T X^T y +y^T y + \lambda \theta^T \theta$
• $0=\frac{\partial L}{\partial \theta} = 2X^TX\theta - 2X^T y+2\lambda \theta$
• $\theta = (X^TX+\lambda I)^{-1} X^T y$

## single Bernoulli

•  L(p) = P(Train Bernoulli(p)) = $P(X_1,…,X_n p)=\prod_i P(X_i p)=\prod_i p^{X_i} (1-p)^{1-X_i}$
• $=p^x (1-p)^{n-x}$ where x = $\sum x_i$
• $log(L(p)) = log(p^x (1-p)^{n-x}=x log(p) + (n-x) log(1-p)$
• $0=\frac{dL(p)}{dp} = \frac{x}{p} - \frac{n-x}{1-p} = \frac{x-xp - np+xp}{p(1-p)}=x-np$
• $\implies \hat{p} = \frac{x}{n}$

## multinomial

•  $L(\theta)=P(Train Multinomial(\theta))=P(d_1,…,d_n \theta_1,…,\theta_p)$ where d is a document of counts x
• =$\prod_i^n P(d_i|\theta_1,…\theta_p)=\prod_i^n factorials \cdot \theta_1^{x_1},…,\theta_p^{x_p}$- ignore factorials because they are always same \begin{itemize}
• require $\sum \theta_i = 1$ \end{itemize}
• $\implies \theta_i = \frac{\sum_{j=1}^n x_{ij}}{N}$ where N is total number of words in all docs
• Regression

[toc]

# problem formulation

• absorb intercept into feature vector of 1
• x = column
• add $x^{(0)} = 1$ as the first element
• matrix formation
• $\hat{y} = f(x) = x^T \theta = \theta x^T = \theta_0 + \theta_1 x^1 + \theta_2 x^2 + …$
• $\pmb{x_1}$ is all the features for one data sample
• $\pmb{x^1}$ is the first feature over all the data samples
• our goal is to pick the optimal theta to minimize least squares
• loss function - minimize SSE
• SSE is a convex function
• single point with 0 derivative
• second derivative always positive
• hessian is psd (positive semi-definite)

# optimization

• gradient - denominator layout - size of variable you are taking - we always use denominator layout
• numerator layout - you transpose the size
• optimization - find values of variables that minimize objective function while satisfying constraints
1. normal equations
• $L(\theta) = \frac{1}{2} \sum_{i=1}^n (f(x_i)-y_i)^2$
• = $1/2 (X \theta - y)^T (X \theta -y)$
• set derivative equal to 0 and solve
• $\theta = (X^TX)^{-1} X^Ty$
• solving normal function is computationally expensive - that’s why we do things like regularization (matrix multiplication is $O(n^3)$)
2. gradient descent = batch gradient descent
• gradient - vector that points to direction of maximum increase
• at every step, subtract gradient multiplied by learning rate: $x_k = x_{k-1} - \alpha \nabla_x F(x_{k-1})$
• alpha = 0.05 seems to work
• $J(\theta) = 1/2 (\theta ^T X^T X \theta - 2 \theta^T X^T y + y^T y)$
• $\nabla_\theta J(\theta) = X^T X \theta - X^T Y$
• = $\sum_i x_i (x_i^T - y_i)$
• this represents residuals * examples
3. stochastic gradient descent
• don’t use all training examples - approximates gradient
• single-sample
• mini-batch (usually better in offline case)
• coordinate-descent algorithm
• online algorithm - update theta while training data is changing
• when to stop?
• predetermined number of iterations
• stop when improvement drops below a threshold
• each pass of the whole data = 1 epoch
• benefits
1. less prone to getting stuck to shallow local minima
2. don’t need huge ram
3. faster
4. newton’s method for optimization
• second-order optimization - requires 1st & 2nd derivatives
• $\theta_{k+1} = \theta_k - H_K^{-1} g_k$
• update with inverse of Hessian as alpha - this is an approximation to a taylor series
• finding inverse of Hessian can be hard / expensive

# evaluation

• accuracy = number of correct classifications / total number of test cases
• you train by lowering SSE or MSE on training data
• report MSE for test samples
• cross validation - don’t have enough data for a test set
• data is reused
• k-fold - split data into N pieces
• N-1 pieces for fit model, 1 for test
• cycle through all N cases
• average the values we get for testing
• leave one out (LOOCV)
• train on all the data and only test on one
• then cycle through everything
• regularization path of a regression - plot each coeff v. $\lambda$
• tells you which features get pushed to 0 and when

# 1 - simple LR

• ml: task -> representation -> score function -> optimization -> models
• all of these things are assumptions

# 2 - LR with non-linear basis functions

• can have nonlinear basis functions (ex. polynomial regression)
• radial basis function - ex. kernel function (Gaussian RBF)
• $exp(-(x-r)^2 / (2 \lambda ^2))$
• non-parametric algorithm - don’t get any parameters theta; must keep data

# 3 - locally weighted LR

• recompute model for each target point
• instead of minimizing SSE, we minimize SSE weighted by each observation’s closeness to the sample we want to query

# 4 - linear regression model with regularizations

• when $(X^T X)$ isn’t invertible can’t use normal equations and gradient descent is likely unstable
• X is nxp, usually n » p and X almost always has rank p
• problems when n < p
• intuitive way to fix this problem is to reduce p by getting read of features
• a lot of papers assume your data is already zero-centered
• conventionally don’t regularize the intercept term

### regularizations

1. ridge regression (L2)
• if (X^T X) not invertible, add a small element to diagonal
• then it becomes invertible
• small lambda -> numerical solution is unstable
• proof of why it’s invertible is difficult
•  argmin $\sum_i (y_i - \hat{y_i})^2+ \lambda \beta _2^2$
• equivalent to minimizing $\sum_i (y_i - \hat{y_i})^2$ s.t. $\sum_j \beta_j^2 \leq t$
• solution is $\hat{\beta_\lambda} = (X^TX+\lambda I)^{-1} X^T y$
• for small $\lambda$ numerical solution is unstable
• When $X^TX=I$, $\beta {Ridge} = \frac{1}{1+\lambda} \beta{Least Squares}$
2. lasso regression (L1)
•  $\sum_i (y_i - \hat{y_i})^2+\lambda \beta _1$
•  equivalent to minimizing $\sum_i (y_i - \hat{y_i})^2$ s.t. $\sum_j \beta_j \leq t$
• “least absolute shrinkage and selection operator”
• lasso - least absolute shrinkage and selection operator - L1
• acts in a nonlinear manner on the outcome y
• keep the same SSE loss function, but add constraint of L1 norm
• doesn’t have closed form for Beta
• because of the absolute value, gradient doesn’t exist
• can use directional derivatives
• best solver is LARS - least angle regression
• if tuning parameter is chose well, will set lots of coordinates to 0
• convex functions / convex sets (like circle) are easier to solve
• if p>n, lasso selects at most n variables
• if pairwise correlations are very high, lasso only selects one variable
3. elastic net - hybrid of the other two
•  $\beta_{Naive ENet} = \sum_i (y_i - \hat{y_i})^2+\lambda_1 \beta _1 + \lambda_2 \beta _2^2$
• l1 part generates sparse model
• l2 part encourages grouping effect, stabilizes l1 regularization path
• grouping effect - group of highly correlated features should all be selected
• naive elastic net has too much shrinkage so we scale $\beta_{ENet} = (1+\lambda_2) \beta_{NaiveENet}$
• to solve, fix l2 and solve lasso
• qi notes

# linear discriminant analysis

• PCA - find component axes that maximize the variance of our data
• “unsupervised” - ignores class labels
• LDA - maximize the separation between multiple classes
• “supervised” - computes directions (linear discriminants) that represent axes that maximize separation between multiple classes
• used as dimensionality reduction technique
• project a dataset onto a lower-dimensional space with good class-separability

# datasets

• resting state fMRI gives a time-series of things turning on
• we want to model correlations between everything, we use a gaussian graphical model
• brain atlas - serial sections of brain images
• histology - the study of the microscopic structure of tissuesch
• leave-one-out cross validation and try to classify autism / non-autism
• see how well autism subjects are identified
• how good is the final connectome?
1. ABIDE
• normal group ~500
• has brain imaging
• subjects as rows, features as cols
• each feature can be an ROI
• autism group ~500
• has brain imaging
• no molecular measurements
1. ABA data
• has both genotype & phenotype level data
• don’t know what kind of phenotype
• mostly about genotype data
• human ROI has ~200
• in clustering case, 30000 recordings, need to cluster into groups

# algorithms

• SLFA
• the features (ex. ROI) have clusters
• SLFA tries to find dependencies between the clusters instead of the variables
• group and group dependency
• this works better in genotype case
• SIMULE
• context-sensitive graph

# kendall tau

• https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient
• (# concordant pairs - # discordant pairs) / (n*(n-1)/2)
• concordant if both ranks agree
• must do something special if tied
• matlab has pretty fast implementation, R slow
• want to speed up / parallelize this - use multicore, not gpu
• the Kendall correlation is not affected by how far from each other ranks are but only by whether the ranks between observations are equal or not

# gaussian graphical model

• likelihood = probability of seeing data given parameters of the model
• clustering algorithm - associate conditional probability with each node (relation between the nodes, given all the other nodes)
• the weights in a network make local assertions about the relationships between neighboring nodes
• inference algorithms turn these local assertions into global assertions about the relationships between nodes
•  $P(A B) = P(AB) / P(B)$
• can be used for learning (given inputs, outputs)
• A Gaussian graphical model is a graph in which all random variables are continuous and jointly Gaussian.
• see defs.png
• precision matrix - inverse of covariance matrix; gives pairwise correlations

# graphical lasso

• optimize parameter to minimize regression between Y and B*X
• problem is hard because far less samples than nodes so can’t invert covariance matrix
• coordinate-descent methods: optimize over one variable at a time
• l1-normalization makes it so there have to be a lot of 0s in B
• so does l0, but this is harder to solve
• have regress all the variables against all the other variables
• graphical lasso lets us do this very efficiently with coordinate descent

# structure learning

• structure learning aims to discover the topology of a probabilistic network of variables such that this network represents accurately a given dataset while maintaining low complexity
• accuracy of representation - likelihood that the model explains the observed data
• complexity of a graphical model - number of parameters

# representational similarity learning

• aims to discover features that are important in representing (human-judged) similarities among objects
• can be posed as a sparsity-regularized multi-task regression problem
• related to representational similarity analysis

# latent dirichlet allocation

• generative model - explain observations from unobserved variables
• In LDA, each document may be viewed as a mixture of various topics
• similar to probabilistic latent semantic analysis (pLSA), except that in LDA the topic distribution is assumed to have a Dirichlet prior

# latent variable model

• relates manifest variables to latent variables
• responses on the manifest variables are result of latent variables
• manifest variables have local independence - nothing in common after controlling for latent variable
• latent factor models have been proposed to find concise descriptions of data
• Search

[toc]

# Uninformed Search – Russell & Norvig 3rd ed. (R&N) 3.1-3.4

### problem-solving agents

• goal - 1st step
• problem formulation - deciding what action and states to consider given a goal
• uninformed - given no info about problem besides definition
• an agent with several immediate options of unknown value can decide what to do first by examining future actions that lead to states of known value
• 5 components
1. initial state
2. applicable actions at teach state
3. transition model
4. goal states
5. path cost function

#### problems

• toy problems
1. vacuum world
2. 8-puzzle (type of sliding-block puzzle)
3. 8-queens problem
4. Knuth conjecture
• real-world problems
1. route-finding
2. TSP (and othe touring problems)
3. VLSI layout
5. automatic assembly sequencing

### searching for solutions

• start at a node and make a search tree
• frontier = open list = set of all leaf nodes available for expansion at any given point
• search strategy determines which state to expand next
• want to avoid redundant paths
1. TREE-SEARCH - continuously expand the frontier
2. GRAPH-SEARCH - keep track of previously visited states in explored set = closed set and don’t revisit

### infrastructure

• node - data structure that contains parent, state, path-cost, action
• metrics
• completeness - does it find a solution
• optimality - does it find the best solution
• time/space complexity
•  theoretical CS: V + E
• b - branching factor - max number of branches of any node
• d - depth - number of steps from the root
• m - max length of any path in the search space
• search cost - just time/memory
• total cost - search cost + path cost
• bfs
• uniform-cost search - always expand node with lowest path cost g(n)
• frontier is priority queue ordered by g
• dfs
• backtracking search - dsp but only one successor is generated at a time; each partially expande node remembers which succesor to generate next
• only O(m) memory instead of O(bm)
• depth-limited search
• diameter of state space - longest possible distance to goal from any start
• iterative deepening dfs - like bfs explores entire depth before moving on
• iterative lengthening search - instead of depth limit has path-cost limit
• bidirectional search - search from start and goal and see if frontiers intersect
• just because they intersect doesn’t mean it was the shortest path
• can be difficult to search backward from goal (ex. N-queens)

# A* Search and Heuristics – R&N 3.5-3.6

• informed search - uses problem-specific knowledge
• has evaluation function f which likely incorporate g and h
• heuristic h = estimated cost of cheapest path from state at node n to a goal state
• best-first - choose nodes with best f
• greedy best-first search - keep expanding node closest to goal
• $A$* search
• $f(n) = g(n) + h(n)$ represents the estimated cost of the cheapest solution through n
• $A$ (with tree search) is optimal and complete if h(n) is *admissible
• h(n) never overestimates the cost to reach the goal
• $A$ (with graph search) is optimal and complete if h(n) is *consistent (stronger than admissible)
• $h(n) \leq cost(n \to n’) + h(n’)$
• can draw contours of f (because nondecreasing)
• $A$* is also optimally efficient (guaranteed to expand fewest nodes) for any given consisten heuristic because any algorithm that that expands fewer nodes runs the risk of missing the optimal solution
• for a heuristic, absolute error $\delta := h^-h$ and *relative error $\epsilon := \delta / h^*$
• here $h^*$ is actual cost of root to goal
• bad when lots of solutions with small absolute error because it must try them all
• bad because it must store all nodes in memory
• memory-bounded heuristic search
• iterative-deepening $A$* - iterative deepening with cutoff f-cost
• recursive best-first search - like standard best-first search but with linear space
• each node keeps f_limit variable which is best alternative path available from any ancestor
• as it unwinds, each node is replaced with backed-up value - best f-value of its children
• decides whether it’s worth reexpanding subtree later
• often flips between different good paths (h is usually less optimistic for nodes close to the goal)
• $SMA$* - simplified memory-bounded A* - best-first until memory is full then forgot worst leaf node and add new leaf
• store forgotten leaf node info in its parent
• on hard problems, too much time switching between nodes
• agents can also learn to search with metalevel learning

### heuristic functions

• effective branching factor $b^$ - if total nodes generated by A is N and solution depth is d, then b* is branching factor for uniform tree of depth d for N+1 nodes: $N+1 = 1+b^* +(b^*)^2 + ... + (b^*)^d$
• want $b^*$ close to 1
• generally want bigger heuristic because everything with f(n) < C* will be expanded, the less f(n) < C* the better
• h1 dominates h2 if $h1(n) \geq h2(n) \forall : n$
• relaxed problem - removes constraints and adds edges to the graph
• solution to original problem still solves relaxed problem
• cost of optimal solution to a relaxed problem is an admissible heuristic for the original problem
• also is consistent
• when there are several good heuristics, pick h(n) = max(h1(n), …, hm(n)) for each node
• pattern database - heuristic stores exact solution cost for every possible subproblem instance
• disjoint pattern database - break into independent possible subproblems
• can learn heuristic by solving lots of problems using useful features
• aren’t necessarily admissible / consisten

# Local Search – R&N 4.1-4.2

• local search looks for solution not path
• maintains only current node and its neighbors
• more like optimization
• complete - finds a goal
• optimal - finds global min/max
• hill-climbing = greedy local search
• also stochastic hill climbing and random-restart hill climbing
• simulated annealing - pick random move
• if move better, then accept
• otherwise accept with some probability proportional to how bad it is and accept less as time goes on
• local beam search - pick k starts, then choose the best k states from their neighbors
• stochastic beam search - pick best k with prob proportional to how good they are
• genetic algorithms - population of k individuals
• each scored by fitness function
• pairs are selected for reproduction using crossover point
• each location subject to random mutation
• schema - substring in which some of the positions can be left unspecified (ex. $246**$)
• want schema to be good representation because chunks tend to be passed on together

### continuous space

• hill-climbing / simulated annealing still work
• could just discretize neighborhood of each state
• if possible, solve $\nabla f = 0$
• otherwise SGD $x = x + \alpha \nabla f(x)$
• can estimate gradient by evaluating response to small increments
• line search - repeatedly double $\alpha$ until f starts to increase again
• Newton-Raphson method
• uses 2nd deriv: x = x - g(x) / g’(x)
• $x = x - H_f^{-1} (x) \nabla f(x)$ where H is the Hessian of 2nd derivs
• constrained optimization
• contains linear programming problems in which constraints must be linear inequalities forming a convex set
• these have no local minima

# Constraint satisfaction problems – R&N 6.1-6.5

• CSP
1. set of variables $X_1, …, X_n$
2. set of domains $D_1, …, D_n$
3. set of constraints $C$ specifying allowable values
• each state is an assignment of variables
• consistent - doesn’t violate constraints
• complete - every variable is assigned
• constraint graph - nodes are variables and links connect any 2 variables that participate in a constraint
• unary constraint - restricts value of single variable
• binary constraint
• global constraint - arbitrary number of variables (doesn’t have to be all)
• converting graphs to only binary constraints
• every finite-domain constraint can be reduced to set of binary constraints w/ enough auxiliary variables
• another way to convert an n-ary CSP to a binary one is the dual graph transformation - create a new graph in which there is one variable for each constraint in the original graph and one binary constraint for each pair of original constraints that share variables
• also can have preference constraints instead of absolute constraints

### inference

• node consistency - prune domains violating unary constraints
• arc consistency - satisfy binary constraints
• uses AC-3 algorithm
• set of all arcs = binary constraints
• pick one and apply it
• if things changed, re-add all the neighboring arcs to the set
•  $O(cd^3)$ - domain = d, # arcs = c
• variable can be generalized arc consistent
• path consistency - consider constraints on triplets - PC-2 algorithm
• extends to k-consistency (although path consistency assumes binary constraint networks)
• strongly k-consistent - also (k-1) consistent, (k-2) consistent, … 1-consistent
• implies $O(k^2d)$
• establishing k-consistency time/space is exponential in k
• global constraints can have more efficient algorithms
• ex. assign different colors to everything
• resource constraint = atmost constraint - sum of variable must not exceed some limit
• bounds propagation - make sure variables can be allotted to solve resource constraint

### backtracking

• CSPs are commutative - order of choosing states doesn’t matter
• backtracking search - depth-first search that chooses values for one variable at a time and backtracks when no legal values left
1. variable and value ordering
• minimum-remaining-values heuristic - assign variable with fewest choices
• degree heuristic - pick variable involved in largest number of constraints on other unassigned variables
• least-constraining-value heuristic - prefers value that rules out fewest choices for nieghboring variables
2. interleaving search and inference
• forward checking - when we assign a variable in search, check arc-consistency on its neighbors
• maintaining arc consistency (MAC) - when we assign a variable, call AC-3, intializing with arcs to neighbors
3. intelligent backtracking - looking backward
• keep track of conflict set for each node (list of variable assignments that deleted things from its domain)
• backjumping - backtracks to most recent assignment in conflict set
• too simple - forward checking makes this redundant
• conflict-directed backjumping
• let $X_j$ be current variable and $conf(X_j)$ be conflict set. If every possible value for $X_j$ fails, backjump to the most recent variable $X_i$ in $conf(X_j)$ and set $conf(X_i) = conf(X_i) \cup conf(X_j) - X_i$
• constraint learning - findining minimum set of variables/values from conflict set that causes the problem = no-good

### local search for csps

• start with some assignment to variables
• min-conflicts heuristic - change variable to minimize conflicts
• can escape plateaus with tabu search - keep small list of visited states
• could use constraint weighting

### structure of problems

• connected components of constraint graph are independent subproblems
• tree - any 2 variables are connected by only one path
• directed arc constistency - ordered variables $X_i$, every $X_i$ is consistent with each $X_j$ for j>i
• tree with n nodes can be made directed arc-consisten in O(n) steps - $O(nd^2)$
• two ways to reduce constraint graphs to trees
1. assign variables so remaining variables form a tree
• assigned variables called cycle cutset with size c
• $O(d^c \cdot (n-c) d^2$
• finding smallest cutset is hard, but can use approximation called cutset conditioning
2. tree decomposition - view each subproblem as a mega-variable
• tree width w - size of largest subproblem - 1
• solvable in $O(nd^{w+1})$
• also can look at structure in variable values
• ex. value symmetry - can assign different colorings
• use symmetry-breaking constraint - assign colors in alphabetical order
• structure learning

[toc]

# 1 - introduction

1. structured prediction - have multiple independent output variables
• output assignements are evaluated jointly
• requires joint (global) inference
• can’t use classifier because output space is combinatorially large
• three steps
1. model - pick a model
2. learning = training
3. inference = testing
2. representation learning - picking features
• usually use domain knowledge
1. combinatorial - ex. map words to higher dimensions
2. hierarchical - ex. first layers of CNN

# 2 - binary classification

• learn - learn w
• $w$ should point to positive examples
• inference - predict
• $\hat{y}=sign(w^T x)$
• losses
• usually don’t minimize 0-1 loss (combinatorial)
• usually $w^Tx$ includes b term, but generally we don’t want to regularize b
1. perceptron - tries to find separating hyperplane
• whenever misclassified, update w
• can add in delta term to maximize margin
• $\hat{w} = argmin_w \sum_i max(0, -y_i \cdot w^T x_i)$
1. linear svm
• $\hat{w} = argmin_w [w^Tw + C \sum_i max(0,1-y_i \cdot w^T x_i))]$
• minimize norm of weights s.t. the closest points to the hyperplane have a score 1
• stochastic sub-gradient descent
• learning different ws differently
1. logistic regression = log-linear model
• $\hat{w} = argmin_w [w^Tw + C \sum_i log(1+exp(-y_i \cdot w^T x_i))]$

# 3 - multiclass classification

• reducing multiclass (K categories) to binary
• one-against-all
• train K binary classifiers
• class i = positive otherwise negative
• take max of predictions
• one-vs-one = all-vs-all
• train C(K,2) binary classifiers
• labels are class i and class j
• inference - any class can get up to k-1 votes, must decide how to break ties
• flaws - learning only optimizes local correctness
• single classifier
• multiclass perceptron (Kesler)
• if label=i, want $w_i ^Tx > w_j^T x \quad \forall j$
• if not, update $w_i$ and $w_j$* accordingly
• kessler construction
• $w = [w_1 … w_k]$
• want $w_i^T x > w_j^T x \quad \forall j$
• rewrite $w^T \phi (x,i) > w^T \phi (x,j) \quad \forall j$
• here $\phi (x,i)$ puts x in the ith spot and zeros elsewhere
• $\phi$ is often used for feature representation
• define margin: $\Delta (y,y’) = \begin{cases} \delta& if y \neq y’ \\ 0& if y=y’ \end{cases}$
• check if $y=argmax_{y’}w^T \phi(x,y’) + \delta (y,y’)$
• multiclass SVMs (Crammer&Singer)
• minimize total norm of weights s.t. true label is score at least 1 more than second best label
• multinomial logistic regression = multi-class log-linear model
•  $P(y x,w)=\frac{exp(w^T_yx)}{\sum_{y’ \in { 1,…,K}} exp(w_{y’}^T,x)}$
• we control the peakedness of this by dividing by stddev
• soft-max: sometimes substitue this for $w^T_y x$

# 4 - neural networks

• if neurons within layer are connected, not feed-forward
• remember to include bias unit at every layer
• typically, first layer has (# input features)
• last layer has (# classes)
• must convert labels to 1-of-K representation
• perceptron convergence thm - if data is linearly separable, perceptron learning algorithm wiil converge
• backprop yields a local optimum but works well in practice

# 5 - structure

• structured output can be represented as a graph
• outputs y
• inputs x
• two types of info are useful
• relationships between x and y
• relationships betwen y and y
• complexities
1. modeling - how to model?
2. train - can’t train separate weight vector for each inference outcome
3. inference - can’t enumerate all possible structures
• need to score nodes and edges
• could score nodes and edges independently
• could score each node and its edges together

# 6 - sequential models

## sequence models

• goal: learn distribution $P(x_1,…,x_n)$ for sequences $x_1,…,x_n$
• ex. text generation
• discrete Markov model
•  $P(x_1,…,x_n) = \prod_i P(x_i x_{i-1})$
• requires
1. initial probabilites
2. transition matrix
• mth order Markov model - keeps history of previous m states
• each state is an observation

## hidden Markov model - this is generative

• goal
• learn distribution $P(x_1,…,x_n,y_1,…,y_n)$
• ex. POS tagging
• model
•  define $P(x_1,…,x_n,y_1,…,y_n) = P(y_1) P(x_1 y_1) \prod_{i} P(y_i y_{i-1})$
• each output label is dependent on its neighbors in addition to the input
• definitions
• $\mathbf{y}$ - state
• states are not observed
• $\mathbf{x}$ - observation
• $\pi$ - initial state probabilities
•  A = transition probabilities $P(y_2 y_1)$
•  B = emission probabilities $P(x_1 y_1)$
• each state stochastically emits an observation
1. inference
• given $(\pi,A,B)$ and $\mathbf{x}$
1. calculate probability of $\mathbf{x}$
2. calculate most probable $\mathbf{y}$
•  define $P(x_1,…,x_n,y_1,…,y_n) = P(y_1) P(x_1 y_1) \prod_{i} P(y_i y_{i-1}) P(x_i y_i)$
•  use MAP: $\hat{y}=\underset{y}{argmax} : P(y x,\pi, A,B)=\underset{y}{argmax} : P(y \land x \pi, A,B)$
• use Viterbi algorithm
1. initial for each state s
•  $score_1(s) = P(s) P(x_1 s) = \pi_s B_{x_1,s}$
2. recurrence - for i = 2 to n, calculate scores using previous score only
•  $score_i(s) = \underset{y_i-1}{max} P(s y_{i-1}) P(x_i s) \cdot score_{i-1}(y_{i-1})$
3. final state
•  $\hat{y}=\underset{y}{argmax} : P(y,x \pi, A,B) = \underset{x}{max} : score_n (s)$
• complexity
• K = number of states
• M = number of observations
• n = length of sequence
• memory - nK
• runtime - $O(nK^2)$
1. learning - learn $(\pi,A,B)$
1. supervised (given y)
• basically just count (maximizing joint likelihood of input and output)
• $\pi_s = \frac{count(start \to s)}{n}$
• $A_{s’,s} = \frac{count(s \to s’)}{count(s)}$
• $B_{s,x} = \frac{count (s \to x)}{count(s)}$ 2. unsupervised (not given y)

## conditional models and local classifiers - discriminative model

• conditional models = discriminative models
•  goal: model $P(Y X)$
• learns the decision boundary only
• ignores how data is generated (like generative models)
• ex. log-linear models
•  $P(\mathbf{y x,w}) = \frac{exp(w^T \phi (x,y))}{\sum_y’ exp(w^T \phi (x,y’))}$
•  training: $w = \underset{w}{argmin} \sum log : P(y_i x_i,w)$
• ex. next-state model
•  $P(\mathbf{y} \mathbf{x})=\prod_i P(y_i y_{i-1},x_i)$
• ex. maximum entropy markov model
•  $P(y_i y_{i-1},x) \propto exp( w^T \phi(x,i,y_i,y_{i-1}))$
• adds more things into the feature representation than HMM via $\phi$
• has label bias problem
• if state has fewer next states they get high probability
•  effectively ignores x if $P(y_i y_{i-1})$ is too high
• ex. conditional random fields=CRF
• a global, undirected graphical model
• divide into factors
•  $P(Y x) = \frac{1}{Z} \prod_i exp(w^T \phi (x,y_i,y_{i-1}))$
• $Z = \sum_{\hat{y}} \prod_i exp(w^T \phi (x,\hat{y_i},\hat{y}_{i-1}))$
• $\phi (x,y) = \sum_i \phi (x,y_i,y_{i-1})$
• prediction via Viterbi (with sum instead of product)
• training
•  maximize log-likelihood $\underset{W}{max} -\frac{\lambda}{2} w^T w + \sum log : P(y_I x_I,w)$
• requires inference
• linear-chain CRF - only looks at current and previous labels
• ex. structured perceptron
• HMM is a linear classifier

# 7 - graphical models

• graphical models represent prob. distributions over multiple random variables

### 1 - Bayesian networks = causal networks (directed graphs)

• must be acyclic
• local independence - each node is independent of its non-descendants given its parents
•  $P(z_1,…,z_n)=\prod P(z_i Parents(z_i))$
• topological independence - a node is independent of all other nodes given its parents, children, and children’s parents = markov blanket
• compact representation of the joint prob. distr.
• global independencies - D-separation
• sometimes Bayes nets cannot represent the independence relations we want conveniently
1. arrow direction unclear
2. independencies structures can be strange

### 2 - Markov networks = Markov random fields (undirected graphs)

• each node is independent of its non-descendants given its immediate neighbors
• the nodes in a complete subgraph form a clique
• complete - all connections present
• $P_\theta(X) = \frac{1}{Z(\theta)} \prod_{c \in Cliques} f(x_c,\theta)$
• the joint probability decouples over cliques
• every clique $x_c$ is associated with a potential function $f(x_c,\theta)$
• partition function $Z(\theta) = \sum_x \prod_{c \in Cliques} f(x_c,\theta)$
• local independence - a node is independent of all other nodes given its neighbors
• global interdependence - if X,Y,Z are sets of nodes, X is conditionally independent of Y given Z if removing all nodes of Z removes all paths from X to Y
• factor graph - makes the factorization explicit
• replaces cliques with factors
• if x is dependent on all its neighbors
• Ising model - if x is binary
• Potts model - x is multiclass

### learning

• train via maximum likelihood

### inference

• compute probability of subset of states
• exact inference
• variable elimination
• belief propagation
• approximate inference
• MCMC
• variational algorithms
• loopy belief propagation

# 8 - constrained conditional models

## consistency of outputs and the value of inference

• ex. POS tagging - sentence shouldn’t have more than 1 verb
• inference
• a global decision comprising of multiple local decisions and their inter-dependencies
1. local classifiers
2. constraints
• learning
• global - learn with inference (computationally difficult)

# 9 - inference

• inference constructs the output given the model
• goal: find highest scoring state sequence
• $argmax_y : score(y) = argmax_y w^T \phi(x,y)$
• naive: score all and pick max - terribly slow
• viterbi - decompose scores over edges
• questions
1. exact v. approximate inference
• exact - search, DP, ILP
• approximate = heuristic - Gibbs sampling, belief propagation, beam search, linear programming relaxations
2. randomized v. deterministic
• if run twice, do you get same answer
• ILP - integer linear programs
• combinatorial problems can be written as integer linear programs
• many commercial solvers and specialized solvers
• NP-hard in general
• special case of linear programming - minimizing/maximizing a linear objective function subject to a finite number of linear constraints (equality or inequality)
• in general, $c = \underset{c}{argmax}: c^Tx$ subject to $Ax \leq b$
• maybe more constraints like $x \geq 0$
• the constraint matrix defines a polytype
• only the vertices or faces of the polytope can be solutions
• $\implies$ can be solved in polynomial time
• in ILP, each $x_i$ is an integer
• LP-relaxation - drop the integer constraints and hope for the best
• 0-1 ILP - $\mathbf{x} \in {0,1}^n$
• decision variables for each label $z_A = 1$ if output=A, 0 otherwise
• don’t solve multiclass classification with an ILP solver (makes it harder)
• belief propagation
• variable elimination
1. fix an ordering of the variables
2. iteratively, find the best value given previous neighbors
• use DP - ex. Viterbi is max-product variable elimination
• when there are loops, require approximate solution
• uses message passing to determine marginal probabilities of each variable
• message $m_{ij}(x_j)$ high means node i believes $P(x_j)$ is high
• use beam search - keep size-limited priority queue of states

# 10/11 - learning protocols

## structural svm

• $\underset{w}{min} : \frac{1}{2} w^T w + C \sum_i \underset{y}{max} (w^T \phi (x_i,y)+ \Delta(y,y_i) - w^T \phi(x_i,y_i) )$

## empirical risk minimization

• ex. $f(x) = max ( f_1(x), f_2(x))$, solve the max then compute gradient of whichever function is argmax

## sgd for structural svm

• highest scoring assignment to some of the output random variables for a given input?
• loss-augmented inference - which structure most violates the margin for a given scoring function?
• adagrad - frequently updated features should get smaller learning rates
• Algorithms

[toc]

# asymptotics

• Big-O
• big-oh: O(g): functions that grow no faster than g - upper bound, runs in time less than g
• f(n)≤c*g(n) for some c, large n
• big-theta: Θ(g): functions that grow at the same rate as g
• big-oh(g) and big-theta(g) - asymptotic tight bound
• big-omega: Ω(g): functions that grow at least as fast as g
• f(n)≥c*g(n) for some c, large n
• Example: f = 57n+3
• O(n^2) - or anything bigger
• Θ(n)
• Ω(n^.5) - or anything smaller
• input must be positive
• We always analyze the worst case run-time
• little-omega: omega(g) - functions that grow faster than g
• little-o: o(g) - functions that grow slower than g
• we write f(n) ∈ O(g(n)), not f(n) = O(g(n))
• They are all reflexive and transitive, but only Θ is symmetric.
• Θ defines an equivalence relation.
• The difference between log10n and log2n is always a constant (about 3.322)
• existence then efficiency
• Upper bound: O(g(n)) - set of functions s.t. there exists c,k>0, 0 ≤ f(n) ≤ c*g(n), for all n > k
• o(g(n)) - O(g(n)) and not Ω(g(n))
• Tight bound: Θ(g(n)) - set of functions s.t. O(g(n)) and Ω(g(n))
• Lower bound: Ω(g(n)) - set of functions s.t. there exists c,k>0, 0 ≤ f(n) ≤ c*g(n), for all n > k
• ω(g(n)) - Ω(g(n)) but not O(g(n))
• add 2 functions, growth rate will be $O(max(g_1(n)+g_2(n))$ (same for sequential code)
• recurrence thm:$f(n) = O(n^c) => T(n) = 0(n^c)$
• $T(n) = a*T(n/b) + f(n)$
• $c = log_b(a)$
• Stirling’s formula: $n! ~= (\frac{n}{e})^n$
• corollary: log(n!) = 0(n log n)
• gives us a bound on sorting
• over bounded number of elements, almost everything is constant time

# recursion

• moving down/right on an NxN grid - each path has length (N-1)+(N-1)
• we must move right N-1 times
• ans = (N-1+N-1 choose N-1)
• for recursion, if a list is declared outside static recursive method, it shouldn’t be static
• generate permutations - recursive, add char at each spot
• think hard about the base case before starting
• look for lengths that you know
• look for symmetry
• n-queens - one array of length n, go row by row

# dynamic programming

//returns max value for knapsack of capacity W, weights wt, vals val
int knapSack(int W, int wt[], int val[])
int n = wt.length;
int K[n+1][W+1];
//build table K[][] in bottom up manner
for (int i = 0; i <= n; i++)
for (int w = 0; w <= W; w++)
if (i==0 || w==0) // base case
K[i][w] = 0;
else if (wt[i-1] <= w) //max of including weight, not including
K[i][w] = max(val[i-1] + K[i-1][w-wt[i-1]], K[i-1][w]);
else //weight too large
K[i][w] = K[i-1][w];  pipes are connected at their endpoints. What is the maximum amount of water that you can route from a given starting point to a given ending point?
return K[n][W];


# hungarian

• assign N things to N targets, each with an associated cost

# max-flow

• A list of pipes is given, with different flow-capacities. These

# sorting

• you can assume w.l.o.g. all input numbers are unique
• sorting requires Ω(nlog n) (proof w/ tree)
• considerations: worst case, average, in practice, input distribution, stability (order coming in is preserved for things with same keys), in-situ (in-place), stack depth, having to read/write to disk (disk is much slower), parallelizable, online (more data coming in)
• adaptive - changes its behavior based on input (ex. bubble sort will stop)

## comparised-based

### bubble sort

• keep swapping adjacent pairs
for i=1:n-1
if a[i+1]<a[i]
swap(a,i,i+1)

• have a flag that tells if you did no swaps - done
• number of passes ~ how far elements are from their final positions
• O(n^2)

### odd-even sort

• swap even pairs
• then swap odd pairs
• parallelizable

### selection sort

• move largest to current position for i=1:n-1 for j=1:n x = max(x,a[j]) jmax = j swap(a,i,j)
• 0(n^2)

### insertion sort

• insert each item into lists for i=2:n insert a[i] into a[1..(i-1)] shift
• O(n^2), O(nk) where k is max dist from final position
• best when alsmost sorted

### heap sort

• insert everything into heap
• kepp remove max
• can do in place by storing everything in array
• can use any height-balanced tree instead of heap
• traverse tree to get order
• ex. B-tree: multi-rotations occur infrequently, average O(log n) height
• 0(n log n)

### smooth sort

• collection of heaps (each one is a factor larger than the one before)
• can add and remove in essentially constant time if data is in order

### merge sort

• split into smaller arrays, sort, merge
• T(n) = 2T(n/2) + n = 0(n log n)
• stable, parallelizable (if parallel, not in place)

### quicksort

• split on pivot, put smaller elements on left, larger on right
• O(n log n) average, O(n^2) worst
• O(log n) space

### shell sort

• generalize insertion sort
• insertion-sort all items i apart where i starts big and then becomes small
• sorted after last pass (i=1)
• O(n^2), O(n^(3/2), … unsure what complexity is
• no matter what must be more than n log n
• not used much in practice

## not comparison-based

### counting sort

• use values as array indices in new sort
• keep count of number of times at each index
• for specialized data only, need small values
• 0(n) time, 0(k) space

### bucket sort

• spread data into buckets based on value
• sort the buckets
• O(n+k) time
• buckets could be trees

• sort each digit in turn
• stable sort on each digit
• like bucket sort d times
• 0(d*n time), 0(k+n) space

### meta sort

• like quicksort, but 0(nlogn) worst case
• run quicksort, mergesort in parallel
• stop when one stops
• there is an overhead but doesn’t affect big-oh analysis
• ave, worst-cast = O(n log n)

## sorting overview

• in exceptional cases insertion-sort or radix-sort are much better than the generic QuickSort / MergeSort / HeapSort answers.
• merge a and b sorted - start from the back

# searching

• binary sort can’t do better than linear if there are duplicates
• if data is too large, we need to do external sort (sort parts of it and write them back to file)
• write binary search recursively
• use low<= val and high >=val so you get correct bounds
• binary search with empty strings - make sure that there is an element at the end of it
• “a”.compareTo(“b”) is -1
• we always round up for these
• finding minimum is Ω(n)
• pf: assume an element was ignored, that element could have been minimum
• simple algorithm - keep track of best so far
• thm: n/2 comparisons are necessary because each comparison involves 2 elements
• thm: n-1 comparisons are necessary - need to keep track of knowledge gained
• every non-min element must win atleast once (move from unkown to known)
• find min and max
• naive solution has 2n-2 comparison
• pairwise compare all elements, array of maxes, array of mins = n/2 comparisons
• check min array, max array = 2* (n/2-1)
• 3n/2-2 comparisons are sufficient (and necessary)
• pf: 4 categories (not tested, only won, only lost, both)
• not tested-> w or l =n/2 comparisons
• w or l -> both = n/2-1
• therefore 3n/2-2 comparisons necessary
• find max and next-to-max
• thm: n-2 + log(n) comparisons are sufficient
• consider elimination tournament, pairwise compare elements repeatedly
• 2nd best must have played best at some point - look for it in log(n)
• selection - find ith largest integer
• repeatedly finding median finds ith largest
• finding median linear yields ith largest linear
• T(n) = T(n/2) + M(n) where M(n) is time to find median
• quickselect - partition around pivot and recur
• average time linear, worst case O(n^2)
• median in linear time - quickly eliminate a constant fraction and repeat
• partition into n/5 groups of 5
• sort each group high to low
• find median of each group
• compute median of medians recursively
• move groups with larger medians to right
• move groups with smaller medians to left
• now we know 3/10 of elements larger than median of medians
• 3/10 of elements smaller than median of medians
• partition all elements around median of medians
• recur like quickselect
• guarantees each partition contains at most 7n/10 elements
• T(n) = T(n/5) + T(7n/10) + O(n) -> f(x+y)≥f(x)+f(y)
• T(n) ≤ T(9n/10) + O(n) -> this had to be less than T(n)

# computational geometry

• range queries
• input = n points (vectors) with preprocessing
• output - number of points within any query rectangle
• 1D
• range query is a pair of binary searches
• O(log n) time per query
• O(n) space, O(n log n) preprocessing time
• 2D
• subtract out rectangles you don’t want
• add back things you double subtracted
• we want rectangles anchored at origin
• nD
• make regions by making a grid that includes all points
• precompute southwest counts for all regions - different ways to do this - tradeoffs between space and time
• O(log n) time per query (after precomputing) - binary search x,y
• polygon-point intersection
• polygon - a closed sequence of segments
• simple polygon - has no intersections
• thm (Jordan) - a simple polygon partitions the plane into 3 regions: interior, exterior, boundary
• convex polygon - intersection of half-planes
• polytope - higher-dimensional polygon
• raycasting
• intersections = odd - interior, even - exterior
• check for tangent lines, intersecting corners
• O(n) time per query, O(1) space and time
• convex case
• preprocessing
• find an interior point p (pick a vertext or average the vertices)
• partition into wedges (slicing through vertices) w.r.t. p
• sort wedges by polar angle
• query
• find containing wedge (look up by angle)
• test interior / exterior
• check triangle - cast ray from p to point, see if it crosses edge
• O(log n) time per query (we binary search the wedges)
• O(n) space and O(n log n) preprocessing time
• non-convex case
• preprocessing
• sort vertices by x
• find vertical slices
• partition into trapezoids (triangle is trapezoid)
• sort slice trapezoids by y
• query
• find containing slice
• find trapezoid in slice
• report interior/ exterior
• O(log n) time per query (two binary searches)
• O(n^2) space and O(n^2) preprocessing time
• convex hull
• input: set of n points
• output: smallest containing convex polygon
• simple solution 1 - Jarvis’s march
• simple solution 2 - Graham’s scan
• mergehull
• partition into two sets - computer MergeHull of each set
• merge the two resulting CHS
• pick point p with least x
• form angle-monotone chains w.r.t p
• merge chains into angle-sorted list
• run Graham’s scan to form CH
• T(n) = 2T(n/2) + n = 0(n log n)
• generalizes to higher dimensions
• parallelizes
• quickhull (like quicksort)
• find right and left-most points
• partition points along this line
• find points farthest from line - make quadrilateral
• eliminate all internal points
• recurse on 4 remaining regions
• concatenate resulting CHs
• O(n log n) expected time
• O(n^2) worst-case time - ex. circle
• generalizes to higher dim, parallelizes
• lower bound - CH requires Ω(n log n) comparisons
• pf - reduce sorting to convex hull
• consider arbitrary set of x_i to be sorted
• raise the x_i to the parabola (x_i,x_i^2) - could be any concave function
• compute convex hull of the parabola - all connected and line on top
• from convex hull we can get sorted x_i => convex hull did sorting so at least n log n comparisons
• corollary - Graham’s scan is optimal
• Chan’s convex hull algorithm
• assume we know CH size m=h
• partitoin points into n/m sets of m each
• convex polygon diameter
• Voronoi diagrams - input n points - takes O(nlogn) time to compute
• problems that are solved
• Voronoi cell - the set of points closer to any given point than all others form a convex polygon
• generalizes to other metrics (not just Euclidean distance)
• a Voronoi cell is unbounded if and only if it’s point is on the convex hull
• corollary - convex hull can be computed in linear time
• Voronoi diagram has at most 2n-5 vertices and 3n-6 edges
• every nearest neighbor of a point defines an edge of the Voronoi diagram
• corollary - all nearest neighbors can be computed from the Voronoi diagram in linear time
• corollary - nearest neighbor search in O(log n) time using planar subdivision search (binary search in 2D)
• connection points of neighboring Voronoi diagram cells form a triangulation (Delanuay triangulation)
• a Delanuay triangulation maximizes the minimum angle over all triangulations - no long slivery triangles
• Euclidean minimum spanning tree is a subset of the Delanuay triangulation (can be computed easily)
• calculating Voronoi diagram
• discrete case / bitmap - expand breadth-first waves from all points
• time is O(bitmap size)
• time is independent of #points
• intersecting half planes
• Voronoi cell of a point is intersection of all half-planes induced by the perpendicular bisectors w.r.t all other points
• use intersection of convex polygons to intersect half-planes (nlogn time per cell)
• can be computed in nlogn total time
1. idea divide and conquer
• merging is complex
2. sweep line using parabolas

### cs

• C/C++ Reference
• The C memory model: global, local, and heap variables. Where they are stored, their properties, etc.
• Variable scope
• Using pointers, pointer arithmetic, etc.
• Managing the heap: malloc/free, new/delete

# C basic

• #include
• printf(“the variable ‘i’ is: %d”, i);
• can only use /* */ for comments
• for constants: #define MAX_LEN 1024

# malloc

• malloc ex.
• There is no bool keyword in C
  /* We're going to allocate enough space for an integer
and assign 5 to that memory location */
int *foo;
/* malloc call: note the cast of the return value
from a void * to the appropriate pointer type */
foo = (int *) malloc(sizeof(int));
*foo = 5;
free(foo);

• char *some_memory = "Hello World";
• this creates a pointer to a read-only part of memory
• it’s disastrous to free a piece of memory that’s already been freed
• variables must be declared at the beginning of a function and must be declared before any other code

# memory

• heap variables stay - they are allocated with malloc
• local variables are stored on the stack
• global variables are stored in an initialized data segment

# structs

  struct point {
int x;
int y;
};
struct point *p;
p = (struct point *) malloc(sizeof(struct point));
p->x = 0;
p->y = 0;


# strings

• array with extra null character at end ‘\0’
• strLen doesn’t include null character

# pointers

int fake = NULL;
int val = 20;
int * x; // declare a pointer
x = &val; //take address of a variable
- can use pointer ++ and pointer -- to get next values


//Hello World #include using namespace std; //always comes after the includes, like a weaker version of packages //everything needs to be in a namespace otherwise you have to writed std::cout to look in iostream - you would use this for very long programs int main(){ //main function, not part of a class, must return an int cout << "Hello World" << endl; return 0; //always return this, means it didn't crash }

Preprocessor #include //System file - angle brackets #include "ListNode.h" //user file - inserts the contents of the file in this place #ifndef - "if not defined" #define - defines a macro (direct text replacement) #define TRUE 0 //like a final int, we usually put it in all caps if(TRUE ==0) #define MY_OBJECT_H //doesn't give it a value - all it does is make #ifdef true and #ifndef false #if/#ifdef needs to be closed with #endif if 2 files include each other, we get into an include loop we can solve this with the header of the .h files - everything is only defined once odd.h: #ifndef ODD_H #define ODD_H #include "even.h" bool odd (int x); #endif even.h: #ifndef EVEN_H #define EVEN_H #include "odd.h" bool even (int x); #endif

I/O #include using namespace std; int main(){ int x; cout << "Enter a value for x: "; //the arrows show you which way the data is flowing cin >> x; return 0; }

C++ Primitive Types int can be 16,32,64 bits depending on the platform double better than char

If statement can take an int, if (0) then false. Otherwise true. //don’t do single equals instead of double equals, will return false

Compiler: clang++

Functions - you can only call methods that are above you in the file function prototype - to compile mutually recursive functions, you need to declare the function with a semicolon instead of brackets and no body. bool even(int x); //called forward declaration / function prototype bool odd(int x){ if(x==0) return false; return even(x-1); } bool even(int x){ if(x==0) return true; return odd(x-1); } Classes Need 3 Separate files: 1. Header file that contains class definition - like an interface - IntCell.h #ifndef INTCELL_H //all .h files start w/ these #define INTCELL_H class IntCell{ public: //visibility blocks, everything in this block is public IntCell(int initialValue=0); //if you don’t provide a parameter, it assumes it is 0. You can call it with 1 or no parameters. ~IntCell(); //destructor, takes no parameters
int getValue() const; //the const keyword when placed here means the method doesn’t modify the object void setValue(int val); private: int storedvalue; int max(int m); }; #endif //all .h files end w/ these 2. C++ file that contains class implementation -IntCell.cpp #include “IntCell.h” using namespace std; // (not really necessary, but…) IntCell::IntCell( int initialValue ) : //default value only listed in .h file storedValue( initialValue ) { //put in all the fieldname(value), this is shorthand } int IntCell::getValue( ) const { return storedValue; } void IntCell::setValue( int val ) { //this is how you define the body of a method storedValue = val; } int IntCell::max(int m){ return 1; } 3. C++ file that contains a main() - TestIntCell.cpp #include #include "IntCell.h" using namespace std; int main(){ IntCell m1; //calls default constructor - we don't use parentheses! IntCell m2(37); cout << m1.getValue() << " " << m2.getValue() << endl; m1 = m2; //there are no references - copies the bits in m2 into m1 m2.setValue(40); cout << m1.getValue() << " " << m2.getValue() << endl; return 0; }

Pointers Stores a memory address of another object //we will assume everyhing is 32 bit Can be a primitive type or a class type int * x; pointer to int char *y; pointer to char Rational * rPointer; pointer to Rational all pointers are 32 bits in size because they are just addresses in a definition, * defines pointer type: int * x; in an expression, * dereferences: *x=2; (this sets a value for what the pointer points to) in a definition, & defines a reference type &x means get the address of x int x = 1; //Address 1000, value 1 - don’t forget to make the pointee int y = 5; //Address 1004, value 5 int * x_pointer = &x; //Address 1008, value 1000 cout « x_pointer; //prints the address 1000 cout « *x_pointer; //prints the value at the address *x_pointer = 2; //this changes the value of x to 2 x_pointer = &y; //this means x_pointer now stores the address of y *x_pointer = 3; //this changes the value of y to 3

int n = 30;
int * p;                //variables are not initialized to any value
*p = n;                 //this throws an error because you have not requested enough memory, unless it happens to be pointing to memory that you have allocated
int *p = NULL;          //this will still crash, but it is a better way to initialize

void swap(int * x, int * y) {
int temp = *x;      //temp takes the value x is pointing to
*x = *y;            //x points to the value that y was pointing to
*y = temp;          //y points to the value 3
}                       //at the end, x and y still are the same addresses

int main() {
int a=0;
int b=3;
cout << "Before swap(): a: " << a << "b: "
<< b << endl;
swap(&b,&a);
cout << "After swap(): a: " << a << "b: "
<< b << endl;
return 0;
} Dynamic Memory Allocation   //not very efficient
Static Memory Allocation - the compiler knows at compile time how much memory is needed
int someArray[10];  //declare array of 10 elements
int *value1_address = &someArray[3]; // declare a pointer to int
new keyword
returns a pointer to newly created "thing"
int main() {
int n;
cout << "Please enter an integer value: " ;         // read in a value from the user
cin >> n;
int * ages = new int [n];// use the user's input to create an array of int using new
for (int i=0; i < n; i++) {                // use a loop to prompt the user to initialize array
cout << "Enter a value for ages[ " << i << " ]: ";
cin >> ages[i];
}
for(int i=0; i<n; i++) {            // print out the contents of the array
cout << "ages[ " << i << " ]: " << ages[i];
delete [] ages;  //finished with the array - clean up the memory used by calling delete
return 0;        //everything you allocate with new needs to be deleted, this is faster than java
}
Generally, SomeTypePtr = new SomeType;
int * intPointer = new int;
delete intPointer; //for array, delete [] ages; -this only deals with the pointee, not the pointer
Accessing parts of an object
regular object:
Rational r;
r.num = 3;
for a pointer, dereference it:
Rational *r = new Rational();
(*r).num=4; //or r->num = 4; (shorthand)
char* x,y; //y is not a pointer!  Write like char *x,y; Linked Lists
List object keeps track of size, pointers to head, tail
head and tail are dummy nodes
ListNode holds a value, previous, and next
ListItr has pointer to current ListNode Friend
class ListNode {
public:
ListNode();                //Constructor
private:                       //only this class can modify these fields
int value;
ListNode *next, *previous; //for doubly linked lists
friend class List;         //these classes can bypass private visibility
friend class ListItr;
}; Constructor - just has to initialize fields
Foo() {
ListNode* head = new ListNode(); //because we put the class type ListNode*, then we are creating a new local variable and not modifying the field
//head = new Listnode() - this works
}
Foo() {
ListNode temp;
head = &temp;                 //this ListNode is deallocated after the constructor ends, doesn't work
}
Assume int *x has been declared
And int y is from user input
Consider these separate C++ lines of code:
x = new int[10]; // 40 bytes
x = new int;     // 4 bytes
x = new int[y];  // y*4 bytes sizeof(int) -> tells you how big an integer is (4 bytes) References - like a pointer holds an address, with 3 main differences
1. Its address cannot change (its address is constant)
2. It MUST be initialized upon declaration
Cannot (easily) be initialized to NULL
3. Has implicit dereferencing
If you try to change the value of the reference, it automatically assumes you mean the value that the reference is pointing to
//can't use it when you need to change ex. ListItr has current pointer that changes a lot
Declaration
List sampleList
List & theList = sampleList;//references has to be initialized to the object, not the address
void swap(int & x, int & y) {   //this passes in references
int temp = x;
x = y;
y = temp;
}
int main() {                    //easier to call, references are nice when dealing with parameters
int a=0;
int b=3;
cout << "Before swap(): a: " << a << "b: "
<< b << endl;
swap(b,a);
cout << "After swap(): a: " << a << "b: "
<< b << endl;
return 0;
}
You can access its value with just a period
Location	       *	           &
Definition	"pointer to"	"reference to"
Statement	"dereference"	"address of" subroutines
methods are in a class
functions are outside a class Parameter passing
Call by value - actual parameter is copied into formal parameter
This is what Java always does - can be slow if it has to copy a lot
-actual object can't be modified
Call by reference - pass references as parameters
Use when formal parameter should be able to change the value of the actual argument
void swap (int &x, int &y);
Call by constant reference - parameters are constant and are passed by reference
Both efficient and safe
bool compare(const Rational & left, const Rational & right);
Can also return by different ways C++ default class
1. Destructor                                           //this will do nothing
Frees up any resources allocated during the use of an object
2. Copy Constructor                                     //copies something over
Creates a new object based on an old object
IntCell copy = original;                         //or Intcell copy(original)
automatically called when object is passed by value into a subroutine
automatically called when object is returned by value from a subroutine
3. operator=()
also known as the copy assignment operator
intended to copy the state of original into copy
called when = is applied to two objects after both have been previously constructed
IntCell original;   //constructor called
IntCell copy;
copy = original;    //operator called
overrides the = operator    //operator overrides only work on objects, not pointers
(and a default constructor, if you don't supply one) //this will do nothing


C++ has visibility on the inheritance

class Name { public: Name(void) : myName(“”) { } ~Name(void) { } void SetName(string theName) { myName = theName; } void print(void) { cout « myName « endl; }

private: string myName; };

class Contact: public Name { //this is like contact extends name public: Contact(void) { myAddress = “”; } ~Contact(void) { } void SetAddress(string theAddress) { myAddress = theAddress; } void print(void) { Name::print(); //this can’t access private fields in Name, needs to call print from super class cout « myAddress « endl; } private: string myAddress; };

C++ has multiple inheritance - you can have as many parent classes as you want class Sphere : public Shape, public Comparable, public Serializable { };

Dispatch Static - Decision on which member function to invoke made using compile-time type of an object when you have a pointer Person *p; p = new Student(); p.print(); //will alway call the Person print method - uses type of the pointer Dynamic - Decision on which member function to invoke made using run-time type of an object Incurs runtime overhead Program must maintain extra information Compiler must generate code to determine which member function to invoke Syntax in C++: virtual keyword (Java does this by default, i.e. everything is virtual) Example class A virtual void foo()
class B : public A virtual void foo() void main () int which = rand() % 2; A *bar; if ( which ) bar = new A(); else bar = new B(); bar->foo(); return 0;

    Virtual method tables - stores the virtual methods in an array
Each object contains a pointer to the virtual method table
In addition to any other fields
That table has the addresses of the methods
Any virtual method must follow the pointer to the object... (one pointer dereference)
Then follow the virtual method table pointer... (second pointer dereference)
Then lookup the method pointer
In Java default is Dynamic
In C++, default is Static - this is faster
When creating a subclass object, the constructor of each subclass overwrites the appropriate pointers in the virtual method table with the overridden method pointers


Abstract Classes class foo { public: virtual void bar() = 0; };

Types of multiple inheritance 1. Shared What Person is in the diagram on the previous slide 2. Replicated (or repeated) What gp_list_node is in the diagram on the previous slide 3. Non-replicated (or non-repeated) A language that does not allow shared or replicated (i.e. no common ancestors allowed) 4. Mix-in What Java (and others) use to fake multiple inheritance through the use of interfaces

In C++, replicated is the default
Shared can be done by specifying that a base class is virtual:
class student: public virtual person, public gp_list_node {
class professor: public virtual person, public gp_list_node {

Java has ArrayStoreException - makes sure the thing you are adding to the array is of the correct type
String[] a = new String[1];
Object[] b = a;
b[0] = new Integer (1);

• Java Reference

# data structures

- LinkedList, ArrayList
- add(Element e), add(int idx, Element e), get(int idx)
- remove(int index)
- remove(Object o)
- Stack
- push(E item)
- peek()
- pop()
- PriorityQueue
- peek()
- poll()
- default is min-heap
- PriorityQueue(int initialCapacity, Comparator<? super E> comparator)
- PriorityQueue(Collection<? extends E> c)
- HashSet, TreeSet
- HashMap
- put(K key, V value)
- get(Object key)
- keySet()
- if you try to get something that's not there, will return null

• default init capacities all 10-20
• clone() has to be cast from Object

# useful

iterator

- it.next() - returns value
- it.hasNext() - returns boolean
- it.remove() - removes last returned value


strings

- String.split(" |\\.|\\?") //split on space, ., and ?
- StringBuilder
- much faster at concatenating strings
- thread safe, but slower
- StringBuilder s = new StringBuilder(CharSequence seq)();
- s.append("cs3v");
- s.charAt(int x), s.deleteCharAt(int x), substring
- s.reverse()
- Since String is immutable it can safely be shared between many threads
- formatting
String s = String.format("%d", 3);
"%05d"	//pad to fill 5 spaces
"%8.3f" //max number of digits
"%-d"	//left justify
"%,d" 	//print commas ex. 1,000,000
| int | double | string |
|-----|--------|--------|
| d   | f      | s      |
new StringBuilder(s).reverse().toString()
int count = StringUtils.countMatches(s, something);
- integer
- String toString(int i, int base)
- int parseInt(String s, int base)
- array
char[] data = {'a', 'b', 'c'};
String str = new String(data);


sorting

- Arrays.sort(Array a)
- Collections.sort(Collection c), Collections.sort(Collection l, Comparator c)
- use mergeSort (with insertion sort if very small)
- Collections.reverseOrder() returns comparator opposite of default
class ComparatorTest implements Comparator<String>
public int compare(String one, String two) //if negative, one comes first
class Test implements Comparable<Object>
public int compareTo(Object two)


exceptions

• ArrayIndexOutOfBoundsException
• throw new Exception("Chandan type")

# higher level

• primitives - byte, short, char, int, long, float, double
• java only has primitive and reference types
• when you assign primitives to each other, it’s fine
• when you pass in a primitive, its value is copied
• when you pass in an object, its reference is copied
• you can modify the object through the reference, but can’t change the object’s address
• garbage collection
• once an object no longer referenced, gc removes it and reclaims memory
• jvm intermittently runs a mark-and-sweep algorithm
• runs when short-term stuff gets full
• older stuff moves to different part
• eventually older stuff is cleared

# object-oriented

| declare | instantiate | initialize | |———|————-|————| | Robot k | new | Robot() |

• class method = static
• called with Foo.DoIt()
• initialized before constructor
• class shares one copy, can’t refer to non-static
• instance method - invoked on specific instance of the class
• called with f.DoIt()
• protected member is accessible within its class and subclasses
• Theory of Computation

# introduction

• Chomsky hierarchy of languages: $L_3 \subset L_2 \subset L_1 \subset L_R \subset L_0 \subset Σ*$
• each L is a set of languages
• $L_0=L_{RE}$ - unrestricted grammars - general phase structure grammars - recursively enumerable languages - include all formal grammars. They generate exactly all languages that can be recognized by a Turing machine.
• computable, maybe undecidable (if not in L_R)
• L_R - recursive grammars - Turing machine that halts eventually
• decidable
• L_1 - context-sensitive grammars - all languages that can be recognized by a linear bounded automaton
• L_2 - context-free grammars - these languages are exactly all languages that can be recognized by a non-deterministic pushdown automaton.
• L_3 - regular grammars - all languages that can be decided by a finite state automaton
•  contains Σ*, Σ* is countably infinite
• strings
• languages
• Σ* Kleene Closure has multiple definitions
•  {w w is a finite length string ^ w is a string over Σ}
•  {xw w in Σ* ^ x in Σ} U {Ɛ}
• Σ_i has strings of length i
• problems
• automata
• delta v delta-hat - delta hat transitions on a string not a symbol
•  - notation writes the state between the symbols you have read and have yet to read
•  - notation with * writes the state before the symbols you have to read and after what you have read
• grammars
• leftmost grammar - expand leftmost variables first - doesn’t matter for context-free
• parse tree - write string on bottom
• sets
• finite
• countably infinite
• not countably infinite
• mappings
1. onto - each output has at least 1 input
2. 1-1 - each output has at most 1 input
3. total - each input has at least 1 output
4. function - each input has at most 1 output
• equivalence relation - reflexive, symmetric, transitive
• proof methods
• **read induction **
• library of babel
• distinct number of books, each contained, but infinite room

# ch 1-3 - finite automata, regular expressions

• alphabet - any nonempty finite set
• string - finite sequence of symbols from an alphabet
• induction hypothesis - assumption that P(i) is true
• lexicographic ordering - {Ɛ,0,1,00,01,10,11,000,…}
• finite automata - like a Markov chain w/out probabilities - 5 parts
1. states
2. E - finite set called the alphabet
3. f: Q x E -> Q is the transition function
• ex. f(q,0) = q’
4. start state
5. final states
• language - L(M)=A - means A is the set of all strings that the machine M accepts
•  A* = {$x_1x_2…x_k k\geq0 \wedge x_i \in A$}
• A+ = A* - Ɛ
•  concatenation A o B = {xy x in A and y in B}
• regular language - is recognized by a finite automata
• class of regular languages is closed under union, concatenation, star operation
• nondeterministic automata
• can have multiple transition states for one symbol
• can transition on Ɛ
• can be thought of as a tree
• After reading that symbol, the machine splits into multiple copies of itself and follows all the possibilities in parallel. Each copy of the machine takes one of the possible ways to proceed and continues as before. If there are subsequent choices, the machine splits again.
• If the next input symbol doesn’t appear on any of the arrows exiting the current state, that copy of the machine dies.
• if any copy is in an accept state at the end of the input, the NFA accepts the input string.
• can also use regular expressions (stuff like unions) instead of finite automata
• to convert, first convert to gnfa
• gnfa (generalized nfa) - start state isn’t accept state
• nonregular languages - isn’t recognized by a finite automata
•  ex. C = {w w has an equal number of Os and 1s}
• requires infinite states

# ch 4 - properties of regular languages (except Sections 4.2.3 and 4.2.4)

• pumping lemma- proves languages not to be regular
•  if L regular, there exists a constant n such that for every string w in L such that w ≥ n, we can break w into 3 strings w=xyz, such that:
1. y≠Ɛ
2.  xy ≤ n
3. For all k ≥ 0, x y^k z is also in L
• closed under union, intersection, complement, concatenation, closure, difference, reversal
• convert NFA to DFA - write the possible routes to the final state, write the intermediate states, remove unnecessary ones
• minimization of DFAs
• eliminate any state that can’t be reached
• partition remaining states into blocks so all states in same block are equivalent
• can’t do this grouping for nfas

# ch 5 - context free grammars and languages

• $w^R$ = reverse
• context-free grammar - more powerful way to describe a language
• ex. substitution rules (generates 0#1)
• A -> OA1
• A -> B
• B -> #
• def
1. variables - finite set
2. terminals - alphabet
3. productions
4. start variable
• recursive inference - start with terminals, show that string is in grammar
• derivation - sequence of substitutions to obtain a string
• can also make these into parse trees
• leftmost derivation - at each step we expand leftmost variable
• arrow with a star does many derivations at once
• parse tree - final answer is at bottom
• sentential form - the string at any step in a derivation
• proofs in 5.2s
• w equivalence
1. parse tree
2. leftmost derivation
3. rightmost derivation
4. recursive inference
5. derivation
•  if else grammar: $S \to \epsilon SS iS iSeS$
• context-free grammars used for parsers (compilers), matching parentheses, palindromes, if-else, html, xml
• if a grammar generates a string in several different ways, we say that the string is derived ambiguously in that grammar
• ambiguity resolution
1. some operators take precedence
2. make things left-associative
• think about terms, expressions, factors
• if unambiguous, leftmost derivation will be unique
• in an unambiguous grammar, leftmost derivations will be unique
• inherently ambiguous language - all its grammars are ambiguous
• ex: $L = {a^nb^nc^md^m} \cup {a^nb^mc^md^n} , n \geq 1, m\geq1$

# ch 6 - pushdown automata (don’t need to know 6.3 proofs)

• pushdown automata - have extra stack of memory - equivalent to context-free grammar
• similiar to parser in typical compiler
• two ways of accepting
1. entering accept state
2. accept by emptying stack
• convert from empty stack to accept state
• add symbol X_1
• start by pushing it onto the stack then push on Z_1, spontaneously transition to q_0
• everything has epsilon-transition to final accepting state when they read X_1 - convert accept state to empty stack
• add symbol X_1 under Z_1 (this is so we never empty stack unless we are in p- there are no transitions on X_1)
• all accept states transition to new state p
• p epsilon-transitions to itself, removes element from each stack every time
• 6.3
• convert context free grammar to empty stack
• simulate leftmost derivations
• put answer on stack, most recent variable on top
• if terminal remove
• if variable nondeterministically expand
• if empty stack, accept
• convert PDA to grammar
• every transition is of from pXq
• variables of the form [pXq] (X is on the stack)
• [pXq] -> a where a is what transitioned p to q
• pushdown automata can transition on epsilon
• def:
1. transition function - takes (state,symbol,stack symbol) - returns set of pairs (new state, new string to put on stack - length 0, 1, or more)
2. start state
3. start symbol (stack starts with one instance of this symbol)
4. set of accepting states
5. set of all states
6. alphabet
7. stack alphabet
• ex. palindromes
1. push onto stack and continue OR
2. assume we are in middle and start popping stack - if empty, accept input up to this point
• label diagrams with i, X/Y - what input is used and new/old tops of stack
• ID for PDA: (state,remaining string,stack)
• conventionally, we put top of stack on left
• parsers generally behave like deterministic PDA
• DPDA also includes all regular languages, not all context free languages
• only include unambiguous grammars

# ch 7 - properties of CFLs (except pp. 295-297)

• Chomsky Normal Form
1. A->BC
2. A->a
• no epsilon transitions
• for any variable that derived to epsilon (ex. A -*> epsilon)
• if B -> CAD
• replace with B -> CD and B -> CAD and remove all places where A could become epsilon - no unit productions - eliminate useless symbols - works for any CFL
• Greibach Normal Form
1. A->aw where a is terminal, w is string of 0 or more variables
2. every derivation takes exactly n steps (n length)
• generating - if x produces some terminal string w
• reachable - x reachable if S ${\to}^*$ aXb for some a,b
• CFL pumping lemma - pick two small strings to pump
•  If L CFL, then z ≥ n, we can break z into 5 strings z=uvwxy, such that:
1. vx ≠ Ɛ
2.  vwx ≤ n, middle portion not too long
3. For all i ≥ 0, $u v^i w x^i y \in$ L
• ex. ${0^n1^n}$
• often have to break it into cases
• proof uses Chomsky Normal Form
• not context free examples
•  ${0^n1^n2^n n\geq1}$
•  {$0^i1^j2^i3^j i\geq 1,j\geq 1$}
•  {ww w $\in {0,1}^*$ }
• closed under union, concatenation, closure, and positive closure, homomorphism, reversal, inverse homomorphism, substitutions
• intersection with a regular language (basically run in parallel)
• not closed under intersection, complement
• substitution - replace each letter of alphabet with a language
• s(a) = $L_a$
• if $w = ax$, $s(w) = L_aL_x$
• if L CFL, s(L) CFL
• time complexities
• O(n)
• CFG to PDA
• PDA final state -> empty stack
• PDA empty stack -> final state
• PDA to CFG: O($n^3$) with size O(n^3)
• converstion to CNF: O(n^2) with size O(n^2)
• emptiness of CFL: O(n)
• testing emptiness - O(n)
• which symbols are reachable
• test membership with dynamic programming table - O(n^3)
• CYK algorithm

# ch 8 - intro to turing machines (except 8.5.3)

• Turing Machine def
1. states
2. start state
3. final states
4. input symbols
5. tape symbols (includes input symbols)
6. transition function $\delta(q,X)=\delta(q,Y,L)$
7. B - blank symbol
• infinite blanks on either side
• arc has X/Y D with old/new tape symbols and direction
• if the TM enters accepting state, it accepts - assume it halts if it accepts
• we can think of Turing machine as having multiple tracks (symbol could represent a tuple like [X,Y])
• multitape TM has each head move independently, multitrack doesn’t
• common use one track for data, one track for mark
• running time - number of steps that TM makes
• NTM - nondeterministic Turing machine - accepts no languages not accepted by a deterministic TM
• halts if enters a state q, scanning X, and there is no move for (q,X)
• restrictions that don’t change things
• tape infinite only to right
• TM can’t print blank
• simplified machines
• two stacks machine - one stack keeps track of left, one right
• every recursively enumerable language is accepted by a two-counter machine
• TM can simulate computer, and time is some polynomial multiple of computer time (O(n^3))
• limit on how big a number computer can store - one instruction - word can only grow by 1 bit
• LBA - linear bounded automaton - Turing machine with left and right end markers
• programs might take infinitely long before terminating - can’t be decided
• turing machine can take 2 inputs: program P and input I
• ID - instantaneous description
• write $X_1X_2…qX_iX_{i+1}…$ where q is scanning X_i
• program that prints “h” as input -> yes or no
• imagine instead of no prints h
• now feed it to itself
• if it would print h, now prints yes - paradox! therefore such a machine can’t exist
• TM simulating computer
1. tape that has memory
2. tape with instruction counter
3. tape with memory address
• reduction - we know X is undecidable - if solving Y implies solving X, then Y is undecidable
• if X reduces to Y, solving Y solves X
• define a total mapping from X to Y
• $X \leq _m Y$ - X reduces to Y - mapping reduction, solving Y solves X
• intractable - take a very long time to solve (not polynomial)
• <> notation means bitstring representation
• $= 0^n$
• $<m,w> means w \in L(M)$
• KD - “known to be distinct”
• idempotent - R + R = R

# ch 9 - undecidability (9.1,9.2,9.3)

• does this TM accept (the code for) itself as input?
• enumerate binary strings - add a leading 1
• express TM as binary string
• give it a number
• TM uses this for each transition
• separate transitions with 11
• diagonalization language - set of strings w_i such that w_i is not in L(M_i)
• make table with M_i as rows, w_j as cols
• complement the diagonal is characteristic vector in L_d
• diagonal can’t be characteristic vector of any TM
• not RE
• recursive - complement is also recursive
• just switch accept and reject
• if language and complement are both RE, then L is recursive
• universal language - set of binary strings that encode a pair (M,W) where M is TM, w $\in (0+1)^*$ - set of strings representing a TM and an input accepted by that TM
• there is a universal Turing machine such that L_u = L(U)
• L_u is undecidable: RE but not recursive
• halting problem - RE but not recursive
• Rice’s Thm - all nontrivial properties of the RE languages are undecidable
• property of the RE languages is a set of RE languages
• property is trivial if it is either empty or is all RE languages
• empty property $\emptyset$ is different from the property of being an empty language {$\emptyset$}
• ex. “the language is context-free, empty, finite, regular”
• however properties such as 5 states are decidable

# Ch 10 - 10.1-10.4 know the additional problems that are NP-complete

• intractable - can’t be solved in polynomial time
• NP-complete examples
1. boolean satisfiability
1. symbols ^-, etc. are represent by themselves
2. x_i is represented by x followed by i in binary
• Cook’s thm - SAT is NP-complete
1. show SAT in NP
2. show all other NP reduce to SAT
• pf involves matrix of cell/ID facts
• cols are ID 0,1,…,p(n)
• rows are alpha_0,alpha_1,…alpha_p(n)
• for any problem’s machine M, there is polynomial-time-converter for M that uses SAT decider to solve in polynomial time
2. 3SAT - easier to reduce things to
• AND of clauses each of which is the OR of exactly three variables or negated variables
• conjunctive normal form - if it is the AND of clauses
• conversion to cnf isn’t always polynomial time - don’t have to convert to equivalent expression, just have to both be satisfiable at the same times
1. push all negatives down the expression tree - linear
2. put it in cnf - demorgans, double negation
• literal - variable or a negated variable
• k-conjunctive normal form - k is number of literals in clauses
3. traveling salesman problem - find cycle of weight less than W
• O(m!)
4. Independent Set - graph G and a lower bound k - yes if and only if G has an indpendent set of k nodes
• none of them are connected by an edge
• reduction from 3SAT
5. node-cover problem
• node cover - every node is on one of the edges
6. Undirected Hamiltonian circuit problem
• TSP with all weights 1
7. Directed Hamiltonian-Circuit Problem
8. subset sum
• is there a subset of numbers that sums to a number
• reductions must be polynomial-time reductions
• P - solvable in polynomial by deterministic TM
• NP - solvable in polynomial time by nondeterministic TM
• NP-completeness (Karp-completeness) - a problem is at least as hard as any problem in NP = for every language L’ in NP, there is a polynomial-time reduction of L’ to L
• Cook-completeness equivalent to NP-completeness - if given a meachansim that in one unit of time would answer any equestion about membership of a string in P, it was possible to recognize any language in NP in polynomial time
• NP-hard - we don’t know if L is in NP, but every problem in NP reduces to L in polynomial time
• if some NP-complete problem p is in P then P=NP
• there are things between polynomial and exponential time (like 2^logn), and we group these in with the exponential category
• could have P polynomials run forever when they don’t accept
• could simply tell them to stop after a certain amount of steps
• there are algorithms called verifiers

# more on NP Completeness

• a language is polynomial-time reducible to a language B if there is a polynomial time comoputable function that maps one to the other
• to solve a problem, efficiently transform to another problem, and then use a solver for the other problem
• satisfiability problem - check if a boolean expression is true
• have to test every possible boolean value - 2^n where n is number of variables
• this can be mapped to all problems of NP
• ex. traveling salesman can be reduced to satisfiability
• P - set of all problems that can be solved in polynomial time
• NP - solved in polynomial time if we allow nondeterminism
• we count the time as the length of the shortest path
• NP-hard problem L’
1. every L is NP reduces to L’ in polynomial time
• NP-complete L’
1. L’ is NP-hard
2. L’ is in NP
• ex. graph coloring
• partitioning into equal sums
• if one NP-complete problem is in P, P=NP
• decider vs. optimizer
• decider tells whether it was solved or not
• if you keep asking it boolean questions it gives you the answer
• graph clique problem - given a graph and an integer k is there a subgraph in G that is a complete graph of size k
• this is reduction from boolean satisfiability
• graph 3-colorability
• reduction from satisfiability - prove with or gate type structure
• approximation algorithms
• find minimum
• greedy - keep going down
• genetic algorithms - pretty bad
• minimum vertex cover problem - given a graph, find a minimum set of vertices such that each edge is incident to at least one of these vertices
• NP-complete
• can not be approximated within 1.36*solution
• can be approximated within 2*solution in linear time
• pick an edge, pick its endpoints
• put them in solution
• eliminate these points and their edges from the graph
• repeat
• maximum cut problem - given a graph, find a partition of the vertices maximizing the number of crossing edges
• can not be approximated within 17/16*solution
• can be approximated within 2*solution
• if moving arbitrary node across partition will improve the cut, then do so
• repeat
• Data Structures

# lists

## arrays and strings

• start by checking for null, length 0
• ascii is 128, extended is 256

## queue - linkedlist

• has insert at back (enqueueu) and remove from front (dequeue)
class Node {
Node next;
int val;
public Node(int d) { val = d; }
}

• finding a loop is tricky, use visited
• reverse a linked list
• requires 3 ptrs (one temporary to store next)
• return pointer to new end

## stack

class Stack {
Node top;
Node pop() {
if (top != null) {
Object item = top.data;
top = top.next;
return item;
}
return null;
}
void push(Object item) {
Node t = new Node(item);
t.next = top;
top = t;
}
}

• sort a stack with 2 stacks
• make a new stack called ans
• pop from old
• while old element is > ans.peek(), old.push(ans.pop())
• then new.push(old element)
• stack with min - each el stores min of things below it
• queue with 2 stacks - keep popping everything off of one and putting them on the other
• sort with 2 stacks

# trees

• Balanced binary trees are generally logarithmic
• Root: a node with no parent; there can only be one root
• Leaf: a node with no children
• Siblings: two nodes with the same parent
• Height of a node: length of the longest path from that node to a leaf - Thus, all leaves have height of zero
• Height of a tree: maximum depth of a node in that tree = height of the root
• Depth of a node: length of the path from the root to that node
• Path: sequence of nodes n1, n2, …, nk such that ni is parent of ni+1 for 1 ≤ i ≤ k
• Length: number of edges in the path
• Internal path length: sum of the depths of all the nodes
• Binary Tree - every node has at most 2 children
• Binary Search Tree - Each node has a key value that can be compared
• Every node in left subtree has a key whose value is less than the root’s key value
• Every node in right subtree has a key whose value is greater than the root’s key value
void BST::insert(int x, BinaryNode * & curNode){    //we pass in by reference because we want a change in the method to actually modify the parameter (the parameter is the curNode *)
//left associative so this is a reference to a pointer
if (curNode==NULL)
curNode = new BinaryNode(x,NULL,NULL);
else if(x<curNode->element)
insert(x,curNode->left);
else if(x>curNode->element)
insert(x,curNode->right);
}

• BST Remove
• if no children: remove node (reclaiming memory), set parent pointer to null
• one child: Adjust pointer of parent to point at child, and reclaim memory
• two children: successor is min of right subtree
• replace node with successor, then remove successor from tree
• worst-case depth = n-1 (this happens when the data is already sorted)
• maximum number of nodes in tree of height h is 2^(h+1) - 1
• minimum height h ≥ log(n+1)-1
• Perfect Binary tree - impractical because you need the perfect amount of nodes
• all leaves have the same depth
• number of leaves 2^h

## AVL Tree

• For every node in the tree, the height of the left and right sub-trees differs at most by 1
• guarantees log(n)
• balance factor := The height of the right subtree minus the height of the left subtree
• “Unbalanced” trees: A balance factor of -2 or 2
• AVL Insert - needs to update balance factors
• same sign -> single rotation
• -2, -1 -> needs right rotation
• -2, +1 -> needs left then right
• Find: Θ(log n) time: height of tree is always Θ(log n)
• Insert: Θ(log n) time: find() takes Θ(log n), then may have to visit every node on the path back up to root to perform up to 2 single rotations
• Remove: Θ(log n): left as an exercise
• Print: Θ(n): no matter the data structure, it will still take n steps to print n elements

## Red-Black Trees

• definition
1. A node is either red or black
2. The root is black
3. All leaves are black The leaves may be the NULL children
4. Both children of every red node are black Therefore, a black node is the only possible parent for a red node
5. Every simple path from a node to any descendant leaf contains the same number of black nodes
• properties
• The height of the right and left subtree can differ by a factor of n
• insert (Assume node is red and try to insert)
1. The new node is the root node
2. The new node’s parent is black
3. Both the parent and uncle (aunt?) are red
4. Parent is red, uncle is black, new node is the right child of parent
5. Parent is red, uncle is black, new node is the left child of parent
• Removal
• Do a normal BST remove
• Find next highest/lowest value, put it’s value in the node to be deleted, remove that highest/lowest node
• Note that that node won’t have 2 children!
• We replace the node to be deleted with it’s left child
• This child is N, it’s sibling is S, it’s parent is P

## Splay Trees

• A self-balancing tree that keeps “recently” used nodes close to the top
• This improves performance in some cases
• Great for caches
• Not good for uniform access
• Anytime you find / insert / delete a node, you splay the tree around that node
• Perform tree rotations to make that node the new root node
• Splaying is Θ(h) where h is the height of the tree
• At worst this is linear time - Θ(n)
• We say it runs in Θ(log n) amortized time - individual operations might take linear time, but other operations take almost constant time - averages out to logarithmic time
• m operations will take m*log(n) time

## other trees

• to go through bst (without recursion) in order, use stacks
• push and go left
• if can’t go left, pop
• add new left nodes
• go right
• recursively print only at a particular level each time
• create pointers to nodes on the right
• balanced tree = any 2 nodes differ in height by more than 1
• (maxDepth - minDepth) <=1
• trie is an infix of the word “retrieval” because the trie can find a single word in a dictionary with only a prefix of the word
• root is empty string
• each node stores a character in the word
• if ends, full word
• need a way to tell if prefix is a word -> each node stores a boolean isWord

# heaps

• used for priority queue
• peek(): just look at the root node
• add(val): put it at correct spot, percolate up
• percolate - Repeatedly exchange node with its parent if needed
• expected run time: ∑i=1..n 1/2^n∗n=2
• pop(): put last leaf at root, percolate down
• Remove root (that is always the min!)
• Put “last” leaf node at root
• Repeatedly find smallest child and swap node with smallest child if needed.
• Priority Queue - Binary Heap is always used for Priority Queue
1. insert
• inserts with a priority
2. findMin
• finds the minimum element
3. deleteMin
• finds, returns, and removes minimum element
• perfect (or complete) binary tree - binary tree with all leaf nodes at the same depth; all internal nodes have 2 children.
• height h, 2h+1-1 nodes, 2h-1 non-leaves, and 2h leaves
• Full Binary Tree
• A binary tree in which each node has exactly zero or two children.
• Min-heap - parent is min
1. Heap Structure Property: A binary heap is an almost complete binary tree, which is a binary tree that is completely filled, with the possible exception of the bottom level, which is filled left to right.
• in an array - this is faster than pointers
• left child: 2*i
• right child: (2*i)+1
• parent: floor(i/2)
• pointers need more space, are slower
• multiplying, dividing by 2 are very fast
2. Heap ordering property: For every non-root node X, the key in the parent of X is less than (or equal to) the key in X. Thus, the tree is partially ordered.
• Heap operations
• findMin: just look at the root node
• insert(val): put it at correct spot, percolate up
• percolate - Repeatedly exchange node with its parent if needed
• expecteed run time: ∑i=1..n 1/2^n∗n=2
• deleteMin: put last leaf at root, percolate down
• Remove root (that is always the min!)
• Put “last” leaf node at root
• Repeatedly find smallest child and swap node with smallest child if needed.
• Compression
• Lossless compression: X = X’
• Lossy compression: X != X’
• Information is lost (irreversible)
•  Compression ratio: X / Y
•  Where X is the number of bits (i.e., file size) of X
• Huffman coding
• Compression
1. Determine frequencies
2. Build a tree of prefix codes
• no code is a prefix of another code
• start with minheap, then keep putting trees together
3. Write the prefix codes to the output
4. reread source file and write prefix code to the output file
• Decompression
1. read in prefix code - build tree
2. read in one bit at a time and follow tree
• ASCII characters - 8 bits, 2^7 = 128 characters
• cost - total number of bits
• “straight cost” - bits / character = log2(numDistinctChars)
• Priority Queue Example
• insert (x)
• deleteMin()
• findMin()
• isEmpty()
• makeEmpty()
• size()

# Hash Tables

• java: load factor = .75, default init capacity: 16, uses buckets
• string hash function: s[0]31^(n-1) + s[1]31^(n-2) + … + s[n-1] where n is length mod (table_size)
• Standard set of operations: find, insert, delete
• No ordering property!
• Thus, no findMin or findMax
• Hash tables store key-value pairs
• Each value has a specific key associated with it
• fixed size array of some size, usually a prime number
• A hash function takes in a “thing” )string, int, object, etc._
• returns hash value - an unsigned integer value which is then mod’ed by the size of the hash table to yield a spot within the bounds of the hash table array
• Three required properties
1. Must be deterministic
• Meaning it must return the same value each time for the same “thing”
2. Must be fast
3. Must be evenly distributed
• implies avoiding collisions - Technically, only the first is required for correctness, but the other two are required for fast running times
• A perfect hash function has:
• No blanks (i.e., no empty cells)
• No collisions
• Lookup table is at best logarithmic
• We can’t just make a very large array - we assume the key space is too large
• you can’t just hash by social security number
• hash(s)=(∑k−1i=0si∗37^i) mod table_size
• you would precompute the powers of 37
• collision - putting two things into same spot in hash table
• Two primary ways to resolve collisions:
1. Separate Chaining (make each spot in the table a ‘bucket’ or a collection)
2. Open Addressing, of which there are 3 types:
1. Linear probing
3. Double hashing
• Separate Chaining
• each bucket contains a data structure (like a linked list)
• analysis of find
• The load factor, λ, of a hash table is the ratio of the number of elements divided by the table size
• For separate chaining, λ is the average number of elements in a bucket
• Average time on unsuccessful find: λ
• Average length of a list at hash(k)
• Average time on successful find: 1 + (λ/2)
• One node, plus half the average length of a list (not including the item)
• typical case will be constant time, but worst case is linear because everything hashes to same spot
• λ = 1
• Make hash table be the number of elements expected
• So average bucket size is 1
• Also make it a prime number
• λ = 0.75
• Java’s Hashtable but can be set to another value
• Table will always be bigger than the number of elements
• This reduces the chance of a collision!
• Good trade-off between memory use and running time
• λ = 0.5
• Uses more memory, but fewer collisions
• Open Addressing: The general idea with all of them is that, if a spot is occupied, to ‘probe’, or try, other spots in the table to use
• 3 Types:
• General: pi(k) = (hash(k) + f(i)) mod table_size 1.Linear Probing: f(i) = i
• Check spots in this order :
• hash(k)
• hash(k)+1
• hash(k)+2
• hash(k)+3
• These are all mod table_size
• find - keep going until you find an empty cell (or get back)
• problems
• cannot have a load factor > 1, as you get close to 1, you get a lot of collisons
• clustering - large blocks of occupied cells
• “holes” when an element is removed 2.Quadratic: f(i) = i^2
• hash(k)
• hash(k)+1
• hash(k)+4
• hash(k)+9
• you move out of clusters much quicker 3.Double hashing: i * hash2(k)
• hash2 is another hash function - typically the fastest
• problem where you loop over spots that are filled - hash2 yields a factor of the table size
• solve by making table size prime
• hash(k) + 1 * hash2(k)
• hash(k) + 2 * hash2(k)
• hash(k) + 3 * hash2(k)
• a prime table size helps hash function be more evenly distributed
• problem: when the table gets too full, running time for operations increases
• solution: create a bigger table and hash all the items from the original table into the new table
• position is dependent on table size, which means we have to rehash each value
• this means we have to re-compute the hash value for each element, and insert it into the new table!
• When to rehash?
• When half full (λ = 0.5)
• When mostly full (λ = 0.75)
• Java’s hashtable does this by default
• When an insertion fails
• Some other threshold
• Cost of rehashing
• Let’s assume that the hash function computation is constant
• We have to do n inserts, and if each key hashes to the same spot, then it will be a Θ(n2) operation!
• Although it is not likely to ever run that slow
• Removing
• You could rehash on delete
• You could put in a ‘placeholder’ or ‘sentinel’ value
• gets filled with these quickly
• perhaps rehash after a certain number of deletes
• has functions
• MD5 is a good hash function (given a string or file contents)
• generates 128 bit hash
• when you download something, you download the MD5, your computer computes the MD5 and they are compared to make sure it downloaded correctly
• not reversible because when a file has more than 128 bits, won’t be 1-1 mapping
• you can lookup a MD5 hash in a rainbow table - gives you what the password probably is based on the MD5 hash
• SHA (secure Hash algorithm) is much more secure
• generates hash up to 512 bits
• Graphical Models

# big data

• marginal correlation - covariance matrix
• estimates are bad when not n » d
• eigenvalues are not well-approximated
• often enforce sparsity
• ex. threshold each value in the cov matrix (set to 0 unless greater than thresh) - this threshold can depend on different things
• can also use regularization to enforce sparsity
• POET doesn’t assume sparsity
• conditional correlation - inverse covariance matrix = precision matrix

# 1 - bayesian networks

•  A and B have conditional independence given C if A B and A C are independent

### bayesian networks intro

• represented by directed acyclic graph
1. could get an expert to design Bayesian network
2. otherwise, have to learn it from data
• each node is random variable
• weights as tables of conditional probabilities for all possibilities
1. encodes conditional independence relationships
2. compact representation of joint prob. distr. over the variables
• markov condition - given its parents, a node is conditionally independent of its non-descendants
•  therefore joint distr: $P(X_1 = x_1,…X_n=x_n)=\prod_{i=1}^n P(X_i = x_i Parents(X_i))$
• inference - using a Bayesian network to compute probabilities
• sometimes have unobserved variables

### sampling

• exact inference is feasible in small networks, but takes a long time in large networks
• approximate inference techniques
1. learning
• prior sampling
• draw N samples from a distribution S
• approximate posterior probability based on observed values
• ex. flip a weighted coin to find out what the probabilities are
• then move to child nodes and repeat
1. inference
•  suppose we want to know P(D !A)
• sample network N times, report probability of D being true when A is false
• more samples is better
•  rejection sampling - if want to know P(D !A)
• sample N times, throw out samples where A isn’t false
• return probability of D being true
• this is slow
• likelihood weighting - fix our evidence variables to their observed values, then simulate the network
• can’t just fix variables - distr. might be inconsistent
• instead we weight by probability of evidence given parents, then add to final prob
• for each observation
• if correct, Count = Count+(1*W)
• always, Total = Total+(1*W)
• return Count/Total
• this way we don’t have to throw out wrong samples
• doesn’t solve all problems - evidence only influences the choice of downstream variables

### 16 learning overview

• notation
• P* - true distribution
• $\hat{P}$ - sample distribution of P
• $\tilde{P}$ - estimated distribution P
1. density estimation - construct a model $\tilde{M}$ such that $\tilde{P}$ is close to generating $P^*$
• this can be estimated with relative entropy distance = $\mathbf{E_X}\left[ log(\frac{P^*(X)}{\tilde{P(X)}} \right]$
• also $= - \mathbf{H}_{P^*}(X) - \mathbf{E}_X [log \tilde{P}(X)]$
• intuitively measures extent of compression loss (in bits)
• can ignore first term because it is unaffected by the model
•  concentrate on expected log-likelihood = $\mathcal{l}(D M) = \mathbf{E}_X [log \tilde{P}(X)]$
• maximizes probability of data given the model
• maximizes prediction assuming we are given complete instances
• could design test suite of queries to evaluate performance on a range of queries
1. classification
• can set loss function to classification error (0/1 loss)
• this doesn’t work well for multiclass labeling
• Hamming loss - counts number of variables Y in which pred differs from ground truth
•  conditional log-likelihood = $\mathbf{E}_{x,y ~ P}[log \tilde{P}(y x)]$ - only measure likelihood with respect to predicted y
1. knowledge discovery
• far more critical to assess the confidence in a prediction
• the amount of data required to estimate parameters reliably grows linearly with the number of parameters, so that the amount of data required can grow exponentially with the network connectivity
• goodness of fit - how well does the learned distribution represent the real distribution?

### 17 parameter estimation

•  assume parametric model P ($x$ $\theta$)
• a sufficent statistic can be used to calculate likelihood

### 18 structure learning

• structure learning - learning the structure (e.g. connections) in the model

### 20 learning undirected models

• a potential function is a non-negative function
• values with higher potential are more probable
• can maximize entropy in order to impose as little structure as possible while satisfying constraints
• Graphs
• Edges are of the form (v1, v2)
• Can be ordered pair or unordered pair
• Definitions
• A weight or cost can be associated with each edge - this is determined by the application
• w is adjacent to v iff (v, w) $\in$ E
• path: sequence of vertices w1, w2, w3, …, wn such that (wi, wi+1) ∈ E for 1 ≤ i < n
• length of a path: number of edges in the path
• simple path: all vertices are distinct
• cycle:
• directed graph: path of length $\geq$ 1 such that w1 = wn
• undirected graph: same, except all edges are distinct
• connected: there is a path from every vertex to every other vertex
• loop: (v, v) $\in$ E
• complete graph: there is an edge between every pair of vertices
• digraph
• directed acyclic graph: no cycles; often called a “DAG”
• strongly connected: there is a path from every vertex to every other vertex
• weakly connected: the underlying undirected graph is connected
• For Google Maps, an adjacency matrix would be infeasible - almost all zeros (sparse)
• an adjacency list would work much better
• an adjacency matrix would work for airline routes
• detect cycle
• dfs from every vertex and keep track of visited, if repeat then cycle
• Topological Sort
• Given a directed acyclic graph, construct an ordering of the vertices such that if there is a path from vi to vj, then vj appears after vi in the ordering
• The result is a linear list of vertices
• indegree of v: number of edges (u, v) – meaning the number of incoming edges
• Algorithm
• start with something of indegree 0
• take it out, and take out the edges that start from it
• keep doing this as we take out more and more edges
• can have multiple possible topological sorts
• Shortest Path
• single-source - start somewhere, get shortest path to everywhere
• unweighted shortest path - breadth first search
• Weighted Shortest Path
• We assume no negative weight edges
• Djikstra’s algorithm: uses similar ideas as the unweighted case
• Greedy algorithms: do what seems to be best at every decision point
• Djikstra: v^2
• Initialize each vertex’s distance as infinity
• Start at a given vertex s
• Update s’s distance to be 0
• Repeat
• Pick the next unknown vertex with the shortest distance to be the next v
• If no more vertices are unknown, terminate loop
• Mark v as known
• For each edge from v to adjacent unknown vertices w
• If the total distance to w is less than the current distance to w
• Update w’s distance and the path to w
• It picks the unvisited vertex with the lowest-distance, calculates the distance through it to each unvisited neighbor, and updates the neighbor’s distance if smaller. Mark visited (set to red) when done with neighbors.
• Shortest path from a start node to a finish node
1. We can just run Djikstra until we get to the finish node
1. Have different kinds of nodes
• Assume you are starting on a “side road”
• Transition to a “main road”
• Transition to a “highway”
• Get as close as you can to your destination via the highway system
• Transition to a “main road”, and get as close as you can to your destination
• Transition to a “side road”, and go to your destination
• Traveling Salesman
• Given a number of cities and the costs of traveling from any city to any other city, what is the least-cost round-trip route that visits each city exactly once and then returns to the starting city?
• Hamiltonian path: a path in a connected graph that visits each vertex exactly once
• Hamiltonian cycle: a Hamiltonian path that ends where it started
• The traveling salesperson problem is thus to find the least weight Hamiltonian path (cycle) in a connected, weighted graph
• Minimum Spanning Tree
• Want fully connected
• Want to minimize number of links used
• We won’t have cycles
• Any solution is a tree
• Slow algorithm: Construct a spanning tree:
• Remove an edge from each cycle
• What remains has the same set of vertices but is a tree
• Spanning Trees
• Minimal-weight spanning tree: spanning tree with the minimal total weight
• Generic Minimum Spanning Tree Algorithm
• KnownVertices <- {}
• while KnownVertices does not form a spanning tree, loop:
• find edge (u,v) that is “safe” for KnownVertices
• KnownVertices <- KnownVertices U {(u,v)}
• end loop
• Prim’s algorithm
• Idea: Grow a tree by adding an edge to the “known” vertices from the “unknown” vertices. Pick the edge with the smallest weight.
• Pick one node as the root,
• Incrementally add edges that connect a “new” vertex to the tree.
• Pick the edge (u,v) where:
• u is in the tree, v is not, AND
• where the edge weight is the smallest of all edges (where u is in the tree and v is not)
• Running time: Same as Dijkstra’s: Θ(e log v)
• Kruskal’s algorithm
• Idea: Grow a forest out of edges that do not create a cycle. Pick an edge with the smallest weight.
• When optimized, it has the same running time as Prim’s and Dijkstra’s: Θ(e log v)
• unoptomized: v^2
• Computer Architecture

# units

• we will use only the i versions (don’t have to write i):
• K - 10^3: Ki - 1024
• M - 10^6: Mi - 1024^2
• G - 10^9: Gi - 1024^3
• convert to these: 2^27 = 128M
• log(8K)=13
• hardware is parallel by default
• amdahl’s law: tells you how much of a speedup you get
• S = 1 / (1-a+a/k)
• a-portion optimized, k-level of parallelization, S-total speedup
• if you really want performance increase in java, allocate a very large array, then keep track of it on your own

# numbers

0x means hexadecimal 0 means octal bit - stores 0 or 1 byte - 8 bits - 2 hex digits integer - almost always 32 bits “Big-endian”: most significant first (lowest address) - how we thnk 1000 0000 0000 0000 = 2^5 = 32768 “Little-endian”: most significant last (highest address) - this is what computers do 1000 0000 0000 0000 = 2^0 = 1 Note that although all the bits are reversed, usually it is displayed with just the bytes reversed Consider 0xdeadbeef On a big-endian machine, that’s 0xdeadbeef On a little-endian machine, that’s 0xefbeadde 0xdeadbeef is used as a memory allocation pattern by some OSes Representing integers Sign and magnitude - first digit specifies sign One’s complement - encode using n-1 bits, flip if negative Two’s complement - encode using n-1 bits, flip if negative, add 1 only one representation for 0 maximum: 2^(n-1) - 1 minimum: - 2^(n-1) flip everything to the left of the rightmost 1 Floating point - like scientific notation 3.24 * 10 ^ -6 Mantissa - 3.24 - between 1 and the base (10) For binary, the mantissa must be between 1 and 2 we assume the base is 2 32 bits are split as follows: bit 1: sign bit, 1 means negative (1 bit) bits 2-9: exponent (8 bits) Exponent values: 0: zeros 1-254: exponent-127 255: infinities, overflow, underflow, NaN bits 10-32: mantissa (23 bits) mantissa=1.0+∑(i=1:23)(b^i/2^i) //we don’t encode the 1. because it has to be there value=(1−2∗sign)∗(1+mantissa)∗2^(exponent−127) The largest float has: 0 as the sign bit (it’s positive) 254 as the exponent (1111 1110) 255 is reserved for infinities and overflows That exponent is 254-127 = 127 All 1’s for the mantissa Which yields almost 2 2 * 2^127 = 2^128 = 3.402823 * 10^38 //actually a little bit lower Minimum positive: 1 * 2^-126 = 2^-126 = 1.175494 x 10^-38 //this is exact Floating point numbers are not spatially uniform Depending on the exponent, the difference between two successive numbers is not the same union class - converts from one data type to another //when you write one field, it overrides the other union foo { //this converts a float to hex float f; int *x; } bar;

int main() {
bar.f = 42.125;
cout << bar.x << endl; // this outputs as 0x42288000 (it is now converted to hex)
}                          // if you were to dereference it, bad things would happen Never compare floating point numbers - even if you print them out, they might be stored internally Any fraction that doesn't have a power of 2 as the denominator will be repeating
// C++ (need to #include <math.h> and compile with -lm) define EPSILON 0.000001 bool foo = fabs (a-b) < EPSILON; You could use a rational or use more digits 64-bit: 11 exponent bits
offset=1023 Cowlishaw encoding: use 3 bits to store a binary digit - inefficient


# x86

Assembly Language - assembler translates text into machine code x86 is the type of chip 8 registers, although you can’t use 2 (stack pointer and base pointer) they are all 32 bit registers 1 byte = 8 bits Declare variables with 3 things: identifier, how big it is, and value doesn’t give you a type ? means uninitialized x DD 1, 2, 3 //declares arr with 3, 4-byte integers y TIMES 8 DB 0 //declares 8 bytes all with value 0 nasm assumes you are using 32 bits mov , more like copy Where dest and src can be: A register A constant Variable name Pointer: [ebx] always put square brackets around variable you can ADD up to two registers, add one constant, and premultipy ONE register by 2,4,or 8 The destination cannot be a constant (would overwrite the constant) You cannot access memory twice in one instruction not enough time to do that at clock speed Stack starts at the end of memory and goes backwards when you push onto it, it's actually at a lower index ESP points to most recently pushed item push First decrements ESP (stack pointer) by 4 (stack grows down) push (mov) operand onto stack (4 bytes - we make this assumption, not always true) pop Pop top element of stack to memory or register, then increment stack pointer (ESP) by 4 Value is written to the parameter Commands 0fH - H at end specifies hex number lea is like & (get the address of) Load effective address Place address of second parameter into the first parameter this is faster than arithmetic because you can do things as a a single command add, sub a += b inc, dec a++ imul a *= b idiv - use shift if possible have to load a,b into one 64-bit integer and, xor - bitwise cmp - compare two things je - jump when equal - specify where you are going to jump to Others: jne, jz, jg, jge, jl, jle, js call

subroutine may not know how many parameters are passed to it - thus, 1st arg must be at ebp+8 and the rest are pushed above it.
Every subroutine call puts return address and ebp backup on the stack


Activation Records Every time a sub-routine is called, a number of things are pushed onto the stack: Registers Parameters Old base/stack pointers Local variables Return address The callee also pushes caller-saved registers Typically stack stops around 100-200 Megabytes, although this can be changed

Memory - There are two types of memory that need to be handled: Dynamic memory (via new, malloc(), etc.) This is stored on the heap Static memory (on the stack) This is where the activation records are kept

The binary program starts at the beginning of the 2^32 = 4 Gb of memory
The heap starts right after this
The stack starts at the end of this 4 Gb of memory, and grows backward
If they meet, you run out of memory


Buffer Overflow void security_hole() { char buffer[12]; scanf (“%s”, buffer); // how C handles input } The stack looks like (with sizes in parenthesis):

 esi (4) 	 edi (4) 	 buffer (12) 	 ebp (4) 	 ret addr (4)

Addresses increase to the right (the stack grows to the left)
What happens if the value stored into buffer is 13 bytes long?
We overwrite one byte of ebp
What happens if the value stored into buffer is 16 bytes long?
We completely overwrite ebp
What if it is exactly 20 bytes long?
We overwrite the return address!
Buffer Overflow Attack
When you read in a string (etc.) that goes beyond the size of the buffer
You can then overwrite the return address
And set it to your own code
For example, code that is included later on in the string - overwrite ebp, overwrite ret addr with beginning of malicious code


We are using nasm as our assembler for the x86 labs looks different when you use the compiler in C, you can only have one method with the same name C translates more cleanly into assembly optimization rearranges stuff to lessen memory access _Z3maxii: ii is the parameter list (two integers)

In little-Endian, the entire 32-bit word and the 8-bit least significant byte have the same address this makes casting very easy RISC Reduced instruction set computer Fewer and simpler instructions (maybe 50 or so) Less chip complexity means they can run fast CISC Complex instruction set computer More and more complex instructions (300-400 or so) More chip complexity means harder to make run fast

Caller Parameters: pushed on the stack Registers: saved on the stack eax,ecx,edx can be modified ebx,edi,esi shouldn’t be call - places return address on stack Local variables: placed in memory on the stack Return value: eax Callee: the function which is called by another function push ebp mov ebp, esp sub esp, 4 //allocate local variables push ebx //you don’t have to back these up mov ebp-4, 1 //load 1 into local variable

add esp, 4  //deallocate local var
pop ebx
pop ebp
ret


# intro to C - we use ANSI standard

all C is valid C++
// doesn't work
always use +=1 not ++
compile with gcc -ansi -pedantic -Wall -Werror program.c
-Werror will stop the program from compiling
all variables have to be declared at the top
int main(int argc, char*argv[]){
int x = 1;
int y = 34;
int z = y*y/x;
x = 13;
int w = 1; <- this will not work
label_name:
printf("omg!");
goto label_name; /*this goes to the label_name line - don't do this, but assembly only has this*/
return 0;
}
printf(const char *format, ...)
printf("%d %f %g %s\n",); /* int, double (these must be explicitly doubles), double (as small as possible), string */


# compile steps

source (text) -> pre-processor -> modified source (text) -> compiler -> assembly (text) -> assembler -> binary program -> linker -> executable

1. pre-processing - deals with hashtags - sets up line numbers for errors, includes external definitions, normal defines (.i)
2. compile - turns it into assembly (.s)
• this assembly has commands with destination, src
3. assemble - turns assembly into binary with a lot of wrappers (.o)
4. link - makes the file executable, gets the code from includes together (a.out)

# strings

• char - number that fits in 1 byte
• string is an array of chars: char*
• all strings end with null character \0, bad security
• length of string doesn’t include null character
h e l l o \0
10 . . . . 15

# memory in C

• byte is smallest accessible memory unit - 2 hex digits (ex: 0x3a)
Bits Name Bits Name
1 bit 16 word
4 nyble 32 double word, long word
8 byte 64 quad word

theoretically would work:

Void * p = 3 (address 3)
*p = 0x3a (value 3a)
p[0] = 0x3a (value 3a)
p[3] is same as *(p+3) - can even use negative addresses, (at end wraps around - overflows)


in practice:

int* p
sizeof(int) == 4, but all pointers are just one byte - points to location of four consecutive bytes that are int
indexing this pointer will tell you how much to offset memory by
address must be 4-bytes aligned (means it will be a multiple of 4)

• little-endian - least significant byte first
• big-endian - most significant byte first (normal) - networks use this
• we will use little-endian, this is what most computers use
• low addresses are unused - will throw error if accessed
• top has the os - will throw error if accessed
• contains heap, stack, code globals, shared stuff

# call stack

2. local variables
3. backups
4. top pointer
5. base pointer
6. next pointer
7. parameters (sometimes)
8. return values (sometimes)

one frame - between the lines is everything the current method has - largest addresses at top, grows downwards

• parameters (backwards)

• base pointer
• previous stack base
• saved stuff
• locals
• top pointer (actually at the bottom)
• return value

• in practice, most parameters and return values are put in registers

# types

• 2’s complement: positive normal, negative subtract 1 more than biggest number you can do
• flip everything to the left of the rightmost one
• math is exactly the same, discard extra bits

type | signed? | bits

•  char ? 8 short signed 16 (usually) int . 16 or 32 long . ≤16 or ≥ int long long signed 64
• everything can have an unsigned / signed in front of the type
• C will cast things without throwing error

# boolean operators

• 9 = 1001 = 1.001 x 2^3
• x && y -> {0,1}
• x & y -> {0,1,…,2^32} (bit operations)
• ^ is xor
• !x - 0 -> 1, anything else -> 0
• and is often called masking because only the things with 1s go through
• shifting will pad with 0s
• (1 « 3 )-1 gives us 3 1s
• ~((1 « 3 )-1) gives us 111…111000
• x & ~((1 « 3 )-1) gives us x1x2…..000
• copies the msb

• then we can or it in order to change those last 3 bits
• trinary operator - a≥b:c means if(a) b; else c
•  a&b ~a&c
• only works for 2-bit numbers if a=00 or a=11 -!x 1 if x=0 -!!x 1 if x!=0 so we want a=-!!x

# ATT x86 assembly

• there are 2 hardwares
• x86-64 (desktop market)
• Arm7 (mobile market) - all instructions are like cmov
• think -> not = ex. mov $3, %rax is 3 into rax • prefixes •$34 - $is an immediate (literal) value • %rax - the contents of register rax • main - label (no prefix) - assembler turns this into an immediate • 3330(%rax,%rdi,8) - memory at (3330+rax+8*rdi) - in x86 but not y86 • you could do 23(%rax) • gcc -S will give you .S file w/ assembly • what would actually be compiled • gcc -c will give you object file • then, objdump -d will dissassemble object file and print assembly • leaves out suffixes that tell you sizes • can’t be compiled, but is informative • gcc -O3 will be fast, default is -O0, -OS optimizes for size • we call the part of x86 we use y86 • registers • general purpose registers (program registers) • PC - program counter - cannot directly change this - what line to run next • CC - condition codes - sign of last math op result • remembers whether answer was 0,-,+ • cmp - basically subtraction (works backwards, 2nd-1st), but only sets the condition codes • in assembly, arithmetic works as += • ex. imul %rdi, %rax multiplies and stores result in rdi • doesn’t really matter: eax, rax are same register but eax is bottom half of rax • on some hardwares eax is faster than rax • call example - PC=0x78 callq 0x56 • PC=0x7d next command (because callq is 5 bytes long, it could be different) • puts 7d on stack to return to at address 0x100 • this address (0x100) is subtracted by number of bytes in address (8) • this value (0x0f8) is put into rsp(in C this is always on the stack) • rsp stores address of the top of the stack • PC becomes 56 • call • movq (next PC), (%rsp) ~PC can’t actually be changed • addq$-8, %rsp
• jmp $0x56 • ret • addq$8, %rsp
• movq (%rsp), (PC) ==same as== jmp (%rsp)
• push
• mov _, (%rsp)
• sub $8, %rsp • pop does the opposite • add$8, %rsp
• movq (%rsp), _ cmp
• cmovle %5, (%rax) - move only if we are in this state

# y86 - all we use

1. halt - stops the chip
2. nop - do nothing
3. op_q
• addq, subq, andq, xorq
• takes 2 registers, stores values in second
• sub is 2nd-1st
4. jxx
• jl, jle, jg, je, jne, jmp
• takes immediate
5. movq longer PC increment because it stores offset, register always comes first (is rA)
• rrmovq (register-register, same as cmov where condition is always)
• irmovq (immediate-register)
• rmmovq
• mrmovq (memory)
6. cmovxx (register-register)
7. call
• takes immediate
• pushes the return address on the stack and jumps to the destination address.
8. ret
• pops a value and jumps to it
9. pushq
• one register
• pushes then decrements
10. popq
• one register
• programmer-visible state
• registers
• program
• rax-r14 (8 registers x86 has 15)
• rsp is the special one
• 64 bit integer (or pointer) - there is no floating point, in x86 floating point is stored in other registers
• other
• program counter (PC), instruction pointer
• 64-bit pointer
• condition codes (CC) - not important, tell us <, =, > on last operation
• only set by the op_q commands
• memory - all one byte array
• instruction
• data
• encoding (assembly -> bytes)
•  1st byte -> high-order nybble low-order nybble
• higher order is opcode (add, sub, …)
• lower-order is either instruction function or flag(le, g, …) - usually 0
• remaining bytes
• argument in little-endian order
• examples
• call $0x123 -> 80 23 01 00 00 00 00 00 00 • ret -> 90 • subq %rcx, %r11 -> 81 1b (there are 15 registers, specify register with one nybble) • irmov$0x3330, %rdi -> 30 f7 30 33 00 00 00 00 00 00 (register-first always, f means no source, but destination of register 7)
• compact binary (variable-length encoding) vs. simple binary (fixed-length encoding)
• x86 vs. ARM
• people can’t decide
• compact binary - complex instruction set computer (cisc) - emphasizes programmer
• simple binary - reduced instruction set computer (risc) - emphasizes hardware
• have more complex compilers
• fixed width instructions
• lots of registers
• few memory addressing modes - no memory operands (only mrmov, rmmov)
• few opcodes
• passes parameters in registers, not the stack (usually)
• no condition codes (uses condition operands)
• in general, computers use compact and tablets/phones use simple
• if we can get away from x86 backwards compatibility, we will probably meet in the middle
• study RISC vs. CISC

# hardware

• flows when control is high
• power - everything loses power by creating heat (every gate consumes power)
• changing a transistor takes more power than leaving it
• voltage - threshold above which transistor is open
• register - on rising clock edge store input
• overclock computer - could work, or logic takes too long to get back - things break
• could be fixed with colder, more power
• chips are small because of how fast they are
• mux - selectors pick which input goes through
• out = [ guard:value; … ];
• this is a language called HCL written by the book’s authors
• out = g?input: g2:input2: …:0
• if first is true return that, otherwise keep going otherwise return 0

# executing instructions

• we have wires
1. register file
2. data memory
3. instruction memory
4. status output - 3-bits
• ex. popq, %rbx
• todo: get icode, check if it was pop, read mem at rsp, put value in rbx, inc rsp by 8
• getting icode
• instruction in instruction memory: B0 3F
• register pP (p is inputs), (P is outputs)
• pP { pc:64 = 0;} - stores the next pc
• pc <- P_pc - the fixed functionality will create i10 bytes
• textbook: icode:ifun = M_1[PC] - gets one byte from PC
• HCL (HCL uses =): wire icode:4; icode = i10bytes[4..8] - little endian values, one byte at a time - this grabs B from B0 3F
• assume icode was b (in reality, this must be picked with a mux)
  valA 	<- R[4] 		// gets %rsp - rsp is 4th register
rA:rB	<- M_1[PC+1]	// book's notation - splits up a byte into 2 halves, 1 byte in rA, 1 byte in rB, PC+1 because we want second byte
// 3 is loaded into rA, F is loaded into rB
valE 	<- valE+8		// inc rsp by 8
valM 	<- M_8[valA] 	// send %rsp to
R[rA] 	<- valM			// writes to %rbx
R[4]	<- valE			// writes to %rsp
p_pc	=  P_pc+2		// increment PC	by 2 because popq is 2-byte instruction

• steps java
1. fetch - what is wanted
2. decode - find what to do it to - read prog registers
3. execute and/or memory - do it
4. write back - tell you result

020 10 nop fetch change pc to 021 021 6303 xorq %rax,%rbx fetch pc <- 0x023 decode read reg. file at 0,3 to get 17 and 11 execute 17^11 = 10001^01011 = 11010 = 26, also sets CC to >0 write back write 26 to regFile[3] 023 50 23 1000000000000000 the immediate 16 is in little-endian but is 8 bytes - 1st in memory last in little-endian mrmovq 16(%rbx),%rcx fetch read bytes, understand, PC <- 0x02d decode read regFile to get (26),13 execute 16+26=42 (to be new address) memory ask RAM for address 42, it says 0x0000000071000000 - little-endian write back put 0x71000000 into regFile[2] 02d 71 0000000032651131 jle 0x3111653200000000 fetch valP <- 0x036 decode cc > 0, not jump, PC <- valP 036 00 halt - set STAT to HALT and the computer shuts off, STAT is always going on in the background push

	- reads source register
- dec rsp by 8
- writes read value to new rsp address

# hardware wires - opq example
java
# fetch
# 1. set pc
# 1.a. make a register to store the next PC
register qS {
pc : 64 = 0;
lt : 1 = 0;
eq : 1 = 0;
gt : 1 = 0;
}

# 2. read i10bytes
pc = S_pc;

# 3. parse out pieces of i10bytes
wire icode:4, ifun:4, rA:4, rB:4;
icode = i10bytes[4..8]; # 1st byte: 0..8  high-order nibble 4..8
ifun = i10bytes[0..4];

const OPQ = 6;
const NO_REGISTER = 0xf;

rA = [
icode == OPQ : i10bytes[12..16];
1: NO_REGISTER;
];

rB = [
icode == OPQ : i10bytes[8..12];
1: NO_REGISTER;
];

wire valP : 64;

valP = [
icode == OPQ : S_pc + 2;
1 : S_pc + 1; # picked at random
];

Stat = STAT_HLT; # fix this

# decode

# 1. set srcA and srcB

srcA = rA;
srcB = rB;

dstE = [
icode == OPQ : rB;
1 : NO_REGISTER;
];

# execute

wire valE : 64;

valE = [
icode == OPQ && ifun == 0 : rvalA + rvalB;
];

q_lt = [
icode == OPQ : valE < 0;
# ...
];

# memory

# writeback

wvalE = [
icode == OPQ : valE;
1: 0x1234567890abcdef;
];

# PC update
q_pc = valP;


# pipelining

• nonuniform partitioning - stages don’t all take same amount of time
• register - changes on the clock
• normal - output your input
• bubble - puts “nothing” in registers - register outputs nop
• stall - put output into input (same output)
• see notes on transitions
• stalling a stage (usually because we are waiting for some earlier instruction to complete)
• stall every stage before you
• bubble stage after you so nothing is done with incomplete work - this will propagate
• stages after that are normal
• bubbling a stage - the work being performed should be thrown away
• bubble stage after it
• basically send a nop - use values NO_REGISTER and 0s
• often there are some fields that don’t matter
• stalling a pipeline = stalling a stage
• bubbling a pipeline - bubble all stages
irmovq	 	$31, %rax addq$rax,%rax
jle

• the stages are offset - everything always working
• when you get a jmp, you have to wait for the thing before you to writeback. 2 possible solns
• stall decode
• forward value from the stage it is currently in
• look at online notes - save everything in register that needs to be used later

### problems

1. dependencies - outcome of one instruction depends on the outcome of another - in the software
1. data - data needed before advancing - destination of one thing is used as source of another
2. control - which instruction to run depends on what was done before
2. hazards - potential for dependency to mess up the pipeline - in the hardware design
• hardware may or may not have a hazard for a dependency
• can detect them by comparing the wire that reads / writes to regfile (rA,rB / dstE) - they shouldn’t be the same because you shouldn’t be reading/writing to the same register (except when all NO_REGISTER)

### solutions

• P is PC, then FDEMW
1. stall until it finishes if there’s a problem stall_P = 1; //stall the fetch/decode stage bubble_E = 1; //completes and then starts a nop, gives it time to write values
2. forward values (find what will be written somewhere)
• in a 2-stage system, we have dstE and we use it to check if there’s a problem
• usually if we can check that there’s a problem, we have the right answer
• if we have the answer, put value where it should be
• this is difficult, but doesn’t slow down hardware
• we decide whether we can forward based on a pipeline diagram - boxes with stages (time is y axis, but also things are staggered)
• we need to look at when we get the value and when we need it
• we can pipeline if we need the value after we get it
• if we don’t, we need to stall
subq %rax,%rbx
jge bazzle

• we could stall until CC is set in execute of subq to fetch for jge - this is slow
1. speculation execution - we make a guess and sometimes we’ll be right (modern processors are right ~90%)
• branch prediction - process of picking branch
• jumps to a smaller address are taken more often than not - this algorithm and more things make it complicated
• if we were wrong, we need to correct our work
• Example:
1	subq
2	jge 4
3	ir
4	rm

• the stage performing ir is wrong - this stage needs to be bubbled
• in a 5-stage pipleine like this, we only need to bubble decode
• ret - doesn’t know the next address to fetch until after memory stage, also can’t be predicted well
• returns are slow, we just wait
• some large processors will guess, can still make mistakes and will have to correct

### real processors

1. memory is slow and unpredictably slow (10-100 cycles is reasonable)
2. pipelines are generally 10-15 stages (Pentium 4 had 30 stages)
3. multiple functional units
• alu could take different number of cycles for addition/division
• multiple functional units lets one thing wait while another continues sending information down the pipeline
4. out-of-order execution
• compilers look at whether instructions can be done out of order
• it might start them out of order so one can compute in functional unit while another goes through
• can swap operations if 2nd operation doesn’t depend on 1sts
5. (x86) turn the assembly into another language
• micro ops
• makes things specific to chips - profiler - software that times software - should be used to see what is taking time in code - hardware vs. software
• software: .c -> compiler -> assembler -> linker -> executes
• compiler - lexes, parses, O_1, O_2, …
• hardware: register transfer languages (.hcl) -> … -> circuit-level descriptions -> layout (veryify this) -> mask -> add silicon in oven -> processor

# memory

• we want highest speed at lowest cost (higher speeds correspond to higher costs)
• fastest to slowest
• register - processor - about 1K
• SRAM - memory (RAM) - about 1M
• DRAM - memory (RAM) - 4-16 GB
• SSD (solid-state drives) - mobile devices
• Disk/Harddrive - filesystem - 500GB
• the first three are volatile - if you turn off power you lose them
• the last three are nonvolatile
• things have gotten a lot faster
• where we put things affects speed
• registers near ALU
• SRAM very close to CPU
• CPU also covered with heat sinks
• DRAM is large - pretty far away (some other things between them)
• locality
• temporal - close in time - the characteristic of code that repeatedly uses the same address
• spatial - close in space - the characteristic of code that uses addresses that are numerically close
• real-time performance is not big-O - a tree can be faster than hash because of locality
• caching - to keep a copy for the purpose of speed
• if we need to read a value, we want to read from SRAM (we call SRAM the cache)
• if we must read from DRAM (we call DRAM the main memory), we usually want to store the value in SRAM
• cache - what they wanted most recently (good guess because of temporal locality)
• slower caches are still bigger
• cache nearby bytes to recent acesses (good guess because of spatial locality)
• simplified cache
• 4 GB RAM -> 32-bit address
• 1MB cache
• 64-bit words
• ex. addr: 0x12345678
• simple - use entire cache as one block
• chop off the bottom log(1MB)=20 bits
• send addr = 0x12300000
• fill entire cache with 1MB starting at that address
• tag cache with 0x123
• send value from cache at address offset=0x45678
• if the tag of the address is the same as the tag of the cache, then return from cache
• otherwise redo everything
• slightly better cache
• beginning of address is tag
• middle of address might give you index of block in table = log(num_blocks_in_cache)
• end of address is block offset (offset size=log(block_size))
• a (tag,block) pair is called a line
1. Fully-associative cache set of (tag,block) pairs
• table with first column being tags, second column being blocks
• to read, check tag against every tag in cache
• if found, read from that block
• else, pick a line to evict (this is discussed in operating systems)
• read DRAM into that line, read data from that line’s block
• long sets are slow! - typicaly 2 or 3 lines would be common 2. Direct-mapped cache - array of (tag,block) pairs
• table with 1st column tags, 2nd column blocks
• to read, check the block at the index given in the address
• if found, read from block
• else, load into that line
• good spatial locality would make the tags adjacent (you read one full block, then the next full block)
• this is faster (like a mux) instead of many = comparisons, typically big (1K-1M lines) 3. Set-associative cache - hybrid - array of sets
• indices each link to a set, each set with multiple elements
• we search through the tags of this set
• look at examples in book, know how to tell if we have hit or miss

### writing

• assume write-back, write-allocate cache
1. load block into cache (if not already) - write-allocate cache
2. change it in cache
3. two optionsm
1. write back - wait until remove the line to update RAM
• line needs to have tag, block, and dirty bit
• dirty = wrote but did not update RAM
2. write through - update RAM now
• no-write-allocate bypasses cache and writes straight to memory if block not already in cache (typically goes with write-through cache)
• valid bits - whether or not the line contains meaningful information
• kinds of misses
1. cold miss - never looked at that line before
• valid bit is 0 or we’ve only loaded other lines into the cache
• associated with tasks that need a lot of memory only once, not much you can do
2. capacity miss - we have n lines in the cache and have read ≥ n other lines since we last read this one
• typically associated with a fully-associative cache
• code has bad temporal locality
3. conflict miss - recently read line with same index but different tag
• typically asssociated with a direct-mapped cache
• characteristic of the cache more than the code

### cache anatomy

• i-cache - holds instructions
• typically read only
• d-cache - holds program data
• unified cache - holds both
• associativity - number of cache lines per set

# optimization

• only need to worry about locality for accessing memory
• things that are in registers / immediates don’t matter
• compilers are often cautious
• some things can’t be optimized because you could pass in the same pointer twice as an argument
• loop unrolling - often dependencies accross iterations
for(int i=0;i<n;i++){ //this line is bookkeeping overhead - want to reduce this
a[i]+=1; //has temporal locality because you add and then store back to same address, obviously spatial locality
}
// unrolled
for(int i=0;i<n-2;i+=3){ //reduced number of comparisons, can also get benefits from vectorization (complicated)
a[i]+=1;
a[i+1]+=1;
a[i+2]+=1;
}
if(n%3>=1) a[n-1]+=1;
if(n%2>=2) a[n-2]+=1;

• n%4 = n&3
• n%8 = n&7s java //less error prone for(int i=0;i<n;i+=1){ //this can be slower, data dependency between lines might not allow full parallelism (more stalling) a[i]+=1; i+=1; a[i]+=1; i+=1; a[i]+=1; i+=1; } 
• loop order
• the order of loops can make this way faster
for(i...)
for(j...)
a[i][j]+=1; //two memory accesses

• flatten arrays //this is faster, especially if width is a factor of 2, end of one row and beginning of next row are spatially local
• row0 then row1 then row2 ….
• float[heightwidth], access with array[rowwidth+column]
• problem - loop can’t be picked
  for(i..)
for(j...)
a[i][j]=a[j][i];

• solution - blocking
• pick two chunks
• one we read by rows and is happy - only needs one cache line
• one we read by columns - needs blocksize cache lines (by the time we get to the second column, we want all rows to be present in cache)
  int bs=8;
for (bx=0;bx<N;bx+=bs) // for each block
for(by=0;by<N;by+=bs)
for(x=bx;x<bx+bs;x+=1) // for each element of block
for(y=by;y<by+bs;y+=1)
swap(a[x][y],a[y][x]); // do stuff

• conditions for blocking
1. the whole thing doesn’t fit in cache
2. there is no spatially local loop order
• block size must be able to fit in cach
• reassociation optimization - compiler will do this for integers, but not floats (because of round-off errors)
• a+b+c+d+e -> this is slower (addition is sequential by default, the sum of the first two is then added to the third number)
• ((a+b)+(c+d))+e -> this can do things in parallel, we have multiple adders (we can see logarithmic performace if the chip can be fully parallell)
• using methods can increase your instruction cache hit rate

# exceptions

• processor is connected to I/O Bridge which is connected to Memory Bus and I/O Bus
• these things are called the mother board
• we don’t want to wait for I/O Bus
1. Polling - CPU periodically checks if ready
2. Interrupts - CPU asks device to tell the CPU when the device is ready
• this is what is usually done
• CPU has an interrupt pin (interrupt sent by I/O Bridge)
• steps for an interrupt
1. pause my work - save where I can get back to it
2. decide what to do next
3. do the right thing
4. resume my suspended work
• jump table - array of code addresses
• each address points to handler code
• CPU must pause, get index from bus, jump to exception_table[index]
• need the exception table, exception code in memory, need register that tells where the exception table is
• the user’s code should not be able to use exception memory
• memory
• mode register (1-bit): 2 modes
• kernel mode (operating system) - allows all instructions
• user mode - most code, blocks some instructions, blocks some memory
• cant set mode register
• can’t talk to the I/O bus
• largest addresses are kernel only
• some of this holds exception table, exception handlers
• between is user thhings
• smallest addresses are unused - they are null (people often try to dereference them - we want this to throw an error)
• exceptions
1. interrupts, index: bus, who causes it: I/O Bridge
• i1,i2,i3,i4 -> interrupt during i3 instruction
• let i3 finish (maybe)
• handle interrupt
• resume i4 (or rest of i3)
2. trap, %al (user) int assembly instruction (user code)
• trap is between instructions, simple
3. fault, based on what failed failing user-mode instruction
• fault during i3
• suspend i3
• handle fault
• rerun i3 (assuming we corrected fault - ex. allocating new memory) otherwise abort - abort - reaction to an exception (usually to a fault) - quits instead of resuming
• suspending
• save PC, program register, condition codes (put them in a struct in kernel memory)
• on an exception
1. switch to kernel mode
2. suspend program
4. execute exception handler
5. resume in user mode

# processes

• (user) read file -> (kernel) send request to disk, wait, clean up -> (user) resume
• this has lots of waiting so we run another program while we wait (see pic)
• process - code with an address space
• CPU has a register that maps user addresses to physical addresses (memory pointers to each process)
• general we don’t call the kernel a process
• also had pc, prog. registers, cc, etc.
• each process has a pointer to the kernel memory
• also has more (will learn in OS)…
• context switch - changing from one process to another
• generally each core of a computer is running one process
1. freeze one process
2. let OS do some bookkeeping
3. resume another process
• takes time because of bookkeeping and cache misses on the resume
• you can time context switches while(true) getCurrentTime() if(increased a lot) contextSwitches++

• threads are like processes that user code manages, not the kernel
• within one address space, I have 2 stacks
• save/restores registers and stack
• hardware usually has some thread support
• save/restore instructions
• a way to run concurrent threads in parallel
• python threads don’t run in parallel

# system calls

• how user code asks the kernel to do stuff
• exception table - there are 30ish, some free spots for OS to use
• system call - Linux uses exception 128 for almost all user -> kernel requests
• uses rax to decide what you are asking //used in a jump table inside the 128 exception handler
• most return 0 on success
• non-zero on error where the # is errno
• you can write assembly in C code

# software exceptions

• you can throw an exception and then you want to return to something several method calls before you
• nonlocal jump
• change PC
• and registers (all of them)
• try{} - freezes what we need for catch
• catch{} - what is frozen
• throw{} - resume
• hardware exception can freeze state

# signals, setjmps

• exceptions - caused by hardware (mostly), handled by kernel
• signal - caused by kernel, handled by user code (or kernel)
• mimics exception (usually a fault)
• user-defined signal handler know to the OS
• various signals (identified by number)
• implemented with a jump table
• we can mask (ignore) some signals
• turns hardware fault into software exception (ex. divide by 0, access memory that doesn’t exist), this way the user can handle it
• SIGINT (ctrl-c) - usually cancels, can be blocked
• ctrl-c -> interrupt -> handler -> terminal (user code) -> trap (SIGINT is action) -> handler -> sends signal
• SIGTER - cancels, can’t be blocked
• SIGSEG - seg fault
• setjmp/longjmp - caused by user code, handled by user code
• functions in standard C library in setjmp.h
• jumps somewhere where pointer is something that stores current state
• setjmp - succeeds first time (returns 0)
• longjmp - never returns - calls setjmp with a different return value
• you usually use if(setjmp) else {handle error} - basically try-catch

# virtual memory

• 2 address spaces
• used by code
• fixed by ISA designer
2. memory management unit (MMU) - takes in a virtual address and spits out physical address
• page fault - MMU says this virtual address does not have a physical address
• when there’s a page fault, go to exception handler in kernel
• usually we go to disk
3. physical address space (cannot be discovered by code)
• used by memory chips
• constrained by size of RAM
• assume all virtual addresses have a physical address in RAM (this is not true, will come back to this)
• each process has code, globals, heap, shared functions, stack
• lots of unused at bottom, top because few programs use 2^64 bytes
• RAM - we’ll say this includes all caches
• virtual memory is usually mostly empty
• allocated in a few blocks / regions
• MMU
1. bad idea 1: could be a mapping from every virtual address to every physical address, but this wastes a lot
• instead, we split memory into pages (page is continuous block of addresses ~ usually 4k)
• bigger = fewer things to map, more likely to include unused addresses
• address = low-order bits: page offset, high-order bits: page number
• page offset takes log_2(page_size)
2. bad idea 2: page table - map from virtual page number -> physical page number
• we put the map in RAM, we have a register (called the PTBR) that tells us where it is
• we change the PTBR for each process
• CPU sends MMU a virtual address
• MMU splits it into a virtual page number and page offset
• takes 2 separate accesses to memory
1. uses register to read out page number from page table
• page table - array of physical page numbers, 2^numbits(virtual page numbers)
• page table actually stores page table entries (PTEs)
• PTE = PPN, read-only?, code or data?, user allowed to see it?
• MMU will check this and fault on error
2. then it sends page number and page offset and gets back data
• lookup address PTBR + VPN*numbytes(PPN)
• consider 32-bit VA, 16k page (too large)
• page offset is 14 bits
• 2^18 PTEs = 256k PTEs
• each PTE could be 4 bytes so the page table takes about 1 Megabyte
• 64-bit VA, 4k page
• 2^52 PTE -> the page table is too big to store
3. good idea: multi-level page table
• virtual address: page offset, multiple virtual page numbers $VPN_0,VPN_1,VPN_2,VPN_3$ (could have different number of these)
1. start by reading highest VPN: PTBR[VPN_3] -> PTE_3
2. read PPN[VPN_2] -> PTE_2
3. read PPN_2[VPN_1] -> PTE_1
4. read PPN_1[VPN_0] -> PTE_0
5. read PPN_0[VPN] -> PTE_ans
• check at each level if valid, if unallocated/kernel memory/not usable then fault and stop looking
• highest level VP_n is highest bits of address, likely that it is unused
• therefore we don’t have to check the other addresses
• they don’t exist so we save space, only create page tables when we need them - OS does this
• look at these in the textbook
• virtual memory ends up looking like a tree
• top table points to several tables which each point to more tables
• TLB - maps from virtual page numbers to physical page numbers
• TLB vs L1, L2, etc:
• Similarities
• They are all caches- i.e., they have an index and a tag and a valid bit (and sets)
• Differences
• TLB has a 0-bit BO (i.e., 1 entry per block; lg(1) = 0)
• TLB is not writable (hence not write-back or write-through, no dirty bit)
• TLB entries are PPN, L* entries are bytes
• TLB does VPN â†’ PPN; the L* do PA â†’ data

# overview

• CPU -> creates virtual address
• virtual address: 36 bits (VPN), 12 bits (PO) //other bits are disregarded
• VPN broken into 32 bits (Tag), 4 bits (set index)
• set index tells us which set in the Translation Lookaside Buffer to look at
• there are 2^4 sets in the TLB
• currently there are 4 entries per set ~ this could be different
• each entry has a valid bit
• a tag - same length as VP Tag
• value - normally called block - but here only contains one Physical page number - PPN = 40 bits ~ this could be different
• when you go into kernel mode, you reload the TLB

# segments

• memory block
• kernel at top
• stack (grows down)
• shared code
• heap (grows up)
• empty at bottom

base: 0x60

read: 0xb6 val at 0x6b -> 0x3d val at 0xd6 -> ans

read: 0xa4 val at 0x6a -> 0x53 val at 0x34 -> ans

0xb3a6

read: 0xb3 val at 0x6b -> 0x3d val at 0xd3 -> 0x0f val at 0xfa -> 0x6b val at 0xb6

[toc]

# quiz rvw

• commands
• floats
• labs
• In method main you declare an int variable named x. The compiler might place that variable in a register, or it could be in which region of memory? - Stack
• round to even is default
• Which Y86-64 command moves the program counter to a runtime-computed address? - ret
• [] mux defaults to 0
• caller-save register - caller must save them to preserve them
• callee-saved registers - callee must save them to edit them
• in the sequential y86 architecture valA<-eax
• valM is read out of memory - used in ret, mrmovl
• labels are turned into addresses when we assemble files
• accessing memory is slow
• most negative binary number: 100000
• floats can represent less numbers than unsigned ints
• 0s and -0s are same
• NaN doesn’t count
• push/pop - sub/add to %rsp, put value into (%rsp)
• opl is 32-bit, opq is 64-bit
• fetch determines what the next PC will be
• fetch reads rA,rB,icode,ifun - decode reads values from these

# labs

### strlen

unsigned int strlen( const char * s ){
unsigned int i = 0;
while(s[i])
i++;
return i;
}


### strsep

char *strsep( char **stringp, char delim ){
char *ans = *stringp;
if (*stringp == 0)
return 0;
while (**stringp != delim && **stringp != 0) /* don't need this 0 check, 0 is same as '\0' */
*stringp += 1;
if (**stringp == delim){
**stringp = 0;
*stringp += 1;
}
else
*stringp = 0;
return ans;
}


###lists

• always test after malloc
• singly-linked list: node* { TYPE payload, struct node *next }
• length: while(list) list = (*list).next
• allocate: malloc(sizeof(node)*length) head[i].next=(i >= length) ? 0 : (head+i+1)
• access: (*list).payload or list[i].payload (for accessing)
• array: TYPE*
• length: while(list[i] != sentinel)
• allocate: malloc(sizeof(TYPE) * (length+1));
• access: list[i]
• range: { unsigned int length, TYPE *ptr }
• length: list.length
• allocate: list.ptr = malloc(sizeof(TYPE) * length); ans.length = length;
• access: list.ptr[i]

### bit puzzles

// leastBitPos - return a mask that marks the position of the least significant 1 bit
int leastBitPos(int x) {
return x & (~x+1);
}
int bitMask(int highbit, int lowbit) {
int zeros = ~1 << highbit; /* 1100 0000 */
int ones = ~0 << lowbit;   /* 1111 1000 */
return ~zeros & ones;      /* 0011 1000 */
}
/* satAdd - adds two numbers but when positive overflow occurs, returns maximum possible value, and when negative overflow occurs, it returns minimum positive value. */
// soln - overflow when operands have same sign and sum and operands have different sign
int satAdd(int x, int y) {
int x_is_neg = x >> 31;
int y_is_neg = y >> 31;
int sum = x + y;
int same_sign = (x_is_neg & y_is_neg  |  ~x_is_neg & ~y_is_neg);
int overflow = same_sign & (x_is_neg ^ (sum >> 31));
int pos_overflow = overflow & ~x_is_neg;
int neg = 0x1 << 31;
int ans = ~overflow&sum | overflow & (pos_overflow&~neg | ~pos_overflow&neg);
return ans;
}


### ch 1 (1.7, 1.9)

• files are stored as bytes, most in ascii
• all files are either text files or binary files
• i/o devices are connected to the bus by a controller or adapter
• processor holds PC, main memory holds program
• os-layer between hardware and applications - protects hardware and unites different types of hardware
• concurrent - instructions of one process are interleaved with another
• does a context switch to switch between processes
• concurrency - general concept of multiple simultaneous activities
• parallelism - use of concurrency to make a system faster
• virtual memory-abstraction that provides each process with illusion of full main memory
• memory - code-data-heap-shared libraries-stack
• threads allow us to have multiple control flows at the same time - switching
• multicore processor: either has multicore or is hyperthreaded (one CPU, repeated parts)
• processors can do several instructions per clock cycle
• Single-Instruction, Multiple-Data (SIMD) - ex. add four floats

### ch 2 (2.1, 2.4.2, 2.4.4)

• floating points (float, double)
• sign bit (1)
• exponent-field (8, 11)
• bias = 2^(k-1)-1 ex. 127
• normalized exponent = exp-Bias, mantissa = 1.mantissa
• denormalized: exp - all 0s
• exponent = 1-Bias, mantissa without 1 - 0 and very small values
• exp: all 1s
• infinity (if mantissa 0)
• NaN otherwise
• mantissa
• rounding
1. round-to-even - if halfway go to closest even number - avoides statistical bias
2. round-toward-zero
3. round-down
4. round-up
• leading 0 specifies octal
• leading 0x specifies hex
• leading 0b specifies binary

### ch 3 (3.6, 3.7)

• computers execute machine code
• intel processors are all back-compatible
• ISA - instruction set architecture
• control - condition codes are set after every instruction (1-bit registers)
1. Zero Flag - recent operation yielded 0
2. Carry Flag - yielded carry
3. Sign Flag - yielded negative
4. Overflow Flag - had overflow (pos or neg)
• guarded do can check if a loop is infinite
• instruction src, destination
• parentheses dereference a point
• there is a different add command for 16-bit operands than for 64-bit operands
• all instructions change the program counter
• call instruction only changes the stack pointer

### 4.1,4.2

• eight registers
• esp is stack pointer
• CC and PC
• 4-byte values are little-endian
• status code State
• 1 AOK
• 2 HLT
• 3 ADR - seg fault
• 4 INS - invalid instruction code
• lines starting with “.” are assembler directives
• assembly code is assembled resulting in just addresses and instruction codes
• pushl %esp - this doesn’t change esp
• pop %esp - pops the value in esp
• high voltage = 1
• digital system components
1. logic
2. memory elements
3. clock signals
• mux - picks a value and lets it through
• int Out = [ s: A; 1: B; ];
• B is the default
• combinatorial circuit - many bits as input simultaneously
• ALU - three inputs, A, B, func
• clocked registers store individual bits or words
• RAM stores several words and uses address to retrieve them
• stored in register file

### 4.3.1-4

• SEQ - sequential processor
• stages
• fetch
• read icode,ifun <- byte 1
• maybe read rA, rB <- byte 2
• maybe read valC <- 8 bytes
• decode
• read operands usually from rA, rB - sometimes from %esp
• call these valA, valB
• execute
• adds something, called valE
• for jmp tests condition codes
• memory
• reads something from memory called valM or writes to memory
• write back
• writes up to two results to regfile
• PC update
• popl reads two copies so that it can increment before updating the stack pointer
• components: combinational logic, clocked registers (the program counter and condition code register), and random-access memories
• reading from RAM is fast
• only have to consider PC, CC, writing to data memory, regfile
• processor never needs to read back the state updated by an instruction in order to complete the processing of this instruction.
• based on icode, we can compute three 1-bit signals :
1. instr_valid: Does this byte correspond to a legal Y86 instruction? This signal is used to detect an illegal instruction.
2. need_regids: Does this instruction include a register specifier byte?
3. need_valC: Does this instruction include a constant word?

### 4.4 pipelining

• the task to be performed is divided into a series of discrete stages
• increases the throughput - # customers served per unit time
• might increase latency - time required to service an individual customer.
• when pipelining, have to add time for each stage to write to register
• time is limited by slowest stage
• more stages has diminishing returns for throughput because there is constant time for saving into registers
• latency increases with stages
• throughput approaches 1/(register time)
• we need to deal with dependencies between the stages

### 4.5.3, 4.5.8

• several copies of values such as valC, srcA
• registers dD, eD, mM, wW - lowercase letter is input, uppercase is output
• we try to keep all the info of one instruction within a stage
• merge signals for valP in call and valP in jmp as valA
• load/use hazard - (try using before loaded) one instruction reads a value from memory while the next instruction needs this value as a source operand
• we can stop this by stalling and forwarding (the use of a stall here is called a load interlock)

### 5 - optimization

• eliminate unnecessary calls, tests, memory references
• instruction-level parallelism
• profilers - tools that measure the performance of different parts of the program
• critical paths - chains of data dependencies that form during repeated executions of a loop
• compilers can only apply safe operations
• watch out for memory aliasing - two pointers desginating same memory location
• functions can have side effects - calling them multiple times can have different results
• small boost from replacing function call by body of function (although this can be optimized in compiler sometimes)
• measure performance with CPE - cycles per element
• reduce procedure calls (ex. length in for loop check)
• loop unrolling - increase number of elements computed on each iteration
• enhance parallelism
• multiple accumulators
• limiting factors
• register spilling - when we run out of registers, values stored on stack
• branch prediction - has misprediction penalties, but these are uncommon
• trinary operator could make things faster
• understand memory performance
• using macros lets compiler optimizem more, lessens bookkeeping

### 6.1.1, 6.2, 6.3

• SRAM is bistable as long as power is on - will fall into one of 2 positions
• DRAM loses its value ~10-100 ms
• memory controller sends row,col (i,j) to DRAM and DRAM sends back contents
• matrix organization reduces number of inputs, but slower because must use 2 steps to load row then column
• memory modules
• enhanced DRAMS
• nonvolatile memory
• ROM - read-only memories - firmwared
• accessing main memory
• buses - collection of parallel wires that carry address, data, control
• accessing main memory
• locality
• locality of references to program data
• visiting things sequentially (like looping through array) - stride-1 reference pattern or sequential reference pattern
• locality of instruction fetches
• like in a loop, the same instructions are repeated
• memory hierarchy
• block-sizes for caching can differ between different levels
• when accessing memory from cache, we either get cache hit or cache miss
• if we miss we replace or evict a block
• can use random replacement or least-recently used
• cold cache - cold misses / compulsory misses - when cache is empty
• need a placement policy for level k+1 -> k (could be something like put block i into i mod 4)
• conflict miss - miss because placement policy gets rid of block you need - ex. block 0 then 8 then 0 with above placement policy
• capacity misses - the cache just can’t hold enough

### 6.4, 6.5 - cache memories & writing cache-friendly code

• Miss rate. The fraction of memory references during the execution of a program, or a part of a program, that miss. It is computed as #misses/#references.
• Hit rate. The fraction of memory references that hit. It is computed as 1 − miss rate.
• Hit time. The time to deliver a word in the cache to the CPU, including the time for set selection, line identification, and word selection. Hit time is on the order of several clock cycles for L1 caches.
• Miss penalty. Any additional time required because of a miss. The penalty for L1 misses served from L2 is on the order of 10 cycles; from L3, 40 cycles; and from main memory, 100 cycles.
• Traditionally, high-performance systems that pushed the clock rates would opt for smaller associativity for L1 caches (where the miss penalty is only a few cycles) and a higher degree of associativity for the lower levels, where the miss penalty is higher
• In general, caches further down the hierarchy are more likely to use write-back than write-through

### 8.1 Exceptions

• exceptions - partly hardware, partly OS
• when an event occurs, indirect procedure call (the exception) through a jump table called exception table to OS subroutine (exception handler).
• three possibilities
• returns to I_curr
• returns to I_next
• program aborts
• exception table - entry k contains address for handler code for exception k
• processor pushes address, some additional state
• four classes
1. interrupts (the faulting instruction)
• signal from I/O device, Async, return next instruction
2. traps
• intentional exception (interface for making system calls), Sync, return next
3. faults
• potentially recoverable error (ex. page fault exception), Sync, might return curr
4. aborts
• nonrecoverable error, Sync, never returns
• examples
• general protection fault - seg fault
• machine check - fatal hardware error

### 8.2 Processes

• process - instance of program in execution
• every program runs in the context of some process (context has code, data stack, pc, etc.)
1. logic control flow - like we have exclusive use of processor
• processes execute partially and then are preempted (temporarily suspended)
• concurrency/multitasking/time slicing - if things trade off
• parallel - concurrent and on separate things
• kernel uses context switches
1. private address space - like we have exclusive use of memory
• each process has stack, shared libraries, heap, executable

### 8.3 System Call Error Handling

• system level calls return -1, set the global integer variable errno
• this should be checked for

### 9-9.5 Virtual Memory

• address translation - converts virtual to physical address
• translated by the MMU
• VM partitions virtual memory into fixed-size blocks called virtual pages partitioned into three sets
1. unallocated
2. cached
3. uncached
• virtual pages tend to be large because cache misses are large
• DRAM will be fully associative, write-back
• each process has a page table - maps virtual pages to physical pages
• managed by OS
• has PTEs
• PTE - valid bit, n-bit address field
• valid bit - whether its currently cached in DRAm
• if yes, address is the start of corresponding physical page
• if valid bit not set && null address - has not been allocated
• if valid bit not set && real address - points to start of virtual page on disk
• PTE - 3 permission bits
• SUP - does it need to be in kernel (supervisor) mode?
• WRITE - write access
• page fault - DRAM cache miss
• read valid bit is not set - triggers handler in kernel
• demand paging - waiting until a miss occurs to swap in a page
• malloc creates room on disk
• thrashing - not good locality - pages are swapped in and out continuoously
• virtual address space is typically larger
• multiple virtual pages can be mapped to the same shared physical page (ex. everything points to printf)
• VM simplifies many things
• each process follows same basic format for its memory image
• loading executables / shared object files
• sharing
• easier to communicate with OS
• memory allocation
• physical pages don’t have to be contiguous
• memory protection
• private memories are easily isolated

### 9.6 Address Translation

• low order 4 bits serve 2,3 - fault 8c: 1000 1100 b6
• operating systems

# 1 - introduction

## 1.1 what operating systems do

• computer system - hierarchical approach = layered approach
1. hardware
2. operating system
3. application programs
4. users
• views
1. user view - OS maximizes work user is performing
2. system view
• os allocates resources - CPU time, memory, file-storage, I/O
• os is a control program - manages other programs to prevent errors
• program types
1. os is the kernel - one program always running on the computer
• only kernel can access resources provided by hardware
2. system programs - associated with OS but not in kernel
3. application programs
• middleware - set of software frameworks that provide additional services to application developers

## 1.2 computer-system organization

• when computer is booted, needs bootstrap program
• initializes things then loads OS
• also launches system processes
• ex. Unix launches “init”
• events
• hardware signals with interrupt
• software signals with system call
• interrupt vector holds addresses for all types of interrupts
• have to save address of interrupted instruction
• memory
• von Neumman architecture - uses instruction register
• main memory is RAM
• volatile - lost when power off
• secondary storage is non-volatile (ex. hard disk)
• ROM is unwriteable so static programs like bootstrap are ROM
• access
1. uniform memory access (UMA)
2. non-uniform memory access (NUMA)
• I/O
• device driver for I/O devices
• direct memory access (DMA) - transfers entire blocks of data w/out CPU intervention
• otherwise device controller must move data to its local buffer and return pointer to that
• multiprocessor systems
1. increased throughput
2. economies of scale (costwise)
3. increased reliability (fault tolerant)

## 1.3 computer-system architecture

1. single-processor system - one main cpu
• usually have special-purpose processors (e.g. keyboard controller)
2. multi-processor system / multicore system
• multicore means multi-processor on same chip
• multicore is generally faster
• multiple processors in close communication
• increased throughput
• economy of scale
• increased reliability = graceful degradation = fault tolerant
• types
1. asymmetric multiprocessing - boss processor controls the system
2. symmetric multiproccesing (SMP) - each processor performs all tasks
• more common
3. blade server - multiple independence multiprocessor systems in same chassis
3. clustered system - multiple loosely coupled cpus
• types
1. asymmetric clustering - one machine runs while other monitors it (hot-standby mode)
2. symmetric clustering - both run something
• parallel clusters
• require disributed lock manager to stop conflicting parallel operations
• can share same data via storage-area-networks
• beowulf cluster - use ordinary PCs to make cluster

## 1.4 operating-system structure

• multiprogramming - increases CPU utilization so CPU is always doing something
• keeps job pool ready on disk
• time sharing / multitasking - multiple jobs switch so fast that both can be interacted with
• requires an interactive computer system
• process - program loaded into memory
• scheduling
• job scheduling - picking jobs from job pool (disk -> memory)
• CPU scheduling - what to run first (memory -> cpu)
• memory
• processes are swapped from main memory to disk
• virtual memory allows for execution of process not in memory

## 1.5 operating-system operations

• trap / exception - software-generated interrupt
• user-mode and kernel mode (also called system mode)
• when in kernel mode, mode bit is 0
• separate mode for virtual machine manager (VMM)
• this is built into hardware
• kernel can use a timer to getting stuck in user mode

## 1.6 process management

• program is passive, process is active
• process needs resources
• process is unit of work
• single-threaded process has one program counter

## 1.7 memory management

• cpu can only directly read from main memory
• computers must keep several programs in memory
• hardware design is impmortant

## 1.8 storage management

• defines file as logical storage unit
• most programs stored on disk until loaded
• in addition to secondary storage, there is tertiary storage (like DVDs)
• caching - save frequent items on faster things
• cache coherency - make sure cache coherency is properly updated with parallel processes

## 1.9 protection & security

• process can execute only within its address space
• protection - controlling access to resources
• security - defends a system from attacks
• maintain list of user IDs and group IDs
• can temporarily escalate priveleges to an effective UID - setuid command

## 1.10 basic data structures

• bitmap - string of n binary digits

## 1.11 computing environments

• network computers - are essentially terminals that understand web-based computing
• distributed system - shares resources among separate computer systems
• network - communication path between two or more computers
• TCP/IP is most common network protocol
• networks
• PAN - personal-area network (like bluetooth)
• LAN - local-area network connects computers within a room, building, or campus
• WAN - wide-area network
• MAN - metropolitan-area network
• network OS provides features like file sharing across the network
• distributed OS provides less autonomy - makes it feel like one OS controls entire network
• client-server computing
1. compute-server - performs actions for user
2. file-server - stores files
• peer-to-peer computing
1. all clients w/ central lookup service, ex. Napster
2. no centralized lookup service
• uses discovery protocol - puts out request and other peer must respond
• virtualization - allows OS to run within another OS
• interpretation - run programs as non-native code (ex. java runs on JVM)
• BASIC can be compiled or interpreted
• cloud-computing - computing, storage, and applications as a service accross a network
• public cloud
• private cloud
• hybrid cloud
• software as a service (SAAS)
• platform as a service (PAAS)
• infrastructure as a service (IAAS)
• cloud is behind a firewall, can only make requests to it
• embedded systems - like microwaves / robots
• have real-time OS - fixed time constraints

# 2 - OS Structures

## 2.1 os services

• for the user
• user interface - command-line interface and graphical user interface
• program execution - load a program and run it
• I/O operations - file or device
• File-system manipulation
• communications - between processes / computer systems
• error detection
• for system operation
• resource allocation
• accounting - keeping stats on users / processes
• protection / security

## 2.2 user and os interface

1. command interpreter = shell - gets and executes next user-specified command
• could contain the code to execute the command
• command interpreter could have code to execute commands
• more often, executes system programs, such as “rm”, that are executed
2. GUI

## 2.3 system calls

• system calls - provide an interface to os services
• API usually wraps system calls (ex. java)
• libc - provided by Linux/Mac OS for C
• system-call interface links API calls to system calls
• passing parameters
1. pass parameters in registers
2. parameters stored in block of memory and address passed in register
3. parameters pushed onto stack

## 2.4 system call types

1. process control - halting, ending
• lock shared data - no other process can access until released
2. file manipulation
3. device manipulation
• similar to file manipulation
4. information maintenance - time, date, dump()
• single step is CPU mode which throws trap for CPU after every instruction for a debugger
5. communications
1. message-passing model
• each computer has host name and network identifier (IP address)
• each process has process name
• daemons - system programs for receiving connections (like servers waiting for a client)
2. shared-memory model
6. protection

## 2.5 system programs

• system programs = system utilities
• some provide interfaces for system calls
• other uses
1. file management
2. status info
3. file modification
4. programming-language support
6. communications
7. background services

## 2.6 os design and implementation

• mechanism - how to do something
• want this to be general so only certain parameters change
• policy - what will be done
• os mostly in C, low-level kernel in assembly
• high-level is easier to port but slower

## 2.7 os structure

• want modules but current models aren’t very modularized
• monolithic system has performance advantages - very little overhead
• in practice everything is a hybrid
• system can be modularized with a layered approach
• layers: hardware, …, user interface
• easy to construct and debug
• hard to define layers, less efficient
• microkernel approach - used in os Mach
• move nonessential kernel components to system / user-level
• smaller kernel, everything communicates with message passing
• makes extending os easier, but slower functions due to system overhead
• loadable kernel modules
• more flexible - kernel modules can change
• examples (see pics)

## 2.8 os debugging

• errors are written to log file and core dump (memory snapshot) is written to file
• if kernel crashes, must save its dump to s special area
• performance tuning - removing bottlenecks
• monitor trace listings - log if interesting events with times / parameters
• SolarisDTrace is a tool to debug and tune the os
• profiling - periodically samples instruction pointer to determine which code is being executed

## 2.9 generation

• system generation - configuring os on a computer
• usually on a CD-ROM
• lots of things must be determined (like what CPU to use)

## 2.10 system boot

• bootstrap program

# 3 - processes

## 3.1 process concept

• process - program in execution
• batch system executes jobs = processes
• time-shared system has user programs or tasks
• program is passive while process is active
• parts
• program code - text section
• program counter
• registers
• stack
• data section
• heap
• same program can have many processes
• process can be execution environment for other code (ex. JVM)
• process state
• new
• running
• waiting
• terminated
• process control block (PCB) = task control block - repository for any info that varies process to process
• process state
• program counter
• CPU registers
• CPU-scheduling information
• memory-management information
• accounting information
• I/O status information
• could include information for each thread
• parent - process that created another process

## 3.2 process scheduling

• process scheduler - selects available process for multi-tasking
• processes begin in job queue
• processes that are ready and waiting are in the ready queue until they are dispatched - usually stored as a linked list
• lots of things can happen here (fig 3_6)
• ex. make I/O request and go to I/O queue
• I/O-bound process - spends more time doing I/O
• CPU-bound process - spends more time doing computations
• each device has a list of process waiting in its device queue
• scheduler - selects processes from queues
• long-term scheduler - selects from processes on disk to load into memory
• controls the degree of multiprogramming = number of processes in memory
• has much more time than short-term scheduler
• want good mix of I/O-bound and CPU-bound processes
• sometimes this doesn’t exist
• short-term / CPU scheduler - selects from processes ready to execute and allocates CPU to one of them
• sometimes medium-term scheduler
• does swapping - remove a process from memory and later reintroduce it
• context switch - occurs when switching processes
• when interrupt occurs, kernel saves context of old process and loads saved context of new process
• context is in the PCB
• might be more or less work depending on hardware

## 3.3 operations on processes

• usually each process has unique process identifier (pid)
• linux everything starts with init process (pid=1)
• restricting a child process to a subset of the parent’s resources prevents system overload
• parent may pass along initialization data
• after creating new process
1. parent continues to execute concurrently with children
2. parent waits until some or all of its children terminate
• two address-space possibilities for the new process:
1. child is duplicate of parent (it has the same program and data as the parent).
2. child loads new program
• forking
• when call fork() continue operation but returns 0 for parent process and nonzero for child
• child is a copy of the parent
• after fork, usually one process calls exec() to load binary file into memory
• overrides program, doesn’t return unless error occurs
• parent can call wait() until child finishes (moves itself off ready queue until child finishes)
• on Windows, uses CreateProcess() which requires loading a new program rather than sharing address space
• STARTUPINFO -
• PROCESSINFORMATION -
• process termination
• exit() kills process (return in main calls exit)
• process can return status value
• parent can terminate child if it knows its pid
• cascading termination - if parent dies, its children die
• zombie process - terminated but parent hasn’t called wait() yet
• remains because parent wants to know what exit status was
• if parent terminates without wait(), orphan child is assigned init as new parent (init periodically invokes wait())

## 3.4 interprocess communication

• process cooperation
• information sharing
• computation speedup
• modularity
• convenience
• interprocess communication (IPC) - allows exchange of data and info
1. shared memory - shared region of memory is established
• one process establishes region
• other process must attach to it (OS must allow this)
• less overhead (no system calls)
• suffers from cache coherency
• ex. producer consumer
• producer fills buffer and consumer empties it
• unbounded buffer - producer can keep producing indefinitely
• bounded buffer - consumer waits if empty, producer waits if full
• in points to next free position
• out points to first full position
2. message passing - messages between coordinating processes
• useful for smaller data
• easier in a distributed system
1. direct or indirect communication
• direct requires knowing the id of process to send / receive
• can be asymmetrical - need to know id of process to send to, but not receive from
• results in hard-coding
• indirect - messages are sent / received from mailboxes
• more flexible, can send message to whoever shares mailbox
• mailbox owned by process - owner receives those messages
• mailbox owned by os - unclear
1. synchronous or asynchronous communication
• synchronous = blocking
• when both send and recieve are blocking = rendezvous 3. automatic or explicit buffering
• queue for messages can have 3 implementations
1. zero capacity (must be blocking)
2. bounded capacity
3. unbounded capacity

## 3.5 examples of IPC systems

1. POSIX - shared memory
2. Mach - message passing
3. Windows - shared memory for message passing

## 3.6 communication in client-server systems

1. sockets - endpoint for communication
• IP address + port number
• connecting
1. server listens on a port
2. client creates socket and requests connection to server’s port
3. server accepts connection (then usually writes data to socket)
• all ports below 1024 are well known
• connection-oriented=TCP
• connectionless = UDP
• special IP address 127.0.0.1 - loopback - refers to itself
• sockets are low-level - can only send unstructured bytes
2. remote procedure calls (RPCs) - remote message-based communication
• like IPC, but between different computers
• message addressed to an RPC daemon listening to a port
• messages are well-structured
• specifies a port - a number included at the start of a message packet
• system has many ports to differentiate different services
• uses stubs to hide details
• they marshal the parameters
• might have to convert data into external data representation (XDR) (to avoid issues like big-endian vs. little-endian)
• must make sure each message is acted on exactly once
• client must know port
1. binding info (port numbers) may be predetermined and unchangeable
2. binding can be dynamic with rendezvous deaemon (matchmaker) on a fixed RPC port
3. pipes - conduit allowing 2 processes to communicate
• four issues
1. bidirectional?
2. full duplex (data can travel in both directions at same time?) or half duplex (only one way)?
3. parent-child relationship?
4. communicate over a network?
• ordinary pipe - write at one end, read at the other
• unix function pipe(int fd[])
• fd[0] is read-end and fd[1] is write-end
• only exists while a child and parent process are communicating
• therefore only on same machine
• parent and child should both close unused ends of the pipe
• on windows, called anonymous pipes
• requires security attributes
• named pipe - can be bidirectional
• called FIFOs in Unix
• only half-duplex, requires same machine
• Windows - fulll-duplex and can be different machines
• many processes can use them

# 4 - threads

• thread - basic unit of CPU utilization
1. program counter
2. register set
3. stack
• making a thread is quicker and less resource-intensive than making a process
• used in RPC and kernels
• benefits
1. responsiveness
2. resource sharing
3. economy
4. scalability

## 4.2 - multicore programming (skipped)

• amdahl’s law: $speedup \leq \frac{1}{S+(1-S)/N_{cores}}$
• S is serial portion
• parallelism
• data parallelism - distributing subsets of data across cores and performing same operation on each core
• task parallelism - distribution tasks across cores

## 4.3 - multithreading models

• need relationship between user threads and kernel threads
1. many-to-one model - maps user-level threads to one kernel thread
• can’t be parallel on multicore systems
• ex. used by Green threads
2. one-to-one model
• small overhead for creating each thread
• used by Linux and Windows
3. many-to-many model
• less than or equal number of kernel threads
• two-level model mixes a one-to-one model and a many-to-many model

## 4.4 - thread libraries

• thread library - provides programmer with an API for creating/managing threads
• asynchronous v. synchronous threading

1 - POSIX Pthreads

/* get the default attributes */
/* create the thread */
pthread create(&tid,&attr,runner,argv[1]);  // runner is a func to call
/* wait for the thread to exit */

• shared data is declared globally

2 - Windows

3 - Java - uses Runnable interface

## 4.5 - implicit threading (skipped)

• implicit threading - handle threading in run-time libraries and compilers
1. thread pool - number of threads at startup that sit in a pool and wait for work
2. OpenMP - set of compiler directives / API for parallel programming
• identifies parallel regions
• uses #pragma
3. Grand central dispatch - extends C
• uses dispatch queue

## 4.6 - threading issues

• fork/exec need to know if should fork all threads / when to replace program
• signal notifies a process that a particular event has occurred
1. has a default signal handler
2. user-defined signal handler
• delivering a signal to a process: kill(pid_t pid, int signal)
• delivering a signal to a thread: pthread_kill(pthread_t tid, int signal)
• thread cancellation - terminating target thread before it has completed
1. asynchronous cancellation - one thread immediately terminates target thread
2. deferred cancellation - target thread periodically checks whether it should terminate
• uses deferred cancellation
• cancellation occurs only when thread reaches cancellation point
• thread-local storage - when threads need separate copies of data
• lightweight process = LWP - between user thread and kernel thread
• scheduler activation - kernel provides application with LWPs
• upcall - kernel informs application about certain events

## 4.7 - linux (skipped)

• linux process / thread are same = task
• uses clone() system call

# 5 - process synchronization

• cooperating process can effect or be affected by other executing processes
• ex. consumer/producer
• if counter++ and counter– execute concurrently, don’t know what will happen
• this is a race condition

## 5.2 - critical-section problem

• each process has critical section where it updates common variables
• <img src=”pics/5_1.png”/ width=40%>
• 3 requirements
1. mutual exclusion - 2 processes can’t concurrently do critical section
2. progress - things should be in critical selection
3. bounded waiting - every process should eventually get to critical selection
• kernels
1. preemptive kernels
• more responsive
2. nonpreemptive kernels
• no race conditions

## 5.3 - peterson’s solution

• peterson’s solution
• <img src=”pics/5_2.png” width=40%/>
• here i is one task and j is the other
• not guaranteed to work

## 5.4 - synchronization hardware

• locking - protecting critical regions using locks
• single-processor solution
• prevent interrupts while shared variable is being modified
• ex. test_and_set()
• instructions do things like swapping atomically - as one uninterruptable unit
• these are basically locked instructions
• ex. compare_and_swap()

## 5.5 - mutex locks

• mutex: <img src=”pics/5_8.png” width=40%/>
• simplest synchronization tool
• this type of mutex lock is called spinlock because requires busy waiting - processes not in critical section are continuously looping
• good when locks are short

### 5.6 - semaphores

• semaphore S - integer variable accessed through wait() (like trying to execute) and signal() (like releasing)
• counting semaphore - unrestricted domain
• binary sempahore - 0 and 1
wait(S) {
while(S<=0)
// busy wait
S--;
}
signal(S) {
S++;
}

• to improve performace, replace busy wait by process blocking itself
• places itself into a waiting queue
• restarted when other process executes a signal() operation
typedef struct{
int value;
struct process *list;
} semaphore;
wait(semaphore *S) {
S->value--;
if (S->value < 0)
add this process to S->list;
}
signal(semaphore *S) {
S->value++;
if (S->value <= 0){
remove a process P from S->list;
wakeup(P); // resumes execution
}
}

• deadlocked - 2 processes are in waiting queues, can’t wakeup unless other process signals them
• indefinite blocking=starving - could happen if we remove processes from waiting queue in LIFO order
• bottom never gets out
• priority inversion
• only occurs when processes have > 2 priorities
• usually solved with a priority-inheritance protocol
• when a process accesses resources needed by a higher-priority process, it inherits the higher priority until they are finished with the resources in question

## 5.7 - classic synchronization problems

1. bounded-buffer problem
• writers must have exclusive access
3. dining-philosophers problem

### 5.8 - monitors

• monitor - highl-level synchronization construct
• only 1 process can run at a time
• abstract data type which includes a set of programmer-defined operations with mutual exlusion
• has condition variables
• these can only call wait() or signal()
• when a signal is encountered, 2 options
1. signal and wait
2. signal and continue
• can implement with a semaphore
• 1st semaphore: mutex - process must wait before entering and signal after leaving the monitor
• 2nd semaphore: next - signaling processes use next to suspend themselves
• 3rd semaphore: next_count = number of suspended processes
 wait(mutex);
// body of F

if (next count > 0)
signal(next);
else
signal(mutex);

• conditional-wait construct can help with resuming
• x.wait(c);
• priority number c stored with name of process that is suspended
• when x.signal() is executed, process with smallest priority number is resumed next

## 5.9.4 - pthreads synchronization

#include <pthread.h>
pthread mutex t mutex;

/* create the mutex lock */
pthread mutex init(&mutex,NULL) // null specifies default attributes

pthread mutex lock(&mutex); // acquire the mutex lock
/* critical section */
pthread mutex unlock(&mutex); // release the mutex lock

• these functions return 0 w/ correct operation otherwise error code
• POSIX specifies named and unnamed semaphores
• name has name and can be shared by different processes
#include <semaphore.h> sem t sem;
/* Create the semaphore and initialize it to 1 */ sem init(&sem, 0, 1);

/* acquire the semaphore */
sem wait(&sem);

/* critical section */

/* release the semaphore */
sem post(&sem);


## 5.11 - deadlocks

• resource utilization
1. request
2. use
3. release
• deadlock requires 4 simultaneous conditions
1. mutual exclusion
2. hold and wait
3. no preemption
4. circular wait
• deadlocks can be described by system resource-allocation graph
• request edge - directed edge from process P to resource R means P has requested instance of resource type R
• assignment edge - R-> P
• if the graph has no cycles, not deadlocked
• if cycle, possible deadlock
• three ways to handle
1. use protocol to never enter deadlock
2. enter, detect, recover
3. ignore the problem
• developers must write code that avoids deadlocks

# 7 - main memory

## 7.1 - background

• CPU can only directly access main memory and registers
• accessing memory is slower than registers
• processor must stall or use cache
• processes need separate memory spaces
1. base register - holds smallest usable address
2. limit register - specifies size of range
• os / hardware check these, throw a trap if there was error
• input queue holds processes waiting to be be brought into memory
• compiler binds symbolic addresses to relocatable addresses
• memory-management unit (MMU) maps from virtual to physical address
• simple ex. add virtual address to a process’s base register = relocation register
• dynamically linked libraries - system libraries linked to user programs when the programs are run
• stub - tells how to load / locate library routine
• shared libraries - all use same library

## 7.3 - contiguous memory allocation

• contiguous memory allocation - each process has a section
• put OS in low memory and process memory in higher
• transient OS code - not often used
• ex. drivers
• can remove this and change OS memory usage by decreasing val in OS limit register
• split mem into partitions
• each partition can only have 1 process
• multiple-partition method - free partitions take a new process
• variable-partition scheme - OS keeps table of free mem
• all available mem = hole
• holes are divided between processes
1. first-fit - allocate first hole big enough
2. best-fit - allocate smallest hole that is big enough
3. worst-fit - allocate largest hole (largest leftover hole)
• worst
• external fragmentation - there is enough free mem, but it isn’t contiguous
• 50-percent rule - 1/3 of mem is unusable
• solved with compaction - shuffle mem to put free mem together
• can be expensive to move mem around
• internal fragmentation - extra mem a proc is allocated but not using (because given in block sizes)
• 2 types of non-contiguous solutions
1. segmentation
2. paging

## 7.4 - segmentation (skip)

• segments make up logical address space
• name (or number)
• length
• logical address is a tuple
• (segment-number, offset)
• segment table
• each entry has segment base and segment limit
• doesn’t avoid external fragmentation

## 7.5 - paging (skip)

• break physical mem into fixed-size frames and logical mem into corresponding pages
•  CPU address = [page number page offset]
• page table contains base address of each page in physical mem
• usually, each process gets a page table
• <img src=”pics/7_10.png” width=40%/>
• frame table keeps track of which frames are available / who owns them
• paging is prevalent
• avoids external fragmentation, but has internal fragmentation
• small page tables can be stored in registers
• usually page-table base register points to page table in mem
• has translation look-aside buffer - stores some page-table entries
• some entries are wired down - cannot be removed from TLB
• some TLBS store address-space identifiers (ADIDs)
• identify a process
• otherwise hard to contain entries for several processes
• want high hit ratio
• page-table often stores a bit for read-write or read-only
• valid-invalid bit sets whether page is in a process’s logical address space
• OR page-table length register - says how long page table is
• can share reentrant code = pure code
• non-self-modifying code

## 7.6 - page table structure (skip)

• page tables can get quite large (total mem / page size)
1. hierarchical paging - ex. two-level page table
• <img src=”pics/7_18.png” width=40%/>
• also called forward-mapped page table
• unused things aren’t filled in
• for 64-bit, would generally require too many levels
1. hashed page tables
• virtual page number is hash key -> physical page number
• clustered page tables - each entry stores everal pages, can be faster
1. inverted page tables
• only one page table in system
• one entry for each page/frame of memory
• <img src=”pics/7_20.png” width=40%/>
• takes more time to lookup
• hash table can speed this up
• difficulty with shared memory

# 6 - cpu scheduling

• preemptive - can stop and switch a process that is currently running

### 6.3 - algorithms

1. first-come, first-served
2. shortest-job-first
• can be preemptive or non preemptive
3. priority-scheduling
• indefinite blocking / starvation
4. round-robin
• every process gets some time
5. multilevel queue scheduling
• ex. foreground and background
6. multilevel feedback queues
• allows processes to move between queues

### 6.4 - thread scheduling

• process contention scope - competition for CPU takes place among threads belonging to same process
• PTHREAD_SCOPE_PROCESS - user-level threads onto available LWPs
• PTHREAD_SCOPE_SYSTEM - binds LWP for each user-level thread

### 6.5 - multiple-processor scheduling

• asymmetric vs. symmetric
• almost everything is symmetric (SMP)
• processor affinity - try to not switch too much
• load balancing - try to make sure all processes share work
1. coarse-grained - thread executes until long-latency event, such as memory stall
2. fine-grained - switches between instruction cycle

### 6.6 - real-time systems

• event latency - amount of time that elapses from when an event occurs to when it is serviced
1. interrupt latency - period of time from the arrival of an interrupt at the CPU to the start of the routine that services the interrupt
2. dispatch latency
1. Preemption of any process running in the kernel
2. Release by low-priority processes of resources needed by a high-priority process
• rate-monotonic scheduling - schedules periodic tasks using a static priority policy with preemption

# 8 - virtual memory

## 8.1 - background

• lots of code is seldom used
• virtual mem allows the execution of processes that are not completely in
• benefits
• programs can be larger than physical mem
• more processes in mem at same time
• less swapping programs into mem
• sparse address space - virtual address spaces with hole (betwen heap and stack)
• <img src=”pics/8_3.png” width=40%/>

## 8.2 - demand paging

• demand paging - load pages only when they are needed
• lazy pager - only swaps a page into memory when it is needed
• can use valid-indvalid bit in page table to signal whether a page is in memory
• memory resident - residing in memory
• accessing page marked invalid causes page fault
• <img src=”pics/8_6.png” width=40%/>
• must restart after fetching page
1. don’t let anything change while fetching
2. use registers to store state before fetching
• pure demand paging - never bring a page into memory until it is required
• programs tend to have locality of reference, so we bring in chunks at a time
• extra time when there is a page fault
1. service the page-fault interrupt
2. read in the page
3. restart the process
• effective access time is directly proportional to page-fault rate
• anonymous memory - pages not associated with a file

## 8.3 - copy-on-write

• copy-on-write - allows parent and child processes intially to share the same pages
• if either process writes, copy of shared page is created
• new pages can come from a set pool
• zero-fill-on-demand - zeroed out before being allocated
• virtual memory fork - not copy-on-write
• child uses adress space of parent
• parent suspended
• meant for when child calls exec() immediately

## 8.4 - page replacement - select which frames to replace

• multiprogramming might over-allocate memory
• all programs might need all their mem at once
• buffers for I/O also use a bunch of mem
• when over-allocated, 3 options
1. terminate user process
2. swap out a process
3. page replacement
• want lowest page-fault rate
• test with reference string, which is just a list of memory references
• if no frame is free, find one not being used and free it
• write its contents to swap space
• <img src=”pics/8_10.png” width=40%/>
• modify bit=dirty bit reduces overhead
• if hasn’t been modified then don’t have to rewrite it to disk
• page replacement examples
1. FIFO
• Belady’s anomaly - for some algorithms, page-fault rate may increase as number of allocate frames increases
2. optimal (OPT / MIN)
• replace the page that will not be used for the longest period of time
3. LRU - least recently used (last used)
1. implement with counters since each use
2. stack of page numbers (whenever something is used, put it on top)
• stack algorithms - set of pages in memory for n frames is always a subset of the set of pages that would be in memory with n + 1 frames
• don’t suffer from Belady’s anomaly
4. LRU-approximation
• reference bit - set whenever a page is used
• can keep additional reference bits by recording reference bits at regular intervals1
• second-chance algorithm - FIFO, but if ref bit is 1, set ref bit to 0 and move on to next FIFO page
• can have clock algorithm
• <img src=”pics/8_17.png” width=40%/>
• enhanced second-chance - uses reference bit and modify bit
• give preference to pages that have been modified
5. counting-based - count and implement LFU (least frequently used) or MFU (most frequently used)
• page-buffering algorithms
• pool of free frames - makes things faster
• list of modified pages - written to disk whenever paging device is idle
• som algorithms, like databases perform better when they get their own memory capability called raw disk instead of being managed by OS

## 8.6 - thrashing

• if low-priority process gets too few frames, swap it out
• thrashing - process spends more time paging than executing
• CPU utilization stops increasing
• local replacement algorithm = priority replacement algorithm - if one process starts thrashing, cannot steal frames from another
• locality model - each locality is a set of pages actively used together
• give process enough for its current locality
• working-set model - still based on locality
• defines working-set window $\delta$
• defines working set as pages in most recent $\delta$ refs
• OS adds / suspends processes according to working set sizes
• approximate with fixed-interval timer
• page-fault frequency - add / decrease pages based on targe page-fault rate

## 8.8.1 - buddy system

• memory allocated with power-of-2 allocator - requests are given powers of 2
• each page is split into 2 buddies and each of those splits again recursively
• coalescing - buddies can be combined quickly

# 9 - mass-storage structure

## 9.4 - disk scheduling

• bandwidth - total number of bytes transferred, divided by time
• first-come first-served
• shortest-seek-time-frist
• SCAN algorithm - disk swings side to side servicing requests on the way
• also called elevator algorithm
• also has circular-scan

## 9.5 - disk management

• low-level formatting - dividing disk into sectors that controller can read/write
• blocks have header / trailer with error-correcting codes
• bad blocks are corrupted - need to replace them with others = sector sparing = forwarding
• sector slipping - just renumbers to not index bad blocks

# 10 - file-system interface

## 10.1

• os maintains open-file table
• might require file locking
• must support different file types

## 10.2 - access methods

• simplest - sequential
• direct access = relative access
• uses relative block numbers

## 10.3

• disk can be partitioned
• two-level directory
• users are first level
• directory is 2nd level
• extend this into a tree
• acyclic makes it faster to search
• cycles require very slow garbage collection
• link - pointer to another thing

# 11 - file-system implementation

## 11.1

• file-control block (FCB) contains info about file ownership, etc.

## 11.4

• contiguous allocation
• FAT
• indexed allocation - all the pointers in 1 block

## 11.5

• keep track of free-space list
• implemented as bit map
• keep track of linked list of free space
• grouping - block stores n-1 free blocks and 1 pointer to next block
• counting - keep track of ptr to next block and the number of free blocks after that

# 12 - i/o systems

• bus - shared set of wires
• registers
• data-in - read by the host
• data-out
• status
• control
• interrupt chaining - each element in the interrupt vector points to the had of a list of interrupt handlers
• system calls use software interrupt
• direct memory access - read large chunks instead of one byte at a time
• device-status table
• spool - buffer for device (ex. printer) that can’t hold interleaved data
• Information Retrieval

# introduction

• building blocks of search engines
• search (user initiates)
• reccomendations - proactive search engine (program initiates e.g. pandora, netflix)
• information retrieval - activity of obtaining info relevant to an information need from a collection of resources
• information overload - too much information to process
• memex - device which stores records so it can be consulted with exceeding speed and flexibility (search engine)
• IR pieces
1. Indexed corpus (static)
• crawler and indexer - gathers the info constantly, takes the whole internet as input and outputs some representation of the document
• web crawler - automatic program that systematically browses web
• document analyzer - knows which section has what -takes in the metadata and outputs the index (condensed), manage content to provide efficient access of web documents
2. User
• query parser - parses the search terms into managed system representation
3. Ranking
• ranking model -takes in the query representation and the indices, sorts according to relevance, outputs the results
• also need nice display
• query logs - record user’s search history
• user modeling - assess user’s satisfaction
• steps
1. repository -> document representation
2. query -> query representation
3. ranking is performed between the 2 representations and given to the user
4. evaluation - by users
• information retrieval:
1. reccomendation
3. text mining

# related fields

they are all getting closer, database approximate search and information extraction converts unstructed data to structured:

database systems information retrieval
structured data unstructured data
semantics are well-defined semantics are subjective
structured query languages (ex. SQL) simple keyword queries
exact retrieval relevance-drive retrieval
emphasis on efficiency emphasis on effectiveness
• natural language processing - currently the bottleneck
• deep understainding of language
• cognitive approaches vs. statistical
• small scale problems vs. large
• developing areas
• currently mobile search is big - needs to use less data, everything needs to be more summarized
• interactive retrieval - like a human being, should collaborate
• core concepts
• information need - desire to locate and obtain info to satisfy a need
• query - a designed representation of user’s need
• document - representation of info that could satisfy need
• relevance - relatedness between documents and need, this is vague
• multiple perspectives: topical, semantic, temporal, spatial (ex. gas stations shouldn’t be behind you)
• Yahoo used to have system where you browsed based on structure (browsing), but didn’t have queries (querying)
• better when user doesn’t know keywords, just wants to explore
• push mode - systems push relevant info to users without a query
• pull mode - users pull out info using keywords

# web crawler

• web crawler determines upper bound for search engine
• loop over all URL’s (difficult to set its order)
• make sure it’s not visited
• read it and save it as indexed
• setItVisited
• visiting strategy
• breadth first - has to memorize all nodes on previous level
• depth first - explore the web by branch
• focused crawlings - prioritize the new links by predefined strategies
• not all documents are equally important
• prioritize by in-degree
• prioritize by PageRank - breadth-first in early state then approximate periodically
• prioritize by topical relevance
• estimate the similarity by anchortext or text near anchor
• some websites provide site map for google, disallows certain pages (ex. cnn.com/robots.txt)
• some websites push info to google so it doesn’t need to be crawled (ex. news websites)
• need to revisit to get changed info
• uniform re-visiting (what google does)
• proportional re-visiting (visiting frequency is proportional to page’s update frequency)
• html parsing
• shallow parsing - only keep text between title and p tags
• automatic wrapper generation - regular expression for HTML tags’ combination
• representation
• long string has no semantic meaning
• list of sentences - sentence is just short document (recursive definition)
• list of words
• tokenization - break a stream of text into meaningful units
• several statistical methods
• bag-of-words representation
• we get frequencies, but lose grammar and order
• N-grams (improved)
• continguous sequence of n items from a given sequence of text
• for example, keep pairs of words
• google uses n = 7
• increase vocabulary to V^n
• full text indexing
• pros: preserves all information, full automatic
• cons: vocab gap: car vs. cars, very large storage
• Zipf’s law - frequency of any word is inversely proportional to its rank in the frequency table
• frequencies decrease linearly
• discrete version of power law
• stopwords - we ignore these and get meaningful part
• head words take large portion but are meaningless e.g. the, a, an
• tail words - major portion of dictionary, but rare e.g. dextrosinistral
• risk: we lost structure ex. this is not a good option -> option
• normalization
• convert different forms of a word to normalized form
• USA St. Louis -> Saint Louis
• rule based: delete period, all lower case
• dictionary based: equivalence classes ex. cell phone -> mobile phone
• stemming: ladies -> lady, referring -> refer
• risks: lay -> lie
• solutions
• Porter stemmer - pattern of vowel-consonant sequence
• Krovertz stemmer - morphological rules
• empirically, stemming still hurts performance
• modern search engines don’t do stemming or stopword removal
• more advanced NLP techniques are applied - ex. did you search for a person? location?

# inverted index

• simple attempt
• documents have been craweld from web, tokenized/normalized, represented as bag-of-words
• try to match keywords to the documents
• space complexity O(d*v) where d = # docs, v = vocab size
• Zipf’s law: most of space is wasted so we only store the occurred words
• instead of an array, we store a linked list for each doc
• time complexity O(qd_lengthd_num) where q=length of query
• solution
• look-up table for each word, key is word, value is list of documents that contain it
• time-complexity O(q*l) where l is average length of list of documents containing word
• by Zipf’s law, d_length « l
• data structures
• hashtable - modest size (length of dictionary)
• postings - very large - sequential access, contain docId, term freq, term position…
• compression is needed
• sorting-based inverted index construction - map-reduce
• from each doc extract tuples of (termId (key in hashtable), docId, count)
• sort by termId within each doc
• merge sort to get one list sorted by key in hashtable
• compress terms with same termId and put into hashtable
• features
• needs to support approximate search, proximity search, dynamic index update
• dynamic index update
• periodically rebuild the index - acceptable if change is small over time and missing new documents is fine
• auxiliary index
• keep index for new docs in memory
• merge to index when size exceeds threshold
• soln: multiple auxiliary indices on disk, logarithmic merging
• index compression
• save space
• increase cache efficiency
• improve disk-memory transfer rate
• coding theory: E[L] = ∑p(x_l) * l
• instead of storing docId, we store gap between docIDs since they are ordered
• biased distr. gives great compression: frequent words have smaller gaps, infrequent words have large gaps, so the large numbers don’t matter (Zipf’s law)
• variable-length coding: less bits for small (high frequency) integers
• more things put into index
• document structure
• title, abstract, body, bullets, anchor
• entity annotation
• these things are fed to the query

# query processing

• parse syntax ex. Barack AND Obama, orange OR apple
• same processing as on documents: tokenization -> normalization -> stemming -> stopword removal
• speed-up: start from lowest frequency to highest ones (easy to toss out documents)
• phrase matching “computer science”
• N-grams doesn’t work, could be very long phrase
• soln: generalized postings match
• equality condition check with requirement of position patter between two query terms
• ex. t2.pos-t1.pos (t1 must be immediately before t2 in any matched document)
•  proximity query: t2.pos-t1.pos <= k
• spelling correction
• pick nearest alternative or pick most common alternative
• proximity between query terms
• edit distance = minimum number of edit operations to transform one string to another
• insert, replace, delete
• speed-up
• fix prefix length
• build character-level inverted index
• consider layout of a keyboard
• phonetic similarity ex. “herman” -> “Hermann”
• solve with phonetic hashing - similar-sounding terms hash to same value

# user

• result display
1. relevance
2. diversity
3. navigation - query suggestion, search by example
• list of links has always been there
• search engine reccomendations largely bias the user
• direct answers (advanced I’m feeling lucky)
• ex. 100 cm to inches
• google was using user’s search result feedback
• spammers were abusing this
• social things have privacy concerns
• instant search (refreshes search while you’re typing)
• slightly slows things down
• carrot2 - browsing not querying
• foam tree display, has circles with sizes representing popularity
• has a learning curve
• pubmed - knows something about users, has keyword search and more
• result display
• relevance
• most users only look at top left
• this can be changed with multimedia content
• HCI is attracting more attention now
• mobile search
• multitouch
• less screen space

# ranking model

• naive boolean query “obama” AND “healthcare” NOT “news”
• unions, intersects, lists
• often over-constrained or under-constrained
• also doesn’t give you relevance of returned documents
• you can’t actually return all the documents
• instead we have rank docs for the users (top-k retrieval) with different kinds of relevance
1. vector space model (uses similarity between query and document)
• how to define similarity measure
• both doc and query represented by concept vectors
• k concepts define high-dimensional space
• element of vector corresponds to concept weight
• concepts should be orthogonal (non-overlapping in meaning)
• could use terms, n-grams, topics, usually bag-of-words
• weights: not all terms are equally important
• TF - term frequency weighting - a frequent term is more important
• normalization: tf(t,d) = 1+log(f), if f(t,d) > 0
• or proportionally: = a+(1-a)*f/max(f)
• IDF weighting - a term is more discriminative if it occurs only in fewer docs
• IDF(t) = 1+log(N/(d_num(t))) where N = total # docs, d_num(t) = # docs containt t
• total term frequency doesn’t work because words can frequently occur in a subset
• combining TF and IDF - most widely used
• w(t,d) = TF(t,d) * IDF(t)
• similarity measure
• Euclidean distance - penalizes longer docs too much
• cosine similarity - dot product and then normalize
• drawbacks
• assumes term independence
• assume query and doc to be the same
• lack of predictive adequacy
• lots of parameter tuning
1. (uses probablity of relevance)
• vocabulary - set of words user can query with

# latent semantic analysis - removes noise

• terms aren’t necessarily orthogonal in vectors space model
• synonmys: car vs. automobile
• polysems: fly (action vs. insect)
• independent concept space is preferred (axes could be sports, economics, etc.)
• constructing concept space
• automatic term expansion - cluster words based on thesaurus (WordNet does this)
• word sense disambiguation - use dictionary, word-usage context
• latent semantic analysis
• assumption - there is some underlying structure that is obscurred by randomness of word choice
• random noise contaminates term-document data
• linear algebra - singular value decomposition
• m x n matrix C with rank r
• decompose into U * D * V^T, where D is an r x r diagonal matrix (like eigenvalues^2)
• U and V are orthogonal matrices
• we put the eigenvalues in D into descending order and only take the first k values to be nonzero
• this is low rank decomposition
• multiply the D’s of different docs get similarity
• eigenvector is new representation of each doc
• principle component analysis - separate things based on direction that maximizes variance
• put query into low-rank space
• LSA can also be used beyond text
• 𝑂(𝑀𝑁2)

# probabalistic ranking principle - different approach, ML

• total probablility - use bayes’s rule over a partition
• Hypothesis space H={H_1,…,H_n}, training data E
•  $P(H_i E) = P(E H_i)P(H_i)/P(E)$
• prior = P(H_i)
•  posterior = P(H_i E)
• to pick the most likely hypothesis H*, we drop P(E)
•  P(H_i E) = P(E H_i)P(H_i)
• losses - rank by descending loss
•  a1 = loss(retrieved non-relevant)
•  a2 = loss(not retrieved relevant)
• we need to make a relevance measure function
• assume independent relevance, sequential browsing
• most existing ir research has fallen into this line of thinking
•  conditional models for P(R=1 Q,D)
• basic idea - relevance depends on how well a query matches a document
•  P(R=1 Q,D) = g(Rep(Q,D),t)
• linear regression
•  MLE: prediction = argmax(P(X 0))
•  Bayesian: prediction = argmax(P(X 0)) P(0)
###### ml
• features/attributes for ranking - many things
• use logistic regression to find relevance
• little guidance on feature selection
• this model has completely taken over
###### generative models for P(R=1|Q,D)
•  compute Odd(R=1 Q,D) using Bayes’ rule
###### language models
• a model specifying probabilty distributions for different word sequences (generative model)
• too much memory for n-gram, so we use unigrams
• generate text by sampling from discrete distribution
• maximum likelihood estimation (MLE)
• sampling with replacement (like picking marbles from bag) - gives you probability distributions
• when you get a query see which document is more likely to generate the query
• MLE can’t represent unseen words (ex. ipad)
• smoothing
• we want to avoid log zero for these words, but we can’t arbitrarily add to the zero
• instead we add to the zero probabilities and subtract from the probabilities of observed words
1. additive smoothing - add a constant delta to the counts of each word
• skews the counts in favor of infrequent terms - all words are treated equally
2. absolute discounting - subtract from each nonzero word, distribute among zeros
• reference smoothing - use reference language model to choose what to add
3. linear interpolation - subtract a percentage of your probability, distribute among zeros
4. dirichlet prior/bayesian - not affected by document length
• effect of smoothing is to get rid of log(0) and to devalue very common words and add weight to infrequent words
• longer documents should borrow less because they see the more uncommon words

# retrieval evaluation

• evaluation criteria
• small things - speed, # docs returned, spelling correction, suggestions
• most important this is satisfying user’s information need
• Cranfield experiments - retreived documents’ relevance is a good proxy of a system’s utility in satisfying user’s information need
• standard benchmark - TREC, hosted by NIST
• elements of evaluation
1. document collection
2. set of information needs expressible as queries
3. relevance judgements - binary relevant, nonrelevant for each query-document pair
• stats
• type 1: false positive - wrongly returneda
• precision - fraction of retrieved documents that are relevant = p(relevant|retrieved) = tp/(tp+fp)
• recall - fraction of relevant documents that are retrieved = p(retrieved|relevant) = tp/(tp+fn)
• they generally trade off
• evaluation is in terms of one query
1. unordered evaluation - consider the documents unordered
• calculate the precision P and recall P
• combine them with harmonic mean: F = 1 / (a(1/P)+(1-a)1/R) where a assigns weights, usually pick a=1
• F = 2/(1/P+1/R)
• we do this instead of normal mean because values very close to 0 results in very large denominators
1. ranked evaluation w/ binary relevance - consider the ranked results
• precision vs recall has sawtooth shape curve
• recall never decreases
• precision increases if we find relevant doc, decreases if irrelevant 1. eleven-point interpolated (use recall levels 0,.1,.2,…,1.0)
• shouldn’t really use 1.0 - not very meaningful 2. precision@k
• ignore all docs ranked lower than k
• only use relevant docs
• recall@k is problematic because it is hard to know how many docs are relevant 3. MAP - mean average precision - usually best
• considers rank position of each relevant doc
• compute p@k for each relevant doc
• average precision = average of those p@k
• mean average precision = mean over all the queries
• weakness - assumes users are interested in finding many relevant docs, requires many relevance judgements 4. MRR - mean reciprocal rank - only want one relevant doc
• uses: looking for fact, known-item search, navigational queries, query auto completion
• reciprocal rank = 1/k where k is ranking position of 1st relevant document
• mean reciprocal rank = mean over all the queries
• ranked evaluation w/ numerical relevance
• binary relevance is insufficient - highly relevant documents are more useful
• gain is accumulated starting at the top and discounted at lower ranks
• typical discount is 1/log(rank)
• DCG (discounted cumulative gain) - total gain accumulated at a particular rank position p
• DCG_p = rel_1 + sum(i=1 to p) rel_i/log_2(i)
• DCG_p = sum_{i=1 to p}(2^rel_i - 1)/(log_2(1+i)) where rel_i is usually 0 to 4
• this is what is actually used
• emphasize on retrieving highly relevant documents
• different queries have different numbers of relevant docs - have to normalized DCG
• normalized DCG - normalize by the DCG of the ideal ranking
• statistical significance tests - difference could just be because of p values you chose
• p-value - prob of data using null hypothesis, if p < alpha we reject null hypothesis
1. sign test
• hypothesis - difference median is zero 2. wilcoxon signed rank test
• hypothesis - data are paired and come from the same population 3. paired t-test
• difference has zero mean value 4. one-tail v. two tail
• lol use two-tail
• kappa statistic - measures accuracy of assesor - P(judges agree)-P(judges agree randomly) / (1-P(judges agree randomly))
• = 0 if they agree by chance
• otherwise 1 or < 0
• P(judges agree randomly) = marginals for yes-yes and no-no
• pooling - hard to annotate all docs - relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems

### feedback as model interpolation

• important that we take distance from Q to D not D to Q
• this is because the measure is asymmetric

### mp3

• 2^rel - rel can be 0 or 1
• whenever you change stopword removal/stemming, have to rebuild index
• otherwise, you will think they are all important

### as we may think

• there are too many published things, hard to keep track

### 19 web search basics

• client server design
1. server communicates with client via a protocal such as http in a markup language such as html
2. client - generally a brower - can ignore what it doesn’t understand
• we need to include autoritativeness when thinking about a document’s relevance
• we can view html pages as nodes and hyperlinks as directed edges
• power law: number of web pages w/ in-degree i ~ 1/(i^a)
• bowtie structure: three types of webpages IN -> SCC -> OUT
• spam - would repeat keywords to be included in searches
• there is paid inclusion
• cloaking - different page is shown to crawler than to user
• doorway page - text and metadata to rank highly - then redirects
• SEO (search engine optimizers) - consulting for helping people rank highly
• search engine marketing - how to budget different keywords
• some search engines started out without advertising
• advertising - per click, per view
• competitors can click spam the ads of opponents
• types of queries
• informational - general info
• navigational - specific website
• difficult to get size of index
• shingling - count repeating consecutive sequences

### 2.2, 20.1, 20.2

• hard to tokenize

### 2.3, 2.4, 4, 5.2, 5.3

• compression and vocaublary

### 1.3, 1.4 boolean retrieval

• find lists for each term, then intersect or union or complement
• lists need to be sorted by docId so we can just increment the pointers
• we start with shortest lists and do operations to make things faster
• at any point we only want to look at the smallest possible list

### 6.2, 6.3, 6.4 vector space model

• tf(t,d) = term frequency of term t in doc d
• uses bag of words - order doesn’t matter, just frequency
• often replaced by wf(t,d) = 1+log(tf(t,d)) else 0 because have way more terms doesn’t make it way more relevant
• also could normalize ntf(t,d) = a + (1-a)*tf(t,d)/tf_max(d) where a is a smoothing term
• idf(t) = inverse document frequency of term t
• collection frequency = total number of occurrences of a term in the collection.
• document frequency df(t) = #docs in that contain term t.
• idf(t) = log(N/df(t)) where N = #docs
• combination weighting scheme: tf_idf(t,d) = tf(t,d)*idf(t) - (tf is actually log)
• document score = sum over terms tf_idf(t,d)
•  cosine similarity = doc1*doc2 / ( doc1 * doc2 ) (this is the dot product)
• we want the highest possible similarity
• euclidean distance penalizes long documents too much
• similarity = cosine similarity of (query,doc)
• pivoted normalized document length? - generally penalizes long document, but avoids overpenalizing

### 13 text classification

• machine learning approach
• naive bayes text classification

### math

• Real Analysis

# ch 1 - the real numbers

• there is no rational number whose square is 2 <div class="collapse" id="111"> proof by contradiction </div>
• contrapositive: $-q \to -p$ - logically equivalent
•  triangle inequality: $a+b \leq a + b$
often use a-b = (a-c)+(c-b)
• axiom of completeness - every nonempty set $A \subseteq \mathbb{R}$ that is bounded above has a least upper bound
• doesn’t work for $\mathbb{Q}$
• supremum = supA = least upper bound (similarly, infimum)
1. supA is an upper bound of A
2. if $s \in \mathbb{R}$ is another u.b. then $s \geq supA$
• can be restated as $\forall \epsilon > 0, \exists a \in A$ $s-\epsilon < a$
• nested interval property - for each $n \in N$, assume we are given a closed interval $I_n = [a_n,b_n]={ x \in \mathbb{R} : a_n \leq x \leq b_n }$ Assume also that each $I_n$ contains $I_{n+1}$. Then, the resulting nested sequence of nonempty closed intervals $I_1 \supseteq I_2 \supseteq …$ has a nonempty intersection <div class="collapse" id="141"> use AoC with x = sup{$a_n: n \in \mathbb{N}$} in the intersection of all sets</div>
• archimedean property
1. $\mathbb{N}$ is unbounded above (sup $\mathbb{N}=\infty$)
2. $\forall x \in \mathbb{R}, x>0, \exists n \in \mathbb{N}, 0 < \frac{1}{n} < x$
• $\mathbb{Q}$ is dense in $\mathbb{R}$ - for every $a,b \in \mathbb{R}, a<b$, $\exists r \in \mathbb{Q}$ s.t. $a<r<b$
• pf: want $a < \frac{m}{n} < b$
• by Archimedean property, want $\frac{1}{n} < b-a$
• corollary: the irrationals are dense in $\mathbb{R}$
• there exists a real number $r \in \mathbb{R}$ satisfying $r^2 = 2$
• pf: let r = $sup { t \in \mathbb{R} : t^2 < 2 }$. disprove $r^2<2, r^2>2$ by considering $r+\frac{1}{n},r-\frac{1}{n}$
• A ~ B if there exists f:A->B that is 1-1 and onto
• A is finite - there exists n $\in \mathbb{N}$ s.t. $\mathbb{N}_n$~A
• countable = $\mathbb{N}$~A.
• uncountable - inifinite set that isn’t countable
• Q is countable
• pf: Let $A_n = { \pm \frac{p}{q}:$ where p,q $\in \mathbb{N}$ are in lowest terms with p+q=n}
• R is uncountable
• pf: Assume we can enumerate $\mathbb{R}$ Use NIP to exclude one point from $\mathbb{R}$ each time. The intersection is still nonempty, so we didn’t succesfully enumerate $\mathbb{R}$
• $\frac{x}{x^2-1}$ maps (0,1) $\to \mathbb{R}$
• countable union of countable sets is countable
• if $A \subseteq B$ and B countable, then A is either countable or finite
• if $A_n$ is a countable set for each $n \in \mathbb{N}$, then their union is countable
• the open interval (0,1) = ${ x \in \mathbb{R} : 0 < x < 1 }$ is uncountable
• pf: diagonalization - assume there exists a function from (0,1) to $\mathbb{R}$. List the decimal expansions of these as rows of a matrix. Complement of diagonal does not exist.
• cantor’s thm - Given any set A, there does not exist a function f:$A \to P(A)$ that is onto
• P(A) is the set of all subsets of A

# ch 2 - sequences and series

•  a sequence $(a_n)$ converges to a real number if $\forall \epsilon > 0, \exists N \in \mathbb{N}$ such that $\forall n\geq N, a_n-a < \epsilon$
• otherwise it diverges
• if a limit exists, it is unique
•  a sequence $(x_n)$ is bounded if there exists a number M > 0 such that $x_n \leq M \forall n \in \mathbb{N}$
• every convergent sequence is bounded
• algebraic limit thm - let lim $a_n = a$ and lim $b_n$ = b. Then
1. lim($ca_n$) = ca
2. lim($a_n+b_n$) = a+b
3. lim($a_n b_n$) = ab
4. lim($a_n/b_n$) = a/b, provided b $\neq$ 0
•  pf 3: use triangle inequality, $a_nb_n-ab = a_nb_n-ab_n+ab_n-ab =…= b_n a_n-a + a b_n-b$
• pf 4: show $(b_n) \to b$ implies $(\frac{1}{b_n}) \to \frac{1}{b}$
• order limit thm - Assume lim $a_n = a$ and lim $b_n$ = b.
1. If $a_n \geq 0$ $\forall n \in \mathbb{N}$, then $a \geq 0$
2. If $a_n \leq b_n$ $\forall n \in \mathbb{N}$, then $a \leq b$
3. If $\exists c \in \mathbb{R}$ for which $c \leq b_n$ $\forall n \in \mathbb{N}$, then $c \leq b$
• pf 1: by contradiction
• monotone - increasing or decreasing (not strictly)
• monotone convergence thm - if a sequence is monotone and bounded, then it converges
• convergence of a series
• define $s_m=a_1+a_2+…+a_m$
• $\sum_{n=1}^\infty a_n$ converges to A $\iff (s_m)$ converges to A
• cauchy condensation test - suppose $a_n$ is decreasing and satisfies $a_n \geq 0$ for all $n \in \mathbb{N}$. Then, the series $\sum_{n=1}^\infty a_n$ converges iff the series $\sum_{n=1}^\infty 2^na_{2^n}$ converges
• p-series $\sum_{n=1}^\infty 1/n^p$ converges iff p > 1

### 2.5

• let $(a_n)$ be a sequence and $n_1<n_2<…$ be an increasing sequence of natural numbers. Then $(a_{n_1},a_{n_2},…)$ is a subsequence of $(a_n)$
• subsequences of a convergent sequence converge to the same limit as the original sequence
• can be used as divergence criterion
• bolzano-weierstrass thm - every bounded sequence contains a convergent subsequence
• pf: use NIP, keep splitting interval into two

### 2.6

•  $(a_n)$ is a cauchy sequence if $\forall \epsilon > 0, \exists N \in \mathbb{N}$ such that $\forall m,n\geq N, a_n-a_m < \epsilon$
• cauchy criterion - a sequence converges $\iff$ it is a cauchy sequence
• cauchy sequences are bounded
• overview: AoC $\iff$ NIP $\iff$ MCT $\iff$ BW $\iff$ CC

### 2.7

• algebraic limit thm - let $\sum_{n=1}^\infty a_n$ = A, $\sum_{n=1}^\infty b_n$ = B
1. $\sum_{n=1}^\infty ca_n$ = cA
2. $\sum_{n=1}^\infty a_n+b_n$ = A+B
1. cauchy criterion for series - series converges $\iff$ $(s_m)$ is a cauchy sequence
• if the series $\sum_{n=1}^\infty a_n$ converges then lim $a_n=0$
1. comparison test
2. geometric series - $\sum_{n=0}^\infty a r^n = \frac{a}{1-r}$
• $s_m = a+ar+…+ar^{m-1} = \frac{a(1-r^m)}{1-r}$
1. absolute convergence test
2. alternating series test 1. decreasing 2. lim $a_n$ = 0
• then, $\sum_{n=1}^\infty (-1)^{n+1} a_n$ converges
• rearrangements: there exists one-to-one correspondence
• if a series converges absolutely, any rearrangement converges to same limit

# ch 3 - basic topology of R

### 3.1 cantor set

• C has small length, but its cardinality is uncountable
• discussion of dimensions, doubling sizes leads to 2^dimension sizes
• Cantor set is about dimension .631

### 3.2 open/closed sets

• A set O $\subseteq \mathbb{R}$ is open if for all points a $\in$ O there exists an $\epsilon$-neighborhood $V_{\epsilon}(a) \subseteq O$
•  $V_{\epsilon}(a)={ x \in R : x-a < \epsilon$}
1. the union of an arbitrary collection of open sets is open
2. the intersection of a finite collection of open sets is open
• a point x is a limit point of a set A if every $\epsilon$-neighborhood $V_{\epsilon}(x)$ of x intersects the set A at some point other than x
• a point x is a limit point of a set A if and only if x = lim $a_n$ for some sequence ($a_n$) contained in A satisfying $a_n \neq x$ for all n $\in$ N
• isolated point - not a limit point
• set $F \subseteq \mathbb{R}$ closed - contains all limit points
• closed iff every Cauchy sequence contained in F has a limit that is also an element of F
• density of Q in R - for every $y \in \mathbb{R}$, there exists a sequence of rational numbers that converges to y
• closure - set with its limit points
• closure $\bar{A}$ is smallest closed set containing A
• iff set open, complement is closed
• R and $\emptyset$ are both open and closed
1. the union of a finite collection of closed sets is closed
2. the intersection of an arbitrary collection of closed sets is closed

### 3.3

• a set K $\subseteq \mathbb{R}$ is compact if every sequence in K has a subsequence that converges to a limit that is also in K
• Nested Compact Set Property - intersection of nested sequence of nonempty compact sets is not empty
• let A $\subseteq \mathbb{R}$. open cover for A is a (possibly infinite) collection of open sets whose union contains the set A.
• given an open cover for A, a finite subcover is a finite sub-collection of open sets from the original open cover whose union still manages to completely contain A
• Heine-Borel thm - let K $\subseteq \mathbb{R}$. All of the following are equivalent
1. K is compact
2. K is closed and bounded
3. every open cover for K has a finite subcover

# ch 4 - functional limits and continuity

### 4.1

• dirichlet function: 1 if r $\in \mathbb{Q}$ 0 otherwise

### 4.2 functional limits

•  def 1. Let f:$A \to R$, and let c be a limit point of the domain A. We say that $lim_{x \to c} f(x) = L$ provided that for all $\epsilon$ > 0, there exists a $\delta$ > 0 s.t. whenever 0 < x-c < $\delta$ (and x $\in$ A) it follows that f(x)-L < $\epsilon$
• def 2. Let f:$A \to R$, and let c be a limit point of the domain A. We say that $lim_{x \to c} f(x) = L$ provided that for every $\epsilon$-neighborhood $V_{\epsilon}(L)$ of L, there exists a $\delta$-neighborhood $V_{\delta}($c) around c with the property that for all x $\in V_{\delta}($c) different from c (with x $\in$ A) it follows that f(x) $\in V_{\epsilon}(L)$.
• sequential criterion for functional limits - Given function f:$A \to R$ and a limit point c of A, the following 2 statements are equivalent:
1. $lim_{x \to c} f(x) = L$
2. for all sequences $(x_n) \subseteq$ A satisfying $x_n \neq$ c and $(x_n) \to c$, it follows that $f(x_n) \to L$.
• algebraic limit thm for functional limits
• divergence criterion for functional limits

### 4.3 continuous functions

•  a function f:$A \to R$ is continuous at a point c $\in$ A if, for all $\epsilon$>0, there exists a $\delta$>0 such that whenever x-c <$\delta$ (and x$\in$ A) it follows that $f(x)-f( c) <\epsilon$. F is continous if it is continuous at every point in the domain A
• characterizations of continuouty
• criterion for discontinuity
• algebraic continuity theorem
• if f is continuous at c and g is continous at f( c) then g $\circ$ f is continuous at c

### 4.4 continuous functions on compact sets

• preservation of compact sets - if f continuous and K compact, then f(K) is compact as well
• extreme value theorem - if f if continuous on a compact set K, then f attains a maximum and minimum value. In other words, there exist $x_0,x_1 \in K$ such that $f(x_0) \leq f(x) \leq f(x_1)$ for all x $\in$ K
•  f is uniformly continuous on A if for every $\epsilon$>0, there exists a $\delta$>0 such that for all x,y $\in$ A, $x-y < \delta \implies f(x)-f(y) < \epsilon$
•  a function f fails to be uniformly continuous on A iff there exists a particular $\epsilon_o$ > 0 and two sequences $(x_n),(y_n)$ in A sastisfying $x_n - y_n \to 0$ but $f(x_n)-f(y_n) \geq \epsilon_o$
• a function that is continuous on a compact set K is uniformly continuous on K

### 4.5 intermediate value theorem

• intermediate value theorem - Let f:[a,b]$\to R$ be continuous. If L is a real number satisfying f(a) < L < f(b) or f(a) > L > f(b), then there exists a point c $\in (a,b)$ where f( c) = L
• a function f has the intermediate value property on an inverval [a,b] if for all x < y in [a,n] and all L between f(x) and f(y), it is always possible to find a point c $\in (x,y)$ where f( c)=L.

# ch 5 - the derivative

### 5.2 derivatives and the intermediate value property

• let g: A -> R be a function defined on an interval A. Given c $\in$ A, the derivative of g at c is defined by g’( c) = $\lim_{x \to c} \frac{g(x) - g( c)}{x-c}$, provided this limit exists. Then g is differentiable at c. If g’ exists for all points in A, we say g is differentiable on A
• identity: $x^n-c^n = (x-c)(x^{n-1}+cx^{n-2}+c^2x^{n-3}+…+c^{n-1}$)
• differentiable $\implies$ continuous
• algebraic differentiability theorem
2. scalar multiplying
3. product rule
4. quotient rule
• chain rule: let f:A-> R and g:B->R satisfy f(A)$\subseteq$ B so that the composition g $\circ$ f is defined. If f is differentiable at c in A and g differentiable at f( c) in B, then g $\circ$ f is differnetiable at c with (g$\circ$f)’( c)=g’(f( c))*f’( c)
• interior extremum thm - let f be differentiable on an open interval (a,b). If f attains a maximum or minimum value at some point c $\in$ (a,b), then f’( c) = 0.
• Darboux’s thm - if f is differentiable on an interval [a,b], and a satisfies f’(a) < $\alpha$ < f’(b) (or f’(a) > $\alpha$ > f’(b)), then there exists a point c $\in (a,b)$ where f’( c) = $\alpha$
• derivative satisfies intermediate value property

### 5.3 mean value theorems

• mean value theorem - if f:[a,b] -> R is continuous on [a,b] and differentiable on (a,b), then there exists a point c $\in$ (a,b) where $f’( c) = \frac{f(b)-f(a)}{b-a}$
• Rolle’s thm - f(a)=f(b) -> f’( c)=0
• if f’(x) = 0 for all x in A, then f(x) = k for some constant k
• if f and g are differentiable functions on an interval A and satisfy f’(x) = g’(x) for all x $\in$ A, then f(x) = g(x) + k for some constant k
•  generalized mean value theorem - if f and g are continuous on the closed interval [a,b] and differentiable on the open interval (a,b), then there exists a point c $\in (a,b)$ where f(b)-f(a) g’( c) = g(b)-g(a) f’( c). If g’ is never 0 on (a,b), then can be restated $\frac{f’( c)}{g’( c)} = \frac{f(b)-f(a)}{g(b)-g(a)}$
•  given g: A -> R and a limit point c of A, we say that $lim_{x \to c} g(x) = \infty$ if, for every M > 0, there exists a $\delta$> 0 such that whenever 0 < x-c < $\delta$ it follows that g(x) ≥ M
• L’Hospital’s Rule: 0/0 - let f and g be continuous on an interval containing a, and assume f and g are differentiable on this interval with the possible exception of the point a. If f(a) = g(a) = 0 and g’(x) ≠ 0 for all x ≠ a, then $lim_{x \to a} \frac{f’(x)}{g’(x)} = L \implies lim_{x \to a} \frac{f’(x)}{g’(x)} = L$
• L’Hospital’s Rule: $\infty / \infty$ - assume f and g are differentiable on (a,b) and g’(x) ≠ 0 for all x in (a,b). If $lim_{x \to a} g(x) = \infty$, then $lim_{x \to a} \frac{f’(x)}{g’(x)} = L \implies lim_{x \to a} \frac{f’(x)}{g’(x)} = L$

# ch 6 - sequences and series of function

### 6.2 uniform convergence of a sequence of functions

• for each n $\in \mathbb{N}$ let $f_n$ be a function defined on a set A$\subseteq R$. The sequence ($f_n$) of functions converges pointwise on A to a function f if, for all x in A, the sequence of real numbers $f_n(x)$ converges to f(x)
•  let ($f_n$) be a sequence of functions defined on a set A$\subseteq$R. Then ($f_n$) converges unformly on A to a limit function f defined on A if, for every $\epsilon$>0, there exists an N in $\mathbb{N}$ such that $\forall n ≥N, x \in A , f_n(x)-f(x) <\epsilon$
•  Cauchy Criterion for uniform convergence - a sequence of functions $(f_n)$ defined on a set A $\subseteq$ R converges uniformly on A iff $\forall \epsilon > 0 \exists N \in \mathbb{N}$ s.t. whenever m,n ≥N and x in A, $f_n(x)-f_m(x) <\epsilon$
• continuous limit thm - Let ($f_n$) be a sequence of functions defined on A that converges uniformly on A to a function f. If each $f_n$ is continuous at c in A, then f is continuous at c

### 6.3 uniform convergence and differentiation

• differentiable limit theorem - let $f_n \to f$ pointwise on the closed interval [a,b], and assume that each $f_n$ is differentiable. If $(f’_n)$ converges uniformly on [a,b] to a function g, then the function f is differentiable and f’=g
• let ($f_n$) be a sequence of differentiable functions defined on the closed interval [a,b], and assume $(f’_n)$ converges uniformly to a function g on [a,b]. If there exists a point $x_0 \in [a,b]$ for which $f_n(x_0)$ is convergent, then ($f_n$) converges uniformly. Moreover, the limit function f = lim $f_n$ is differentiable and satisfies f’ = g

### 6.4 series of functions

• term-by-term continuity thm - let $f_n$ be continuous functions defined on a set A $\subseteq$ R and assume $\sum f_n$ converges uniformly on A to a function f. Then, f is continuous on A.
• term-by-term differentiability thm - let $f_n$ be differentiable functions defined on an interval A, and assume $\sum f’_n(x)$ converges uniformly to a limit g(x) on A. If there exists a point $x_0 \in [a,b]$ where $\sum f_n(x_0)$ converges, then the series $\sum f_n(x)$ converges uniformly to a differentiable function f(x) satisfying f’(x) = g(x) on A. In other words, $f(x) = \sum f_n(x)$ and $f’(x) = \sum f’_n(x)$
•  Cauchy Criterion for uniform convergence of series - A series $\sum f_n$ converges uniformly on A iff $\forall \epsilon > 0 \exists N \in N$ s.t. whenever n>m≥N, x in A $f_{m+1}(x) + f_{m+2}(x) + f_{m+3}(x) + …+f_n(x) < \epsilon$
•  Wierstrass M-Test - For each n in N, let $f_n$ be a function defined on a set A $\subseteq$ R, and let $M_n > 0$ be a real number satisfying $f_n(x) ≤ M_n$ for all x in A. If $\sum M_n$ converges, then $\sum f_n$ converges uniformly on A

### 6.5 power series

• power series f(x) = $\sum_{n=0}^\infty a_n x^n = a_0 + a_1 x_1 + a_2 x^2 + a_3 x^3 + …$
•  if a power series converges at some point $x_0 \in \mathbb{R}$, then it converges absolutely for any x satisfying x < $x_0$
• if a power series converges pointwise on the set A, then it converges uniformly on any compact set K $\subseteq$ A
•  if a power series converges absolutely at a point $x_0$, then it converges uniformly on the closed interval [-c,c], where c = $x_0$
• Abel’s thm - if a power series converges at the point x = R > 0, the the series converges uniformly on the interval [0,R]. A similar result holds if the series converges at x = -R
• if $\sum_{n=0}^\infty a_n x^n$ converges for all x in (-R,R), then the differentiated series $\sum_{n=0}^\infty n a_n x^{n-1}$ converges at each x in (-R,R) as well. Consequently the convergence is uniform on compact sets contained in (-R,R).
• can take infinite derivatives

### 6.6 taylor series

• Taylor’s Formula $\sum_{n=0}^\infty a_n x^n = a_0 + a_1 x_1 + a_2 x^2 + a_3 x^3 + …$
• centered at 0: $a_n = \frac{f^{(n)}(0)}{n!}$
• Lagrange’s Remainder thm - Let f be differentiable N+1 times on (-R,R), define $a_n = \frac{f^{(n)}(0)}{n!}…..$
• not every infinitely differentiable function can be represented by its Taylor series (radius of convergence zero)

# ch 7 - the Riemann Integral

### 7.2 def of Riemann integral

• partition of [a,b] is a finite set of points from [a,b] that includes both a and b
• lower sum - sum all the possible smallest rectangles
• a partition Q is a refinement of a partition P if $P \subseteq Q$
• if $P \subseteq Q$, then L(f,P)≤L(f,Q) and U(f,P)≥U(f,Q)
• a bounded function f on the interval [a,b] is Riemann-integrable if U(f) = L(f) = $\int_a^b f$
• iff $\forall \epsilon >0$, there exists a partition P of [a,b] such that $U(f,P)-L(f,P)<\epsilon$
• U(f) = inf{U(f,P)} for all possible partitions P
• if f is continuous on [a,b] then it is integrable

### 7.3 integrating functions with discontinuities

• if f:[a,b]->R is bounded and f is integrable on [c,b] for all c in (a,b), then f is integrable on [a,b]

### 7.4 properties of Integral

• assume f: [a,b]->R is bounded and let c in (a,b). Then, f is integrable on [a,b] iff f is integrable on [a,c] and [c,b]. In this case we have $\int_a^b f = \int_a^c f + \int_c^b f.$F
• integrable limit thm - Assume that $f_n \to f$ uniformly on [a,b] and that each $f_n$ is integarble. Then, f is integrable and $lim_{n \to \infty} \int_a^b f_n = \int_a^b f$.

### 7.5 fundamental theorem of calculus

1. If f:[a,b] -> R is integrable, and F:[a,b]->R satisfies F’(x) = f(x) for all x $\in$ [a,b], then $\int_a^b f = F(b) - F(a)$
2. Let f: [a,b]-> R be integrable and for x $\in$ [a,b] define G(x) = $\int_a^x g$. Then G is continuous on [a,b]. If g is continuous at some point $c \in [a,b]$ then G is differentiable at c and G’(c) = g(c).

# overview

• convergence
1. sequences
2. series
3. functional limits
• normal, uniform
4. sequence of funcs
• pointwise, uniform
5. series of funcs
• pointwise, uniform
6. integrability
• sequential criterion - usually good for proving discontinuous
1. limit points
2. functional limits
3. continuity
4. absence of uniform continuity
• algebraic limit theorem ~ scalar multiplication, addition, multiplication, division
1. limit thm
2. sequences
3. series - can’t multiply / divide these
4. functional limits
5. continuity
6. differentiability
7. ~integrability~
• limit thms
• continuous limit thm - Let ($f_n$) be a sequence of functions defined on A that converges uniformly on A to a function f. If each $f_n$ is continuous at c in A, then f is continuous at c
• differentiable limit theorem - let $f_n \to f$ pointwise on the closed interval [a,b], and assume that each $f_n$ is differentiable. If $(f’_n)$ converges uniformly on [a,b] to a function g, then the function f is differentiable and f’=g
• convergent derivatives almost proves that $f_n \to f$
• let ($f_n$) be a sequence of differentiable functions defined on the closed interval [a,b], and assume $(f’_n)$ converges uniformly to a function g on [a,b]. If there exists a point $x_0 \in [a,b]$ for which $f_n(x_0) \to f(x_0)$ is convergent, then ($f_n$) converges uniformly
• integrable limit thm - Assume that $f_n \to f$ uniformly on [a,b] and that each $f_n$ is integarble. Then, f is integrable and $lim_{n \to \infty} \int_a^b f_n = \int_a^b f$.
• functions are continuous at isolated points, but limits don’t exist there
•  uniform continuity: minimize $f(x)-f(y)$
• derivative doesn’t have to be continuous
• integrable if finite amount of discontinuities and bounded
• Calculus

# Single-variable calculus

Derivatives:

$\frac{d}{dx}x^n = nx^{n-1}$

$\frac{d}{dx}a^x = a^{x}ln(a)$

$\frac{d}{dx}ln(x) = 1/x$

$\frac{d}{dx}tan(x)= sec^2(x)$

$\frac{d}{dx}cot(x)= -csc^2(x)$

$\frac{d}{dx}sec(x)= sec(x)tan(x)$

$\frac{d}{dx}csc(x)= -csc(x)cot(x)$

 $\int tan = ln sec$
 $\int cot = ln sin$
 $\int sec = ln sec+tan$
 $\int csc = ln csc-cot$

$\int \frac{du}{\sqrt{a^2-u^2}} = sin^{-1}(\frac{u}{a})$

$\int \frac{du}{u\sqrt{u^2-a^2}} = \frac{1}{a}sec^{-1}(\frac{u}{a})$

$\int \frac{du}{a^2+u^2} = \frac{1}{a} tan^{-1}(\frac{u}{a})$

Continuous: left limit = right limit = value

Differentiable: continuous and no sharp points / asymptotes

L’Hospital’s - for indeterminate forms: $(\frac{f(x)}{g(x)})’ = \frac{f’(x)}{g’(x)}$

Integration by parts: $\int{udv}=uv-\int{duv}$, LIATE

Expansions:

$e^x = \sum{\frac{x^n}{n!}}$

$sin(x) = \sum_0^\infty{\frac{(-1)^n x^{2n+1}}{(2n+1)!}}$

$cos(x) = \sum_0^\infty{\frac{(-1)^n x^{2n}}{(2n)!}}$

Geometric Sum: $a_{1st}\frac{1-r^{n+1}}{1-r}$

# Multivariable calculus

Polar: r,$\theta$,z

Spherical: $\rho,\theta,\phi$

Clairut’s Thm: Conservative function $f_{xy}=f_{yx}$

• Chaos

# Normal forms of Hopf bifurcations

• pitchfork: $\dot{x} = \lambda x - x^3$
• subcritical pitchfork: $\dot{x} = \lambda x + x^3$
• saddle node (turning point): $\dot{x} = \lambda - x^2$
• transcritical: $\dot{x} = \lambda x - x^2$

# important figs

• period-doubling (flip bifurcation) $f = \mu x (1-x) (f = \mu sin (\pi x)$ is similiar)
• inverse tangent bifurcation - unstable and stable P-3 orbits coalesce, move slightly off bisector and becomes chaotic
• pendulum
• energy surface - trajectories run around the surface, not down it
• Conservative systems: 6.5
• study Hamiltonian p. 187-188
• Pendulum: 6.7
• dynamics - study of things that evolve with time
• chaos - deterministic, aperiodic, sensitive, long-term prediction impossible
1. phase space - has coordinates $x_1,…,x_n$
2. phase portrait - variable x-axis, derivative y-axis
3. bifurcation diagram - parameter x-axis, steady state y-axis
• draw separate graphs for these
• first check - look for fixed points
• for 1-D, if f’ $<$ 0 then stable
• stable f.p. = all possible ICs in a.s.b.f.n. result in trajectories that remain in a.s.b.f.n. for all time
• asymptotically stable f.p. - stable and approaches f.p. as $t\ra\infty$
• hyperbolic f.p. - eigenvals aren’t strictly imaginary
• bifurcation point of f.p. - point where num solutions change or phase portraits change significantly
• globally stable - stable from any ICs
• autonomous = f is a function of x, not t
• we can always make a system autonomous by having $x_n$ = t, so $\dot{x_n}$ = 1
• dimension = number of 1st order ODEs, dimension of phase-space
• existence and uniqueness thm: if $\dot{x}$ and $\dot{x}’$ are continuous, then there is some unique solution
• linearization - used to find stability of f.p.s
• solving Hopf: use polar to get $\dot{\rho}, \dot{\theta}$
• multiply one thing by cos, one by sin, then add

• $\rho = \sqrt{x_1^2 + x_2^2} \theta = tan^{-1}(\frac{x_2}{x_1})$
• Hysterisis curve - S-shaped curve of fixed branches - ruler getting larger - snap bifurcation - both axes are parameters

# Systems of Linear ODEs

• solutions are of the form $\underbar{x}(t) = \underbar{C}_1e^{\alpha_1 t} + \underbar{C}_2e^{\alpha_2 t}$
• Eigenspaces: $E^S$ (stable), $E^U$ (unstable), $E^C$ (center - real part) - plot eigenvectors
• how to solve these systems?
• solve eigenvectors
• positive real part - goes out
• negative real part - goes in
• bifurcation requires 0 as eigenvalue
• has imaginary component: spiral / focus
• purely imaginary - center = stable, but not a.s.
• finite velocity = $\frac{dRe(\alpha)}{d\lambda}$
• change coordinates to polar
• for $\lambda \geq 0$, solution is a stable L.C. (from either direction spirals into a circular orbit)
• attracting - any trajectory that starts within $\delta$ of $\bar{\underbar{x}}$ evolves to $\bar{\underbar{x}}$ as t $\to \infty$ (it doesn’t have to remain within $\delta$ at all times
• stable (Lyapanov stable) - any trajectory that starts within $\delta$ remains within $\varepsilon$ for all time ($\varepsilon$ is chosen first)
• asymptotically stable - attracting and stable
• hyperbolic f.p. - iff all eigenvals of the linearization of the nds about the f.p. have nonzero real parts \

# Discrete Nonlinear Dynamical Systems

• functional iteration: $x_{n+m} = f^m(x_n)$ (apply f m times)
• fixed point: $f(x^)=x^$
•  f.p. stable if $\frac{df}{dx}(x^*) <1$, unstable if $>$ 1
• check n-orbit by checking nth derivative: $\frac{df^n}{dx}(x_i^) = \prod_{i=1}^{n-1} \frac{df}{dx}(x_i^)$
• period-doubling bifurcations
• self-stability - orbit for which the stability-determining derivative is zero. This means that the max of the map and the point at which the max occurs are in the orbit.
• type I intermittency - exhibited by inverse tangent bifurcation
• Feigenbaum sequence - period-doubling path to chaos, keep increasing parameter until period is chaotic

\begin{center} \begin{tabular}{ | m{4cm} | m{4cm} | } \hline \multicolumn{2}{|c|}{3D Attractors}
\hline Type of Attractor & Sign of Exponents \ \hline Fixed Point & (-, -, -)\ Limit Cycle & (0, -, -) \ Torus & (0, 0, -)
Strange Attractor & (+, 0, -)
\hline \end{tabular} \end{center}

• homoclinic orbit - connects unstable manifold of saddle point to its own stable manifold
• e.g. trajectory that starts and ends at the same fixed point
• manifolds are denoted by a W (ex. $W^S$ is the stable manifold)
• heteroclinic orbit - connects unstable manifold of fp to stable manifold of another fp \

# Conservative Systems

• $F(x) = -\frac{dV}{dx}$ (by defn.)
• $m\ddot{x}+\frac{dV}{dx}=0$, multiply by $\dot{x} \to \frac{d}{dt}[\frac{1}{2}m\dot{x}^2+V(x)]=0$
• so total energy $E=\frac{1}{2}m\dot{x}^2+V(x)$
• motion of pendulum: $\frac{d^2\theta}{dt^2}+\frac{g}{L}sin\theta=0$
• nondimensionalize with $\omega=\sqrt{g/L}, \tau=\omega t \to \ddot{\theta}+sin\theta =0$
• can multiply this by $\dot{\theta}$
• $\omega$-limit $t \to \infty$
• $\alpha$-limit $t \to -\infty$
• libration - small orbit surrounding center
• system: $\dot{\theta}=\nu$, $\dot{\nu} = -sin\theta$

## Hamiltonian Dynamical System

• $\dot{\underbar{x}}=\frac{\partial H}{\partial y}(\underbar{x},\underbar{y})$ , $\dot{\underbar{y}}=-\frac{\partial H}{\partial x}(\underbar{x},\underbar{y})$ for some function H called the Hamiltonian
• we can only have centers (minima in the potential) and saddle points (maxima)
• separatrix - orbit that separates trapped and passing orbits
• Poincare Benderson Thm - can’t have chaos in a 2D system

# Ref

• $\frac{\partial}{\partial x}(f_1 * f_2 * f_3) = \frac{\partial f_1}{\partial x} f_2 f_3 + \frac{\partial f_2}{\partial x} f_1 f_3 + \frac{\partial f_3}{\partial x} f_1 f_2$
• $e^{\mu it} = cos(\mu t)+ isin(\mu t)$
• $x = A e^{(\lambda + i)t} + B e^{(\lambda - i)t} \implies x = (A’ sin(t) + B’ cos(t)) e^{\lambda t}$ If we have $\dot{x_1},\dot{x_2}$ then we can get $x_2(x_1) with \frac{dx_1}{dx_2} = \frac{\dot{x_1}}{\dot{x_2}}$
• Differential Equations

# Differential Equations

Separable: Separate and Integrate FOLDE: y’ + p(x)y = g(x) IF: $e^{\int{p(x)}dx}$

Exact: Mdx+Ndy = 0 $M_y=N_x$ Integrate Mdx or Ndy, make sure all terms are present

Constant Coefficients: Plug in $e^{rt}$, solve characteristic polynomial repeated root solutions: $e^{rt},re^{rt}$ complex root solutions: $r=a\pm bi, y=c_1e^{at} cos(bt)+c_2e^{at} sin(bt)$

SOLDE (non-constant): py’‘+qy’+ry=0

Reduction of Order: Know one solution, can find other

Undetermined Coefficients (doesn’t have to be homogenous): solve homogenous first, then plug in form of solution with variable coefficients, solve polynomial to get the coefficients

Variation of Parameters: start with homogenous solutions $y_1,y_2$ $Y_p=-y_1\int \frac{y_2g}{W(y_1,y_2)}dt+y_2\int \frac{y_1g}{W(y_1,y_2)}dt$

Laplace Transforms - for anything, best when g is noncontinuous

$\mathcal{L}(f(t))=F(t)=\int_0^\infty e^{-st}f(t)dt$

Series Solutions: More difficult

Wronskian: $W(y_1 ,y_2)=y_1y _2’ -y_2 y_1’$ W = 0 $\implies$ solns linearly dependent

Abel’s Thm: y’‘+py’+q=0 $\implies W=ce^{\int pdt}$

• linear algebra

# SVD + eigenvectors

### strang 5.1 - intro

• elimination changes eigenvalues
• eigenvector application to diff eqs $\frac{du}{dt}=Au$
• soln is exponential: $u(t) = c_1 e^{\lambda_1 t} x_1 + c_2 e^{\lambda_2 t} x_2$
• eigenvalue eqn: $Ax = \lambda x \implies (A-\lambda I)x=0$
• set $det(A-\lambda I) = 0$ to get characteristic polynomial
• eigenvalue properties
• 0 eigenvalue signals that A is singular
• eigenvalues are on the main diagonal when the matrix is triangular
• checks
1. sum of eigenvalues = trace(A)
2. prod eigenvalues = det(A)
• defective matrices - lack a full set of eigenvalues

### strang 5.2 - diagonalization

• assume A (nxn) has n eigenvectors
• S := eigenvectors as columns
• $S^{-1} A S = \Lambda$ where corresponding eigenvalues are on diagonal of $\Lambda$
• if matrix A has no repeated eigenvalues, eigenvectors are independent
• other S matrices won’t produce diagonal
• only diagonalizable if n independent eigenvectors
• not related to invertibility
• eigenvectors corresponding to different eigenvalues are lin. independent
• there are always n complex eigenvalues
• eigenvalues of $A^2$ are squared, eigenvectors remain same
• eigenvalues of $A^{-1}$ are inverse eigenvalues
• eigenvalue of rotation matrix is $i$
• eigenvalues for $AB$ only multiply when A and B share eigenvectors
• diagonalizable matrices share the same eigenvector matrix S iff $AB = BA$

### strang 5.3 - difference eqs and power $A^k$

• compound interest
• solving for fibonacci numbers
• Markov matrices
• steady-state Ax = x
• corresponds to $\lambda = 1$
• stability of $u_{k+1} = A u_k$
•  stable if all eigenvalues satisfy $\lambda_i$ <1
•  neutrally stable if some $\lambda_i =1$
•  unstable if at least one $\lambda_i$ > 1
• Leontief’s input-output matrix
• Perron-Frobenius thm - if A is a positive matrix (positive values), so is its largest eigenvalue. Every component of the corresponding eigenvector is also positive.

### strang 6.3 - singular value decomposition

• SVD for any m x n matrix: $A=U\Sigma V^T$
• U (mxm) are eigenvectors of $AA^T$
• columns of V (nxn) are eigenvectors of $A^TA$
• r singular values on diagonal of $\Sigma$ (m x n) - square roots of nonzero eigenvalues of both $AA^T$ and $A^TA$
• properties
1. for PD matrices, $\Sigma=\Lambda$, $U\Sigma V^T = Q \Lambda Q^T$
• for other symmetric matrices, any negative eigenvalues in $\Lambda$ become positive in $\Sigma$
• applications
• very numerically stable because U and V are orthogonal matrices
• condition number of invertible nxn matrix = $\sigma_{max} / \sigma_{min}$
• $A=U\Sigma V^T = u_1 \sigma_1 v_1^T + … + u_r \sigma_r v_r^T$
• we can throw away columns corresponding to small $\sigma_i$
• pseudoinverse $A^+ = V \Sigma^+ U^T$

## Linear Basics

• Linear
1. Superposition f(x+y) = f(x)+f(y)
2. Proportionality $f(kx) = kf(x)$
• Vector Space
1. Closed under addition
2. Contains Identity
• Inner Product - returns a scalar
1. Linear
2. Symmetric
3. Something Tricky
• Determinant - sum of products including one element from each row / column with correct sign
• Eigenvalues: $det(A-\lambda I)=0$
• Eigenvectors: Find null space of A-$\lambda$I
• Linearly Independent: $c_1x_1+c_2x_2=0 \implies c_1=c_2=0$
• Mapping $f: a \mapsto b$
• Onto (surjective): $\forall b\in B \exists a\in A \, f(a)=b$
• 1-1 (injective): $f(a_1)=f(a_2) \implies a_1=a_2$
•  norms $x p = (\sum{i=1}^n x_i ^p)^{1/p}$
• $L_0$ norm - number of nonzero elements
•  $x _1 = \sum x_i$
•  $x _\infty = max_i x_i$
•  matrix norm - given a vector norm X , matrix norm is given by: A = $max_{x ≠ 0} Ax / x$
• represents the maximum stretching that A does to a vector x
• psuedo-inverse $A^+ = (A^T A)^{-1} A^T$
• inverse
• if orthogonal, inverse is transpose
• if diagonal, inverse is invert all elements
• inverting 3x3 - transpose, find all mini dets, multiply by signs, divide by det
• to find eigenvectors, values
• $det(A-\lambda I)=0$ and solve for lambdas
• $A = QDQ^T$ where Q columns are eigenvectors

# Singularity

• nonsingular = invertible = nonzero determinant = null space of zero
• only square matrices
• rank of mxn matrix- max number of linearly independent columns / rows - rank==m==n, then nonsingular
• ill-conditioned matrix - matrix is close to being singular - very small determinant
• positive semi-definite - $A \in R^{nxn}$
• intuitively is like having upwards curve
• if $\forall x \in R^n, x^TAx \geq 0$ then A is positive semi definite (PSD)
• if $\forall x \in R^n, x^TAx > 0$ then A is positive definite (PD)
• PD $\to$ full rank, invertible
• Gram matrix - G = $X^T X \implies$PSD
• if X full rank, then G is PD

# Matrix Calculus

• gradient $\Delta_A f(\mathbf{A})$- partial derivatives with respect to each element of matrix
• f has to be a function that takes matrix, returns a scalar
• output will be the same sizes as the variable you take the gradient of (in this case A)
• $\nabla^2$ is not gradient of the gradient
• examples
• $\nabla_x a^T x = a$
• $\nabla_x x^TAx = 2Ax$ (if A symmetric)
• $\nabla_x^2 x^TAx = 2A$ (if A symmetric)
• function f(x,y)
• 1st derivative is vector of derivatives
• 2nd derivative is Hessian matrix
• Math Basics

# Differential Equations

Separable: Separate and Integrate FOLDE: y’ + p(x)y = g(x) IF: $e^{\int{p(x)}dx}$

Exact: Mdx+Ndy = 0 $M_y=N_x$ Integrate Mdx or Ndy, make sure all terms are present

Constant Coefficients: Plug in $e^{rt}$, solve characteristic polynomial repeated root solutions: $e^{rt},re^{rt}$ complex root solutions: $r=a\pm bi, y=c_1e^{at} cos(bt)+c_2e^{at} sin(bt)$

SOLDE (non-constant): py’‘+qy’+ry=0

Reduction of Order: Know one solution, can find other

Undetermined Coefficients (doesn’t have to be homogenous): solve homogenous first, then plug in form of solution with variable coefficients, solve polynomial to get the coefficients

Variation of Parameters: start with homogenous solutions $y_1,y_2$ $Y_p=-y_1\int \frac{y_2g}{W(y_1,y_2)}dt+y_2\int \frac{y_1g}{W(y_1,y_2)}dt$

Laplace Transforms - for anything, best when g is noncontinuous

$\mathcal{L}(f(t))=F(t)=\int_0^\infty e^{-st}f(t)dt$

Series Solutions: More difficult

Wronskian: $W(y_1 ,y_2)=y_1y _2’ -y_2 y_1’$ W = 0 $\implies$ solns linearly dependent

Abel’s Thm: y’‘+py’+q=0 $\implies W=ce^{\int pdt}$

# Misc

$\Gamma(n)=(n-1)!=\int_0^\infty x^{n-1}e^{-x}dx$

$\zeta(x) = \sum_1^\infty \frac{1}{x^2}$ \end{multicols}

# stochastic processes

Stochastic - random process evolving with time

 Markov: $P(X_t=x X_{t-1})=P(X_t=x X_{t-1}…X_1)$

Martingale: $E[X_t]=X_{t-1}$

# abstract algebra

Group: set of elements endowed with operation satisfying 4 properties:

1. closed 2. identity 3. associative 4. inverses

Equivalence Relation;

1. reflexive 2. transitive 3. symmetric

# discrete math

Goldbach’s strong conjecture: Every even integer greater than 2 can be expressed as the sum of two primes (He considered one a prime).

Goldbach’s weak conjecture: All odd numbers greater than 7 are the sum of three primes.

Set - An unordered collection of items without replication

Proper subset - subset with cardinality less than the set

A U A = A Idempotent law

Disjoint: A and B = empty set

Partition: mutually disjoint, union fills space

powerset $\mathcal{P}$(A) = set of all subsets

Converse: $q\ra p$ (same as inverse: $-p \ra -q$)

$p_1 \ra p_2 \iff - p_1 \lor p_2$

 The greatest common divisor of two integers a and b is the largest integer d such that d  a and d  b

Proof Techniques

1. Proof by Induction

2. Direct Proof

3. Proof by Contradiction - assume p $\land$ -q, show contradiction

4. Proof by Contrapositive - show -q $\ra$ -p

# identities

$e^{-2lnx}= \frac{1}{e^{2lnx}} = \frac{1}{e^{lnx}e^{lnx}} = \frac{1}{x^2}$

$ln(xy) = ln(x)+ln(y)$

$lnx * lny = ln(x^{lny})$

$e^{\mu it} = cos(\mu t)+ isin(\mu t)$

Partial Fractions $\frac{3x+11}{(x-3)(x+2)} = \frac{A}{x-3} + \frac{B}{x+2}$

$(ax+b)^k = \frac{A_1}{ax+b}+\frac{A_2}{(ax+b)^2}+…$

$(ax^2+bx+c)^k = \frac{A_1x+B_1}{ax^2+bx+c}+…$

$cos(a\pm b) = cos(a)cos(b)\mp sin(a)sin(b)$

$sin(a \pm b) = sin(a)cos(b) \pm sin(b)cos(a)$

• Proofs

# proofs

• induction
• must already know formula
• doesn’t give intuition
• there are uncomputable functions e.g. Halting Problem, 3x+1 problem
• non-existence proofs
• must cover all possible scenarios, harder than existence
• pigeonhole principle
• Real Analysis

# ch 1 - the real numbers

• there is no rational number whose square is 2 <div class="collapse" id="111"> proof by contradiction </div>
• contrapositive: $-q \to -p$ - logically equivalent
•  triangle inequality: $a+b \leq a + b$
often use a-b = (a-c)+(c-b)
• axiom of completeness - every nonempty set $A \subseteq \mathbb{R}$ that is bounded above has a least upper bound
• doesn’t work for $\mathbb{Q}$
• supremum = supA = least upper bound (similarly, infimum)
1. supA is an upper bound of A
2. if $s \in \mathbb{R}$ is another u.b. then $s \geq supA$
• can be restated as $\forall \epsilon > 0, \exists a \in A$ $s-\epsilon < a$
• nested interval property - for each $n \in N$, assume we are given a closed interval $I_n = [a_n,b_n]={ x \in \mathbb{R} : a_n \leq x \leq b_n }$ Assume also that each $I_n$ contains $I_{n+1}$. Then, the resulting nested sequence of nonempty closed intervals $I_1 \supseteq I_2 \supseteq …$ has a nonempty intersection <div class="collapse" id="141"> use AoC with x = sup{$a_n: n \in \mathbb{N}$} in the intersection of all sets</div>
• archimedean property
1. $\mathbb{N}$ is unbounded above (sup $\mathbb{N}=\infty$)
2. $\forall x \in \mathbb{R}, x>0, \exists n \in \mathbb{N}, 0 < \frac{1}{n} < x$
• $\mathbb{Q}$ is dense in $\mathbb{R}$ - for every $a,b \in \mathbb{R}, a<b$, $\exists r \in \mathbb{Q}$ s.t. $a<r<b$
• pf: want $a < \frac{m}{n} < b$
• by Archimedean property, want $\frac{1}{n} < b-a$
• corollary: the irrationals are dense in $\mathbb{R}$
• there exists a real number $r \in \mathbb{R}$ satisfying $r^2 = 2$
• pf: let r = $sup { t \in \mathbb{R} : t^2 < 2 }$. disprove $r^2<2, r^2>2$ by considering $r+\frac{1}{n},r-\frac{1}{n}$
• A ~ B if there exists f:A->B that is 1-1 and onto
• A is finite - there exists n $\in \mathbb{N}$ s.t. $\mathbb{N}_n$~A
• countable = $\mathbb{N}$~A.
• uncountable - inifinite set that isn’t countable
• Q is countable
• pf: Let $A_n = { \pm \frac{p}{q}:$ where p,q $\in \mathbb{N}$ are in lowest terms with p+q=n}
• R is uncountable
• pf: Assume we can enumerate $\mathbb{R}$ Use NIP to exclude one point from $\mathbb{R}$ each time. The intersection is still nonempty, so we didn’t succesfully enumerate $\mathbb{R}$
• $\frac{x}{x^2-1}$ maps (0,1) $\to \mathbb{R}$
• countable union of countable sets is countable
• if $A \subseteq B$ and B countable, then A is either countable or finite
• if $A_n$ is a countable set for each $n \in \mathbb{N}$, then their union is countable
• the open interval (0,1) = ${ x \in \mathbb{R} : 0 < x < 1 }$ is uncountable
• pf: diagonalization - assume there exists a function from (0,1) to $\mathbb{R}$. List the decimal expansions of these as rows of a matrix. Complement of diagonal does not exist.
• cantor’s thm - Given any set A, there does not exist a function f:$A \to P(A)$ that is onto
• P(A) is the set of all subsets of A

# ch 2 - sequences and series

•  a sequence $(a_n)$ converges to a real number if $\forall \epsilon > 0, \exists N \in \mathbb{N}$ such that $\forall n\geq N, a_n-a < \epsilon$
• otherwise it diverges
• if a limit exists, it is unique
•  a sequence $(x_n)$ is bounded if there exists a number M > 0 such that $x_n \leq M \forall n \in \mathbb{N}$
• every convergent sequence is bounded
• algebraic limit thm - let lim $a_n = a$ and lim $b_n$ = b. Then
1. lim($ca_n$) = ca
2. lim($a_n+b_n$) = a+b
3. lim($a_n b_n$) = ab
4. lim($a_n/b_n$) = a/b, provided b $\neq$ 0
•  pf 3: use triangle inequality, $a_nb_n-ab = a_nb_n-ab_n+ab_n-ab =…= b_n a_n-a + a b_n-b$
• pf 4: show $(b_n) \to b$ implies $(\frac{1}{b_n}) \to \frac{1}{b}$
• order limit thm - Assume lim $a_n = a$ and lim $b_n$ = b.
1. If $a_n \geq 0$ $\forall n \in \mathbb{N}$, then $a \geq 0$
2. If $a_n \leq b_n$ $\forall n \in \mathbb{N}$, then $a \leq b$
3. If $\exists c \in \mathbb{R}$ for which $c \leq b_n$ $\forall n \in \mathbb{N}$, then $c \leq b$
• pf 1: by contradiction
• monotone - increasing or decreasing (not strictly)
• monotone convergence thm - if a sequence is monotone and bounded, then it converges
• convergence of a series
• define $s_m=a_1+a_2+…+a_m$
• $\sum_{n=1}^\infty a_n$ converges to A $\iff (s_m)$ converges to A
• cauchy condensation test - suppose $a_n$ is decreasing and satisfies $a_n \geq 0$ for all $n \in \mathbb{N}$. Then, the series $\sum_{n=1}^\infty a_n$ converges iff the series $\sum_{n=1}^\infty 2^na_{2^n}$ converges
• p-series $\sum_{n=1}^\infty 1/n^p$ converges iff p > 1

### 2.5

• let $(a_n)$ be a sequence and $n_1<n_2<…$ be an increasing sequence of natural numbers. Then $(a_{n_1},a_{n_2},…)$ is a subsequence of $(a_n)$
• subsequences of a convergent sequence converge to the same limit as the original sequence
• can be used as divergence criterion
• bolzano-weierstrass thm - every bounded sequence contains a convergent subsequence
• pf: use NIP, keep splitting interval into two

### 2.6

•  $(a_n)$ is a cauchy sequence if $\forall \epsilon > 0, \exists N \in \mathbb{N}$ such that $\forall m,n\geq N, a_n-a_m < \epsilon$
• cauchy criterion - a sequence converges $\iff$ it is a cauchy sequence
• cauchy sequences are bounded
• overview: AoC $\iff$ NIP $\iff$ MCT $\iff$ BW $\iff$ CC

### 2.7

• algebraic limit thm - let $\sum_{n=1}^\infty a_n$ = A, $\sum_{n=1}^\infty b_n$ = B
1. $\sum_{n=1}^\infty ca_n$ = cA
2. $\sum_{n=1}^\infty a_n+b_n$ = A+B
1. cauchy criterion for series - series converges $\iff$ $(s_m)$ is a cauchy sequence
• if the series $\sum_{n=1}^\infty a_n$ converges then lim $a_n=0$
1. comparison test
2. geometric series - $\sum_{n=0}^\infty a r^n = \frac{a}{1-r}$
• $s_m = a+ar+…+ar^{m-1} = \frac{a(1-r^m)}{1-r}$
1. absolute convergence test
2. alternating series test 1. decreasing 2. lim $a_n$ = 0
• then, $\sum_{n=1}^\infty (-1)^{n+1} a_n$ converges
• rearrangements: there exists one-to-one correspondence
• if a series converges absolutely, any rearrangement converges to same limit

# ch 3 - basic topology of R

### 3.1 cantor set

• C has small length, but its cardinality is uncountable
• discussion of dimensions, doubling sizes leads to 2^dimension sizes
• Cantor set is about dimension .631

### 3.2 open/closed sets

• A set O $\subseteq \mathbb{R}$ is open if for all points a $\in$ O there exists an $\epsilon$-neighborhood $V_{\epsilon}(a) \subseteq O$
•  $V_{\epsilon}(a)={ x \in R : x-a < \epsilon$}
1. the union of an arbitrary collection of open sets is open
2. the intersection of a finite collection of open sets is open
• a point x is a limit point of a set A if every $\epsilon$-neighborhood $V_{\epsilon}(x)$ of x intersects the set A at some point other than x
• a point x is a limit point of a set A if and only if x = lim $a_n$ for some sequence ($a_n$) contained in A satisfying $a_n \neq x$ for all n $\in$ N
• isolated point - not a limit point
• set $F \subseteq \mathbb{R}$ closed - contains all limit points
• closed iff every Cauchy sequence contained in F has a limit that is also an element of F
• density of Q in R - for every $y \in \mathbb{R}$, there exists a sequence of rational numbers that converges to y
• closure - set with its limit points
• closure $\bar{A}$ is smallest closed set containing A
• iff set open, complement is closed
• R and $\emptyset$ are both open and closed
1. the union of a finite collection of closed sets is closed
2. the intersection of an arbitrary collection of closed sets is closed

### 3.3

• a set K $\subseteq \mathbb{R}$ is compact if every sequence in K has a subsequence that converges to a limit that is also in K
• Nested Compact Set Property - intersection of nested sequence of nonempty compact sets is not empty
• let A $\subseteq \mathbb{R}$. open cover for A is a (possibly infinite) collection of open sets whose union contains the set A.
• given an open cover for A, a finite subcover is a finite sub-collection of open sets from the original open cover whose union still manages to completely contain A
• Heine-Borel thm - let K $\subseteq \mathbb{R}$. All of the following are equivalent
1. K is compact
2. K is closed and bounded
3. every open cover for K has a finite subcover

# ch 4 - functional limits and continuity

### 4.1

• dirichlet function: 1 if r $\in \mathbb{Q}$ 0 otherwise

### 4.2 functional limits

•  def 1. Let f:$A \to R$, and let c be a limit point of the domain A. We say that $lim_{x \to c} f(x) = L$ provided that for all $\epsilon$ > 0, there exists a $\delta$ > 0 s.t. whenever 0 < x-c < $\delta$ (and x $\in$ A) it follows that f(x)-L < $\epsilon$
• def 2. Let f:$A \to R$, and let c be a limit point of the domain A. We say that $lim_{x \to c} f(x) = L$ provided that for every $\epsilon$-neighborhood $V_{\epsilon}(L)$ of L, there exists a $\delta$-neighborhood $V_{\delta}($c) around c with the property that for all x $\in V_{\delta}($c) different from c (with x $\in$ A) it follows that f(x) $\in V_{\epsilon}(L)$.
• sequential criterion for functional limits - Given function f:$A \to R$ and a limit point c of A, the following 2 statements are equivalent:
1. $lim_{x \to c} f(x) = L$
2. for all sequences $(x_n) \subseteq$ A satisfying $x_n \neq$ c and $(x_n) \to c$, it follows that $f(x_n) \to L$.
• algebraic limit thm for functional limits
• divergence criterion for functional limits

### 4.3 continuous functions

•  a function f:$A \to R$ is continuous at a point c $\in$ A if, for all $\epsilon$>0, there exists a $\delta$>0 such that whenever x-c <$\delta$ (and x$\in$ A) it follows that $f(x)-f( c) <\epsilon$. F is continous if it is continuous at every point in the domain A
• characterizations of continuouty
• criterion for discontinuity
• algebraic continuity theorem
• if f is continuous at c and g is continous at f( c) then g $\circ$ f is continuous at c

### 4.4 continuous functions on compact sets

• preservation of compact sets - if f continuous and K compact, then f(K) is compact as well
• extreme value theorem - if f if continuous on a compact set K, then f attains a maximum and minimum value. In other words, there exist $x_0,x_1 \in K$ such that $f(x_0) \leq f(x) \leq f(x_1)$ for all x $\in$ K
•  f is uniformly continuous on A if for every $\epsilon$>0, there exists a $\delta$>0 such that for all x,y $\in$ A, $x-y < \delta \implies f(x)-f(y) < \epsilon$
•  a function f fails to be uniformly continuous on A iff there exists a particular $\epsilon_o$ > 0 and two sequences $(x_n),(y_n)$ in A sastisfying $x_n - y_n \to 0$ but $f(x_n)-f(y_n) \geq \epsilon_o$
• a function that is continuous on a compact set K is uniformly continuous on K

### 4.5 intermediate value theorem

• intermediate value theorem - Let f:[a,b]$\to R$ be continuous. If L is a real number satisfying f(a) < L < f(b) or f(a) > L > f(b), then there exists a point c $\in (a,b)$ where f( c) = L
• a function f has the intermediate value property on an inverval [a,b] if for all x < y in [a,n] and all L between f(x) and f(y), it is always possible to find a point c $\in (x,y)$ where f( c)=L.

# ch 5 - the derivative

### 5.2 derivatives and the intermediate value property

• let g: A -> R be a function defined on an interval A. Given c $\in$ A, the derivative of g at c is defined by g’( c) = $\lim_{x \to c} \frac{g(x) - g( c)}{x-c}$, provided this limit exists. Then g is differentiable at c. If g’ exists for all points in A, we say g is differentiable on A
• identity: $x^n-c^n = (x-c)(x^{n-1}+cx^{n-2}+c^2x^{n-3}+…+c^{n-1}$)
• differentiable $\implies$ continuous
• algebraic differentiability theorem
2. scalar multiplying
3. product rule
4. quotient rule
• chain rule: let f:A-> R and g:B->R satisfy f(A)$\subseteq$ B so that the composition g $\circ$ f is defined. If f is differentiable at c in A and g differentiable at f( c) in B, then g $\circ$ f is differnetiable at c with (g$\circ$f)’( c)=g’(f( c))*f’( c)
• interior extremum thm - let f be differentiable on an open interval (a,b). If f attains a maximum or minimum value at some point c $\in$ (a,b), then f’( c) = 0.
• Darboux’s thm - if f is differentiable on an interval [a,b], and a satisfies f’(a) < $\alpha$ < f’(b) (or f’(a) > $\alpha$ > f’(b)), then there exists a point c $\in (a,b)$ where f’( c) = $\alpha$
• derivative satisfies intermediate value property

### 5.3 mean value theorems

• mean value theorem - if f:[a,b] -> R is continuous on [a,b] and differentiable on (a,b), then there exists a point c $\in$ (a,b) where $f’( c) = \frac{f(b)-f(a)}{b-a}$
• Rolle’s thm - f(a)=f(b) -> f’( c)=0
• if f’(x) = 0 for all x in A, then f(x) = k for some constant k
• if f and g are differentiable functions on an interval A and satisfy f’(x) = g’(x) for all x $\in$ A, then f(x) = g(x) + k for some constant k
•  generalized mean value theorem - if f and g are continuous on the closed interval [a,b] and differentiable on the open interval (a,b), then there exists a point c $\in (a,b)$ where f(b)-f(a) g’( c) = g(b)-g(a) f’( c). If g’ is never 0 on (a,b), then can be restated $\frac{f’( c)}{g’( c)} = \frac{f(b)-f(a)}{g(b)-g(a)}$
•  given g: A -> R and a limit point c of A, we say that $lim_{x \to c} g(x) = \infty$ if, for every M > 0, there exists a $\delta$> 0 such that whenever 0 < x-c < $\delta$ it follows that g(x) ≥ M
• L’Hospital’s Rule: 0/0 - let f and g be continuous on an interval containing a, and assume f and g are differentiable on this interval with the possible exception of the point a. If f(a) = g(a) = 0 and g’(x) ≠ 0 for all x ≠ a, then $lim_{x \to a} \frac{f’(x)}{g’(x)} = L \implies lim_{x \to a} \frac{f’(x)}{g’(x)} = L$
• L’Hospital’s Rule: $\infty / \infty$ - assume f and g are differentiable on (a,b) and g’(x) ≠ 0 for all x in (a,b). If $lim_{x \to a} g(x) = \infty$, then $lim_{x \to a} \frac{f’(x)}{g’(x)} = L \implies lim_{x \to a} \frac{f’(x)}{g’(x)} = L$

# ch 6 - sequences and series of function

### 6.2 uniform convergence of a sequence of functions

• for each n $\in \mathbb{N}$ let $f_n$ be a function defined on a set A$\subseteq R$. The sequence ($f_n$) of functions converges pointwise on A to a function f if, for all x in A, the sequence of real numbers $f_n(x)$ converges to f(x)
•  let ($f_n$) be a sequence of functions defined on a set A$\subseteq$R. Then ($f_n$) converges unformly on A to a limit function f defined on A if, for every $\epsilon$>0, there exists an N in $\mathbb{N}$ such that $\forall n ≥N, x \in A , f_n(x)-f(x) <\epsilon$
•  Cauchy Criterion for uniform convergence - a sequence of functions $(f_n)$ defined on a set A $\subseteq$ R converges uniformly on A iff $\forall \epsilon > 0 \exists N \in \mathbb{N}$ s.t. whenever m,n ≥N and x in A, $f_n(x)-f_m(x) <\epsilon$
• continuous limit thm - Let ($f_n$) be a sequence of functions defined on A that converges uniformly on A to a function f. If each $f_n$ is continuous at c in A, then f is continuous at c

### 6.3 uniform convergence and differentiation

• differentiable limit theorem - let $f_n \to f$ pointwise on the closed interval [a,b], and assume that each $f_n$ is differentiable. If $(f’_n)$ converges uniformly on [a,b] to a function g, then the function f is differentiable and f’=g
• let ($f_n$) be a sequence of differentiable functions defined on the closed interval [a,b], and assume $(f’_n)$ converges uniformly to a function g on [a,b]. If there exists a point $x_0 \in [a,b]$ for which $f_n(x_0)$ is convergent, then ($f_n$) converges uniformly. Moreover, the limit function f = lim $f_n$ is differentiable and satisfies f’ = g

### 6.4 series of functions

• term-by-term continuity thm - let $f_n$ be continuous functions defined on a set A $\subseteq$ R and assume $\sum f_n$ converges uniformly on A to a function f. Then, f is continuous on A.
• term-by-term differentiability thm - let $f_n$ be differentiable functions defined on an interval A, and assume $\sum f’_n(x)$ converges uniformly to a limit g(x) on A. If there exists a point $x_0 \in [a,b]$ where $\sum f_n(x_0)$ converges, then the series $\sum f_n(x)$ converges uniformly to a differentiable function f(x) satisfying f’(x) = g(x) on A. In other words, $f(x) = \sum f_n(x)$ and $f’(x) = \sum f’_n(x)$
•  Cauchy Criterion for uniform convergence of series - A series $\sum f_n$ converges uniformly on A iff $\forall \epsilon > 0 \exists N \in N$ s.t. whenever n>m≥N, x in A $f_{m+1}(x) + f_{m+2}(x) + f_{m+3}(x) + …+f_n(x) < \epsilon$
•  Wierstrass M-Test - For each n in N, let $f_n$ be a function defined on a set A $\subseteq$ R, and let $M_n > 0$ be a real number satisfying $f_n(x) ≤ M_n$ for all x in A. If $\sum M_n$ converges, then $\sum f_n$ converges uniformly on A

### 6.5 power series

• power series f(x) = $\sum_{n=0}^\infty a_n x^n = a_0 + a_1 x_1 + a_2 x^2 + a_3 x^3 + …$
•  if a power series converges at some point $x_0 \in \mathbb{R}$, then it converges absolutely for any x satisfying x < $x_0$
• if a power series converges pointwise on the set A, then it converges uniformly on any compact set K $\subseteq$ A
•  if a power series converges absolutely at a point $x_0$, then it converges uniformly on the closed interval [-c,c], where c = $x_0$
• Abel’s thm - if a power series converges at the point x = R > 0, the the series converges uniformly on the interval [0,R]. A similar result holds if the series converges at x = -R
• if $\sum_{n=0}^\infty a_n x^n$ converges for all x in (-R,R), then the differentiated series $\sum_{n=0}^\infty n a_n x^{n-1}$ converges at each x in (-R,R) as well. Consequently the convergence is uniform on compact sets contained in (-R,R).
• can take infinite derivatives

### 6.6 taylor series

• Taylor’s Formula $\sum_{n=0}^\infty a_n x^n = a_0 + a_1 x_1 + a_2 x^2 + a_3 x^3 + …$
• centered at 0: $a_n = \frac{f^{(n)}(0)}{n!}$
• Lagrange’s Remainder thm - Let f be differentiable N+1 times on (-R,R), define $a_n = \frac{f^{(n)}(0)}{n!}…..$
• not every infinitely differentiable function can be represented by its Taylor series (radius of convergence zero)

# ch 7 - the Riemann Integral

### 7.2 def of Riemann integral

• partition of [a,b] is a finite set of points from [a,b] that includes both a and b
• lower sum - sum all the possible smallest rectangles
• a partition Q is a refinement of a partition P if $P \subseteq Q$
• if $P \subseteq Q$, then L(f,P)≤L(f,Q) and U(f,P)≥U(f,Q)
• a bounded function f on the interval [a,b] is Riemann-integrable if U(f) = L(f) = $\int_a^b f$
• iff $\forall \epsilon >0$, there exists a partition P of [a,b] such that $U(f,P)-L(f,P)<\epsilon$
• U(f) = inf{U(f,P)} for all possible partitions P
• if f is continuous on [a,b] then it is integrable

### 7.3 integrating functions with discontinuities

• if f:[a,b]->R is bounded and f is integrable on [c,b] for all c in (a,b), then f is integrable on [a,b]

### 7.4 properties of Integral

• assume f: [a,b]->R is bounded and let c in (a,b). Then, f is integrable on [a,b] iff f is integrable on [a,c] and [c,b]. In this case we have $\int_a^b f = \int_a^c f + \int_c^b f.$F
• integrable limit thm - Assume that $f_n \to f$ uniformly on [a,b] and that each $f_n$ is integarble. Then, f is integrable and $lim_{n \to \infty} \int_a^b f_n = \int_a^b f$.

### 7.5 fundamental theorem of calculus

1. If f:[a,b] -> R is integrable, and F:[a,b]->R satisfies F’(x) = f(x) for all x $\in$ [a,b], then $\int_a^b f = F(b) - F(a)$
2. Let f: [a,b]-> R be integrable and for x $\in$ [a,b] define G(x) = $\int_a^x g$. Then G is continuous on [a,b]. If g is continuous at some point $c \in [a,b]$ then G is differentiable at c and G’(c) = g(c).

# overview

• convergence
1. sequences
2. series
3. functional limits
• normal, uniform
4. sequence of funcs
• pointwise, uniform
5. series of funcs
• pointwise, uniform
6. integrability
• sequential criterion - usually good for proving discontinuous
1. limit points
2. functional limits
3. continuity
4. absence of uniform continuity
• algebraic limit theorem ~ scalar multiplication, addition, multiplication, division
1. limit thm
2. sequences
3. series - can’t multiply / divide these
4. functional limits
5. continuity
6. differentiability
7. ~integrability~
• limit thms
• continuous limit thm - Let ($f_n$) be a sequence of functions defined on A that converges uniformly on A to a function f. If each $f_n$ is continuous at c in A, then f is continuous at c
• differentiable limit theorem - let $f_n \to f$ pointwise on the closed interval [a,b], and assume that each $f_n$ is differentiable. If $(f’_n)$ converges uniformly on [a,b] to a function g, then the function f is differentiable and f’=g
• convergent derivatives almost proves that $f_n \to f$
• let ($f_n$) be a sequence of differentiable functions defined on the closed interval [a,b], and assume $(f’_n)$ converges uniformly to a function g on [a,b]. If there exists a point $x_0 \in [a,b]$ for which $f_n(x_0) \to f(x_0)$ is convergent, then ($f_n$) converges uniformly
• integrable limit thm - Assume that $f_n \to f$ uniformly on [a,b] and that each $f_n$ is integarble. Then, f is integrable and $lim_{n \to \infty} \int_a^b f_n = \int_a^b f$.
• functions are continuous at isolated points, but limits don’t exist there
•  uniform continuity: minimize $f(x)-f(y)$
• derivative doesn’t have to be continuous
• integrable if finite amount of discontinuities and bounded

### neuro

• brain basics
• Cerebrum - The cerebrum is the largest portion of the brain, and contains tools which are responsible for most of the brain’s function. It is divided into four sections: the temporal lobe, the occipital lobe, parietal lobe and frontal lobe. The cerebrum is divided into a right and left hemisphere which are connected by axons that relay messages from one to the other. This matter is made of nerve cells which carry signals between the organ and the nerve cells which run through the body.
• Frontal Lobe - The frontal lobe is one of four lobes in the cerebral hemisphere. This lobe controls a several elements including creative thought, problem solving, intellect, judgment, behavior, attention, abstract thinking, physical reactions, muscle movements, coordinated movements, smell and personality.
• Parietal Lobe Located in the cerebral hemisphere, this lobe focuses on comprehension. Visual functions, language, reading, internal stimuli, tactile sensation and sensory comprehension will be monitored here.
• Sensory Cortex - The sensory cortex, located in the front portion of the parietal lobe, receives information relayed from the spinal cord regarding the position of various body parts and how they are moving. This middle area of the brain can also be used to relay information from the sense of touch, including pain or pressure which is affecting different portions of the body.
• Motor Cortex - This helps the brain monitor and control movement throughout the body. It is located in the top, middle portion of the brain.
• Temporal Lobe: The temporal lobe controls visual and auditory memories. It includes areas that help manage some speech and hearing capabilities, behavioral elements, and language. It is located in the cerebral hemisphere.
• Wernicke’s Area- This portion of the temporal lobe is formed around the auditory cortex. While scientists have a limited understanding of the function of this area, it is known that it helps the body formulate or understand speech.
• Occipital Lobe: The optical lobe is located in the cerebral hemisphere in the back of the head. It helps to control vision.
• Broca’s Area- This area of the brain controls the facial neurons as well as the understanding of speech and language. It is located in the triangular and opercular section of the inferior frontal gyrus.
• Cerebellum
• This is commonly referred to as “the little brain,” and is considered to be older than the cerebrum on the evolutionary scale. The cerebellum controls essential body functions such as balance, posture and coordination, allowing humans to move properly and maintain their structure.

# Limbic System

• The limbic system contains glands which help relay emotions. Many hormonal responses that the body generates are initiated in this area. The limbic system includes the amygdala, hippocampus, hypothalamus and thalamus.
• Amygdala:The amygdala helps the body responds to emotions, memories and fear. It is a large portion of the telencephalon, located within the temporal lobe which can be seen from the surface of the brain. This visible bulge is known as the uncus.
• Hippocampus: This portion of the brain is used for learning memory, specifically converting temporary memories into permanent memories which can be stored within the brain. The hippocampus also helps people analyze and remember spatial relationships, allowing for accurate movements. This portion of the brain is located in the cerebral hemisphere.
• Hypothalamus:The hypothalamus region of the brain controls mood, thirst, hunger and temperature. It also contains glands which control the hormonal processes throughout the body.
• Thalamus:The Thalamus is located in the center of the brain. It helps to control the attention span, sensing pain and monitors input that moves in and out of the brain to keep track of the sensations the body is feeling.

# Brain Stem

• All basic life functions originate in the brain stem, including heartbeat, blood pressure and breathing. In humans, this area contains the medulla, midbrain and pons. This is commonly referred to as the simplest part of the brain, as most creatures on the evolutionary scale have some form of brain creation that resembles the brain stem. The brain stem consists of midbrain, pons and medulla.
• Midbrain:The midbrain, also known as the mesencephalon is made up of the tegmentum and tectum. These parts of the brain help regulate body movement, vision and hearing. The anterior portion of the midbrain contains the cerebral peduncle which contains the axons that transfer messages from the cerebral cortex down the brain stem, which allows voluntary motor function to take place.
• Pons: This portion of the metencephalon is located in the hindbrain, and links to the cerebellum to help with posture and movement. It interprets information that is used in sensory analysis or motor control. The pons also creates the level of consciousness necessary for sleep.
• Medulla: The medulla or medulla oblongata is an essential portion of the brain stem which maintains vital body functions such as the heart rate and breathing.
• Computational Neuroscience

[toc]

# 1- introduction

### 1.1 - overview

• three types
1. descriptive brain model - encode / decode external stimuli
2. mechanistic brian cell / network model - simulate the behavior of a single neuron / network
3. interpretive (or normative) brain model - why do brain circuits operate how they do

### 1.2 - descriptive

• receptive field - the things that make a neuron fire

### 1.3 - mechanistic and interpretive

• retina has on-center / off-surround cells - stimulated by points
• then, V1 has differently shaped receptive fields
• efficient coding hypothesis - learns different combinations (e.g. lines) that can efficiently represent images
1. sparse coding (Olshausen and Field 1996)
2. ICA (Bell and Sejnowski 1997)
3. Predictive Coding (Rao and Ballard 1999)
• brain is trying to learn faithful and efficient representations of an animal’s natural environment
• same goes for auditory cortex

# 2 - neural encoding

### 2.1 - defining neural code

• extracellular
• fMRI
• averaged over space
• slow, requires seconds
• EEG
• noisy
• averaged, but faster
• multielectrode array
• record from several individual neurons at once
• calcium imaging
• cells have calcium indicator that fluoresce when calcium enters a cell
• intracellular - can use patch electrodes
• raster plot
• replay a movie many times and record from retinal ganglion cells during movie
•  encoding: P(response stimulus)
• tuning curve - neuron’s response (ex. firing rate) as a function of stimulus
• orientation / color selective cells are distributed in organized fashion
• some neurons fire to a concept, like “Pamela Anderson”
• retina (simple) -> V1 (orientations) -> V4 (combinations) -> ?
• also massive feedback
•  decoding: P(stimulus response)

### 2.2 - simple encoding

•  want P(response stimulus)
• response := firing rate r(t)
• stimulus := s
• simple linear model
• r(t) = c * s(t)
• weighted linear model - takes into account previous states weighted by f
1. temporal filtering
• r(t) = $f_0 \cdot s_0 + … + f_t \cdot s_t = \sum s_{t-k} f_k$ where f weights stimulus over time
• could also make this an integral, yielding a convolution:
• r(t) = $\int_{-\infty}^t d\tau : s(t-\tau) f(\tau)$
• a linear system can be thought of as a system that searches for portions of the signal that resemble its filter f
• leaky integrator - sums its inputs with f decaying exponentially into the past
• flaws
• no negative firing rates
• no extremely high firing rates
• can add a nonlinear function g of the linear sum can fix this
• r(t) = $g(\int_{-\infty}^t d\tau : s(t-\tau) f(\tau))$
2. spatial filtering
• r(x,y) = $\sum_{x’,y’} s_{x-x’,y-y’} f_{x’,y’}$ where f again is spatial weights that represent the spatial field
• could also write this as a convolution
• for a retinal center surround cell, f is positive for small $\Delta x$ and then negative for large $\Delta x$
• can be calculated as a narrow, large positive Gaussian + spread out negative Gaussian - can combine above to make spatiotemporal filtering
• filtering = convolution = projection

### 2.3 - feature selection

•  P(response stimulus) is very hard to get
• stimulus can be high-dimensional (e.g. video)
• stimulus can take on many values
• need to keep track of stimulus over time
•  solution: sample P(response s) to many stimuli to characterize what in input triggers responses
• find vector f that captures features that lead to spike
• dimensionality reduction - ex. discretize
• value at each time $t_i$ is new dimension
• commonly use Gaussian white noise
• time step sets cutoff of highest frequency present
• prior distribution - distribution of stimulus
• multivariate Gaussian - Gaussian in any dimension, or any linear combination of dimensions
• look at where spike-triggering points are and calculate spike-triggered average f of features that led to spike
• use this f as filter
• determining the nonlinear input/output function g
•  replace stimulus in P(spike stimulus) with P(spike $s_1$), where s1 is our filtered stimulus
•  use bayes rule $g=P(spike s_1)=\frac{P(s_1 spike)P(spike)}{P(s_1)}$
•  if $P(s_1 spike) \approx P(s_1)$ then response doesn’t seem to have to do with stimulus
• incorporating many features $f_1,…,f_n$
• here, each $f_i$ is a vector of weights
• $r(t) = g(f_1\cdot s,f_2 \cdot s,…,f_n \cdot s)$
• could use PCA - discovers low-dimensional structure in high-dimensional data
• each f represents a feature (maybe a curve over time) that fires the neuron

### 2.4 - variability

• hidden assumptions about time-varying firing rate and single spikes
• smooth function RFT can miss some stimuli
•  statistics of stimulus can effect P(spike stimulus)
• Gaussian white noise is nice because no way to filter it to get structure
• identifying good filter
•  want $P(s_f spike)$ to differ from $P(s_f)$ where $s_f$ is calculated via the filter
• instead of PCA, could look for f that directly maximizes this difference (Sharpee & Bialek, 2004)
• Kullback-Leibler divergence - calculates difference between 2 distributions
• $D_{KL}(P(s),Q(s)) = \int ds P(s) log_2 P(s) / Q(s)$
• maximizing KL divergence is equivalent to maximizing mutual info between spike and stimulus
• this is because we are looking for most informative feature
• this technique doesn’t require that our stimulus is white noise, so can use natural stimuli
• maximization isn’t guaranteed to uniquely converge
• modeling the noise
• need to go from r(t) -> spike times
• divide time T into n bins with p = probability of firing per bin
• over some chunk T, number of spikes follows binomial distribution (n, p)
• mean = np
• var = np(1-p)
• if n gets very large, binomial approximates Poisson
• $\lambda$ = spikes in some set time
• mean = $\lambda$
• var = $\lambda$ 1. can test if distr is Poisson with Fano factor=mean/var=1 2. interspike intervals have exponential distribution - if fires a lot, this can be bad assumption (due to refractory period)
• generalized linear model adds explicit spike-generation / post-spike filter (Pillow et al. 2008)
• post-spike filter models refractory period
• Paninski showed that using exponential nonlinearity allows this to be optimized
• could add in firing of other neurons
• time-rescaling theorem - tests how well we have captured influences on spiking (Brown et al 2001)
• scaled ISIs ($t_{i-1}-t_i$) r(t) should be exponential

# 3- neural decoding

### 3.1 - neural decoding and signal detection

•  decoding: P(stimulus response) - ex. you hear noise and want to tell what it is
• here r = response = firing rate
• monkey is trained to move eyes in same direction as dot pattern (Britten et al. 92)
• when dots all move in same direction (100% coherence), easy
• neuron recorded in MT - tracks dots
• count firing rate when monkey tracks in right direction
• count firing rate when monkey tracks in wrong direction
• as coherence decreases, these firing rates blur
•  need to get P(+ or - r)
• can set a threshold on r by maximizing likelihood
•  P(r +) and P(r -) are likelihoods
• Neyman-Pearson lemma - likelihood ratio test is the most efficient statistic, in that is has the most power for a given size
•  $\frac{p(r +)}{p(r -)} > 1?$
• accumulated evidence - we can accumulate evidence over time by multiplying these probabilities
• instead we take sum the logs, and compare to 0
•  $\sum_i ln \frac{p(r_i +)}{p(r_i -)} > 0?$
• once we hit some threshold for this sum, we can make a decision + or -
• experimental evidence (Kiani, Hanks, & Shadlen, Nat. Neurosci 2006)
• monkey is making decision about whether dots are moving left/right
• neuron firing rates increase over time, representing integrated evidence
• neuron always seems to stop at same firing rate
• priors - ex. tiger is much less likely then breeze
•  scale P(+ r) by prior P(+)
•  neuroscience ex. photoreceptor cells P(noise r) is much larger than P(signal r)
• therefore threshold on r is high to minimize total mistakes
• cost of acting/not acting
•  loss for predicting + when it is -: $L_- \cdot P[+ r]$
•  loss for predicting - when it is +: $L_+ \cdot P[- r]$
• cut your losses: answer + when average Loss$+$ < Loss$-$
•  i.e. $L_+ \cdot P[- r]$ < $L_- \cdot P[+ r]$
• rewriting with Baye’s rule yields new test:
•  $\frac{p(r +)}{p(r -)}> L_+ \cdot P[-] / L_- \cdot P[+]$
• here the loss term replaces the 1 in the Neyman-Pearson lemma

### 3.2 - population coding and bayesian estimation

• population vector - sums vectors for cells that point in different directions weighted by their firing rates
• ex. cricket cercal cells sense wind in different directions
• since neuron can’t have negative firing rate, need overcomplete basis so that can record wind in both directions along an axis
• can do the same thing for direction of arm movement in a neural prosthesis
• not general - some neurons aren’t tuned, are noisier
• not optimal - making use of all information in the stimulus/response distributions
• bayesian inference
•  $p(s r) = \frac{p(r s)p(s)}{p( r)}$
•  maximum likelihood: s* which maximizes p(r s)
•  MAP = maximum $a:posteriori$: s* which mazimizes p(s r)
• simple continuous stimulus example
• setup
• s - orientation of an edge
• each neuron’s average firing rate=tuning curve $f_a(s)$ is Gaussian (in s)
• let $r_a$ be number of spikes for neuron a
• assume receptive fields of neurons span s: $\sum r_a (s)$ is const
• solving
• maximizing log-likelihood with respect to s - take derivative and set to 0
• soln $s^* = \frac{\sum r_a s_a / \sigma_a^2}{\sum r_a / \sigma_a^2}$
• if all the $\sigma$ are same, $s^* = \frac{\sum r_a s_a}{\sum r_a}$
• this is the population vector
• maximum a posteriori
•  $ln : p(s r) = ln : P(r s) + ln : p(s) = ln : P(r )$
• $s^* = \frac{T \sum r_a s_a / \sigma^2a + s{prior} / \sigma^2{prior}}{T \sum r_a / \sigma^2_a + 1/\sigma^2{prior}}$
• this takes into account the prior
• narrow prior makes it matter more
• doesn’t incorporate correlations in the population

• decoding s -> $s^*$
• want an estimator $s_{Bayes}=s_B$ given some response r
• error function $L(s,s_{B})=(s-s_{B})^2$
•  minimize $\int ds : L(s,s_{B}) : p(s r)$ by taking derivative with respect to $s_B$
•  $s_B = \int ds : p(s r) : s$ - the conditional mean (spike-triggered average)
• add in spike-triggered average at each spike
• if spike-triggered average looks exponential, can never have smooth downwards stimulus
• could use 2 neurons (like in H1) and replay the second with negative sign
• LGN neurons can reconstruct a video, but with noise
• recreated 1 sec long movies - (Jack Gallant - Nishimoto et al. 2011, Current Biology)
1. voxel-based encoding model samples ton of prior clips and predicts signal
•  get p(r s)
•  pick best p(r s) by comparing predicted signal to actual signal
• input is filtered to extract certain features
• filtered again to account for slow timescale of BOLD signal
2. decoding
•  maximize p(s r) by maximizing p(r s) p(s), and assume p(s) uniform
• 30 signals that have highest match to predicted signal are averaged
• yields pretty good pictures

# 4 - information theory

### 4.1 - information and entropy

• surprise for seeing a spike h(p) = $-log_2 (p)$
• entropy = average information
• code might not align spikes with what we are encoding
• how much of the variability in r is encoding s
• define q as en error
•  $P(r_+ s=+)=1-q$
•  $P(r_- s=+)=q$
• similar for when s=-
• total entropy: $H(R ) = - P(r_+) log P(r_+) - P(r_-)log P(r_-)$
•  noise entropy: $H(R S=+) = -q log q - (1-q) log (1-q)$
•  mutual info I(S;R) = $H(R ) - H(R S)$ = total entropy - average noise entropy
• = $D_{KL} (P(R,S), P(R )P(S))$
• grandma’s famous mutual info recipe
• for each s
•  P(R s) - take one stimulus and repeat many times (or run for a long time)
•  H(R s) - noise entropy
•  $H(R S)=\sum_s P(s) H(R s)$
•  $H(R )$ calculated using $P(R ) = \sum_s P(s) P(R s)$

### 4.2 information in spike trains

1. information in spike patterns
• divide pattern into time bins of 0 (no spike) and 1 (spike)
• binary words w with letter size $\Delta t$, length T (Reinagel & Reid 2000)
• can create histogram of each word
• can calculate entropy of word
• look at distribution of words for just one stimulus
• distribution should be narrower
• calculate $H_{noise}$ - average over time with random stimuli and calculate entropy
• varied parameters of word: length of bin (dt) and length of word (T)
• there’s some limit to dt at which information stops increasing
• this represents temporal resolution at which jitter doesn’t stop response from identifying info about the stimulus
• corrections for finite sample size (Panzeri, Nemenman,…)
2. information in single spikes - how much info does single spike tell us about stimulus
• don’t have to know encoding, mutual info doesn’t care
1. calculate entropy for random stimulus
• $p=\bar{r} \Delta t$ where $\bar{r}$ is the mean firing rate 2. calculate entropy for specific stimulus
•  let $P(r=1 s) = r(t) \Delta t$
•  let $P(r=0 s) = 1 - r(t) \Delta t$
• get r(t) by having simulus on for long time
• ergodicity - a time average is equivalent to averging over the s ensemble
• info per spike $I(r,s) = \frac{1}{T} \int_0^T dt \frac{r(t)}{\bar{r}} log \frac{r(t)}{\bar{r}}$
• timing precision reduces r(t)
• low mean spike rate -> high info per spike
• ex. rat runs through place field and only fires when it’s in place field
• spikes can be sharper, more / less frequent

### 4.3 coding principles

• natural stimuli
• huge dynamic range - variations over many orders of magnitude (ex. brightness)
• power law scaling - structure at many scales (ex. far away things)
• efficient coding - in order to have maximum entropy output, a good encoder should match its outputs to the distribution of its inputs
• want to use each of our “symbols” (ex. different firing rates) equally often
• should assign equal areas of input stimulus PDF to each symbol
• adaptataion to stimulus statistics
• feature adaptation (Atick and Redlich)
• spatial filtering properties in retina / LGN change with varying light levels
• at low light levels surround becomes weaker
• coding sechemes
1. redundancy reduction
• population code $P(R_1,R_2)$
• entropy $H(R_1,R_2) \leq H(R_1) + H(R_2)$ - being independent would maximize entropy
2. correlations can be good
• error correction and robust coding
• correlations can help discrimination
• retina neurons are redundant (Berry, Chichilnisky)
3. more recently, sparse coding
• penalize weights of basis functions
• instead, we get localized features
• we ignored the behavioral feedback loop

# 5 - computing in carbon

### 5.1 - modeling neurons

• nernst battery
1. osmosis (for each ion)
2. electrostatic forces (for each ion)
• together these yield Nernst potential $E = \frac{k_B T}{zq} ln \frac{[in]}{[out]}$
• T is temp
• q is ionic charge
• z is num charges - part of voltage is accounted for by nernst battery $V_{rest}$ - yields $\tau \frac{dV}{dt} = -V + V_\infty$ where $\tau=R_mC_m=r_mc_m$ - equivalently, $\tau_m \frac{dV}{dt} = -((V-E_L) - g_s(t)(V-E_s) r_m) + I_e R_m$

### 5.3 - simplified model neurons

• integrate-and-fire neuron
• passive membrane (neuron charges)
• when V = V$_{thresh}$, a spike is fired
• then V = V$_{reset}$
• doesn’t have good modeling near threshold
• can include threshold by saying
• when V = V$_{max}$, a spike is fired
• then V = V$_{reset}$
• modeling multiple variables
• also model a K current
• can capture things like resonance
• theta neuron (Ermentrout and Kopell)
• often used for periodically firing neurons (it fires spontaneously)

### 5.4 - a forest of dendrites

• cable theory - Kelvin
• voltage V is a function of both x and t
• separate into sections that don’t depend on x
• coupling conductances link the sections (based on area of compartments / branching)
• Rall model for dendrites
• if branches obey a certain branching ratio, can replace each pair of branches with a single cable segment with equivalent surface area and electrotonic length
• $d_1^{3/2} = d_{11}^{3/2} + d_{12}^{3/2}$
• dendritic computation (London and Hausser 2005)
• hippocampus - when inputs arrive at soma, similiar shape no matter where they come in = synaptic scaling
• where inputs enter influences how they sum
• dendrites can generate spikes (usually calcium) / backpropagating spikes
• ex. Jeffress model - sound localized based on timing difference between ears
• ex. direction selectivity in retinal ganglion cells - if events arive at dendrite far -> close, all get to soma at same time and add

# 6 - computing with networks

### 6.1 - modeling connections between neurons

• model effects of synapse by using synaptic conductance $g_s$ with reversal potential $E_s$
• $g_s = g_{s,max} \cdot P_{rel} \cdot P_s$
• $P_{rel}$ - probability of release given an input spike
• $P_s$ - probability of postsynaptic channel opening = fraction of channels opened
• basic synapse model
• assume $P_{rel}=1$
• model $P_s$ with kinetic model
• open based on $\alpha_s$
• close based on $\beta_s$
• yields $\frac{dP_s}{dt} = \alpha_s (1-P_s) - \beta_s P_s$
• 3 synapse types
1. AMPA - well-fit by exponential
2. GAMA - fit by “alpha” function - has some delay
3. NMDA - fit by “alpha” function - has some delay
• linear filter model of a synapse
• pick filter (ex. K(t) ~ exponential)
• $g_s = g_{s,max} \sum K(t-t_i)$
• network of integrate-and-fire neurons
• if 2 neurons inhibit each other, get synchrony (fire at the same time

### 6.2 - intro to network models

• comparing spiking models to firing-rate models
• spike timing
• spike correlations / synchrony between neurons
• computationally expensive
• uses linear filter model of a synapse
• developing a firing-rate model
• replace spike train $\rho_1(t) \to u_1(t)$
• can’t make this replacement when there are correlations / synchrony?
• input current $I_s$: $\tau_s \frac{dI_s}{dt}=-I_s + \mathbf{w} \cdot \mathbf{u}$
• works only if we let K be exponential
• output firing rate: $\tau_r \frac{d\nu}{dt} = -\nu + F(I_s(t))$
• if synapses are fast ($\tau_s « \tau_r$)
• $\tau_r \frac{d\nu}{dt} = -\nu + F(\mathbf{w} \cdot \mathbf{u}))$
• if synapses are slow ($\tau_r « \tau_s$)
• $\nu = F(I_s(t))$
• if static inputs (input doesn’t change) - this is like artificial neural network, where F is sigmoid
• $\nu_{\infty} = F(\mathbf{w} \cdot \mathbf{u})$
• could make these all vectors to extend to multiple output neurons
• recurrent networks
• $\tau \frac{d\mathbf{v}}{dt} = -\mathbf{v} + F(W\mathbf{u} + M \mathbf{v})$
• $-\mathbf{v}$ is decay
• $W\mathbf{u}$ is input
• $M \mathbf{v}$ is feedback
• with constant input, $v_{\infty} = W \mathbf{u}$
• ex. edge detectors
• V1 neurons are basically computing derivatives

### 6.3 - recurrent networks

• linear recurrent network: $\tau \frac{d\mathbf{v}}{dt} = -\mathbf{v} + W\mathbf{u} + M \mathbf{v}$
• let $\mathbf{h} = W\mathbf{u}$
• want to investiage different M
• can solve eq for $\mathbf{v}$ using eigenvectors
• suppose M (NxN) is symmetric (connections are equal in both directions)
• $\to$ M has N orthogonal eigenvectors / eigenvalues
• let $e_i$ be the orthonormal eigenvectors
• output vector $\mathbf{v}(t) = \sum c_i (t) \mathbf{e_i}$
• allows us to get a closed-form solution for $c_i(t)$
• eigenvalues determine network stability
• if any $\lambda_i > 1, \mathbf{v}(t)$ explodes $\implies$ network is unstable
• otherwise stable and converges to steady-state value
• $\mathbf{v}_\infty = \sum \frac{h\cdot e_i}{1-\lambda_i} e_i$
• amplification of input projection by a factor of $\frac{1}{1-\lambda_i}$
• ex. each output neuron codes for an angle between -180 to 180
• define M as cosine function of relative angle
• excitation nearby, inhibition further away
• memory in linear recurrent networks
• suppose $\lambda_1=1$ and all other $\lambda_i < 1$
• then $\tau \frac{dc_1}{dt} = h \cdot e_1$ - keeps memory of input
• ex. memory of eye position in medial vestibular nucleus (Seung et al. 2000)
• integrator neuron maintains persistent activity
• nonlinear recurrent networks: $\tau \frac{d\mathbf{v}}{dt} = -\mathbf{v} + F(\mathbf{h}+ M \mathbf{v})$
• ex. rectification linearity F(x) = max(0,x)
• ensures that firing rates never go below
• can have eigenvalues > 1 but stable due to rectification
• can perform selective “attention”
• network performs “winner-takes-all” input selection
• gain modulation - adding constant amount to input h multiplies the output
• also maintains memory
• non-symmetric recurrent networks
• ex. excitatory and inhibitory neurons
• linear stability analysis - find fixed points and take partial derivatives
• use eigenvalues to determine dynamics of the nonlinear network near a fixed point

# 7 - networks that learn: plasticity in the brain & learning

### 7.1 - synaptic plasticity, hebb’s rule, and statistical learning

• if 2 spikes keep firing at same time, get LTP - long-term potentiation
• if input fires, but not B then could get LTD - long-term depression
• Hebb rule $\tau_w \frac{d\mathbf{w}}{dt} = \mathbf{x}v$
• $\mathbf{x}$ - input
• $v$ - output
• translates to $\mathbf{w}_{i+1}=\mathbf{w}_i + \epsilon \cdot \mathbf{x}v$
• average effect of the rule is to change based on correlation matrix $\mathbf{x}^T\mathbf{x}$
• covariance rule: $\tau_w \frac{d\mathbf{w}}{dt} = \mathbf{x}(v-E[v])$
• includes LTD as well as LTP
• Oja’s rule: $\tau_w \frac{d\mathbf{w}}{dt} = \mathbf{x}v- \alpha v^2 \mathbf{w}$ where $\alpha>0$
• stability
• Hebb rule - derivative of w is always positive $\implies$ w grows without bound
• covariance rule - derivative of w is still always positive $\implies$ w grows without bound
•  could add constraint that $w =1$ and normalize w after every step
•  Oja’s rule - $w = 1/\sqrt{alpha}$, so stable
• solving Hebb rule $\tau_w \frac{d\mathbf{w}}{dt} = Q w$ where Q represents correlation matrix
• write w(t) in terms of eigenvectors of Q
• lets us solve for $\mathbf{w}(t)=\sum_i c_i(0)exp(\lambda_i t / \tau_w) \mathbf{e}_i$
• when t is large, largest eigenvalue dominates
• hebbian learning implements PCA
• hebbian learning learns w aligned with principal eigenvector of input correlation matrix
• this is same as PCA

### 7.2 - intro to unsupervised learning

• most active neuron is the one whose w is closest to x
• competitive learning
• updating weights given a new input
1. pick a cluster (corresponds to most active neuron)
2. set weight vector for that cluster to running average of all inputs in that cluster
• $\Delta w = \epsilon \cdot (\mathbf{x} - \mathbf{w})$
• related to self-organizing maps = kohonen maps
• in self-organizing maps also update other neurons in the neighborhood of the winner
• update winner closer
• update neighbors to also be closer
• ex. V1 has orientation preference maps that do this
• generative model
• prior P(C )
•  likelihoood P(X C)
•  posterior P(C X)
•  mixture of Gaussians model - Gaussian assumption P(X C=c) is Gaussian
• EM = expectation-maximization
1.  estimate P(C X) - pick what cluster point belongs to
• for Gaussian model, each cluster gets a probability of changing
• this probability weights the change - *“soft”
2. learn parameters of generative model - change parameters of Gaussian (mean and variance) for clusters
• assumes you have all the points at once

### 7.3 - sparse coding and predictive coding

• eigenface - Turk and Pentland 1991
• eigenvectors of the input covariance matrix are good features
• can represent images using sum of eigenvectors (orthonormal basis)
• suppose you use only first M principal eigenvectors
• then there is some noise
• can use this for compression
• not good for local components of an image (e.g. parts of face, local edges)
• if you assume Gausian noise, maximizing likelihood = minimizing squared error
• generative model
• images X
• causes
•  likelihood P(X=x C=c)
• Gaussian
• proportional to $exp(x-Gc)$
•  want posterior P(C X)
• prior p(C )
• assume priors causes are independent
• want sparse distribution
• has heavy tail (super-Gaussian distribution)
• then P(C ) = $k \cdot \prod exp(g(C_i))$
• can implement sparse coding in a recurrent neural network
• Olshausen & Field, 1996 - learns receptive fields in V1
• sparse coding is a special case of predicive coding
• there is usually a feedback connection for every feedforward connection (Rao & Ballard, 1999)

# 8 -

### 8.2 - reinforcement learning - predicting rewards

• dopamine serves as brain’s reward

# ml analogies

## Brain theories

• Computational Theory of Mind
• Classical associationism
• Connectionism
• Situated cognition
• Memory-prediction framework
• Fractal Theory: https://www.youtube.com/watch?v=axaH4HFzA24
• Brain sheets are made of cortical columns (about .3mm diameter, 1000 neurons / column)
• Have ~6 layers

## brain as a computer

• Brain as a Computer – Analog VLSI and Neural Systems by Mead (VLSI – very large scale integration)
• Brain Computer Analogy
• Process info
• Signals represented by potential
• Signals are amplified = gain
• Power supply
• Knowledge is not stored in knowledge of the parts, but in their connections
• Based on electrically charged entities interacting with energy barriers
• http://en.wikipedia.org/wiki/Computational_theory_of_mind
• http://scienceblogs.com/developingintelligence/2007/03/27/why-the-brain-is-not-like-a-co/
• Brain’ storage capacity is about 2.5 petabytes (Scientific American, 2005)
• Electronics
• Voltage can be thought of as water in a reservoir at a height
• It can flow down, but the water will never reach above the initial voltage
• A capacitor is like a tank that collects the water under the reservoir
• The capacitance is the cross-sectional area of the tank
• Capacitance – electrical charge required to raise the potential by 1 volt
• Conductance = 1/ resistance = mho, siemens
• We could also say the word is a computer with individuals being the processors – with all the wasted thoughts we have – the solution is probably to identify global problems and channel people’s focus towards working on them
• Brain chip: http://www.research.ibm.com/articles/brain-chip.shtml
• Differences: What Can AI Get from Neuroscience?
• Brains are not digital
• Brains don’t have a CPU
• Memories are not separable from processing
• Asynchronous and continuous
• Details of brain substrate matter
• Feedback and Circular Causality
• Brains has lots of sensors
• Lots of cellular diversity
• NI uses lots of parallelism
• Delays are part of the computation

## Brain v. Deep Learning

• http://timdettmers.com/
• problems with brain simulations:
• Not possible to test specific scientific hypotheses (compare this to the large hadron collider project with its perfectly defined hypotheses)
• Does not simulate real brain processing (no firing connections, no biological interactions)
• Does not give any insight into the functionality of brain processing (the meaning of the simulated activity is not assessed)
• Neuron information processing parts
• Dendritic spikes are like first layer of conv net
• Neurons will typically have a genome that is different from the original genome that you were assigned to at birth. Neurons may have additional or fewer chromosomes and have sequences of information removed or added from certain chromosomes.
• http://timdettmers.com/2015/03/26/convolution-deep-learning/
• The adult brain has 86 billion neurons, about 10 trillion synapse, and about 300 billion dendrites (tree-like structures with synapses on them
• Development

# 22 early development

• ways to study
1. top-down: rosy retrospection
2. bottom-up: e.g. LTP/LTD
3. human disease: stroke-by-stroke
4. development=ontogeny
• timeframe
• month 1 - gastrulation
• most sensitive time for mom
• month 2-5 - cells being born
• up to year 2 - axon guidance / synapse formation
1. gastrulation - process by which early embryo undergoes folds = shapes of NS
• diseases
• spina bifida - neural tube fails to seal
• vitamin B12 can fix this
• anencephaly - neural tube fails to close higher up
• parts
1. roofplate at top (back)
2. floorplate on bottom (stomach)
3. neural crest - pinches off top of roofplate
4. neuroblasts = classic stem cells
• assymetric division - cells generate themselves and differentiated progeny
• ultimate stem cell - fertilized eggs
1. differentiation
• cells made by neuroblasts decide what they are going to become
• morphogens
• BMP - roofplate
• cyclopia - fatal defect in BMP
• Hedge hogs - at floor plate
• Retinoids - axial, affect skin
• affected by thalidomide - helps morning sickness but causes missing limb segments
• also affected by accutane
• FGFs - axial symmetry
• Wnts - skin, gut, hair
• loss of wnts is loss of hair
• floor plate loses function after embryogenesis except glioblastoma
• measure BMP and HH gradient to figure out where you are
• treat ALS by adding HH to make more alpha motor neurons 1. dorsal direction
• roofplate makes BMP
• low HH - interneurons, sensory neurons (ex. nociceptors)
• even BMP/HH - sympathetic
• high HH - more motor neurons
• floorplate makes HH (hedge hog) 2. axial specification (anterior/posterior)
• tube swells into bulbs that become cerebellum, superior colliculus, cortex
• homeotic genes = hox genes - set of genes (transcription factors) in order on chromosome
• order corresponds to order of your body parts
• rhombomeres - segments in brainstem made by hox gene patterns
1. lineages
• when neuroblast is born, starts producing progeny (family tree of neuron types)
• very often, cells are produced in certain order
• timing: cell-cell interations and tyrosine kinases determine order
• first alpha neurons, then GABAergic to control those, last is glia
• neural crest function
• migratory - moves out and divides:
• neuroblastoma - developed early - severe problem because missing parts of NS
• makes DRG and associated glial cells (schwann cells)
• makes sympathetic NS and target ganglia, enteric NS, parasympathetic NS targets
• makes melanocytes - know how to migrate and divide but can make melanoma (cancer)
• cortex is made inside out (6->1)
• starts with stem cells called radial glia
• cortical dysplasia - missing a layer / duplicating a layer
• small part with 2 layer 3s - severe epilepsy 5. cell death
• 1/2 of cells die in development
1. axon guidance (ch 23)
• each cell born and axon grows and are guided to a target
• dendrite basically follows same rules
1. synapse formation (ch 23, 24)
• pruning and plasticity
• NMDA receptor type
• form synapses and if they don’t look right - get rid of them
• K1/K-1 synapses breaking and forming
• after age 21, K-1 starts increasing and net loss of synapses

# 23 circuit formation

• growth cone - motile tip of axon
• actin tip
1. lamellipodium - sheet (hand)
2. filopodium - huge curves (fingers)
• chemo attraction (actin assembly) and chemo repulsion (actin disassembly)
• microtubule shaft - tubulin is much more cemented in
• mauthner cell of tadpole - first recorded growth cone
• can’t regrow (that’s why we can’t regrow spinal cord)
• signals in growing axons
1. pioneer axons (Betz cells) are first - often die
2. follower axons (other Betz cells) can jump onto these and connect before pioneer dies
• trophic support - neuron survives on contact
• frog tectum (has superior colliculus) with map of retina:
• ephrin (EPH) repulses axon
• retinal NT -> tectum AP
• axons have different amount of EPH receptors (in retina temporal has more than nasal)
• gradient of EPH (in tectum anterior has less than posterior)
• if we flip eye upside down (on nasal-temporal axis), image will be upside down
• 3 classes of axon guidance molecules:
1. ECM/integrins
2. NCAM (homophilic—binds to another neuron that is NCAM),
• follower neurons bind to pioneer through NCAM-NCAM interactions
• involved in recognition of being some place
• 4 important ligands/receptors
1. ephrins/eph
• gradient of eph receptor
2. netrin/dcc = guidance moleculereceptor = DCC
• attracts axons to floorplate (midline)
• cells without DCC don’t cross midline
3. slit/robo - receptor is slit
• chases axons off (away from midline)
• axons not destined to cross midline are born expressing robo
• axons destined to cross the midline only express robo after crossing
• if DCC (-) and robo (-) will continue wandering around
• robo 4 is associated with Tourette’s
4. semaphorins/plexins
• combinatorial code - use combinations of these to guide axons
• these are the same genes that move cancer around
• synaptic formation
• neuroexins - further recognition
• turn up in autism and schizophrenia
• DSCAM
• associated with Down’s syndrome
• doesn’t use gradients
• makes different kinds of proteins by differential slicing
• competition
• neurotrophins are secreted by muscle
• in early development, a muscle fiber has many alpha motor neurons innervating it
• all innervating neurons suck up neurotrophin and whichever sucks up most, kills all the others
• eventually, each muscle fiber is innervated by one alpha motor neuron
• only enough neurotrophin in target cells for a certain number of synapses
• happens everywhere
• ex. sympathetic ganglia
• ex. sensory neurons in skin get axons to correct cell types based on neurotrophin
• merkel - BDNF
• proprioceptor - NT3
• nociceptor - NGF
• ex. muscles - produce NGF
• treating ALS with NGF hyperactivates sensory neurons with trkA -> causes chicken pox
• signals/receptors
1. NGF - trk a (Trk receptor - survival signaling pathways)
2. BDNF - trk b
3. NT3 - trk b and c
4. NT4/5 - trk b
• all bind p75 (death receptor)
• want to keep neurotrophins local, because there aren’t that many of them

# 24 plasticity in systems

• experience-dependent plasticity -
• ex. ducks imprinting is non-reversable
• learning is crystallized during critical period
• CREB and protein synthesis
• NMDA receptors
• epigenetics - histones control transcription and other things
• follow Hebb’s postulate - fire together, wire together
• different eyes firing together will sync up (NMDA receptors to strengthen synapses)
• systems
1. ocular dominance
• left/right neurons terminate in adjacent zones
• LGN in cortex uses efferents just like superior colliculus
• label injected into retina can make it into cortex
• cat experiments
• some cells see only one eye, some see both
• cats need to form visual map in short critical period (<6 days)
• this is why you need cochlear implant early
• both eyes open - equal OD columns
• one eye closed - unequal OD columns
• branches coming out of LGN neurons grow more branches based on relative light exposure (they compete for eye’s real estate)
• strabismus = lazy eye - poor coordination with one of the muscles
• one eye is not quite seeing
• treat with patch on good eye -> allows bad eye to catch up since eyes compete for ocular dominance columns
• more stimulus = more branches
• dye from retina goes through thalamus into cortex
• rabies virus does same thing: cell->ganglion->brain
2. tonotopic map
• connection between MSO and inferior/superior colliculus
• playing one tone increases representation
• playing white noise disorganizes map
• birdsong
• hear song 10-20 times when young - crystallized
• afterwards can’t learn new skills
3. stress
• early stress sets stress points later in life
• uses serotonin
• shifts
• superior colliculus - integrate visual, auditory, motor to get X,Y coordinate
• auditory map - plastic (but only when young)
• visual map - not plastic
• if you shift visual map (with a prism), auditory map can shift over to meet the visual
• optic neuritis - ms optic nerve disease that shifts map
• only young animals can shift unless they were shifted before and are now unadapting

# 25 repair and regeneration

1. full repair - human PNS - skin, muscles
• 1-2 mm/day growth - speed of slow axonal transport
• thinnest axons first (thermal receptors and nociceptors)
• proprioceptors last
• process
• perinerium / schwann cells surrounds axons - helps regeneration
• growth cones that are cut form stumps -> distal axons degenerate = walerian degeneration
• macrophages come in and eat up the damaged stuff
• neurotrophins are involved
• miswiring is common - regrow and may not find right target
• bell’s palsy - loss of facial nerve - recovers with miswiring (salivary / tear)
• neuromuscular junctions (NMJ)
• damaged cells leave synaptic ghost = glia and protein matrix for nerve to regrow into
• repairs easily after heavy training
2. no repair / glial scar - human CNS
• no ghost because so spread out
• glia cover wound (scar) but can’t develop further
• has astrocytes and oligodendrocytes (types of glial cells)
• don’t support regrowth
• involved in scarring
• microglia - from immune system
• control inflammation
• release cytokines
• nogo - protein that blocks regrowth (but there are other proteins as well)
• we try repairing with shunts - piece of sciatic nerve from other part of body with schwann cells from PNS to try to repair a connection in the CNS
3. stem cell regeneration - put new neurons being formed, 2 places in humans
• non-human examples
• floor plate of lizards can make new tail
• fish retina always making new cells
• canary brain part has stem cells that learn new song every year
• small C14 incorporation after early development - suggests we don’t regenerate neurons - C14 was from nuclear testing
• human areas that do regenerate
1. hippocampus
• memories you store temporarily
2. subventricular zone makes glomeruli in olfactory bulb cells
• turnover daily
• sensory neurons and their targets constantly die and regenerate
3. niche - places where stem cells stay alive
• ex. places in CNS with WINT molecular signals
• damage control - remove these signals for apoptosis = cell death
• glutamate increase - excitotoxicity
• can stop with NMDA blockers
• induce a coma by cooling them down or GABA drugs
• cytokines increase - immune system (like neurotrophins), inflammation
• hypoxia/stress
• neurotrophin withdrawal
• in stress times neurotrophin goes down

# 26 diseases

### alzheimer’s

• overview
• age-associated - tons of people get it
• doesn’t kill you, secondary complications like pneumonia will kill you
• rate is going up
• very expensive to treat
• declarative memories are affected by Alzheimer’s
• these are memories that you know
• first 2 areas to go in Alzheimer’s
1. hippocampus
• patient HM had no hippocampus
• no anterograde memory - learning new things
• hippocampus stores 1 day of info
• offloading occurs during sleep (REM sleep) to prefrontal cortex, temporal lobe, V4
• dreaming - might see images as you are offloading
2. basal forebrain - spread synapses all over cortex
• uses Ach
• ignition key for entire cortex
• alzheimer’s characteristics only found in autopsy
• amyloid plaques
• maybe A-beta causes it
• A-beta comes from APP
• A-beta42 binds to itself
• prion (starts making more of itself)
• this cycle could be exacerbated by injury
• clumps and attracts immune system which kills local important cells
• this could cause Alzheimer’s
• rare genetic mutations in A-beta increase probability you get Alzheimer’s
• anti-inflammation may be too late
• can take drugs that increase Ach functions - ex. cholinergic agonists, cholinesterase inhibitors
• tangles
• tangles made of protein called Tau
• most people think these are just dead cells resulting from Alzheimer’s but some think they cause it

### parkinson’s

• loss of substantia nigra pars compacta dopaminergic neurons
• when you get down to 20% what you were born with
• dopaminergic neurons form melanin = dark color
• hits to head can give inflammation
• know what they need to do - don’t have enough dopamine to act
• treat with L Dopa -> something like dopamine -> take out globus pallidus
• Lewy bodies are clumps of alpha synuclein - appear at dopaminergic synapses
• clumps like A-beta42
• associated with early-onset Parkinson’s (rare) associated with genetic mutations
• bradykinesia - slowness of movement
• age can give parksinson’s
• no evidence that toxins can induce parkinsons
• PTP/ pesticides can induce Parkinson’s in test animals
• 1/500 people
• memory
• The Neuron Doctrine – the neuron is the fundamental building block and elementary signaling unit of the brain
• Golgi – develops silver staining method which allows Cajal to see entire neuron
• Santiago Ramon y Cajal – Spanish anatomist who simplifies neuron forest and looks at individual neurons, develops model with dendrites, cell body, and axon
• 4 parts: neuron, synapse, connection specificity (specific neurons connect to specific others), dynamic polarization (signals travel in one direction)
• Freud looks into Cajal’s theories, but doesn’t incorporate them
• 1906 – they share Nobel despite Golgi hating Cajal’s theories
• 1955 – Cajal’s intuitions borne out conclusively
• Next Generation
• Sherrington – builds on Cajal’s work – finds that neurons integrate signals and some signals are inhibitory
• Shares Nobel with Adrian in 1932 – Adrian is younger and grateful
• Phases of Nerve Signaling
• Galvani discovers electrical activity in animals, Helmholtz measures speed of electrical signals in neurons
• Adrian measures action potentials and sees that they all have the same size and that intensity correlates with their frequency
• Bernstein (student of Helmholtz) finds that ions carry the electrical current – he investigates only the potassium ion
• Ionic hypothesis – Hodgkin, Huxley (and Katz) – find sodium, potassium in squid axon using voltage clamp, discover voltage-gated channels, win Nobel in 1963
• Interneuronal signaling
• 1920s – Dale and Loewi find that synaptic transmission is chemical - use acetylcholine in frogs
• synaptic potential – has different sizes, slower – only over synapses (can be excitatory or inhibitory, can generate action potential)
• long-term potentiation (LTP) is a persistent strengthening of synapses based on recent patterns of activity
• Eccles – believed in spark theory (synaptic transmission was electrical), but after talking to Popper disproves it with Katz and believes in soup theory (synaptic transmission is chemical)
• Glutamate – major excitatory neurotransmitter
• GABA – main inhibitory transmitter
• Katz’s lab later showed there are a few synapses that are electrical
• Katz – found that neurotransmitters were released by voltage-gated gates letting in Ca ions
• Packets of neurotransmitters called quanta are released in synaptic vesicles
• Confirmed in in 1955
• Modern Generation
• Four Lobes
• Frontal – working memory and lots of stuff
• Temporal – auditory processing, language, and memory
• Parietal – sensory information
• Occipital – vision
• Brain maps - Marshall showed that, even though the different sensory systems carry different types of information and end up in different regions of the cerebral cortex, they share a common logic in their organization: all sensory information is organized topographically in the brain in the form of precise maps of the body’s sensory receptors
• Broca and Wernicke find that specific brain regions are in charge of specific functions
• Broca’s area- expression of language
• Werknicke’s area – perception of language
• Patient H.M. – research by Brenda Milner
• He couldn’t store new memories although he could learn new skills
• Memory is a distinct mental function, clearly separate from other perceptual, motor, and cognitive abilities.
• Short-term memory and long-term memory can be stored separately. Loss of medial temporal lobe structures, particularly loss of the hippocampus, destroys the ability to convert new short-term memory to new long-term memory.
• There is explicit and implicit memory (implicit is a collection of processes)
• Milner showed that at least one type of memory can be traced to specific places in the brain
• Early Eric Kandel
• Gets lucky start recording in hippocampus
• Decides to start recording in Aplysia – large and has only 20,000 neurons separated into nine ganglia (human brain ~ 100 billion)
• Hypothesizes that persistent changes in the strength of synaptic connections results in memory storage
• just as neurons and their synaptic connections are exact and invariant, so, too, the function of those connections is invariant.
• First, we found that the changes in synaptic strength that underlie the learning of a behavior may be great enough to reconfigure a neural network and its information-processing ability
• a given set of synaptic connections between two neurons can be modified in opposite ways— strengthened or weakened—by different forms of learning.
• Third, the duration of short-term memory storage depends on the length of time a synapse is weakened or strengthened.
• Fourth, we were beginning to understand that the strength of a given chemical synapse can be modified in two ways, depending on which of two neural circuits is activated by learning—a mediating circuit or a modulatory circuit
• Learning may be a matter of combining various elementary forms of synaptic plasticity into new and more complex forms, much as we use an alphabet to form words.
• forgetting had at least two phases: a rapid initial decline that was sharpest in the first hour after learning and then a much more gradual decline that continued for about a month.
• Homosynaptic - the depression occurred in the same neural pathway that was stimulated
• Strengthening synapses = greater responses
• Short-term to Long-term
• Memory consolidation – short-term is subject to disruption
• Head injuries or seizures can lead to retrograde amnesia – you forget what was in you short-term memory
• Electric shocks were able to get rid of short-term memory
• A short-term memory lasting minutes is converted—by a process of consolidation that requires the synthesis of new protein—into stable, long-term memory lasting days, weeks, or even longer
• Long-term memory results in growing or shedding synapses
• As the memory fades, the number of synapses goes almost back to normal, with the difference accounting for relearning a task easier
• Recall
• Based on cues, in the case of Aplysia gill-withdrawal, external stimulus
• Short-term
• Short-term memory changes are presynaptic - during short-term habituation lasting minutes, the sensory neuron releases less neurotransmitter, and during short-term sensitization it releases more neurotransmitter
• Relies on interneurons:
• Mediating circuits – produce behavior directly, sensory neurons that innervate the siphon, the interneurons, and the motor neurons that control the gill-withdrawal reflex
• Modulatory circuits - not directly involved in producing a behavior but instead fine-tunes the behavior in response to learning by modulating—heterosynaptically—the strength of synaptic connections between the sensory and motor neurons
• For example, in gill-withdrawal sensitization, interneurons release serotonin into the presynaptic terminals of the sensory neurons
• This generates a long, slow synaptic potential in the motor neurons
• Ionotropic receptors – neurotransmitter-gated, open ion channels
• Metabotropic receptors – gated, activate enzymes; these enzymes can make cyclic AMP, act much longer
• Serotonin binds to metabotropic receptors in the presynaptic terminal of sensory neurons increasing the amount of cAMP and in turn the amount of glutamate
• This was verified in studies of Drosophila
• Sensory regions map to specific places in brain, keeping proximity
• Pavlov
• Habituation – animal repeatedly presented with neutral sensory stimulus learns to ignore it
• Sensitization – animal learns strong stimulus is dangerous and enhances its defensive reflexes
• Classical Conditioning – pair neutral stimulus with potentially dangerous stimulus, animal learns to respond to neutral stimulus
• Long-term memory
• Proteins must be made - DNA makes RNA, and RNA makes protein
• The serotonin itself was able to make new synapses grow via synthesis of proteins in the nucleus
• Genes
• There are effector genes which mediate cellular functions and regulatory genes which switch them on and off
• Genes have regions called promoters and repressors and regulatory proteins must bind to these in order to express them
• With repeated pulses of serotonin, kinase A move into the nucleus – turn on regulatory protein called CREB which turns on some genes
• Also requires turning off other genes
• MAP kinase also moves into nucleus and depresses CREB-2
• Together, activating CREB and deactivating CREB-2 transfers short-term memories to long-term
• Synaptic marking – the proteins produced in the nucleus know which synapses to go to because of their short-term changes
• Proteins must be synthesized locally at the synapses
• Dormant mRNA is sent to all the synapses
• There is a protein called CPEB that is activated by serotonin and is required by the synapses to maintain protein synthesis
• Resembles a prion – special protein with dominant and recessive conformation, dominant can be harmful
• Dominant is self-perpetuating – turns recessive into dominant
• Explicit memory - depends on the elaborate neural circuitry of the hippocampus and the medial temporal lobe, and it has many more possible storage sites.
• Long-term potentiation - synaptic strengthening mechanism in the hippocampus
• long-term potentiation describes a family of slightly different mechanisms, each of which increases the strength of the synapse in response to different rates and patterns of stimulation
• glutamate acts on two different types of ionotropic receptors in the hippocampus, the AMPA receptor and the NMDA receptor. The AMPA receptor mediates normal synaptic transmission and responds to an individual action potential in the presynaptic neuron. The NMDA receptor, on the other hand, responds only to extraordinarily rapid trains of stimuli and is required for long-term potentiation flow of calcium ions into the postsynaptic cell acts as a second messenger (much as cyclic AMP does), triggering long-term potentiation - allows calcium ions to flow through its channel if and only if it detects the coincidence of two neural events, one presynaptic and the other postsynaptic: The presynaptic neuron must be active and release glutamate, and the AMPA receptor in the postsynaptic cell must bind glutamate and depolarize the cell.
• explicit memory in the mammalian brain, unlike implicit memory in Aplysia or Drosophila, requires several gene regulators in addition to CREB
• Cognitive science - Kantian notion that the brain is born with a priori knowledge, “knowledge that is . . . independent of experience.”
• Visual system
• Different layers respond to different things - each cell in the primary visual cortex responds only to a specific orientation of such light-dark contours
• The brain does not simply take the raw data that it receives through the senses and reproduce it faithfully. Instead, each sensory system first analyzes and deconstructs, then restructures the raw, incoming information according to its own built-in connections and rules
• What and where are different neural pathways
• there is no single cortical area to which all other cortical areas report exclusively, either in the visual or in any other system. In sum, the cortex must be using a different strategy for generating the integrated visual image.
• Spatial Map
• the hippocampus of rats contains a representation—a map—of external space and that the units of that map are the pyramidal cells of the hippocampus, referred to as “place cells.”
• The brain sometimes codes with coordinates centered on the receiver and sometimes relative to the outside world
• Attention acts as a filter, selecting some objects for further processing
• A modulating circuit involving dopamine in the hippocampus forms spatial maps
• The dopamine comes from the cerebral cortex (a higher level part of the brain)
• Eric Kandel tried to translate his research into a cure for age-related memory loss
• Alzheimer’s: This degeneration of tissue is caused in large part by the accumulation of an abnormal material known as ß-amyloid in the form of insoluble plaques in the spaces between brain cells.
• We found that drugs which activate these dopamine receptors, and thereby increase cyclic AMP, overcome the deficit in the late phase of long-term potentiation. They also reverse the hippocampus-dependent memory deficit.
• Various disorders are being solved slowly through research
• Consciousness
• consciousness in people, is an awareness of self, an awareness of being aware.
• The brain does reconstruct our perception of an object, but the object perceived—the color blue or middle on the pia no—appears to correspond to the physical properties of the wavelength of the reflected light or the frequency of the emitted sound
• Claustrum is connected to a bunch of different brain parts – could regulate attention
• viewing frightening stimuli activates two different brain systems, one that involves conscious, presumably top-down attention and one that involves unconscious, bottom-up attention, or vigilance
• readiness potential can measure what a person is going to do before they know they want to do it
• Motor system

# 16 lower

• sensory in dorsal spinal cord, motor in ventral
• farther out neurons control farther out body parts (medial=trunk, lateral=arms,legs)
• one motor neuron (MN) innervates multiple fibers
• the more fibers/neuron, the less precise
• MN pool - group of MNs=motor units
• muscle tone = all your muscles are a little on, kind of like turning on the car engine and when you want to, you can move forward
• more firing = more contraction
• MN types
1. fast fatiguable - white muscle
2. fast fatigue-resistant
3. slow - red muscles, make atp
• muscles are innervated by a proportion of these MNs
• reflex
• whenever you get positive signal on one side, also get negative on other
• flexor - curl in (bicep)
• extensor - extend (tricep)
1. proprioceptors (+) - measure length - more you stretch, more firing of alpha MN to contract
• intrafusal muscle=spindle - stretches the proprioceptor so that it can measure even when muscle is already stretched
• $\gamma$ motor neuron - adjusts intrafusal muscles until they are just right
• keeps muscles tight so you know how much muscle is streteched
• if alpha fires a lot, gamma will increase as well
• high gamma allows for fast responsiveness - brainstem modulators (serotonin) also do this
• opposes muscle stretch to keep it fixed
• spindle -> activates muscles -> contracts -> turns off
• sensory neurons / gamma MNs innervate muscle spindle
• homonymous MNs go into same muscle, antagonistic muscle pushes other way
1. golgi tendon (-) measures pressure not stretch
• safety switch
• inhibits homonymous neuron so you don’t rip muscle off
• ALS = Lou Gehrig’s disease
• MNs are degenerating - reflexes don’t work
• progressive loss of $\alpha$ MNs
• last neuron to go is superior rectus muscle -> people use eyes to talk with tracker
• CPG = central pattern generator
• ex. step on pin, lift up leg
• walking works even if you cut cat’s spinal cord
• collection of interneurons

# 17 upper

• cAMP is used by GPCR
• lift and hold circuit
1. ctx->lateral white matter->lateral ventral horn->limb muscles
• lateral white matter - most sensitive to injury
2. brainstem->medial white matter->medial horn->trunk
• medial white matter -> goes into trunk
• bulbarspinal tracts
1. lateral and medial vestibulospinal tracts - feedback
• automated system - not much thinking
• posture - reflex
• too slow for learning surfing
2. reticular - feedforward = anticipate things before they happen
• command / control system for trunk muscles (posture)
• feedforward - not a reflex, lean back before opening drawer
• caudal pontine - feeds into spinal cord
3. colliculospinal tract
• has superior colliculus - eye muscles, neck-looking
• see ch. 20 - reflex
• corticular bulbar tract (premotor->primary motor->brainstem)
• motor cortexes - this info is descending
• can override reticular reflexes in reticular formation
• premotor cortex (P2) - contains all actions you can do
• has mirror neurons that fire ahead of primary neurons
• fire if you think about it or if you do it
• primary motor cortex (P1)
• layer 1 ascending
• layer 4 input
• layer 5 - Betz cells - behave like 6 (output)
• layer 6 - descending output
• has map like S1 does
• Jacksonian march get seizure that goes from feet to face (usually one side)
• epileptic seizure - neurons fire too much and fire neurons near them
• insular - flashes of moods
• pyriform - flashes of smells
• Betz cells - if they fire, you will do something
• dictate a goal, not single neuron to fire
• axons to ventral horn of spinal cord
• lesions
1. upper
• spasticity - unorganized leg motions
• increased tone - tight muscles
• hyperactive deep reflexes
• ex. babinski’s sign
• curl foot down a lot because you don’t know how much to curl
• curling foot down = normal plantar
• more serotonin can cause this
2. lower
• hypoactive deep reflexes
• decreased tone
• severe muscle atrophy
• pathways
• Betz cell
• 90% cross midline in brainstem - control limbs
• 10% don’t cross - trunk muscles

# 18 basal ganglia (choose what you want to do)

• “who you are”
• outputs
1. brainstem
2. motor cortex
• 4 loops (last 2 aren’t really covered)
• motor loops
1. body movement loop
• SnC -> S (CP) -> (-) Gp -> (-) VA/VL -> motor cortex
2. oculomotor loop
• cortex -> caudate -> substantia nigra pars reticulata -> superior colliculus
• non-motor loops
1. prefrontal loop - daydreaming (higher-order function)
• spiny neurons corresponding to a silly idea (alien coming after you) filtered out because not fired enough
• schizophrenia - can’t filter that out
2. limbic loop - mood
• has nucleus accumbens
• can make mood better with dopamine
• substantia nigra
1. pars compacta - dopaminergic neurons (input to striatum)
• more dopamine = more strength between cortical pyramidal neurons and spiny neurons (turns up the gain)
• dopamine helps activate a spiny neuron
• may be the ones that learn (positive outcome is saved, will result in more dopamine later)
• Parkinson’s - specific loss of dopaminergic neurons
• dopaminergic neurons form melanin = dark color
• when you get down to 20% what you were born with
• know what they need to do - don’t have enough dopamine to act
• treat with L Dopa -> something like dopamine -> take out globus pallidus
• cocaine, amphetamine - too much dopamine
• Huntington’s - death of specific class of spiny neurons
• have uncontrolled actions
• Tourette’s - too much dopamine
• also alcohol
• MPPP (synthetic heroin)
• MPTP looks like dopamine but turns into MPP and kills dopaminergic neurons
• treated with L Dopa to reactivate spiny neurons
2. pars reticulata
• doesn’t have dopamine (output from striatum)
1. striatum contains spiny neurons
3. caudate (for vision) - output to globus pallidus and substantia nigra (pars reticulata)
4. putamen - output only to globus pallidus
• each spiny neuron gets input from ~1000 cortical pyramidal cells
1. globus pallidus - each spiny neuron connects to one globus pallidus neuron - deja vu - spiny neuron you haven’t fired in a while
2. VA/VL (thalamus) - all motor actions must go through here before cortex - has series of commands of all actions you can do - has parallel set of betz cells that will illicit those actions - VA/VL is always firing, globus pallidus inhibits it (tonic connection)

# 19 cerebellum (fine tuning all your motion)

• redundant system - cortex could do all of this, but would be slow
• repeated circuit - interesting for neuroscientists
• all info comes in, gets processed and goes back out
• cerebellum gets motor efferant copy
• all structures on your brain that do processing send out efferent
• cerebellum sends efferant copy back to itself with time delay (through inferior olive)
1. cerebrocerebellum
• deals with premotor cortex (mostly motor cortex)
1. spinocerebellum = clarke’s nucleus, knows stretch of every muscle, many proprioceptors go straight into here
• motor cortex
• has a map of muscles
1. vestibular cerebellum - vestibular->cerebellum->vestibular
• vestibular system leans you back but if wind blows, have to adjust to that
• input
• pontine nuclei (from cortex)
• vestibular nuclei (balance)
• cuneate nucleus (somatosensory from spinal upper body)
• clarke (proprio from spinal lower body)
• processing
• cerebellar deep nuclei
• output
• deep cerebellar nuclei
• go to superior colliculus, reticular formation
• VA/VL (thalamus) - back to cortex
• red nucleus
• circuit 1 - fine-tuning
• circuit 2 - detects differences, adjusts
• cerebellum -> red nucleus (is an efferant copy) -> inferior olive -> cerebellum
• compare new copy to old copy
• cells
• purkinje cells - huge number of dendrite branches - dead planar allows good imaging
• GABAergic
• (input) mossy fibers -(+)> granule cells (send parallel fibers) -(+)> purkinje cell -(-)> deep cerebellar nuclei (output)
1. mossy->granule->parallel fibers connect to ~100,000 parallel fibers
2. climbing fiber - comes from inferior olive and goes back to purkinje cell (this is the efferent copy) = training signal
• loops
• deep excitatory loop (climbing/mossy) -(+)-> deep cerebellar nuclei
• cortical inhibitory loop (climbing/granule) -(+)-> purkinje
• the negative is from purkinje to deep cerebellar nuclei
• alcohol
• can create gaps = folia
• long-term use causes degeneration = ataxia (lack of coordination)

# 20 eye movements/integration

• Broca’s view - look at people with problems
• Ramon y Cajal - look at circuits
• 5 kinds of eye movements
• use cortex, superior colliculus (visual info -> LGN -> cortex, 10% goes to brainstem)
• constantly moving eyes around (fovea)
• ~scan at 30 Hz
• 5 Hz=200 ms for cortex to process so pause eyes (get 5-6 images)
• there is a little bit of drift
• can’t control this
• humans are better than other animals at seeing things that aren’t moving
2. VOR - vestibular ocular reflex - keeps eyes still
• use vestibular system, occurs in comatose
• fast
• works better if you move your head fast
3. optokinetic system - tracks with eyes
• ex. stick head out window of car and track objects as they go by
• slower than VOR (takes 200 ms)
• works better if slower
• reflex
• in cortex (textbooks) but probs brainstem (new)
4. smooth pursuit - can track things moving very fast
• suppress saccades and track smoothly
• only in higher apes
• area MT is highest area of motion coding and goes up and comes down multiple ways
• high speed processing isn’t understood
• could be retina processing
5. vergence - crossing your eyes
• suppresses conjugate eye movements
• we can control this
• only humans - bring objects up very close
• reading uses this
• eye muscles
• rectus
• vertical
• superior
• inferior
• use complicated vertical gaze center
• last to degenerate in ALS
• locked-in syndrome - can only move eyes vertically
• controls oculomotor nucleus
• lateral
• medial
• lateral (controlled by abducens)
• use horizontal gaze center=PPRF which talk to abducens -MLF connects abducents to opposite medial lateral rectus muscle
• oblique - more circular motions
• superior (controlled by trochlear nucleus)
• inferior
• everything else controlled by oculomotor nucleus
• superior colliculus has visual map
• controls saccades, connects to gaze centers
• takes input from basal ganglia (oculomotor loop)
• also gets audio input from inferior colliculus (hear someone behind you and turn)
• gets strokes
• redundant with frontal eye field in secondary motor cortex
• connects to superior colliculus, gaze center, and comes back
• if you lose one of these, the other will replace it
• if you lose both, can’t saccade to that side

# 21 visceral (how you control organs, stress levels, etc.)

• parasympathetic works against sympathetic
• divisions
1. sympathetic - fight-or-flight (adrenaline)
• functions
• neurons to smooth muscle
• pupils dilate
• increases heart rate
• turn off digestive system
• 2 things with no parasympathetic counterpart
• increase BP
• sweat glands
• location
• neurons in spinal cord lateral horn
• send out neurons to sympathetic trunk (along the spinal cord)
• all outgoing connections are adrenergic
• beta agonist - activates adrenaline receptors (do this before EKG)
2. parasympathetic - relaxing (ACh)
• location
1. brainstem
1. edinger westphal nucleus - pupil-constriction
2. salivatory nucleus
3. vagus nucleus - digestive system, sexual function
4. nucleus ambiguous - heart
5. nucleus of the solitary tract
• all input/output goes through this
1. rostral part (front) - taste neurons
2. caudal part (back) contains all sensory information of viscera (ex. BP, heart rate, sexual
2. sacral spinal cord (bottom) - gut/bladder/genitals
• not parallel to sympathetic – poor design - may cause stress-associated diseases
• hard to make drugs with ACh
3. enteric nervous system - in your gut
• takes input through vagus nerve from vagus nucleus
• also has sensory neurons and sends afferents back to brainstem
• pathway
• insular cortex - what you care about
• amygdala - contains emotional memories
• hypothalamus - controls a lot
• mostly peptinergin neurons
• aging, digestion, mood, straight to bloodstream & CNS
• releases hormones
• ex. leptin - stops you eating when you eat calories
• reticular formation - feedforward, prepares digestion before we eat
• three examples
1. heart rate
• starts at nucleus ambiguous
• also takes input from chemoreceptors (ex. pH)
• SA node at heart generates heartbeat - balances Ach and adrenaline
• sympathetic sends info from thoracic spinal cord
• heart sends back baroreceptor afferents
1. parasympathetic in sacral lateral horn make you pee (contracts bladder)
2. turn off sympathetic NS
3. open sphincter muscle (voluntary)
• can also control this via skeletal nervous system
• circuit
• amygdala (can’t pee when nervous)
• pontine micturation center -> parasympathetic preganglionic neurons -> parasympathetic ganglionic neurons
• inhibitory local circuit neurons -> somatic MNs
3. sexual function
• Viagra turns on parasympathetic NS
• also gives temporal color blindness
• sympathetic involved in ejaculation
• temporal correlation (“Point and Shoot”)
• Neural Signaling

# 1 - introduction

• ## genomics

• male Drosophila uses body position and environment to add rhythmic notes to song
• female uses this to gauge male’s brain
• neural circuits make up neural systems
• neural systems serve 3 general functions
1. sensory systems
2. motor systems
3. associational systems - link the other two, higher order functions
• gene has coding DNA (exons) and regulatory DNA (introns)
• model organisms
• cat - visual cortex
• squid and sea slug had really large neurons
• 4 species: worm C. elegans, Drosophila, zebrafish Danio rerio, mouse Mus musculus
• complete genome is available for them
• can try homologous recombination - splicing in new genes
• human genome has ~20k genes, ~14k expressed in brain, ~6k expressed only in brain
• single-gene mutations can cause diseases like microcephaly
• simulate brain as a computer
• passive cabling equation
• theoretical neuroscience
• blue brain project
• human brain project

## cellular components

• neuron doctrine by Ramon y Cajal / Sherrington replaces Golgi’s reticular theory
• Cajal uses Golgi’s salt-staining method to show neurons are distinct
• Sherrington finds synapses
• there are rare gap junctions between neurons (where there are electrical synapses)
1. neurons
• number of inputs reflects convergence
• number of targets reflects divergence
• local circuit neurons = interneurons - short axons
• projection neurons - long axons
1. glial cells - support and regeneration
• outnumber neurons 3:1
• they are stem cells - can generate new glia
• maintaining ion gradients, modulating nerve signals, modulating synaptic action, scaffolding, aiding recovery
1. astrocytes - in CNS, maintain chemical environment, retain stem cell properties
2. oligodendrocytes - in CNS, lay down myelin - in PNS, Schwann cells do this
3. microglial cells - remove debris
• glial stem cells - make more glia and sometimes neurons

## cellular diversity

• ~10^11 neurons, more glia
• histology - microscopic analysis of cells
• Nissl method - stains the nucleolus

## neural circuits

• neuropil - bundle of dendrites/axons/glia - where synaptic connectivity occurs
• afferent neuron - carries info toward the brain
• efferent neuron - carries info away
• myotatic reflex example - knee-jerk

## organization of the human nervous system

1. sensory systems
2. motor systems

3. CNS
• brain
• spinal cord
4. PNS
• sensory neurons
• somatic motor division - connect CNS to skeletal muscles
• autonomic or visceral motor division - innervate muscles / glands
• autonomic ganglia - peripheral motor neurons that take inputs from brainstem / spinal cord
• enteric system - small ganglia / neurons in gut that influence gastric motility and secretion
• sympathetic division - ganglia lie near the vertebral column and sent their axons to a variety of targets
• parasympathetic division - ganglia are found near organs they innervate
• groupings
• ganglia - accumulations of cell bodies / supporting cells
• nerve - bundles of axons in PNS
• tract - bundles of CNS axons
• if they cross the brain midline called commisures
• nuclei - local accumulations of similar neurons
• cortex - sheet-like arrays of neurons
• gray matter - has more cell bodies
• white matter - has more axons

## neural systems

1. unity of function - divide things into different systems - ex. visual
2. representation of information - ex. vision can be topographic map, taste can be computational map
3. subdivision into subsystems - ex. vision has color, form, motion, all in parallel

## structural analysis

• often-used lesion studies
• anterograde - source to termination
• vs retrograde - terminus to source

## functional analysis of neural systems

1. electrophysiological recording - uses electrodes, neuron-level
• can determine receptive field - region in sensory space where a specific stimulus elicits a spike
2. functional brain imaging - noninvasive, records local activity
• computerized tomography (CT), magnetic resonance imaging (MRI), diffusion tensor imaging (DTI), PET, SPECT, fMRI, MEG, MSI

## analyzing complex behavior

• cognitive neuroscience - understanding higher-order functions
• neuroethology - complex behaviors of animals

# 2 - electrical signals of nerve cells

• microelectrode - fine glass tubing filled with good conductor
• all cells have a voltage difference across them
• assume resting potential - we’ll use -58 mV
• depolarized - less negative - we’ll use +58 mV
• potentials
1. receptor potential - (small) due to the activation of sensory neurons by external stimuli
• at terminal of dendrite
• graded - depends on how strong the input is
2. synaptic potential - (small) caused by activation of synapse
• at middle of dendrite
3. action potential - cause by the other 2
• at the axon
• active transporters create differences in concentrations of specific ions - battery
• cells can be depolarized by adding too much K+ outside
• ion channels - make membranes selectively permeable - wires
• ions
• outside: high Na, Cl; low K
• generally 10:1 ratio between inside, outside
• inside of cell has a bunch of negative proteins to balance chlorine
• ions want to spread out because of entropy (then they factor in charge difference)
• $V_{ion} = 58/z * log(X_{out} / X_{in})$
• calculate for each ion, if able to move
• z is charge on ion
• for potassium: 58/1 * log(.1) = -58mV
• Cl Nernst potential is actually -70 mV (even though we assumed -58 before)
• Cl works as an inhibitor - ex. alcohol lets chloride in
• whichever ion leaks, this determines the membrane potential
• hodgkin-huxley
• large squid escape neurons
• adding K+ outside depolarizes the cell
• adding Na+ outside raises height of spike

# 3 - voltage-dependent membranes

• voltage clamp - one electrode outside, one inside
• measured with reference to outside (usually more negative inside)
• keeps voltage constant
• current clamp - keeps current constant
• current clamp - just measures the voltage without interfering
• patch-clamp - suction part of cell into pipette
• passive properties
• current injection: $V_t = V_{\infty} (1-e^{-t/ \tau})$
• voltage decay: $V_t = V_{\infty} e^{-t / \tau}$
• block Na+ current with tetrodotoxin
• block K+ current with tetraethyl-ammonium
• refractory period is because Na needs to stop being inactivated
• Na+ is transient, K+ is not
• myelin insulates sections - less ion loss
• called saltatory conduction
• faster and more efficient
• concentrates action potential to nodes
• without myelin, 10 m/s with myelin, 100 m/s
• deals with JAM receptor system
• umyelinated can be ok
• might want to regenerate
• don’t care about speed
• PNS
• Schwann cells
• loss - Guillan-barre syndrome
• CNS
• oligodendrocytes
• loss - multiple sclerosis (MS)

# 4 - ion channel transporters

• patch electrode - pull a piece of membrane out
1. cell-attached - don’t break membrane
2. whole-cell - break membrane
3. inside-out - inside of membrane is outside electrode
4. outside-out - outside of membrane is outside electrode - this method is always preferred
• tetrodotoxin binds to outside of cell membrane
• some channels are delayed
• self-inactivating = transient - channels turn off by themselves
• take 10-20 ms
• voltage-gated channels
• Na+, K+, Ca, Cl
• frog Xenopus Ooctyes ion channels are studied
• TRP channels gated by mechanical / heat
• 4 K+ channels
1. delayed rectifier
2. fast acting - shapes AP, used for hearing
3. late phase - slow ending, makes AP fire again
4. inward rectifier - open at resting potential - establishes membrane potential
• mitten model - protein rotates around
• human genes: 10 Na, 10 Ca, 100K, ~5 Cl
• there are more types of potassium channels
• channel-opathies - diseases can be caused by altered ion channels
• ion transporters
• ATPase Pumps
• Na+/K+ pump
• Ca pump
• ion exchangers
• Na+/K+ pump exchanges 3Na for 2K ions
• 1/3 of body’s energy
• 2/3 of neuron’s energy
• brains use 20% of body’s energy
• Ouabain blocks this
• SERT - Na transporter
• co-transporter
• ligand-gated channels
• respond to a chemical
• usually allow Na, K, Cl to flow in and out

# 5 - synapses

## synapse types

1. electrical synapses
• gap junction channels
• ions flow through gap junction channels
• presynaptic and postsynaptic cell are almost the same
• delay is fast (.1 ms)
• gap junction proteins have been showing up in diseases
• simple organisms have these
2. chemical synapses
• bouton - end of presynaptic dendrite
• spines - start of postsynaptic dendrite
• voltage-gated Ca comes in and causes vesicles to fuse with presynaptic membrane
• neurotransmitter released
• bind to ligand-gated channels which let ions flow through
• if Cl flows into postsynaptic cell - inhibition
• pumps get rid of neurotransmitters quickly

## neurotransmitter types

• released when Ca comes in due to depolarization
1. peptides
• ex. oxytocin
• require long Ca exposure
• loaded into vesicles up by the cell body - can take days to get to bouton
• neurotransmitter diffuses away - doesn’t always have specific target
• can spread to all neurons in the area (ex. substance P)
1. small & fast
• glutamate, Ach, GABA
• loaded into vesicles in bouton
• presynaptic cell takes these back up

## discovery

• neurotransmitter lifecycle
• synthesis -> receptors -> function -> removal
• important that they are removed
• 60 s to recycle
• Loewi’s experiment showed that neurotransmitters can flow through solution to synchronize heart

## synaptic transmission

• minis = MEPP - not big enough to fire the neuron
• you can treat a muscle as a post-synaptic junction
• chatter from single vesicle release
• quantal basis of neurotransmitter release - 1,2,3,etc because vesicles release as all-or-none
• synapses / vesicles are all about the same size across different species
• receptors receive these neurotransmitters differently
• each vesicle is covered with proteins
• SNAPs on plasma membrane
• SNAREs on vesicle
• ex. synaptobrevin
• botulinum toxin, tetanus toxin - block synaptobrevin - clip other proteins
• render a vesicle inactive
• they recognize each other and lock for priming - ready to release when Ca+ enters
• need to endocytose membrane to make more vesicles
• endocytosis includes clathrin which curves the membrane
• can measure single ligand-gated channel by clamping it alone

# 6 - neurotransmitters

## receptors

• ionotropic

| Name | AMPA GluR-x | NMDA NR-x | Kainate | GABA-A | glycine | Nicotinic Ach | 5HT-3 | P2x purinergic | |——– |————- |———– |——— |——– |——— |————— |———– |—————- | | Abbrev | AMPA | NMDA | Kainate | GABA | Glycine | nACh | Serotonin | Purines | | Ion | Na | Na/Ca | Na | Cl | Cl | Na | Na | Na |

• metabotropic

| Name | mGlu-x | GABA-B | D-x | Alpha, beta, adrenergic | H-x | 5HT1-7 | Purinergic A or P | Ach Muscarinic-x | |———- |———– |——– |——————— |————————- |——————— |—————— |——————- |—————— | | Abbrev | Glutamate | GABA_B | Dopamine | NE,Epi | Histamine | Serotonin | Purines | Muscarinic | | Function | | | cocaine, ADHD drugs | antianxiety | unkown, probs sleep | 3 is ionotropic! | | mushroom drugs |

• ## excitatory

1. Acetylcholine - excites muscle cells
• receptors: nAch
• Acetylcholinesterase breaks down Ach after it is released
• neurotoxins (ex. Serin) break down Acetylcholinesterase so Ach stays and muscles stay on (nerve gases)
• Myasthinia is when you develop antibodies against your own nAch receptors
• you have trouble controlling your eyes
1. Glutamate - excites pyramidal cells
• VGLUT pumps Glutamate into vesicles
• EATT transports extracellular Glutamate into presynaptic terminal / nearby Glial cells
• Glial cells convert glutamate to glutamine (inactivates) and glutamine is taken up by the presynaptic terminal
• glutamate overload if you overload the inactivating pumps in the glial cells
• glutamate receptors
• AMPA - fast Na only
• NMDA - slow, Na and Ca and also requires depolarization

## inhibitory

1. ,4. GABA/glycine
• have simple transporters that move released GABA into presynaptic terminal / Glia
• Ionotropic GABA receptors - depressants; shut down nervous system
• can bind steroids like estrogen - different effects in men / women
• alcohol binds to this, shuts things down
• immature neurons have high intracellular Cl - people don’t know why

## neuromodulators

• lifecycle molecules
• synthesis: L-Dopa, trytophan
• reuptake: DAT, NET, SerT
• breakdown: MAO
• vesicular transport: VMAT
1. catecholamines
• pathway: DOPA -> Dopamine -> Norepinephrine -> Epinephrine
• dopamine - forming of memories
• produced by substantia nigra
• loss of these neurons -> Parkinson’s
• norepinephrine
• produced by locus coeruleus
1. serotonin = 5HT - happiness
• Tryptophan -> serotonin
• serotonin produced by Raphe nuclei
• affected by LSD
1. histamine - not well-known
• antihistamines can make you hallucinate
1. atp - sensitivity to pain
2. neuropeptides - slow
• substance P - pain
• alpha-endorphin - analgesia (block pain)
• vasopressin - blood pressure
• thyrotropin releasing hormone (TRH) - metabolism
• neuropeptide Y - mood/aggression
1. enndocannabinoids - weed
• ex. anandamide - binds to CB1 (cannabanoid 1 receptor)
• retrograde signal - post back to pre - inhibit the inhibitor
• this increases the signal
1. nitric oxide - gas
• binds to guanylyl cyclase

# 7 - molecular signaling

## localization

• chemical signaling mechanisms
1. synaptic - local
• ex. Ach
2. paracrine - medium distance, neurotransmitter sprinkled and nearby targets take it up
• ex. serotonin
3. endocrine - get into your blood stream - body-wide signaling
• ex. vasopressin
• amplification = enzyme
• signal that activates enzyme amplifies signal
• cell-signaling molecules
1. cell-impermeant molecules
• need transmembrane receptors
• ex. glutamate
2. cell-permeant molecules (steroids)
• can have intracellular receptors
3. signaling molecules
• adhesion molecules - like a lock and key - binds neurons together
• spine has small neck - hard for proteins to go through it
• keeps information local
• large raft of signaling molecules keeps info local

## celllular receptors

1. ionotropic - channel-linked receptors
• neurotransmitter binds to a channel that opens
• ex. Glu ionotropic receptor
• signal is sodium coming in
• ex. TrkA NGF receptor - Tyrosine kinase
• once it binds, it becomes an enzyme
3. metabotropic - G-protein-coupled receptors
• ex. Glue metabotropic receptor
• activate G-protein that then does something
• these require energy for G-proteins
1. Heterotrimeric G-proteins
• G-protein has 3 subunits: -
1. Gs - binds norepinephrine
• cAMP 2. Gq - binds glutamate
• DAG (diaglycerol) & IP3 3. Gi - binds dopamine
• inhibits cAMP
1. Monomeric G-proteins
• G-protein has just one subunit
• ex. Ras
4. intracellular receptors
• ex. estrogen receptors
• activates intracellular transcription factors

## second messengers

• Ca must be pumped out of neuron or
• Ca can be stored into internal stores in ER
• Adenylyl cyclase turns ATP into cAMP -> PKA
• Guanylyl cyclase turns GTP into cGMP -> PKG
• would use inside-out patch to study these
• Phospholipase C -> PKC -> IP3 lets Ca out of ER (usually kills cell)

## protein control

• kinases (on switch) add phosphate to protein and makes them active
• PKA - cAMP binds then catalytic domain can bind
• non-covalent so can diffuse into nucleus
• CAM kinase
• covalent
• protein kinase C
• covalent - very local to membrane
• phosphatases (off switch) remove phosphates

## long-term alteration

• long-term alteration requires epigenetic changes (change transcription factors)
• transcription factor CREB requires three things at once
1. Ca comes in and binds Cam kinase
2. activate Protein kinase A - this can stay in nucleus for a while
• how much to make
3. MAP kinase, ras
• on/off switch - when all of these things come in at once - convergence signaling - CREB will make actin, AMPA receptors -
• TrkA binds NGF (Nerve growth factor) - peptide and has 3 pathways:
1. PI 3 kinase
2. ras
3. Phospholipase C
• this displays divergence signaling
• cerebellar synapses
• mGlu inhibits AMPA with negative feedback
• lets out Ca which depresses AMPA receptor
• signal scaling - tyrosine hydroxylase makes dopamine
• more Ca in bouton activates more hydroxylase -> makes more dopamine
• Ca comes in whenever fires
• use it or lose it

# 8 - synaptic plasticity

## short-term

• measured by firing neuron before muscle and recording response
1. synaptic facilitation - frequency-dependent plasticity - fire faster, get bigger EPSPS
• Ca comes in and persists during next pulse
• ms time scale
2. synaptic depression - transmitters are depleted
• s time scale
3. synaptic potentiation/augmentation - changes in presynaptic proteins
• min time scale
• experiments
• habituation - decrease vesicles on sensory neuron after too much stimuli
• sensitization - associate two stimuli together:
• mechanism
• sensory neuron -> serotinergic neuromodulatary interneuon -> motor neuron
• interneuron releases serotonin
• cAMP produced in sensory neuron
• long term - CREB in nucleus - synapse growth
• central signal for LTM
• activates PKA
• short term - decreases K+ current
• sensory neuron doesn’t learn, neuromuscular junction gets stronger
• mutant genes associated with cAMP identified
• phosphodiesterase - if you remove this, too much cAMP

## long-term

• HM lost his memory w/out hippocampus
• site of LTP
• at night memories are moved from hippocampus (flash drive) to cortex (long-term hard drive)
• Schaffer collateral pathway is one pathway in hippocampus (also perforant pathway, mossy fiber)
• pre-stimulate with tetanus
• later when stimulated EPSP is bigger
• usually 20 ms between firing - LTP when multiple firing in less time
• Schaffer mechanism - NMDA receptor key to this
• both AMPA and NMDA exist
• NMDA -> Ca -> CAM kinase -> LTP
• Mg blocks NMDA unless already depolarized
• requires good timing!
• silent synapses
• short-term - more AMPA receptors, long-term new synapses
• all synapses born with only NMDA
• protein synthesis needed for LTP (mostly making more AMPA)
• unclear how synapse decides whether to strengthen / make more synapses
• LTD - long-term depression is opposite of LTP - long-term potentiation
• low levels of Ca lead to AMPA being endocytosed
• mGlu -> PKC -> starts LTD
• epilepsy - neurons fire together and wire together
• bmi
• companies
• Kernel
• data types
• Neural dust
• calcium imaging
• Sensory Input

# 9- somatosensory

## cheat sheet

• vocab
• nerve - bundle of axons
• tract - bundle of axons in CNS
• nucleus - bundle of neurons related to some function
• midline - center of nervous system
• brain tends to be lateralized - one side is given control
• ex. speak almost exclusively from left side of brain
• information processing
1. feedback (gain)
• almost always with glutamatergic / GABA
2. feedforward - anticipation
• estimate things before they happen
• adjust your behavior in advance of the world (ex. lean before you hit a table)
3. center-surround inhibition (spatial gain)
• if you touch yourself, brain enhances sensitivity of one point by suppressing information from around it

## sensory system overview

• we have dorsal root ganglia (DRG) on spinal cord
• axon goes to CNS
• dendrites go everywhere
• pseudounipolar - born polar but become uni-polar
• dendrite goes straight into axon with cell body off to the side
• do very little processing
• dorsal horn - top layer that controls sensory information
• in the brain stem, these are called cranial ganglia
• special one is trigeminal ganglia (sensory receptors for face)
• oxytocin important clinically
• Trp channels - connected mechanically into membrane
• dermatomes
• map of sensory parts to brain
• segments of spinal cord correspond to stripes across your body
• brain to feet: cervical, thoracic, lumbar, sacral
• shingles - virus where you get stripes of sores - single DRG
• pops out the skin on the dendrite of one DRG
• peripheral damage won’t give you stripes of pain
• feeling resolution - depends on density of neurons innervating skin
• more neurons - small receptive fields
• two-point discrimination test - poke you at different points and see if you can tell if the points are different
• higher discrimination is better
• discrimination is different that sensitivity (like how it hurts when wounded)

## 4 neuron classes

• they have certain structures that tune them into certain kinds of vibrations
1. Proprioception
1. muscle spindles - on every neuron - fastest
• measures stretch on every muscles
• lets you know where your arm is
2. Golgi tendon organ
• measures tension on tendon
• safety switches - numb your body if you’re over-stressing something (make you let go of hanging on cliff)
2. Ia II - touch neurons
• superficial - most sensitive
1. Merkel: hi-res, slow adapt
2. Meissner: hi-res, fast adapt
• deeper - sense vibrations, pressure
1. Ruffini: low-res, slow adapt
2. Pacinian: low-res, fast adapt
• these are in order of depth
• diabetes - tissue loss and pain / numbness are lost
3. Adelta - fast pain
4. C fibers - pain, temperature, itch
• very slow, stay on
• no myelination - Pruritus - newly discovered set of sensory neurons
• between pain/touch - itch neurons
• new in mice: massage neurons
• can only fire by stimulating in certain pattern
• goes to emotion center not knowledge - pleasure
• speed proportional to diameter, myelination
• some adapt slowly (you keep feeling something)
• some adapt quickly (stop feeling)
• if you move finger slightly, start firing again when changed
• better if you feel cockroach that starts moving

## pathways

• upper-body
• S1 cortex - primary somatic sensory cortex - this is the knowledge of where was touched
• VPL - everything accumulates here in the thalamus then goes to
• Cuneate nucleus - everything goes into this
• lower-body (trunk down)
• everything in the lower body goes to Gracile nucleus - in brain stem
• special case - sensory for face
• trigeminal ganglion connects into vpm (thalamus) then goes into S1 cortex
• proprioceptive pathways
• starts in lower body
• axons split - half go up to Clark’s nucleus
• half go back into muscles
• Clark’s nucleus goes straight into cerebellum
• starts in upper body - goes straight into cerebellum
• thus cerebellum have map of where / how tense muscles are

## representation

• cortex - this is where understanding is
• dedicates area based on how many neurons coming in
• lips / hands have more area
• S1 - primary somatosensory cortex
• most body parts
• neurons from functionally distinct columns
• cortex assigns space based on how much info comes in
• after amputation and time, map grows into lost space
• map is different when different stimuli are given to fingers
• S2 - secondary somatosensory cortex
• processes and codes information from S1
• throat, tongue, teeth, jaw, gum

## pathway

• mechanosensory
1. DRG
2. Cuneate, Gracile
3. VPL
4. S1
• face mechanosensory
1. trigeminal ganglion
2. principal nucleus of trigeminal complex
3. vpm
4. S1
• proprioception
• lower body
1. muscle spindles split
2. half go to motor neurons
3. other half go to Clark’s nucleus
4. clark’s nucleus -> cerebellum
• upper body - straight to the cerebellum

# 10 - nociception

## review

• chronic pain is very import clinically
• cortex - lets you know if you are sensing something
1. loss-of-function lesion - piece of cortex is lost - lose awareness
• can come from stroke, migraine-aura
2. gain-of-function lesion = excitatory lesion - like epilepsy
• cortex comes on when it shouldn’t
• increased awareness
• can come from stroke / migraine
• “sixth sense” - measuring stretch of all your muscles in cerebellum
• nociception = pain
• has nociceptors - neurons that do nociception
• thermoceptors - neurons that sense temperature
• two classes of linking receptors
1. Adelta fibers - fast pain
2. C fibers - slow and chronic
• Trp channels - mechanically or thermally gated
• let Na+ in
• trpV heat - binds capsaicin
• in the class of vanilloids
• birds not capsaicin sensitive
• trpM cold - binds menthol
• adapts in minutes - stop feeling cold after a while
• synapses of nociceptors go to dorsal horn of drg
• nociceptor goes contralateral (must cross midline) - if you cut left side of spinal chord, lose - mechanoception (ipsilateral) from left and nociception (contralateral) from right
• mechanoreceptors, by contrast, send axon up the spinal cord
• dorsal horn has laminal structure (has layers)
1. know where pain is
• somatosensory cortex
1. care about pain
• insular cortex - emotional part of brain
• whether or not you care about pain
• pairs up with other senses
• can have both loss-of-function and gain-of-function lesions in both places
• referred pain map - map that refers to a specific problem (ex. esophagus)
• visceral pain - don’t know where the pain is
• hyperalgesia - increased pain sensitivity
• pain sensing neurons are hyperactive because of inflammation
• pain sensing neuron releases substance P into Mast cell or neutrophil which releases histamine which strengthens receptor
• prostaglandins activate nococeptors
• allodynia - when mechanosensation hurts - not understood
• turning off pain - add serotonin
1. exercise
2. lack of serotonin ~ mood disorders
• central sensitization: allodynia
• these mechanisms work through introception
• senses chemical imbalances
• phantom limbs and phantom pain - if you lose a limb and still feel pain
• mechanoreceptors inhibit nociceptors

## pathway

• nociception
• same as mechanosensory except goes all the way to thalamus
• doesn’t stop in brainstem
• crosses the midline after first synapse
• visceral pain
• axons mainline straight up, go through vpl, go straight to insular cortex

# 11 - vision (eye)

• most of visual system is to read faces
• eye
• aqueous humor
• posterior chamber
• lens
• ciliary muscles
• retina
• fovea
• optic disk
• optic nerve and retinal vessels
• to see far, stretch lens = accomodation
• retina - rods and cones are at back
• cones - color
• retinal ganglion cells sends down signal
• 12 days to turnover whole photoreceptor disks into PE (pigment epithelium)
• PE is what the rods / cones are in
• PE contains optic disks containing rhodopsin protein that is sensitive to light that break off of rods / cones
• light leads to inhibition
• melanopsin - receptor for blue light

## circuits

• accomodation - stretching lens uncrosses lines
• function photoreceptor
• usually cGMP is letting in Na/Ca
• Ca provides negative feedback here
• when light hits, retinal inside rohodopsin activates phosphodiesterase - breaks down cGMP so channel closes and they aren’t let in
• light on middle
• depolarizes cone
• excites oncenter
• inhibits offcenter
• these adjust quickly
• horizontal cells - takes positive input from photoreceptor and inhibits it back
• inhibits horizontal cells else around it - creates contrast
• have these for each color

## pathway

1. rods / cones (2). horizontal cells - regulate gain control, how fast adapts, contrast adaptation
2. bipolar cells (4). amacrine cells - processing of movements
3. retinal ganglion cells
4. go into thalamus then to cortex (6). small amount go into brain stem and control mood / circadian rhythms

# 12 - central visual system

• cortex is a pizza box
• has columns
• autophagy - process by which cells eat parts of themselves
• nobel 2016
• cones - color
• 12 day cycle for processing optic disks
• photoreceptors have cyclic G-activated channel
• light shuts down photoreceptors
• cell decreases in activity
• very roughly - each cone connects to cone bipolar cell
• this gets represented by one column in the cortex
• 15-30 rods connect to 1 rod bipolar cells
• cortex has 6 layers
• each has tons of neurons, mostly pyramidal neurons
• column is a section through the 6 layers - all does about the same thing
• orientation columns responds to specific x,y
• has subregions that respond to specific orientations
• ocular dominance column - both eyes for same coordinate go to same spot
• dominated by one eye
• distance
• far cells
• tuned cells
• near cells
• V4 in temporal lobe - object recognition

## pathways

• overall
1. V1
2. V2
3. V4 or MT
• central projections
• retinal ganglions
• all go through optic stuff

# 13 - auditory system

• ear parts
• outer
• middle
• tympanic membrane
• inner
• cochlea - senses the sound
• oval window
• round window - not understood
• conductive hearing loss - in the outer/middle ear
• sensorineural hearing loss - in the cochlea
• can’t be fixed with hearing aids
• humans
• 2-5kHz ~= human speech (can sometimes hear more)
• 30-100x boost for tympanic membrane
• this differs between people
• 200x focus onto oval window
• cochlea
• 4 layers
• inner hair cells - what you hear with
• outer hair cells - generate sound
• generates noise at every frequency except one you want to hear
• otoacoustical emmision - low buzz that is produced
• tenitis - ringing in the ears
• can be internal
• can be peripheral - generated by otoacoustical emmision
• high frequencies right next to cochlea
• low frequencies on distal tip
• human high frequency cells die with age
• hair cells
• bundle of cilia
• have an orientation
• kinocilium is the tallest
• tall ones are in the back
• dying hair cells - can’t be replaced
1. loud sounds
2. certain antibiotics
• auditory pathwayz
• MSO - medial superior olive - decides where the sounds is coming from
• takes input from right / left ear, decides which came in first
• medial geniculate complex of the thalamus
• brain shape
• folds are pretty random
• phrenology - shape of skull was based on brain
• thought it could determine personality
• false
• Hsechl’s Gyrus folding pattern is not random
• argument that if you have one, you are more musical
• any sounds is made up of a bunch of frequencies

## circuits

• K depolarizes hair cells, lets in Ca, releases vesicles

# 14 - vestibular system

• very related to cochlea
• same hair cells
• differences
1. vestibular system doesn’t use cortex (you don’t think about it)
• goes right into spinal chord 2. controls eye movements
• one of the fastest circuits in the brain
• clinically important
• you have to be able to have your balance
1. each column is computational unit of the cortex
2. ocular dominance column
• one for each eye
• labyrinth and its innervation
• semicircular canals
• can only measure one axis of rotation
• remember horizontal canal - measures turning head left to right
• this measures acceleration
• like a hoola hoop filled with glitter
• has ampulla at one place in the hoop
• cupula - sits over the ampulla’s hair cells
• if the “glitter” hits the cupula, it will bend the hair cells
• if you keep spinning, fluid starts moving and you stop detecting movement
• this means the canals adapt mechanically
• if you stop spinning, fluid keeps moving and system thinks you’re spinning the other way
• right horizontal canal activated by turn to the right
• same for left
• scarpa’s ganglion - has hair cells inside
• sends axons into vestibular nuclei
• lots of fluid (high in K+)
• macula - place where all the hair cells are
• Ampullae - at base of canals
• hair cells all in the same direction
• utricle and saccule - measure head tilt
• hair cells in multiple orientations
• these contain otoconia
• these are little crystals that move with gravity
• measure acceleration and tilt
• tilts do not adapt - they keep firing while you’re leaned back
• they basically report tilt / position at all times
• tiplink - connect cilia together for hair cells
• when they move, tiplink move, pull on ion channels
• motor on connected hair cell moves up and down to generate correct amount of tension
• motor uses myosin and actin
• harming these proteins can cause deafness
• both eyes must always be looking in the same direction
• also must be sitting over image for a while
• ipsilateral - same side
• contralateral - different side
• vestibular ocular reflex VOR - turn your head to the right, eyes move left
• doesn’t require cortex
• only have to learn excitatory

# 15 - chemical senses

• cAMP is used by GPCR

### research

• Research future
• keeping up to date: https://sanjayankur31.github.io/planet-neuroscience/

# questions

• problems that are solved, or soon will be
• how do single neurons compute?
• what is the connectome of a small nervous system, like that of C. elegans (300 neurons)?
• how can we image a live brain of 100,000 neurons at cellular and millisecond resolution?
• hydra was completed
• how does sensory transduction work?
• problems that we should be able to solve in the next 50 years
• can we add senses to the brain?
• like cochlear implant
• like vibrations
• how do circuits of neurons compute?
• what is the complete connectome of the mouse brain (70e6 neurons)?
• how can we image a live mouse brain at cellular and millisecond resolution?
• what causes psychiatric and neurological illness?
• how do learning and memory work?
• short-term vs. long-term
• declarative vs. non-declarative
• encodes relationships between things not things themselves
• memory retrieval
• why do we sleep and dream?
1. sleep is restorative (but then why high neural activity?)
2. allows the brain to run simulations
3. consolidating memories and forgetting
• where is consciousness?
• at this point, sounds and vision should line up (delayed appropriately)
• how do we make decisions?
• how does the brain represent abstract ideas?
• what does neural baseline activity represent?
• how does the brain solve timing?
• moving eyes
• hearing and vision time differences
• how does sensorimotor learning build a model of the world?
• problems that we should be able to solve, but who knows when
• how do brains simulate the future?
• how does the mouse brain compute?
• what is the complete connectome of the human brain (8e10 neurons)?
• how can we image a live human brain at cellular and millisecond resolution?
• how could we cure psychiatric and neurological diseases?
• how could we make everybody’s brain function best?
• brain and quantum?
• some work in quantum neural nets
• how is info coded in neural activity?
• like measuring tansistors and guessing what computer is doing
• neuron gets lots of inputs
• do glial cells and other signaling molecules compute?
• what is intelligence?
• what is iq?
• how do specialized systems integrate?
• problems we may never solve
• what are emotions?
• brain states that quickly assign values
• in the amygdala
• how does the human brain compute?
• how can cognition be so flexible and generative?
• how and why does conscious experience arise?
• thing that flickers on when you wake up that was not there
• evolutionary to manage all the different systems
• meta-questions
• what counts as an explanation of how the brain works? (and which disciplines would be needed to provide it?)
• how could we build a brain? (how do evolution and development do it?)
• what are the different ways of understanding the brain? (what is function, algorithm, implementation?)
• ref David Eaglemen article: http://discovermagazine.com/2007/aug/unsolved-brain-mysteries
• ref Adolphs 2015, “The unsolved problems of neuroscience”

# brain-machine interfaces

• surgeons won’t want to put chips into people’s brains
• only for people with serious medical conditions right now

# RNA barcoding

• allows for tagging different neurons
• can then optically get differences
• also can sequence and get differences (http://www.cell.com/neuron/pdf/S0896-6273%2816%2930421-4.pdf)
• future of electrophysiology: https://www.technologynetworks.com/neuroscience/articles/shining-a-light-on-the-future-of-electrophysiology-286992

# brain transplant

• computational hypothesis of the mind

# tms

• temporary cure for autism
• can change people’s minds
• Research main
• Problems: Alzheimers, PTSD, autism, addiction, MS, depression, schizophrenia

# 1 - neural decoding

## fMRI decoding

• reconstructing visual experiences from brain activity evoked by movies (Gallant, 2011)
• try doing this with music?
• typing
• like fb or neuralink
• new data (e.g. BMI?)

## spike sorting

• can be based on electrodes or calcium-imaging data
• can get gt with intracellular recordings
• spike sorting with GANs
• simulated datasets also work well
• http://spikefinder.codeneuro.org

## neural encoding

• cochlear implant turns sound into neural signal

# 2 - brain mapping

## structural connectomics

• random forests / CNNs for neuronal image segmentation
• uses gradient boosting with MALIS

## functional connectivity

• computational fMRI (Cohen et al. 2017)
• using graphical models with weighted-l1 regularization

# 3 - computational learning models

## neural priors

• cox train cnn w/ fMRI

## comparison to cnn

• look at features found in layers

## biophysically plausible network learning

• PCA
• anti-hebbian learning (Foldiak)
• sparse coding (Olshausen & Field)
• ICA (Sejnowski)
• adaptive synaptogenesis with inhibition

# 4 - theoretical neural coding

## action potentials

• velocity vs. energy

## linearization

• linearization PLOS
• linearization JCNeuro
• interspike interval

# 5 - cnns

• auto-encoder with sparsity rules
• rotation project
• train without flips
• neural network compression
• extracting memory with deep learning
• learning how to find the right segments of memory
• learning to decode another neural network?
• Research ref

# datasets

• senseLab: https://senselab.med.yale.edu/
• modelDB - has NEURON code
• model databases: http://www.cnsorg.org/model-database
• comp neuro databases: http://home.earthlink.net/~perlewitz/database.html
• crns data: http://crcns.org/
• hippocampus spike train data: http://crcns.org/data-sets/hc
• visual cortex data (gallant)
• allen brain atlas
• human fMRI datasets: https://docs.google.com/document/d/1bRqfcJOV7U4f-aa3h8yPBjYQoLXYLLgeY6_af_N2CTM/edit
• Kay et al 2008 has data on responses to images
• calcium imaging data: http://spikefinder.codeneuro.org/
• spikes: http://www2.le.ac.uk/departments/engineering/research/bioengineering/neuroengineering-lab/software

# data types

| | fMRI | EEG | ECoG | Local field potential (together forms microelectrode array) | single-unit | calcium imaging | |————–|———-|———-|——————-|————————————————————-|————-| —| | scale | high | high | high | low | tiny | | spatial res | mid-low | very low | low | mid-low | x | | temporal res | very low | mid-high | high | high | super high | | invasiveness | non | non | yes (under skull) | very | very |

• neural dust

# ongoing projects

• human brain project
• blue brain project - large-scale brain simulation
• european brain project
• companies
• Kernel
• Facebook neural typing interface
• IBM: project joshua blue

#conferences:

• Annual Computational Neuroscience Meeting
• Statistical Analysis of Neuronal Data
• 2017
• SFN (11/11-11/15) - DC
• NIPS (12/4-12/9) - Long Beach
• 2018
• ICCV (March)
• VSS (5/18-5/23) - Florida (Always)
• ICML (7/10-7/15) - Stockholm
• CVPR (6/18-6/23) - Salt Lake City
• SFN (11/3-11/7) - San Diego
• NIPS (12/3-12/8) - Montreal
• 2019
• ICCV (March) - Korea?
• ICML (7/10-7/14) - Long Beach
• CVPR (Unknown)
• SFN (10/19-10/23) - Chicago
• NIPS (Unknown)

# areas

• Basic approaches:
• The problem of neural coding
• Spike trains, point processes, and firing rate
• Statistical thinking in neuroscience
• Overview of stimulus-response function models
• Theory of model fitting / regularization / hypothesis testing
• Bayesian methods
• Estimation of stimulus-response functionals: regression methods, spike-triggered covariance
• Variance analysis of neural response
• Estimation of SNR. Coherence
• Generalized Linear Models
• Information theoretic approaches:
• Information transmission rates and maximally informative dimensions
• Scene statistics approaches and neural modeling
• Techniques for analyzing multiple-unit recordings:
• Event sorting in electrophysiology and optical imaging
• Optophysiology cell detection
• Sparse coding/ICA methods, vanilla and methods including statistical models of nonlinear dependencies
• Methods for assessing functional connectivity
• Statistical issues in network identification
• Low-dimensional latent dynamical structure in network activity–Gaussian process factor analysis/newer methods
• Models of memory, motor control and decision making:
• Neural integrators
• Attractor networks

### stat

• Information Theory

[toc]

# Info-theory basics

### entropy

• $H(X) = - \sum p(x) log p(x) = E[h(p)]$
• $h(p)= - log(p)$
• $H(p)$ implies p is binary
• intuition
• higher entropy $\implies$ more uniform
• lower entropy $\implies$ more pure
1. expectation of variable $W=W(X)$, which assumes the value $-log(p_i)$ with probability $p_i$
2. minimum, average number of binary questions (like is X=1?) required to determine value is between H(X) and H(X)+1
3. related to asymptotic behavior of sequence of i.i.d. random variables
•  $H(Y X)=\sum_j p(x_j) H(Y X=x_j)$
•  $H(X,Y)=H(X)+H(Y X) =H(Y)+H(X Y)$

### relative entropy / mutual info

• relative entropy = KL divergence - measures distance between 2 distributions
• if we knew the true distribution p of the random variable, we could construct a code with average description length H(p).
•  If, instead, we used the code for a distribution q, we would need H(p) + D(p q) bits on the average to describe the random variable.
•  $D(p q) \neq D(q p)$
• mutual info I(X; Y)
•  $I(X; Y) = \sum_X \sum_y p(x,y) log \frac{p(x,y)}{p(x) p(y)} = D(p(x,y) p(x)\cdot p(y))$
•  $I(X; Y) = H(X) - H(X Y)$
• $I(X; X) = H(X)$ so entropy sometimes called self-information

### chain rules

•  entropy - $H(X_1, …, X_n) = \sum_i H(X_i X_{i-1}, …, X_1)$
•  conditional mutual info $I(X; Y Z) = H(X Z) - H(X Y,Z)$
•  $I(X_1, …, X_n; Y) = \sum_i I(X_i; Y X_{i-1}, … , X_1)$
•  conditional relative entropy $D(p(y x) q(y x)) = \sum_x p(x) \sum_y p(y x) log \frac{p(y x)}{q(y x)}$
•  $D(p(x, y) q(x, y)) = D(p(x) q(x)) + D(p(y x) q(y x))$

### Jensen’s inequality

• convex - lies below any chord
• has positive 2nd deriv
• linear functions are both convex and concave
• Jensen’s inequality - if f is a convex function and X is an R.V., $E[f(X)] \geq f(E[X])$
• if f strictly convex, equality $\implies X=E[X]$
• implications
•  information inequality $D(p q) \geq 0$ with equality iff p(x)=q(x) for all x
•  $H(X) \leq log X$ where X denotes the number of elements in the range of X, with equality if and only X has a uniform distr
•  $H(X Y) \leq H(X)$ - information can’t hurt
• $H(X_1, …, X_n) \leq \sum_i H(X_i)$

# axiomatic approach

• fundamental theorem of information theory - it is possible to transmit information through a noisy channel at any rate less than channel capacity with an arbitrarily small probability of error
• to achieve arbitrarily high reliability, it is necessary to reduce the transmission rate to the channel capacity
• uncertainty measure axioms
1. H(1/M,…,1/M)=f(M) is a montonically increasing function of M
2. f(ML) = f(M)+f(L) where M,L $\in \mathbb{Z}^+$
3. grouping axiom
4. H(p,1-p) is continuous function of p
• $H(p_1,…,p_M) = - \sum p_i log p_i = E[h(p_i)]$
• $h(p_i)= - log(p_i)$
• only solution satisfying above axioms
• H(p,1-p) has max at 1/2
• lemma - Let $p_1,…,p_M$ and $q_1,…,q_M$ be arbitrary positive numbers with $\sum p_i = \sum q_i = 1$. Then $-\sum p_i log p_i \leq - \sum p_i log q_i$. Only equal if $p_i = q_i : \forall i$
• intuitively, $\sum p_i log q_i$ is maximized when $p_i=q_i$, like a dot product
• $H(p_1,…,p_M) \leq log M$ with equality iff all $p_i = 1/M$
• $H(X,Y) \leq H(X) + H(Y)$ with equality iff X and Y are independent
•  $I(X,Y)=H(Y)-H(Y X)$
• sometimes allow p=0 by saying 0log0 = 0
• information $I(x)=log_2 \frac{1}{p(x)}=-log_2p(x)$
• reduction in uncertainty (amount of surprise in the outcome)
• if the probability of event happening is small and it happens the info is large
• entropy $H(X)=E[I(X)]=\sum_i p(x_i)I(x_i)=-\sum_i p(x_i)log_2 p(x_i)$
•  information gain $IG(X,Y)=H(Y)-H(Y X)$
•  $=-\sum_j p(x_j) \sum_i p(y_i x_j) log_2 p(y_i x_j)$
• parts
1. random variable X taking on $x_1,…,x_M$ with probabilities $p_1,…,p_M$
2. code alphabet = set $a_1,…,a_D$ . Each symbol $x_i$ is assigned to finite sequence of code characters called code word associated with $x_i$
3. objective - minimize the average word length $\sum p_i n_i$ where $n_i$ is average word length of $x_i$
• code is uniquely decipherable if every finite sequence of code characters corresponds to at most one message
• instantaneous code - no code word is a prefix of another code word
• Linear Models

# ch 1 introduction

• regression analysis studies relationships among variables
• $Y = f(X_1,…X_i) + \epsilon$
• We can use $X_i^2$ as a term in a linear regression, but the function must be a linear combination of terms (no coefficients in the exponent, in a sine, etc.)
• regression analysis
• statement of the probelm
• select potential relevant variables
• data collection
• model specification
• choice of fitting method
• model fitting
• model validation - important
• iterative process!
• regressions
• simple linear regression - univariate Y, univariate X
• multiple linear regression - univariate Y, multivariate X
• multivariate linear regression - multivariate Y
• generalized linear regression - Y isn’t normally distributed
• ANOVA - all X are categorical
• Analysis of covariance - part of X are categorical

# ch 2 simple linear regression

### basics

• $Y = \beta_0 + \beta_1X + \epsilon$
• Take samples $x_i,y_i$
• assume error $\epsilon \sim N(0,\sigma^2)$
• further assume error $\epsilon_i,\epsilon_j,…\epsilon_n$ are i.i.d
• this isn’t always the case, for example if some the data points were correlated to each other
• given $x_i$
• $Var[Y_i] = Var[\epsilon_i] = \sigma^2$
• $Y_i \sim N(\beta_0 + \beta_1X , \sigma^2)$ - $Cov(Y_i,Y_j) = 0$, uncorrelated
• p-value - probability we reject $H_o$, but it is true
• want this to be low to reject

### parameter estimation (least squares)

• $\epsilon_i = y_i - \beta_0 - \beta_1x_i$
• minimize sum of squared errors
• Sums
• SSE = $\sum_1^n\epsilon_i^2 = \sum_1^n (y_i - \beta_0 - \beta_1x_i)^2$
• $S_{xx}=\sum (x_i-\bar{x})^2$
• $S_{yy}=\sum (y_i-\bar{y})^2$
• $S_{xy}=\sum (x_i-\bar{x})(y_i-\bar{y})$
• estimators
• $\hat{\beta_1}=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2}$
• $\hat{\beta_0}=\bar{y}-\hat{\beta_1}\bar{x}$
• Gauss-Markov Theorem - the least squares estimators $\hat{\beta_0}$ and $\hat{\beta_1}$ are unbiased estimators and have minimum variance among all unbiased linear estimators = best linear unbiased estimators.
• unbiased means $E[\hat{x}] = E[x]$
• $Var(\hat{\beta_1})=\frac{\sigma^2}{\sum(x_i-\bar{x})^2}$
• $Var(\hat{\beta_0})=\frac{\sigma^2}{n}+\frac{(\bar{x}\sigma)^2}{\sum(x_i-\bar{x})^2}$
• $\hat{\sigma}^2 = MSE = \frac{SSE}{n-2}$ - n-2 since there are 2 parameters in the linear model
• sometimes we have to enforce $\beta_0=0$, there are different statistics for this

### evaluate model fitting

• SST - total sum of squares - measure of total variation in response variable
• $\sum(y_i-\bar{y})^2$
• SSR - regression sum of squares - measure of variation explained by predictors
• $\sum(\hat{y_i}-\bar{y})^2$
• SSE - measure of variation not explained by predictors
• $\sum(y_i-\hat{y_i})^2$
• SST = SSR + SSE
• $R^2 = \frac{SSR}{SST}$ - coefficient of determination
• measures the proportion of variation in Y that is explained by the predictor
• Cor(X,Y) = $\rho$ = $\frac{S_{xy}}{\sqrt{S_{xx}S_{yy}}}$
• only measures linear relationship
• measure of strength and direction pof the linear association between two variables
• better for simple linear regression, doesn’t work later
• $\rho^2$ = $R^2$

### inference for simple linear regression

• confidence interval construction
• confidence interval (CI) is range of values likely to include true value of a parameter of interest
• confidence level (CL) - probability that the procedure used to determine CI will provide an interval that covers the value of the parameter
• $\hat{\beta_0} \pm t_{n-2,\alpha /2} * s.e.(\hat{\beta_0})$
• for $\beta_1$
• with known $\sigma$
• $\frac{\hat{\beta_1}-\beta_1}{\sigma(\hat{\beta_1})} \sim N(0,1)$
• derive CI
• with unknown $\sigma$
• $\frac{\hat{\beta_1}-\beta_1}{s(\hat{\beta_1})} \sim t_{n-2}$
• derive CI
• hypothesis testing
• t-test
• $H_0:\beta_1=b$
• $t_1 = \frac{\hat{\beta_1}-b}{s.e.(\hat{\beta_1})}$, n-2 degrees of freedom
• f-test: $H_0:\beta_1=0$
• F=MSR/MSE
• reject if F > $F_{1-\alpha;1,n-2}$
• two kinds of prediction
1. the prediction of the value of the respone variable Y which corresponds to any chose value, $x_o$ of the predictor variable
2. the estimation of the mean response $\mu_o$ when X = $x_o$

### assumptions

1. There exists a linear relation between the response and predictor variable(s).
• otherwise predicted values will be biased
2. The error terms have the constant variance, usually denoted as $\sigma^2$.
• otherwise prediction / confidence intervals for Y will be affected
3. The error terms are independent, have mean 0.
• otherwise a predictor like time might have been omitted from the model
4. Model fits all observations well (no outliers).
• otherwise misleading fit
5. The errors follow a Normal distribution.
• otherwise usually ok
• assessing regression assumptions
1. look at scatterplot
2. look at residual plot
• should fall randomly near 0 with similar vertical variation, magnitudes
3. Q-Q plot / normal probability plot
• standardized residulas vs. normal scores
• values should fall near line y = x, which represents normal distribution
4. could do histogram of residuals
• look for normal curve - only works with a lot of data points - lack of fit test - based on repeated Y values at same X values

### variable transformations

• if assumptions don’t work, sometimes we can transform data so they work
• transform x - if residuals generally normal and have constant variance
• corrects nonlinearity
• transform y - if relationship generally linear, but non-constant error variance
• stabilizes variance
• if both problems, try y first
• Box-Cox: Y’ = $Y^l$ if l ≠ 0, else log(Y)

# ch 3 multiple linear regression

• $Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + …+ \beta_p X_p + \epsilon$
• each coefficient is the contribution of x_i after both variables have been linearly adjusted for the other predictor variables
• least squares solved to estimate regression coefficients
• unbiased $\hat{\sigma}^2 = \frac{SSE}{n-p-1}$

### matrix form

• write Y = X*$\beta+\epsilon$
• each row is one $X_0,X_1,…,X_p$
• $\hat{\underline{\beta}} = (X’X)^{-1}X’Y$
• multicollinearity - sometimes no unique soln - parameter estimates have large variability
• $\hat{\sigma}^2 = \frac{SSE}{n-p-1}$ where there are p predictors
• hat matrix - $\hat{Y}=HY$
• H = $X(X’X)^{-1}X’$

### F-tests

• F statistic tests $H_0$: $\beta_1 = … = \beta_p = 0$
• reject when p ≤ .05
• $R^2$ only gets larger
• adjusted $R^2$ - divide each sum of squares by its degrees of freedom
• partial f-test: $H_0$: $\beta_2 = \beta_3 = 0$ ~ any subset of betas is 0
• tries to eliminate these from the model
• can only eliminate 1 coefficient at a time
• extra sum of squares
• first variables given go into model first
• basically partial f-test, but calculate f in different way $H_0$: $\beta_2 = \beta_3 = 0$ ~ any subset of betas is 0
• tries to eliminate these from the model

### anova table

•  last column is P(> t )
• test is whether the statistic = 0
• default F-value is for all coefficients=0

### extra sums of squares

• regression happens in order you specify

# ch 4 multicollinearity

• multicollinearity - when predictors are highly correlated with each other
• roundoff errors
1. X’X has determintant close to zero
2. X’X elements differ substantially in magnitude
• correlation transformation - normalizes variables
• when linearly dependent, clearly can’t determine coefficients uniquely
• cannot interpret one set of regression coefficients as reflecting effects of the different predictors
• cannot extrapolate
• predicting is fine
• variance inflation factor (VIF) - measure how much the variances of the estimated regression coefficients are inflated as compared to when the predictors are not linearly related

# ch 5 categorical predictors

• quantitive variable - gets a number
• qualitative variable - ex. color
• A matrix is singular if and only if its determinant is zero
• bonferroni procedure - we are doing 3 tests with 5% confidence, so we actually do 5/3% for each test in order to restrict everything to 5% total

### indicator variables with 2 classes

• ancova - at least one categorical and one quantitative predictor
• indicator variables take on the value 0 or 1
• dummy coding - matrix is singular so we drop the last indicator variable - called reference class / baseline class
• additive effects assume that each predictor’s effect on the response does not depend on the value of the other predictor (as long as the other one was fixed
• assume they have the same slope
• interaction effects allow the effect of one predictor on the response to depend on the values of other predictors.
• $y_i = β_0 + β_1x_{i1} + β_2xi2 + β_3xi1xi2 + ε_i$
• We can use the levene.test() function, from the lawstat package. The null hypothesis for this test is that the variances are equal across all classes.

### more than 2 classes

• $β_0$ is the mean response for the reference class when all the other predictors X1, X2, · · · are zero.
• $β_1$ is the mean response for the first class of C minus the mean response for the reference class when X1, X2, · · · are held constant.
• The F statistic reported with the summary() function for a linear model tests if β1 = ···β7 = 0.

## other coding

• effect coding
• one vector is all -1s
• B_0 should be weighted average of the class averages
• orthogonal coding (not on test)

# ch 6 polynomial regression

• have to get all lower order terms
• beware of overfitting
• must center all the variables to reduce multicollinearity
• hierarchical appraoch - fit higher order model to check whether lower order model is adequate or not
• if a given order term is retained, all related terms of lower order must be retianed
• otherwise it isn’t invariant to transformations of the columns
• interaction terms are similar to before

# ch 7 model comparison and selection

• Ockham’s razor - principle of parsimony - given two theories that describe a phenomenon equally well, we should prefer the theory that is simpler
• several different criteria
• don’t penalize many predictors
• $R^2_p$ - doesn’t pen
• penalize many predictors
• adjusted $R^2_p$ - penalty
• Mallow’s $C_p$
• $AIC_p$
• $BIC_p$
• PRESS

# ch 9 model building process overview

• Probability

# Properties

• Mutually Exclusive: P(AB)=0
• Independent: P(AB) = P(A)P(B)
•  Conditional: P(A B) = $\frac{P(AB)}{P(B)}$

# Measures

• $E[X] = \int P(x)x dx$
• $V[X] = E[(x-\mu)^2] = E[x^2]-E[x]^2$
• for unbiased estimate, divide by n-1
• $Cov[X,Y] = E[(X-\mu_X)(Y-\mu_Y)] = E[XY]-E[X]E[Y]$
• $Cor(Y,X) = \rho = \frac{Cov(Y,X)}{s_xs_y}$
• $(Cor(Y,X))^2 = R^2$
• Cov is a measure of how much 2 variables change together
• linearity
• $Cov(aX+bY,Z) = aCov(X,Z)+bCov(Y,Z)$
• $V(a_1X_…+a_nX_n) = \sum_{i=1}^{n}\sum_{j=1}^{n}a_ia_jcov(X_i,X_j)$
• if $X_1,X_2$ independent, $V(X_1-X_2) = V(X_1) + V(X_2)$
•  $f(X), X=v(Y), g(Y) = f(v(Y))$ $\frac{d}{dy}g^{-1}(y)$
•  $g(y_1,y_2) = f(v_1,v_2) det(M)$ where M in row-major is $\frac{\partial v1}{y1}, \frac{\partial v1}{y2} …$
• $Corr(aX+b,cY+d) = Corr(X,Y)$ if a and c have same sign
• $E[h(X)] \approx h(E[X])$
• $V[h(X)] \approx h’(E[X])^2 V[X]$
• skewness = $E[(\frac{X-\mu}{\sigma})^3]$

# Moment-generating function

• $M_X(t) = E(e^{tX})$
• $E(X^r) = M_X ^ {(r )} (0)$
• sometimes you can use $ln(M_x(t))$ to find $\mu$ and $V(X)$
• Y = aX+b $-> M_y(t) = e^{bt}M_x(at)$
• Y = $a_1X_1+a_2X_2 \to M_Y(t) = M_{X_1}(a_1t)M_{X_2}(a_2t)$ if $X_i$ independent
• probability plot - straight line is better - plot ([100(i-.5)/n]the percentile, ith ordered observation)
• ordered statistics - variables $Y_i$ such that $Y_i$ is the ith smallest
• If X has pdf f(x) and cdf F(x), $G_n(y) = (F(y))^n$, $g_n(y) = n(F(y))^{n-1}f(y)$
• If joint, $g(y_1,…y_n) = n!f(y_1)…f(y_n)$
• $g(y_i) = \frac{n!}{(i-1)!(n-i)!}(F(y_i))^{i-1}(1-F(y_i))^{n-1}f(y_i)$

# Distributions

• Bernoulli: $f(x)= \begin{cases} 1,& \text{if } 0\leq x\leq p 0, & \text{otherwise} \end{cases}$

• Binomial: $f(n,p)= \begin{cases} {n \choose p} p^x (1-p)^{n-x},& \text{if } 0\leq x\leq p 0, & \text{otherwise} \end{cases}$

• Gaussian: $f(\lambda)= \begin{cases} \frac{1}{\lambda}e^{\lambda t},& \text{if } 0\leq x\leq p 0, & \text{otherwise} \end{cases}$

• Statistics

# Statistics and Sampling Distributions

• can calculated expected values of sample mean and sample $\sigma$ 2 ways: prob. rules and simulation (for simulation fix n and repeat k times)
• CLT - random samples have a normal distr. if n is large
• CLT: $lim_{n->\infty}P(\frac{\bar{X}-\mu}{\sigma/\sqrt{n}}\leq z)=P(Z\leq z) = \Phi(z)$
• CLT $\to Y = X_1..X_n$ has approximately lognormal distribution if all $P(X_i>0)$

### Law of Large Numbers

• $E(\bar{X}-\mu)^2 \to 0$ as $n \to \infty,$
•  $P( \bar{X}-\mu \geq \epsilon) \to 0$ as $n \to \infty$
• $T_o = X_1+…+X_n, E(T_o) = n\mu , V(T_o) = n\mu ^2$
• $E(\bar{X}) = \mu$
• $V(\bar{X}) = \frac{\sigma_x^2}{n}$
• chi-squared - finding the distribution for sums of squares of normal variables.
• if Z1,…, Zn are i.i.d. standard normal,then $Z_1^2+…+Z_n^2 = \chi_n^2$
• $(n-1)S^2/\sigma^2 \text{ proportional to } \chi_{n-1}^2$
• t - to use the sample standard deviation to measure precision for the mean X, we combine the square root of a chi-squared variable with a normal variable
• f - compare two independent sample variances in terms of the ratio of two independent chi-squared variables.

# Point Estimation

• point estimate - single number prediction
• point estimator - statistic that predicts a parameter
• MSE - mean squared error - $E[(\hat{\theta}-\theta)^2]$ = $V(\hat{\theta})+[E(\hat{\theta})-\theta]^2$
• bias = $E(\hat{\theta})=\theta$
• after unbiased we want MVUE (minimum variance unbiased estimator)
• Estimators: $\tilde{X}$ = Median, $X_e$ = Midrange((max+min)/2), $X_{tr(10)}=$ 10 percent trimmed mean (discard smallest and largest 10 percent)
• standard error: $\sigma_{\hat{\theta}} = s_{\hat{\sigma}} = \sqrt{Var(\hat{\theta)}}$ - determines consistency
• boostrap - computationally compute standard error
• $S^2 (Unbiased)= \sum{\frac{(X_i-\bar{X})^2}{n-1}}$
• $\hat{\sigma^2} (MLE) = \sum{\frac{(X_i-\mu)^2}{n}}$
• Can calculate estimators for a distr. by calculating moments
• A statistic T = t(X1, . . ., Xn) is said to be sufficient for making inferences about a parameter y if the joint distribution of X1, X2, . . ., Xn given that T = t does not depend upon y for every possible value t of the statistic T.
• Neyman Factorization Thm - $t(X_1,…,X_n)$ is sufficient $\leftrightarrow f = g(t,\theta)*h(x_1,…,x_n)$
•  Estimating h($\theta$), if U is unbiased, T is sufficient for $\theta$, then use $U^* = E(U T)$
• Fisher Information $I(\theta)=V[\frac{\partial}{\partial\theta}ln(f(x;\theta))]$ (for n samples, multiply by n)
• If T is unbiased estimator for $\theta$ then $V(T) \geq \frac{1}{nI(\theta)}$
• Efficiency of T is ratio of lower bound to variance of T
• hypergeometric - number of success in n draws of (without replacement) of sample with m successes and N-m failures
• negative binomial - fix number of successes, X = number of trials before rth success
• normal - standardized: $\frac{X-\mu}{\sigma}$ (mean 0 and std.dev.=1)
• gamma: $\Gamma (a) = \int_{0}^{\infty} x^{a-1}e^{-x}dx$, $\Gamma(1/2) = \sqrt{\pi}$

## MLE

• MLE - maximize $f(x_1,…,x_n;\theta_1,…\theta_m)$ - agreement with chosen distribution - often take ln(f) and then take derivative $\approx$ MVUE, but can be biased
• $\hat{\theta} =$argmax $L(\theta)$
•  Likelihood $L(\theta)=P(X_1…X_n \theta)=\prod_{i=1}^n P(X_i \theta)$
•  $logL(\theta)=\sum log P(X_i \theta)$
• to maximize, set $\frac{\partial LL(\theta)}{\partial \theta} = 0$
•  Use $\hat{\theta} =$argmax $P(\text{Train Model}(\theta))$

# statistical intervals

• interval estimates come with confidence levels
• $Z=\frac{\bar{X}-\mu}{\sigma / \sqrt{n}}$
• For p not close to 0.5, use Wilson score confidence interval (has extra terms)
• confidence interval - If multiple samples of trained typists were selected and an interval constructed for each sample mean, 95 percent of these intervals contain the true preferred keyboard height

# Tests on Hypotheses

• Var($\bar{X}-\bar{Y})=\frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n}$
• tail refers to the side we reject (e.g. upper-tailed=$H_a:\theta>\theta_0$
• $\alpha$ - type 1 - reject $H_0$ but $H_0$ true
• $\beta$ - type 2 - fail to reject $H_0$ but $H_0$ false
• we try to make the null hypothesis a statement of equality
• upper-tailed - reject large values
• $\alpha$ is computed using the probability distribution of the test statistic when $H_0$ is true, whereas determination of b requires knowing the test statistic distribution when $H_0$ is false
• type 1 error usually more serious, pick $\alpha$ level, then constrain $\beta$
• can standardize values and test these instead
• P-value is the probability, calculated assuming that the null hypothesis is true, of obtaining a value of the test statistic at least as contradictory to $H_0$ as the value calculated from the available sample. (observed significance level)
• reject $H_0$ if p $\leq \alpha$

# Inferences Based on 2 Samples

• $\sigma_{\bar{X}-\bar{Y}} = \sqrt{\frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n}}$
• there are formulas for type 1,2 errors
• If both normal, $Z = \frac{\bar{X}-\bar{Y}-(\mu_1-\mu_2)}{\sqrt{\frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n}}}$
• If both have same variance, do a weighted average (pooled) $S_p^2 = \frac{m-1}{m+n-2}S_1^2+\frac{n-1}{m+n-2}S_2^2$
• If we have a large sample size, these expressions are basically true, we just use the sample standard deviation
• randomized controlled experiment - investigators assign subjects to the two treatments in a random fashion
• small sample sizes - two-sample t test
• $T = \frac{\bar{X}-\bar{Y}-(\mu_1-\mu_2)}{\sqrt{\frac{S_1^2}{m}+\frac{S_2^2}{n}}}$
• $\nu= \frac{(se_1^2 + se_2^2)^2}{\frac{se_1^4}{m-1}+\frac{se_2^4}{n-1}}$ (round down)
• $se_1 = \frac{s_1}{\sqrt{m}}$
• $se_2 = \frac{s_2}{\sqrt{n}}$
• two-sample t confidence interval for $\mu_1-\mu_2$ with confidence 100(1-a) percent:
• $\bar{x}-\bar{y} \pm t_{\alpha/2,v} \sqrt{\frac{s_1^2}{m}+\frac{s_2^2}{n}}$
• very hard to calculate type II errors here
• paired data - not independent
• we do a one-sample t test on the differences
• do pairing when large correlation within experimental units
• do independent-samples when correlation within pairs is not large
• proportions when m and n both large:
• $Z=\frac{\hat{p_1}-\hat{p_2}}{\sqrt{\hat{p}\hat{q}(\frac{1}{m}+\frac{1}{n})}}$ where $\hat{p}=\frac{m}{m+n}\hat{p_1}+\frac{n}{m+n}\hat{p_2}$, $\hat{q}=1-\hat{p}$
• bootstrap - computationally compute by taking samples, can use percentile intervals (sort and then pick nth from bottom/top)
• permutation tests - permute the labels on the data - p-value is the fraction of arrangements that are at least as extreme as the value computed for the original data
• for testing if two variances are equal, use $F_{\alpha,m-1,n-1}$

# ANOVA

• ANOVA - analysis of variance

### Regression and Correlations

• y - called dependent, response variable
• x - independent, explanatory, predictor variable
•  notation: $E(Y x^) = \mu_{Y\cdot x^} =$ mean value of Y when x = $x^*$
• Y = f(x) + $\epsilon$
• linear: $Y=\beta_0+\beta_1 x+\epsilon$
• logistic: $odds = \frac{p(x)}{1-p(x)}=e^{\beta_0+\beta_1 x+\epsilon}$
• we minimize least squares: $SSE = \sum_{i=1}^n (y_i-(b_0+b_1x_i))^2$
• $b_1=\hat{\beta_1}=\frac{\sum (x_i-\bar{x})(y_i-\bar{y})}{\sum (x_i-\bar{x})^2} = \frac{S_{xy}}{S_{xx}}$
• $b_0=\bar{y}-\hat{\beta_1}\bar{x}$
• $S_{xy}=\sum x_iy_i-\frac{(\sum x_i)(\sum y_i)}{n}$
• $S_{xx}=\sum x_i^2 - \frac{(\sum x_i)^2}{n}$
• residuals: $y_i-\hat{y_i}$
• SSE = $\sum y_i^2 - \hat{\beta}_0 \sum y_i - \hat{\beta}_1 \sum x_iy_i$
• SST = total sum of squares = $S_{yy} = \sum (y_i-\bar{y})^2 = \sum y_i^2 - (\sum y_i)^2/n$
• $r^2 = 1-\frac{SSE}{SST}=\frac{SSR}{SST}$ - proportion of observed variation that can be explained by regression
• $\hat{\sigma}^2 = \frac{SSE}{n-2}$
• $T=\frac{\hat{\beta}1-\beta_1}{S / \sqrt{S{xx}}}$ has a t distr. with n-2 df
• $s_{\hat{\beta_1}}=\frac{s}{\sqrt{S_{xx}}}$
• $s_{\hat{\beta_0}+\hat{\beta_1}x^} = s\sqrt{\frac{1}{n}+\frac{(x^-\bar{x})^2}{S_{xx}}}$
• sample correlation coefficient $r = \frac{S_{xy}}{\sqrt{S_xx}\sqrt{S_{yy}}}$
• this is a point estimate for population correlation coefficient = $\frac{Cov(X,Y)}{\sigma_X\sigma_Y}$
• make fisher transformation - this test statistic also tests correlation
• degrees of freedom
• one-sample T = n-1
• T procedures with paired data - n-1
• T procedures for 2 independent populations - use formula ~= smaller of n1-1 and n2-1
• variance - n-2
• use z-test if you know the standard deviation—