Course:CPSC317/2013WT1/Lecture Notes

From UBC Wiki

Lecture 16

Routing

Admin

  • Exam
    • Mean:
  • Regrading
    • If there’s an addition error, let me know
    • Otherwise, regrading requests needs to be for 4 or more points
      • Takes a change of 5.5 points on the midterm to change your course mark by 1 pt
    • Rationale has to be in an email to me


  • If you didn’t do well
    • Don’t panic, assignments, final and participation are worth 82% or your grad
    • Come see me regularly at office hours
  • Course grading scheme
    • Assignments 44%
    • Final Exam 36%
    • Midterm 18%
    • PeerReview 4%


  • Percentages may move by 1 or 2 points


  • The midterm and/or final marks may be scaled up a bit if the course average is lower than I’d like

Last Lecture

  • IP layer uses the FT to find two things


  • FT:
    • CIDB block à? outgoing InterfaceID/IP and Next Hop IP


  • The link layer at that interface knows it’s own MAC address
    • but does not know the MAC address of the Next Hop IP


  • ARP module sits betw IP layer and link


  • ARP:
    • Next Hop IP à? dest MAC addr


  • Now link layer has both src and dest MAC addrs ==== Now let’s get back to routing ====


  • Data plane: mostly about forwarding
    • Uses the Forwarding Table data structure


  • Control plane: mostly about “routing”
    • Routing is the process by which routers collectively compute forwarding tables such that:
      • Every host can reach every other host
      • There are no cycles, except perhaps for short periods
    • Routing will use a data structure called a routing table
      • Much like a forwarding table but may have more info in it
      • Will depend on the routing protocol

Where do Forwarding Tables Come From?

  • Forwarding table entries can be statically configured
    • E.g., “map 12.34.158.0/24 to Serial0/0.1”


  • But, this doesn’t adapt
    • To failures
    • To new equipment
    • To the need to balance load


  • That is where routing protocols come in

Routing

  • Routing on the Internet is hierarchical
    • The internet is broken up into Autonomous Systems
      • An AS is a large IP network within a single administrative domain"
      • One organizations can have several Ases"
    • Not all networks are in an AS"
      • Some networks are stub IP networks"
      • They’re attached to exactly one AS"


  • Two main levels of routing hierarchy
    • How to route between ASes"
      • Interdomain routing or Exterior Gateway Protocols "
    • How to route within an AS or stub IP network"
      • Intradomain routing or Interior Gateway Protocols

Interdomain routing

  • As a first cut, think of each AS as “meta router”


  • Think about a given AS


  • It is connected to several neighboring ASes


  • For a given CIDR block, the given AS needs to decide on the “next hop” AS


  • For a given destination address block, the goal is for the next hop edges to form a directed tree where all paths lead to the “root” AS

Intradomain routing

  • Just worries about CIDR blocks within the AS
    • and maybe those in connected stub networks


  • The routers can exchange detailed proprietary topology and network info, e.g., link speeds


  • The paths taken to a subnet in the AS must be only through routers in the AS

Putting them together

  • Eventually you have to glue the intradomain routing decisions and the interdomain routing decisions together
    • More on that later

Routing

  • There are two main techniques for routing
    • Link State routing
      • Crucially uses Dijkstra’s algorithm for single source shortest path
      • Used in Intradomain routing
    • Distance Vector routing
      • Used in Intradomain for small networks
      • A variant used in Interdomain
  • There are two main techniques for routing
    • Link State routing
      • Crucially uses Dijkstra’s algorithm for single source shortest path
      • Used in Intradomain routing
    • Distance Vector routing
      • Used in Intradomain for small networks
      • A variant used in Interdomain

Link State Routing

  • We’ll abstract away the edge subnets and just talk about routing on a graph
  • Four steps 1. Link State Detection:
      • Each router keeps track of its incident links


  • Whether the link is up or down


  • The cost on the link 2. Link State Flooding:
      • Nodes collectively broadcast topology updates


  • After which each node has its own copy of the topology 3. Local Shortest Path Computation:
      • Every node, e.g., u, locally computes the shortest path between it and all the other nodes in the graph 4. FT Update: Every node updates it forwarding table
      • For every v, u uses the first edge on its path to v in it’s FT


  • E.g., if the first edge on the shortest path from u to v is (u,x) then u’s forwarding table for v is v: (u,x)"
  • Link state often used for intradomain routing


  • Example link state intradomain protocols
    • Open Shortest Path First (OSPF)
    • Intermediate System ** Intermediate System (IS-IS)
  • Four steps 1. Each router keeps track of its incident links
      • Whether the link is up or down
      • The cost on the link 2. Nodes collectively broadcast topology updates

(flooding)

      • After which each node has its own copy of the topology 3. Every node, e.g., u, locally computes the shortest path between it and all the other nodes in the graph 4. Every node updates it forwarding table
      • For every v, u uses the first edge on its path to v in it’s FT


  • E.g., if the first edge on the shortest path from u to v is (u,x) then u’s forwarding table for v is v: (u,x)"

Shortest-Path Tree

  • Shortest-path tree from u * Forwarding table at u u! v! w! x! y! z! s! t! v (u,v) w (u,w) x (u,w) y (u,v) z (u,v) link s (u,w) t (u,w)

Link State Routing

  • Four steps 1. Each router keeps track of its incident links
      • Whether the link is up or down
      • The cost on the link 2. Nodes collectively broadcast topology updates

(flooding)

      • After which each node has its own copy of the topology 3. Every node, e.g., u, locally computes the shortest path between it and all the other nodes in the graph 4. Every node updates it forwarding table
      • For every v, u uses the first edge on its path to v in it’s FT


  • E.g., if the first edge on the shortest path from u to v is (u,x) then u’s forwarding table for v is v: (u,x)"
  • For shortest path problem we use either the hop count or more generally edges can have different costs:
    • E.g., cost might be inversely proportional to link rate


  • We assume that the costs are not a function of current traffic in the network
    • But might be a function of long term averages of traffic

Single Source Shortest-Path Problem

  • Given: network topology with link costs
    • c(x,y): link cost from node x to node y
    • Infinity if x and y are not direct neighbors


  • Compute: least-cost paths to all nodes
    • From a given source u to all other nodes u! v!

Dijsktra’s Algorithm

1 Initialization: 2 S0 = {u} 3 for all nodes v 4 if (v is adjacent to u) 5 D0(v) = c(u,v) 6 else D0(v) = 8 For i=1 to N 9 find w not in Sk-1 with the smallest Dk-1(u,w) 10 add w to S: Sk = Sk-1 union {w} 11 update D(u,v) for all v adjacent to w and not in S: 12 Dk(u,v) = min{Dk-1(u,v), Dk-1(u,w) + c(w,v)} 13 Scope of For

Link State Routing

  • Four steps 1. Each router keeps track of its incident links
      • Whether the link is up or down
      • The cost on the link 2. Nodes collectively broadcast topology updates

(flooding)

      • After which each node has its own copy of the topology 3. Every node, e.g., u, locally computes the shortest path between it and all the other nodes in the graph 4. Every node updates it forwarding table
      • For every v, u uses the first edge on its path to v in it’s FT


  • E.g., if the first edge on the shortest path from u to v is (u,x) then u’s forwarding table for v is v: (u,x)"

Detecting Topology Changes

  • Beaconing
    • Periodic “hello” messages in both directions
    • Detect a failure after a few missed “hellos”


  • Performance trade-offs
    • Detection speed
    • Overhead on link bandwidth and CPU
    • Likelihood of false detection

“hello”!


Learning The Topology

  • The routers are using the Link State Protocol to build the routing table
    • They can’t assume they have a routing table to use to forward their local topology information to the other nodes on the network


  • How do the nodes communicate local topology information?

Broadcasting Link State

  • Idea: Flooding
    • Each node sends out it’s local link state to it’s neighbors
      • The neighbors rebroadcast it
    • In general, if a node gets link state info it rebroadcasts it to its neighbors


  • Eventually, every nodes gets info about all of the links


  • Problem: exponential number of packets
    • When do you stop rebroadcasting?


  • Solution: don’t rebroadcast an update over the link it came in on
    • Ensures an update travels on each edge exactly once
    • Can piggyback updates on one another

Broadcasting the Link State

  • Flooding
    • Node sends link-state information out its links
    • And then the next node sends out all of its links
    • … except the one where the information arrived X A C B D

(a) X A C B D (b) X A C B D (c) X A C B D (d)

  • Problem: if a node broadcasts its link state periodically, an older broadcast may reach another node at a later time than a more recent broadcast
    • How to prevent nodes from using “old” updates?


  • Solution: time-stamps and/or sequence numbers
    • They have their own problems

When to Initiate Flooding

  • Topology change
    • Link or node failure
    • Link or node recovery


  • Configuration change
    • Link cost change


  • Periodically
    • Refresh the link-state information
    • Typically (say) 30 minutes
    • Corrects for possible corruption of the data


  • Challenges
    • Packet loss
    • Out-of-order arrival


  • Reliable flooding
    • Ensure all nodes receive link-state information
    • … and that they use the latest version


  • Solutions
    • Acknowledgments and retransmissions
    • Sequence numbers
    • Time-to-live for each packet


  • Protocols are complex

When the Routers Disagree

(during transient periods)!

Convergence

  • Getting consistent routing information to all nodes
    • E.g., all nodes having the same link-state database


  • Consistent forwarding after convergence
    • All nodes have the same link-state database
    • All nodes forward packets on shortest paths
    • The next router on the path forwards to the next hop

Transient Disruptions

  • Detection delay
    • A node does not detect a failed link immediately
    • … and forwards data packets into a “blackhole”
    • Depends on timeout for detecting lost hellos


  • Inconsistent link-state database
    • Some routers know about failure before others
    • The shortest paths are no longer consistent
    • Can cause transient forwarding loops 4

Convergence Delay

  • Sources of convergence delay
    • Detection latency
    • Flooding of link-state information
    • Shortest-path computation
    • Creating the forwarding table


  • Performance during convergence period
    • Lost packets due to blackholes and TTL expiry
    • Looping packets consuming resources
    • Out-of-order packets reaching the destination


  • Very bad for VoIP, online gaming, and video

Reducing Convergence Delay

  • Faster detection
    • Smaller hello timers
    • Link-layer technologies that can detect failures


  • Faster flooding
    • Flooding immediately
    • Sending link-state packets with high-priority


  • Faster computation
    • Faster processors on the routers
    • Incremental Dijkstra’s algorithm


  • Faster forwarding-table update
    • Data structures supporting incremental updates

Scaling Link-State Routing

  • Overhead of link-state routing
    • Flooding link-state packets throughout the network
    • Running Dijkstra’s shortest-path algorithm


  • Introducing hierarchy through “areas” Area Area 1 Area Area 3 Area area border router

Where are we?

  • IP Control Plane: Routing
    • Link State protocol family
      • Flooding link state + local computation of shortest paths
      • Used for Intra-AS routing
      • When links come up and down, black holes and loops possible until convergence
    • Distance Vector protocol family
      • Share with your neighbors what you think your distance is to the other nodes in the graph
      • Used for Intra-AS and Inter-AS routing

Distance Vector Protocol

  • Abstract away IP addresses and subnet addresses again
    • We’ll just concentrate on a graph


  • Each node in the graph is going to be running the distance vector protocol
    • The goal is very similar as to the Link State protocol:
      • The protocol in node v is trying to compute the cost of the least cost path to every other node in the network, and the outgoing edge for that path
      • Unlike a link state protocol v does not compute the entire path itself
  • In DV, node v keeps it’s current estimate of the cost of the least cost paths to every other node in the network.
    • This is v’s “distance vector”
    • When v’s distance vector changes, it shares it’s distance vector with its neighbors
    • Conversely, when v receives updates from neighbors, it updates its distance vector and sends updates to its neighbors if necessary

Distance Vector Algorithm

Distributed:

  • Each node performs its iterations locally and asynchronously Iterative: each local iteration caused by:


  • Local link cost change


  • Distance vector update message from neighbor Asynchronous:


  • Each node notifies neighbors when its DV changes, or at least according to a local timer wait for (change in local link cost or message from neighbor) recompute estimates if distance to any destination has changed, notify neighbors Each node:

How do the updates work?

  • Define distance between two nodes in a graph
    • D(x,y) = cost of least-cost path from x to y

= minimum over all paths p from x to y of {the cost of p}

    • Let N(y) be the set of all neighbors of y


  • Bellman Ford Equation
    • D(x,y) = min over w in N(y) of { D(x,w) + c(w,y) }

= min over w in N(x) of { c(x,w) + D(w,y) }

  • We’re still at a prototypical node v.


  • Rather than worrying about v’s distance to every other node—it’s whole distance vector--let’s just focus on just one u.


  • Consider the neighbors of v: w1, w2,…,wk.


  • v knows the cost of its outgoing links: c(v,w1), c(v,w2),…


  • For each neighbor, v stores what that neighbor last told v about it’s current estimate of the distance to u


  • When v gets an update it uses the BF eq’s to calculate it’s its new estimate
    • D(v,u) = min over w in N(v) of { c(v,w) + D(w,u) }

Distance Vector Algorithm

Distributed:

  • Each node performs its iterations locally and asynchronously Iterative: each local iteration caused by:


  • Local link cost change


  • Distance vector update message from neighbor Asynchronous:


  • Each node notifies neighbors when its DV changes, or at least according to a local timer wait for (change in local link cost or message from neighbor) recompute estimates if distance to any destination has changed, notify neighbors Each node:

Bellman Ford

  • Why is it true?


  • In the definition of D(x,y) we need to take the minimum over *every* path between x and y.


  • Let’s do the following 1. Let’s partition all of the paths into disjoint sets 2. Find the minimum cost of a path in each partition 3. Take the minimum over the minimum in each partition
    • This is equivalent to the definition

1. What partition should we use?

    • For each w in N(y), consider the paths through w. This is a partition of all the paths from x to y 2. What’s the minimum cost in each partition?
    • Since each path goes through (w,y) each path shares the additive cost of c(w,y). The minimum of this partition is just the minimum cost from x to w + c(w,y) which is D(x,w) + c(w,y) 3. What’s the minimum over these partition values?
    • Min over w in N(y) of {D(x,w) + c(w,y)}

DV protocol!

  • Quick note:
    • DV often described as neighbors exchanging the entire distance vector (their current estimate to every other node in the network)
    • In some variants only the distance estimate that have changed are sent in updates
    • Updates can be
      • in response to changes in the topology,
      • Periodic
      • Or both

Lecture 15

IP FORWARDING AND ADDRESSING

IP ADDRESS (IPV4)

  • A unique 32-bit number


  • Represented in dotted-decimal notation
    • Each byte represented by a integer between 0 and


  • E.g., 12.34.158.00001100 00100010 10011110 12 34 158 ====HIERARCHICAL ADDRESSING: IP PREFIXES====


  • An IP address is divided into network & host portions (left and right)


  • The network portion is often called the network prefix 11001100 00100010 10011110 Network (24 bits) Host (8 bits) 204 34 158


  • Host whose IP addresses share the same network portion/prefix are part of a subnetwork


  • IP address that share the same network prefix form a block of addresses, e.g., [204.34.158.0, 204.34.158.255] 11001100 00100010 10011110 Network (24 bits) Host (8 bits) 204 34 158

Forwarding Table Look-up

  • The forwarding table is a mapping from subnetworks (i.e., blocks of addresses) to outgoing links


  • When a destination IP address comes in:
    • the router checks which subnetwork the IP address is in and retrieves the outgoing link.
      • It does that by comparing the prefix of the destination IP address with the network prefixes that define the subnetworks in its table
      • Can think of the prefix as the most significant bits of the address

CLASS-BASED ADDRESS BLOCKS

  • A Class A address block:
    • X.0.0.0 where
      • X is a byte with leading * A Class B address block:
    • X’.Y.0.0 where
      • X’ is a byte with leading o Y can be anything


  • A Class C address block:
    • X’’.Y.Z.0 where
      • X’’ is a byte with leading o Y, and Z can be anything

CLASSLESS-BASED ADDRESSING

  • Size of Class-based blocks not flexible enough


  • New scheme in 1993: classless-addressing
    • Explicitly encode the number of bits being used for the network prefix

CLASSLESS INTER-DOMAIN ROUTING (CIDR)

IP Address : 12.4.0.0 IP Mask: 255.254.0.Address 00001100!00000100!00000000!00000000! Mask 11111111!11111110!00000000!00000000! Network Prefix for hosts Use two 32-bit numbers to represent a network. CIDR block = (Network IP address, Mask) Written as 12.4.0.0/

SUBNET MASKS

  • With the subnet mask the router can quickly:
    • perform a bit-wise-and of the mask and the destination IP address
    • and then compare the result to the Network IP address

CIDR BLOCKS

  • Note that a CIDR Block address really denotes a block of contiguous addresses:


  • E.g., 193.34.158.0/23!
    • The last 32-23 = 9 bits are wildcards!
    • That’s 512 IP addresses!
    • [193.34.158.0 to 193.34.160.0) = !
    • [ 193.34.158.0 to 193.34.159.255]

DELEGATION AND AGGREGATION

  • CIDR block addressing allows for flexible block sizes:
    • To better match the needs of each network


  • Address blocks delegated down from:
    • IANA to RIRs to chain of provider/customer relationships


  • Allows FTs to aggregate address blocks


  • But need the longest prefix match rule to break ties

LONGEST PREFIX MATCH FORWARDING

  • Forwarding tables in IP routers
    • Maps each IP prefix to output interfaces


  • When a dest IP matches more than one network prefix/CIDR block
    • Choose the CIDR block with the longest prefix 4.0.0.0/4.83.128.0/201.10.0.0/201.10.6.0/126.255.103.0/201.10.6.destination forwarding table Serial0/0.outgoing link

LONGEST PREFIX MATCH

  • Recall address prefixes/CIDR blocks are blocks of addresses


  • The “longest prefix match” is the smallest or most specific address block to which the dest IP belongs

WORKING WITH CIDR ADDRESSES

/n Length of prefix

  1. of

wildcard bits Size of block Size of block in base Size - /30 2 22 = 4 0.0.0.4 0.0.0./29 3 23=8 0.0.0.8 0.0.0./28 4 24=16 0.0.0.16 0.0.0./27 5 25=32 0.0.0.32 0.0.0./26 6 26=64 0.0.0.64 0.0.0./25 7 27=128 0.0.0.128 0.0.0./24 8 28=256 0.0.1.0 0.0.0./23 9 29=21*28=512 0.0.2.0 0.0.1./22 10 210=22*28=1024 0.0.4.0 0.0.3./21 11 211=23*28=2048 0.0.8.0 0.0.7.

  • Determine if 221.75.18.5 is in: a. 221.72.0.0/b. 221.73.128.0
  • What is the range of addresses represented by: a. 221.72.0.0/** 32-14 = 18 wildcard bits
    • So the address block has 2^{18} addresses starting at 221.72.0.** 2^18 = 2^2*2^{16} = 0.4.0.** So, the last address (exclusive) is
    • 221.72.0.0 + 0.4.0.0 = 221.76.0.** The range is
    • [221.72.0.0, 221.76.0.0) =
    • [221.72.0.0, 221.75.255.255]
  • What is the range of addresses represented by: b. 221.73.128.0/** 32-17 = 15 wildcard bits
    • So the address block has 2^{15} addresses starting at 221.73.128.** 2^15 = 2^7*2^8 = 0.0.128.** So, the last address (exclusive) is
    • 221.73.128.0 + 0.0.128.0 = 221.74.0.** The range is
    • [221.73.128.0, 221.74.0.0) =
    • [221.73.128.0, 221.73.255.255]

IP ADDRESSES FOR INTERFACES

  • IP addresses are for network interfaces not for hosts


  • Why?

IP ADDRESSES ARE FOR INTERFACES!

  • IP addresses are for network interfaces not for hosts


  • Why?


  • If a router is going to route packets between two networks, it has to have an address in each network 223.1.1.223.1.1.223.1.1.223.1.1.4 223.1.2.223.1.2.223.1.2.223.1.3.1 223.1.3.223.1.3.223.1.1.0/24 223.1.2.0/223.1.3.0/

SUBNETS

  • IP address:
    • subnet part (high order bits)
    • host part (low order bits)


  • What’s a subnet ?
    • device interfaces with same subnet part of IP address
    • can physically reach each other without intervening router 223.1.1.223.1.1.223.1.1.223.1.1.4 223.1.2.223.1.2.223.1.2.223.1.3.1 223.1.3.223.1.3.network consisting of 3 subnets subnet 223.1.1.0/223.1.2.0/223.1.3.0/
  • To determine the physical subnets, detach each router interface, creating islands of isolated networks.


  • The IP addresses of all interfaces in a subnet
    • share the same address prefix
    • are in the same address block 223.1.1.0/24 223.1.2.0/223.1.3.0/

How many? 223.1.1.223.1.1.223.1.1.223.1.2.1 223.1.2.223.1.2.223.1.3.1 223.1.3.223.1.3.223.1.1.223.1.7.223.1.7.223.1.8.1 223.1.8.223.1.9.223.1.9.

SPECIAL IP ADDRESSES

  • Private address blocks
    • 10.0.0.0/8, 172.16.0.0/12,192.168.0.0/** Can be used inside a private network
    • A router is never supposed to forward a packet with a private source or dest addr to the public Internet
    • A router sitting between a private and public network has to do something clever to translate between private and public addresses—more on this later in the course


  • Loopback
    • 127.0.0.0/** Can be used inside of a host (e.g., web server and browser running on the same host)


  • Directed Multicast
    • Network part followed by all 1s
    • Directed broadcast to a specific network:
    • Only as dest addr
    • E.g. 12.34.158.255 in 12.34.158.0/* In-network multicast
    • All 1’s
    • Can never be forwarded outside of the subnet


  • “This host”
    • When a host doesn’t know its IP address but needs to put something in the source addr field
    • 0.0.0.* And a few others….

NEXT HOP

  • We’ve been saying that a forwarding table is a mapping from an address prefix to an outgoing interface
    • What happens when several routers are connected to a LAN?
      • Need a way to say which of the other routers should receive the packet
    • This is common in “peering points” where routers of multiple ISPs share packets back and forth
    • Also true for wifi and classic ethernet LANS
    • The IP should not make any assumptions about whether the link layer is pt to pt or broadcast


CIDR block IP interface Interface ID Next hop Flags 201.98.20.0/10 25.8.16.1 Eth0 25.8.16. Simplified Forwarding Table!


    • What happens at the destination router?
      • The subnet in which the destination host resides is called the directly connected network
      • How do you know you’re there?
      • What’s the next hop?

DIRECTLY CONNECTED

CIDR block IP interface Interface ID Next hop Flags 201.98.20.0/10 25.8.16.1 Eth0 25.8.16.15.22.8.0/25 15.22.8.5 Eth1 --- Simplified Forwarding Table!

NEXT HOP

    • What happens at the destination router?
      • How do you know you’re there?
    • If your dest IP addr matches on a CIDR block AND
    • The IP addr of the output IP Interface in that row of the FT is in the CIDR block AND
    • There is no value in the next hop field
      • Sometimes this is also indicated by a flag in that row of the FT
      • What is the next hop?
    • The next hop in this case is just the dest IP addr of the datagram

Connection between Link and IP layer

  • Before moving on with routing, I want to back up just a bit


  • We talked about:
    • IP addresses associated with interfaces
      • So every interface has an associated MAC address and IP address
    • The notion of an IP subnet
    • The need for “next hop” in the routing table
    • A little bit about what the destination router does

Subnets and LANs

  • When you go hop by hop at the network layer across an IP network:
    • The source and destination IP addresses stay the same the entire time
    • BUT
    • The source and destination MAC address (both) change at each hop

Example: A Sending a Packet to B

How does host A send an IP packet to B? A R B A sends packet to R, and R sends packet to B.!

Host A Sends Packet Through R

  • IP header
    • Source: 111.111.111.** Dest: 222.222.222.A R B


A R B A’s FT gives 111.111.111.110 as Next Hop for 222….

Next Hop

CIDR block IP interface Interface ID Next hop Flags 222.222.222.128/25 111.111.111.111 Eth0 111.111.111.Dest IP 222.222.222.222 matches this row of the forwarding table!

Crafting the link layer frame

CIDR block IP interface Interface ID Next hop Flags 222.222.222.128/25 111.111.111.111 Eth0 111.111.111.* Passing the IP packet down to the link layer for Eth0!


  • IP interface/Interface ID à? src MAC addr!


  • Next hop IP à? dest MAC addr !

!

  • Host has map of Next hop IP to dest MAC addr !


  • ARP table/ARP protocol!

Host A Sends Packet Through R

  • IP header
    • Source: 111.111.111.** Dest: 222.222.222.* Ethernet frame
    • Source: 74-29-9C-E8-FF-** Dest: E6-E9-00-17-BB-4B A R B

R Decides how to Forward Packet

  • Router R’s adapter receives the packet
    • R extracts the IP packet from the Ethernet frame
    • R sees the IP packet is destined to 222.222.222.A R B
  • Router R consults its forwarding table
    • Dest addr matches 222.222.222.0/24 row of FT
    • Outgoing interface: 222….** Directly connected so Next hop = dest IP A R B

Next Hop

CIDR block IP interface Interface ID Next hop Flags 222.222.222.128/25 222.222.222.220 Eth1 --- Dest IP 222.222.222.222 matches this row of the forwarding table!

R constructs frame

  • Outgoing interface 222..220à?src MAC addr 1A::9B


  • Next hop = dest IP =222..220à?dest MAC addr 49::2A A R B

Router R Wants to Forward Packet

  • IP header
    • From A: 111.111.111.** To B: 222.222.222.* Ethernet frame
    • From R: 1A-23-F9-CD-06-9B
    • To B: 49-BD-D2-C7-56-2A A R B

Question

  • How does the host know the IP address of the gateway router for its subnet?


  • How does the router know the MAC addresses of the IP hosts on its directly connected subnets?


  • Two protocols are needed:
    • ARP: Address Resolution Protocol
    • DHCP: Dynamic Host Configuration Protocol


  • We’ll cover DHCP later in the term

Address Resolution Protocol Table

  • Every node maintains an ARP table
    • A mapping from IP address to MAC address


  • Consult the table when sending a packet
    • Get Next Hop IP from IP Forwarding table
    • Map that IP address to a MAC address via the ARP Table
    • Encapsulate IP packet in a frame using that MAC addr as the dest addr

ARP Protocol

  • But, what if the IP address is not in the table?


  • Via link layer broadcast sender asks: “Who has IP address 1.2.3.156?”
    • Special ARP query packet encapsulated in a broadcast link layer frame
    • Broadcast means dest MAC addr is ff:..:ff


  • Node/Interface with IP addr in query responds: “I have that IP addr and my MAC address is 58-23-D7-FA-20-B0” in an ARP response
    • The ARP response packet is sent via a non-broadcast frame back to the querier
    • Sender caches the result in its ARP table


  • No need for network administrator to get involved

ARP Table Size

  • If ARP had to keep an entry for every IP outside of its subnet, its table could get large


  • But it only needs to know the MAC addr for the next hop IP
    • For a stub subnetwork, Next Hop IP is either the dest IP address or the IP address of the gateway router


  • How does a host know, e.g., after boot,
    • What is the IP address is of my gateway router?
    • How do I know whether a dest IP should go to the gateway router or whether it is the Next Hop IP?
      • i.e., how do I know what subnet/CIDR block I’m sitting in?


  • Host will learn both of these from DHCP

TOPICS

  • Circuit Switching


  • Switching inside the hub, e.g., crossbar switch


  • Multiplexing: FDM, TDM, Stat muxing


  • Packet routing; Hop by hop routing; store and forward routing


  • Internet Architecture philosophy:
    • End to End philosophy: network routers pretty much just forward packets
    • Internetworking: make it easy to glue local networks together


  • Network Effect


  • Main tradeoffs be packet routing and circuit switching


  • 5 layer reference model: what layer is responsible for what


  • Reliability techniques:
    • Error Detection/correction
    • Acknowledgement protocols


  • Link layer protoocls
    • Classic ethernet, switched ethernet, wifi


  • IP forwarding and addressing


  • IP and Link Layer working together

Lecture 14

IP FORWARDING AND ADDRESSING

IP PACKET STRUCTURE

4-bit Version 4-bit Header Length 8-bit Type of Service (TOS) 16-bit Total Length (Bytes) 16-bit Identification 3-bit Flags 13-bit Fragment Offset 8-bit Time to Live (TTL) 8-bit Protocol 16-bit Header Checksum 32-bit Source IP Address 32-bit Destination IP Address Options (if any) Payload

IP HEADER: TO AND FROM ADDRESSES

  • Two IP addresses
    • Source IP address (32 bits)
    • Destination IP address (32 bits)


  • Destination address
    • Unique identifier for the receiving host
    • Allows each node to make forwarding decisions


  • Source address
    • Unique identifier for the sending host
    • Recipient can decide whether to accept packet
    • Enables recipient to send a reply back to source

IP ADDRESS (IPV4)

  • A unique 32-bit number


  • Represented in dotted-decimal notation
    • Each byte represented by a integer between 0 and


  • E.g., 12.34.158.00001100 00100010 10011110 12 34 158 ====SCALABILITY CHALLENGE====

HIERARCHICAL ADDRESSING: IP PREFIXES

  • An IP address is divided into network & host portions (left and right)


  • The network portion is often called the network prefix 11001100 00100010 10011110 Network (24 bits) Host (8 bits) 204 34 158
  • Host whose IP addresses share the same network portion/prefix are part of a subnetwork


  • IP address that share the same network prefix form a block of addresses, e.g., [204.34.158.0, 204.34.158.255] 11001100 00100010 10011110 Network (24 bits) Host (8 bits) 204 34 158

Forwarding Table Look-up

  • The forwarding table is a mapping from subnetworks (i.e., blocks of addresses) to outgoing links


  • When a destination IP address comes in, the router checks which subnetwork the IP address is in and retrieves the outgoing link.
    • It does that by comparing the prefix of the destination IP address with the network prefixes that define the subnetworks in its table
    • Can think of the prefix as the most significant bits of the address

LOCAL MANAGEMENT

  • Can locally add hosts to a subnet w/o needing any FT’s to be updated host! host! host! LAN 1!

...! host! host! host! LAN 2! ...! router! router! router! WAN! WAN! 1.2.3.4 1.2.3.7 1.2.3.156 5.6.7.8 5.6.7.9 5.6.7.1.2.3.* 5.6.7.*

Hierarchical Addressing and Forwarding

  • Similarity with Postal Addresses/Forwarding
    • Both post offices and routers use the “most significant bits” of the hierarchical address to make their forwarding decision
      • They ignore the low order bits


  • E.g., a regional PO ignores the street address and focuses on city name, county, province, country."


  • A router ignores the host portion of the address"


  • Differences
    • Postal addresses and forwarding have many levels of hierarchy
    • IP addresses just have one
      • But there is more than one level of hierarchy in the system

WHICH PREFIX

  • Suppose a packet with dest IP addr comes in:
    • X.Y.Z.W


  • Suppose there are subnets in the routing table of the form:
    • X.*.*.*
    • X.Y.*.*
    • X.Y.Z.*


  • What to do?
    • Try #1: Disallow this
    • Add rules so that the three size subnets can’t be prefixes of one another

ADDRESS BLOCK SIZES: EARLY DAYS

  • Class-Based Addressing
    • Class A:
      • X.*.*.* ~2^{24} = ~16 Mil addresses/block
      • X = any 8 bits starting with 0 128 Class A address blocks
    • Class B:
      • X’.Y.*.* ~2^{16} = ~64K addresses/block
      • X’ = any 8 bits starting with 10 2^{14} = ~16K address blocks
    • Class C:
      • X’’.Y.Z.* ~2^{8} = ~256 addresses/block
      • X’’ = any 8 bits starting with 110 2^{21} = ~2 Mil address blocks

REPRESENTATION IN THE FT

  • A Class A address block:
    • X.0.0.0 where X is a byte with leading 0: 0*******


  • A Class B address block:
    • X’.Y.0.0 where X’ is a byte with leading 10: 10******


  • A Class C address block:
    • X’’.Y.Z.0 where X’’ is a byte with leading 110: 110

TABLE LOOK UP

  • When an IP packet comes in:
    • If first bit of dest addr = o Compare first 8 bits with all the Class A prefixes in the table
    • Else if first two bits are o Compare the first 16 bits with all of the Class B prefixes in the table
    • Else if the first three bits are o Compare the first 24 bits with all of the Class C prefixes


  • “Compare with all” means find a match as efficiently as possible

PROBLEMS WITH CLASS-BASED ADDRESSING

  • Suppose an organization/subnet had 25K machines


  • A class B address would overprovision by

~40K address

  • Would need ~100 Class C addresses


  • Routing tables were getting too big

MORE FLEXIBLE ADDRESSING SCHEME

  • Needed a more flexible addressing scheme
    • Needed more address block sizes


  • What about an address block for every power of two?


  • Instead of using a prefix free encoding in the address, use an additional number
    • Use an explicit encoding of the number of bits of the network prefix
      • Which also gives an explicit encoding of the number of wildcard bits for the host portion of an IP address

CLASSLESS INTER DOMAIN ROUTING

  • The human readable way of specifying that the network prefix is n bits:
    • a.b.c.d/n, e.g, 193.34.158.0/23 !
    • The bit representation of a.b.c.d is assumed have 32-n lsb 0’s.


  • A router does not store the network address prefix as 193.34.158.0/23!


  • How does it store it?
    • It stores it as a pair of 32 bit numbers
      • an IP address and a subnet mask

CLASSLESS INTER-DOMAIN ROUTING (CIDR)

IP Address : 12.4.0.0 IP Mask: 255.254.0.Address 00001100!00000100!00000000!00000000! Mask 11111111!11111110!00000000!00000000! Network Prefix for hosts Use two 32-bit numbers to represent a network. CIDR block = (Network IP address, Mask) Written as 12.4.0.0/

SUBNET MASKS

  • With the subnet mask the router can quickly perform a bit-wise-and of the mask and the destination IP address
    • and then compare the result to the Network IP address

EXAMPLE

  • Suppose
    • 193.34.158.0/23! Is in the forwarding table


  • Look up request for destination IP addr
    • 193.34.158.* Router takes the bitwise-and of
    • dest IP addr: 193.34.158.5 and!
    • the network mask: 255.255.254.0, !
    • and gets 193.34.158.0 !


  • Compares this with network IP address!
    • In this case, it gets a match with 193.34.158.0 in the forwarding table!

CIDR ADDRESSING

  • Many legal CIDR address blocks could use the same Network IP address.
    • In the example,
    • 12.4.0.0/** 12.4.0.0/** 12.4.0.0/** …
    • 12.4.0.0/* But not all prefixes sizes are legit
    • 12.4.0.0/13 is not a legit CIDR block. Why?

CIDR BLOCKS

  • Note that a CIDR Block address really denotes a block of contiguous addresses:


  • E.g., 193.34.158.0/23!
    • The last 32-23 = 9 bits are wildcards!
    • That’s 512 IP addresses!
    • [193.34.158.0 to 193.34.160.0) = !
    • [ 193.34.158.0 to 193.34.159.255] !

SCALABILITY: ADDRESS AGGREGATION

Provider is given 201.10.0.0/201.10.0.0/22 201.10.4.0/24 201.10.5.0/24 201.10.6.0/Provider


  • The FT’s in the rest of the internet only need to have 201.10.0.0./21 in them


  • The provider’s FT’s will forward packets for the smaller networks on appropriately

IANA

  • How does a provider get a CIDR block of addresses?


  • Internet Assigned Number Authority
    • IANA gives big blocks to Regional Internet Registries
    • RIR gives CIDR blocks to providers
    • Providers may give sub CIDR blocks to their customers

REGIONAL INTERNET REGISTRIES

CIDR: HIERARCHAL ADDRESS ALLOCATION

  • Prefixes are key to Internet scalability
    • Address allocated in contiguous address blocks (prefixes)
    • Routing protocols and packet forwarding based on prefixes
    • Today, routing tables contain ~350,000 prefixes

PROBLEM WITH CIDR: CUSTOMERS KEEPING ADDRESS BLOCKS

PROBLEM WITH CIDR: MULTIHOMING

* Multi-homed customer with 201.10.6.0/23 has two providers. Provider 2 is a fail-over provider
  • Provider 2 needs to have a FT entry for 201.10.6.0/23 going to the green link"
  • When the yellow subnet fails over to the green provider, the FT of some upstream routers will have both 201.10.0.0/21 and 201.10.6.0/23


  • The IP address 201.10.6.5 will match both"

PREFIX MATCH FORWARDING

  • Forwarding tables in IP routers
    • Maps each IP prefix to output interfaces


  • When a dest IP matches more than one network prefix/CIDR block
    • Choose the CIDR block with the longest prefix 4.0.0.0/4.83.128.0/201.10.0.0/201.10.6.0/126.255.103.0/201.10.6.destination forwarding table Serial0/0.outgoing link

LONGEST PREFIX MATCH

  • Recall address prefixes/CIDR blocks are blocks of addresses


  • The “longest prefix match” is the smallest or most specific address block to which the dest IP belongs

Lecture 13

IP FORWARDING AND ADDRESSING

ROUTER ARCHITECTURE OVERVIEW

Two key router functions:

  • Run routing protocols to construct their forwarding table


  • Forward datagrams from incoming to outgoing links using their forwarding table

INPUT PORT FUNCTIONS

  • Decentralized switching


  • given datagram dest., lookup output port using forwarding table in input port memory


  • goal: complete input port processing at ‘line speed’


  • queuing: if datagrams arrive faster than forwarding rate into switch fabric Physical layer: bit-level reception Data link layer: e.g., Ethernet

OUTPUT PORTS

  • Buffering required when datagrams arrive from fabric faster than the transmission rate


  • Scheduling discipline chooses among queued datagrams for transmission

IP PACKET STRUCTURE

IP HEADER: VERSION, LENGTH, TOS

  • Version number (4 bits)
    • Indicates the version of the IP protocol
    • Necessary to know what other fields to expect
    • Typically “4” (for IPv4), and sometimes “6” (for IPv6)


  • Header length (4 bits)
    • Number of 32-bit/4 byte words in the header
    • Typically “5” (for a 20-byte IPv4 header)
    • Can be more when “IP options” are used


  • Type-of-Service (8 bits)
    • Allow packets to be treated differently based on needs
    • E.g., low delay for audio, high bandwidth for bulk transfer

IP HEADER: LENGTH, FRAGMENTS, TTL

  • Total length (16 bits)
    • Number of bytes in the packet
    • Maximum size is 63,535 bytes (216 -1)
    • … though underlying links may impose harder limits


  • Fragmentation information (32 bits)
    • Packet identifier, flags, and fragment offset
    • Supports dividing a large IP packet into fragments
    • … in case a link cannot handle a large IP packet


  • Time-To-Live (8 bits)
    • Used to identify packets stuck in forwarding loops
    • … and eventually discard them from the network

IP HEADER: MORE ON TIME-TO-LIVE (TTL)

  • Potential robustness problem
    • Forwarding loops can cause packets to cycle forever
    • Confusing if the packet arrives much later


  • Time-to-live field in packet header
    • TTL field decremented by each router on the path
    • Packet is discarded when TTL field reaches 0…
    • …and “time exceeded” message is sent to the source

==== IP HEADER FIELDS: TRANSPORT PROTOCOL

  • Protocol (8 bits)
    • Identifies the higher-level protocol
      • E.g., “6” for the Transmission Control Protocol

(TCP)

      • E.g., “17” for the User Datagram Protocol (UDP)
    • Important for demultiplexing at receiving host
      • Indicates which protocol is to be given IP payload

IP HEADER: CHECKSUM ON THE HEADER

  • Checksum (16 bits)
    • Sum of all 16-bit words in the IP packet header
      • Mod 2^{16}
    • Receiving router discards corrupted packets
      • Sending host will retransmit the packet, if needed
    • Recomputed every hop since TTL changes

+ = + = Mismatch

IP HEADER: TO AND FROM ADDRESSES

  • Two IP addresses
    • Source IP address (32 bits)
    • Destination IP address (32 bits)


  • Destination address
    • Unique identifier for the receiving host
    • Allows each node to make forwarding decisions


  • Source address
    • Unique identifier for the sending host
    • Recipient can decide whether to accept packet
    • Enables recipient to send a reply back to source

IP ADDRESS (IPV4)

  • A unique 32-bit number


  • Represented in dotted-decimal notation


  • E.g., 12.34.158.00001100 00100010 10011110 12 34 158

SCALABILITY CHALLENGE

SCALABILITY IMPROVED

HIERARCHICAL ADDRESSING: IP PREFIXES

  • An IP address is divided into network & host portions (left and right)


  • The network portion is often called the network prefix


  • All the hosts with the same network portion/prefix are part of a subnetwork address block 11001100 00100010 10011110 Network (24 bits) Host (8 bits) 204 34 158

Forwarding Table Look-up

  • The forwarding table is a mapping from subnetworks (i.e., blocks of addresses) to outgoing links


  • When a destination IP address comes in, the router checks which subnetwork the IP address is in and retrieves the outgoing link.
    • It does that by comparing the prefix of the IP address with the prefixes that define the subnetworks in its table
    • Can think of the prefix as the most significant bits of the address

Hierarchical Addressing and Forwarding

  • Similarity with Postal Addresses/Forwarding
    • Both post offices and routers use the “most significant bits” of the hierarchical address to make their forwarding decision
      • They ignore the low order bits


  • E.g., a regional PO ignores the street address and focuses on city name, county, province, country."


  • A router ignores the host portion of the address"


  • Differences
    • Postal addresses and forwarding have many levels of hierarchy
    • IP addresses just have one
      • But there is more than one level of hierarchy in the system— more on this later

Lecture 17 - Routing

IP Routing

IP Routing

  • Data plane: mostly about forwarding
    • Uses the Forwarding Table data structure


  • Control plane: mostly about “routing”
    • Routing is the process by which routers collectively compute forwarding tables such that:
      • Every host can reach every other host
      • There are no cycles, except perhaps for short periods
    • Routing will use a data structure called a routing table
      • Much like a forwarding table but may have more info in it
      • Will depend on the routing protocol

Routing

  • Routing on the Internet is hierarchical
    • The internet is broken up into Autonomous Systems
      • An AS is a large IP network within a single administrative domain"
      • One organizations can have several Ases"
    • Not all networks are in an AS"
      • Some networks are stub IP networks"
      • They’re attached to exactly one AS"


  • Two main levels of routing hierarchy
    • How to route between ASes"
      • Interdomain routing or Exterior Gateway Protocols"
    • How to route within an AS or stub IP network"
      • Intradomain routing or Interior Gateway Protocols


  • There are two main techniques for routing
    • Link State routing
      • Crucially uses Dijkstra’s algorithm for single source shortest path
      • Used in Intradomain routing
    • Distance Vector routing
      • Used in Intradomain for small networks
      • A variant used in Interdomain

Distance Vector Protocol

  • In DV, node v keeps it’s current estimate of the cost of the least cost paths to every other node in the network.
    • This is v’s “distance vector”
    • When v’s distance vector changes, it shares it’s distance vector with its neighbors
    • Conversely, when v receives updates from neighbors, it updates its distance vector and sends updates to its neighbors if necessary

Distance Vector: Initialization

graphic

Bad News Travels Slowly

graphic

Poisson Reverse

  • If your best route for a destination d is via a neighbor N,
    • You tell N that the cost to get to d from you is infinity
    • You tell all of your other neighbours the actual cost of getting to d from you


  • This prevents cycles of length two during transients in the DV protocol
    • but it does not prevent longer cycles

Routing Information Protocol (RIP)

  • Distance vector protocol
    • Nodes send distance vectors every 30 seconds
    • … or, when an update causes a change in routing


  • Link costs in RIP
    • All links have cost ** Valid distances of 1 through ** … with 16 representing infinity
    • Small “infinity” à? smaller “counting to infinity” problem


  • RIP is limited to fairly small networks

Similarities of LS and DV Routing

  • Shortest-path routing
    • Metric-based, using link weights


  • As such, commonly used inside an organization
    • RIP and OSPF are mostly used as intradomain protocols

Differences btw LS and DV

Message complexity

  • LS: each link sends O(E log n) bits per round of updates


  • DV: each link sends O(n log n) bits per round of updates Speed of Convergence


  • LS: relatively fast


  • DV: convergence time varies
    • Count-to-infinity problem

What works for IGP and EGP

table

Interdomain Routing and Autonomous Systems (ASes)

EGP/Interdomain Routing

  • Internet is divided into Autonomous Systems
    • Distinct regions of administrative control
    • Routers/links managed by a single “institution”
    • Service provider, company, university, …


  • Hierarchy of Autonomous Systems
    • Large, tier-1 provider with a nationwide backbone
    • Medium-sized regional provider with smaller backbone
    • Small network run by a single company or university


  • Interaction between Autonomous Systems
    • Internal topology is not shared between ASes
    • … but, neighboring ASes interact to coordinate routing

Autonomous System Numbers

AS Numbers are 16 bit values.

  • UBC: * U of T: * Stanford: * AT&T: 7018, 6341, 5074, …


  • Rogers: 812, 2491, 20453, …


  • Telus: 852, 24914, … Currently over 20,000 in use.

AS Number Trivia

  • AS number is a 16-bit quantity
    • So, 65,536 unique AS numbers


  • Some are reserved (e.g., for private AS numbers)
    • So, only 64,510 are available for public use


  • Managed by ICANN
    • Gives blocks of 1024 to Regional Internet Registries


  • RIRs assign AS numbers to institutions


  • Recently started assigning 32-bit AS #s (2007)

AS-level Topology

    • Destinations are IP prefixes (e.g., 12.0.0.0/8)
    • Nodes are Autonomous Systems (ASes)
    • Possibly multiple links between neighbours
      • Different types of links for different business relationships
  • EGP is about trying to find directed trees in the AS graph
    • There may be more than one link between two Ases
    • Really a tree where edges are between border routers
      • Alternate between inter-AS edges and intra-AS edges


  • EGP needs a directed tree for every CIDR block


  • EGP must scale
    • Prefixes: 400,000, and growing
    • ASes: 20,000+ visible ones, and 40K allocated
    • Routers: at least in the millions…


  • EGP must allow for ASes to implement business relationships
    • E.g., some paths preferred/disallowed because of business relationships


  • EGP must provide privacy
    • Should not divulge the interior topology of the AS
    • Should not divulge business relationships between ASes


  • There can only be one EGP
    • All the ASes have to run the same protocol

What works for EGP?

Link State Distance Vector Interior Gateway Protocols OSPF IS-IS RIP Exterior Gateway Protocols ? ?

LS or DV for EGP?

  • Problem with LS
    • Every change in AS topology in origin AS floods entire AS graph with update on all CIDR blocks originated by that AS
      • DV updates only percolate as far as they affect the shortest path
    • No internet-wide notion of cost of edges between ASes
    • With just hop count, not all shortest paths in AS graph are consistent with business relationships
      • Shortest path between some pairs of ASes allowed/ not allowed because of business relationships
  • Problem with DV
    • Slow convergence
    • Also based on shared notion of link weights and shortest paths

What to do?

  • Start with Distance Vector and modify/generalize
    • Improve convergence, reduce loops
    • Allow for flexible policy
      • Best path based on more factors than one number
      • To Implement business policies


  • Don’t have to accept announcements from all neighbours"


  • Don’t’ have to send announcements to all neighbours"


  • Can vary import and export policy for announcements per neighbour"
  • Generalization 1:
    • Don’t base “least cost” or “best path” on a single integer
    • Base it on a sequence of (four) “attributes”: (a,b,c,d)
      • Each attribute has an order relation
      • More about attributes later
    • Announcements: CIDR block, sequence of attributes
    • Store most recent announcement from each neighbor
    • If a router gets two announcements for the same CIDR block
      • Take the one that’s better on the first attribute
      • If equal on first attr, take the one that’s better on the second
      • If equal on the second as well, take the one that’s better on the third, etc.
  • Generalization 2: Import/Export Policies
    • Filter/modify incoming announcements:
      • Don’t have to use/accept announcements from all neighbors
      • Can even modify attributes from neighbors before comparing
      • Policies can be dependent on the peer AS or more specifically the peer router
    • Always accept announcements from internal BGP neighbors
    • Import Polices effect how traffic flows *out* of the AS/ border router
  • Generalization 2: Import/Export Policies
    • Filter outgoing announcements:
      • Don’t have to send announcement to all all neighbors
      • Can send different announcements to different neigbors
      • Can modify attributes of an announcement to a neighbor
    • If announcement was triggered by an announcement from an external neighbour
      • Send it to all internal BGP neighbours
    • if announcement was triggered by an announcement from an internal neighbour
      • Do not send it to the internal BGP neighbors
      • Avoids routing loops inside an AS
    • Export Policies effect how traffic flows into the AS/border router

Updates: 1 and 2 together

  • Get an update announcement for a prefix from a neighbor


  • Apply Import Policy
    • Filter it out or modify some of the attributes


  • Compare with existing announcements, find best one


  • If “best” has changed


  • Apply Export Policy
    • Send out properly filtered/modified announcement per neighbor

Border Gateway Protocol (BGP)

THE Interdomain Routing Protocol for the Internet

Border Gateway Protocol

  • 1989 : BGP-1 [RFC 1105], replacement for EGP


  • 1990 : BGP-2 [RFC 1163]


  • 1991 : BGP-3 [RFC 1267]


  • 1995 : BGP-4 [RFC 1771], support for CIDR


  • 2006 : BGP-4 [RFC 4271], update

BGP Sessions

  • BGP is run between “border routers”


  • BGP is run on top of TCP


  • Each border router may have multiple Exterior neighbors


  • Each border router has a BGP session with every other border router in the AS
    • These are “interior” neighbors


  • A BGP session starts by making announcements for all their prefixes to each other
    • Thereafter, incremental updates are sent periodically

BGP Operations

Establish session on TCP port Exchange all active routes Exchange incremental updates ASASWhile connection is ALIVE exchange route UPDATE messages BGP session

BGP Attributes

BGP Attributes used for computing “best” announcement

  • Local Preference: bigger is better


  • AS Path: shorter is better


  • Multi-exit Discriminator


  • IGP metric: smaller is better Other Attributes for Routing and Filtering:


  • Next Hop IP


  • Community ID


  • ….

BGP Attributes used for computing “best” announcement

  • Local Preference: bigger is better


  • AS Path: shorter is better


  • Multi-exit Discriminator


  • IGP metric: smaller is better Other Attributes for Routing and Filtering:


  • Next Hop IP


  • Community ID


Local Pref: Multihomed Backups

graphic Forces outbound traffic to take primary link, unless link is down. AS primary link backup link

AS Path

  • AS Path:
    • The sequence of AS numbers from the “origin” AS of a prefix
    • Most recent is on the left, origin is rightmost
    • E.g., 14.25.19.0/24: AS_Path(54, 20, 10)


  • Allows protocol to filter out loops
    • E.g., if router in AS 20 got the above announcement from a router in AS 54, it would throw the announcement out
    • Improves convergence of protocol


  • Output modification:
    • When a router chooses a route, it prepends its AS number to the AS_Path attribute before announcing it to its external neighbors


  • Other than input and output filtering, most important factor for external BGP

ASPATH Attribute

AS128.112.0.0/AS Path = AS Sprint AS Ebone AT&T AS Global Crossing 128.112.0.0/AS Path = 7018 128.112.0.0/AS Path = 3549 7018 AS 128.112.0.0/Princeton Prefix Originated AS RIPE NCC RIS project AS Global Access 128.112.0.0/AS Path = 7018 128.112.0.0/AS Path = 1239 7018 128.112.0.0/AS Path = 1129 1755 1239 7018 128.112.0.0/AS Path = 1755 1239 7018 ==== Next Hop IP ====

  • When two ASes have multiple connections, need more than just knowing what the next hop AS is


  • Also, need an IP address in order to stitch together the IGP info and the BGP info
    • Next Hop IP stays the same for announcements to internal BGP neighbors
    • More on this next lecture

Next Hop IP Attribute

  • Primary AS graph attributes
    • AS path (e.g., “7018 88”)
    • Next-hop IP address (e.g., 12.127.0.121)


  • Both stay unmodified for internal announcements AS Princeton 128.112.0.0/AS path = Next Hop = 192.0.2.AS AT&T AS Yale 192.0.2.128.112.0.0/AS path = 7018 Next Hop = 12.127.0.12.127.0.121

Lecture 19 - Routing

Routing

Routing

  • Routing on the Internet is hierarchical
    • The internet is broken up into Autonomous Systems
      • An AS is a large IP network within a single administrative domain"
      • One organizations can have several Ases"
    • Not all networks are in an AS"
      • Some networks are stub IP networks"
      • They’re attached to exactly one AS"


  • Two main levels of routing hierarchy
    • How to route between ASes"
      • Interdomain routing or Exterior Gateway Protocols"
    • How to route within an AS or stub IP network"
      • Intradomain routing or Interior Gateway Protocols


  • There are two main techniques for routing
    • Link State routing
      • Crucially uses Dijkstra’s algorithm for single source shortest path
      • Used in Intradomain routing
    • Distance Vector routing
      • Used in Intradomain for small networks
      • A variant used in Interdomain

What works for IGP and EGP

Link State Distance Vector Interior Gateway Protocols OSPF IS-IS RIP Exterior Gateway Protocols ? ? Interdomain Routing and Autonomous Systems (ASes)!

Interdomain Routing and Autonomous Systems (Ases)

AS-level Topology

    • Destinations are IP prefixes (e.g., 12.0.0.0/8)
    • Nodes are Autonomous Systems (ASes)
    • Possibly multiple links between neighbours
      • Different types of links for different business relationships 7 Server Farm Network Client UBC

Requirements for Interdomain Routing

  • EGP is about trying to find directed trees in the AS graph
    • There may be more than one link between two Ases
    • Really a tree where edges are between border routers
      • Alternate between inter-AS edges and intra-AS edges


  • EGP needs a directed tree for every CIDR block


  • EGP must scale
    • Prefixes: 400,000, and growing
    • ASes: 20,000+ visible ones, and 40K allocated
    • Routers: at least in the millions… UBC CS 317 2013WRequirements for Interdomain Routing?!


  • EGP must allow for ASes to implement business relationships
    • E.g., some paths preferred/disallowed because of business relationships


  • EGP must provide privacy
    • Should not divulge the interior topology of the AS
    • Should not divulge business relationships between ASes


  • There can only be one EGP
    • All the ASes have to run the same protocol

Border Gateway Protocol (BGP)

THE Interdomain Routing Protocol for the Internet!

What to do?

  • Start with Distance Vector and modify/generalize
    • Improve convergence, reduce loops
    • Allow for flexible policy
      • Best path based on more factors than one number
      • To Implement business policies


  • Don’t have to accept announcements from all neighbours"


  • Don’t’ have to send announcements to all neighbours"


  • Can vary import and export policy for announcements per neighbour"
  • Generalization 1:
    • Don’t base “least cost” or “best path” on a single integer
    • Base it on a sequence of (four) “attributes”: (a,b,c,d)
      • Each attribute has an order relation
      • More about attributes later
    • Announcements: CIDR block, sequence of attributes
    • Store most recent announcement from each neighbor
    • If a router gets two announcements for the same CIDR block
      • Take the one that’s better on the first attribute
      • If equal on first attr, take the one that’s better on the second
      • If equal on the second as well, take the one that’s better on the third, etc.
  • Generalization 2: Import/Export Policies
    • Filter/modify incoming announcements:
      • Don’t have to use/accept announcements from all neighbors
      • Can even modify attributes from neighbors before comparing
      • Policies can be dependent on the peer AS or more specifically the peer router
    • Always accept announcements from internal BGP neighbors
    • Import Polices effect how traffic flows *out* of the AS/ border router
  • Generalization 2: Import/Export Policies
    • Filter outgoing announcements:
      • Don’t have to send announcement to all all neighbors
      • Can send different announcements to different neigbors
      • Can modify attributes of an announcement to a neighbor
    • If announcement was triggered by an announcement from an external neighbour
      • Send it to all internal BGP neighbours
    • if announcement was triggered by an announcement from an internal neighbour
      • Do not send it to the internal BGP neighbors
      • Avoids routing loops inside an AS
    • Export Policies effect how traffic flows into the AS/border router

Updates: 1 and 2 together

  • Get an update announcement for a prefix from a neighbor


  • Apply Import Policy
    • Filter it out or modify some of the attributes


  • Compare with existing announcements, find best one


  • If “best” has changed


  • Apply Export Policy
    • Send out properly filtered/modified announcement per neighbor

BGP Operations

Establish session on TCP port Exchange all active routes Exchange incremental updates ASASWhile connection is ALIVE exchange route UPDATE messages BGP session

BGP Attributes

BGP Attributes used for computing “best” announcement

  • Local Preference: bigger is better


  • AS Path: shorter is better


  • Multi-exit Discriminator


  • IGP metric: smaller is better Other Attributes for Routing and Filtering:


  • Next Hop IP


  • Community ID


  • ….

BGP Attributes used for computing “best” announcement

  • Local Preference: bigger is better


  • AS Path: shorter is better


  • Multi-exit Discriminator


  • IGP metric: smaller is better Other Attributes for Routing and Filtering:


  • Next Hop IP


  • Community ID


  • ….

BGP Attributes used for computing “best” announcement

  • Local Preference: bigger is better


  • AS Path: shorter is better


  • Multi-exit Discriminator


  • IGP metric: smaller is better Other Attributes for Routing and Filtering:


  • Next Hop IP


  • Community ID


  • ….

Next Hop IP

  • When two ASes have multiple connections
    • need more than just knowing what the next hop AS is


  • Need an IP address in order to stitch together the IGP info and the BGP info
    • Next Hop IP stays the same for announcements to internal BGP neighbors


  • This is the Next Hop IP in the BGP routing table not the Next Hop IP in the forwarding table

Next Hop IP Attribute

  • Primary AS graph attributes
    • AS path (e.g., “7018 88”)
    • Next-hop IP address (e.g., 12.127.0.121)


  • Both stay unmodified for internal announcements AS Princeton 128.112.0.0/AS path = Next Hop = 192.0.2.AS AT&T AS Yale 192.0.2.128.112.0.0/AS path = 7018 Next Hop = 12.127.0.12.127.0.==== Stitching IGP and BGP ====


  • Distribute BGP Routing Info:
    • Each border router’s “best route” (i.e., sequence of attributes) for each of its CIDR blocks is distributed to all the routers in an AS
    • Every router decides on its own “best route” from the ones it hears
      • Every router decides on a (BGP) Next Hop IP for each prefix in the ASes border routers BGP routing tables


  • IGP Routing Info:
    • Every router will have a Next Hop IP for each subnet

“owned” by the AS, including subnets that border routers use to talk to their external neighbors

  • Goal: for a given router and for each prefix in the BGP routing table,
    • find the proper Next Hop IP for the forwarding table for that router


  • Step 1:
    • For a prefix A/n from the BGP routing table, find its

(BGP) Next Hop IP—call this B

  • Step 2:
    • Find the longest prefix match for B among all the prefixes in the IGP routing table
    • Call this C/m
  • Step 3:
    • Find the Next Hop IP for C/m in the IGP routing table— call this D


  • Step 4:
    • Install the prefix A/n with Next Hop IP D into the router’s forwarding table

Business Relationships and Export Policies

Example Topology

Business Relationships

  • Customer-Provider
    • The provider is “higher up” in the AS hierarchy
    • The provider provides access to the parts of the Internet the customer can’t access itself
    • The customer pays the provider for this access
      • The cost of this access is often a function of the bandwidth and other service characteristics
  • Peering Relationship
    • AS A and B may agree to exchange traffic as follows:
    • Traffic from A and A’s customers to B or B’s customers can go across the peering link from A to B
    • Traffic from B and B’s customers to A or A’s customers can go across the peering link from B to A
    • Often no money is exchanged between peers as long as the amount of traffic flowing in each direction is approximately the same
    • If an AS peers with one or more other ASes, it can reduce the amount of traffic coming from or going to its provider
    • This allows it to pay less to its provider—this is the whole point! ==== Business Relationships ====


  • Tier 1 providers are required to peer with every other tier provider
    • In order to handle all of the tier 1 peering traffic, each pair of tier providers have many high speed peering links and routers between them
  • Peers do not transit traffic
    • Suppose A and B are peers, and B and C are peers
    • B will not pass traffic from A through to C
      • B reasons that A and C are trying to use B’s infrastructure to reduce their own costs of having to use their providers
      • This is extra traffic through B’s network that is not to or from B or its customers—B is not benefiting from this traffic
    • If A and C want to exchange traffic w/o using their providers they must establish their own peering link(s) and peering agreement

The “Peering” Relationship

peer peer provider customer Peers provide transit between their respective customers Peers do not provide transit between peers traffic Peers (often) do not exchange $$$ allowed traffic NOT allowed

Export Policies

  • What does this mean for how BGP announcements are made?


  • When an AS gets an announcement for an address block from a provider,
    • it makes an announcement for that address block to all of its customers (with the AS PATH properly updated)
    • it does *not* announce this block to its peers UBC CS 317 2013W1 provider customer Announcement from the provider Traffic to the provider peer


  • Announcement distributed down to customers


  • Announcement not distributed to peers
  • When an AS gets an announcement for an address block from a customer,
    • it makes an announcement for that address block to all of its other customers (with the AS PATH properly updated)
    • It makes an announcement to all of its peers
    • It makes an announcement to its provider

customer provider Traffic to the customer announcements traffic Announcement from the customer

  • When an AS gets an announcement for an address block from a peer,
    • it makes an announcement for that address block to all of its customers (with the AS PATH properly updated)
    • It does *not* makes an announcement to its other peers
    • It does *not* makes an announcement to its provider UBC CS 317 2013W1 peer peer Traffic to/from the peer and its customers d announcements traffic

Conclusions

  • BGP is solving a hard problem
    • Routing protocol operating at a global scale
    • With tens of thousands of independent networks
    • That each have their own policy goals
    • And all want fast convergence


  • Key features of BGP
    • Multiple attributes for comparing announcements for CIDR blocks
    • Incremental updates (announcements and withdrawals)
    • Policies applied at import and export of routes

Transport Protocols

Where are we?

Packet Routing vs Circuit Switching Big Pros:

    • Stat Muxing: more efficient use of individual links, better match for many/most computer communications
    • Core just focuses on forwarding packets efficiently and little else
    • Computation of paths decoupled from decision for accepting data into the network
      • Call admission in circuit switching vs accept every packet and provide it with best effort service

Big Table

Big Cons

  • Need per packet endhost addressing


  • Need multiplexing within the endhost


  • Packet Corruption/Drops


  • Out of order delivery


  • No Flow initiation


  • No Flow Agreement between end hosts


  • No network wide policing of traffic entering the network


  • Variable Delay/ Throughput UBC CS 317 2013WAdding services back in to overcome the cons ! Big Cons Possible Services


  • Need per packet endhost addressing


  • Need multiplexing within the endhost


  • Packet Corruption/Drops


  • Out of order delivery


  • No Flow initiation


  • No Flow Agreement between end hosts


  • No network wide policing of traffic entering the network


  • Variable Delay/ Throughput


  • Network wide IP addressing


  • Multiplexing Service


  • Reliable Packet Delivery


  • In Order Delivery


  • Flow Initiation


  • Flow Control


  • Congestion Control


  • ??

Introduced Layered Architecture

Application Layer Providing end-to-end applications Transport Layer Providing end-to-end network services to applications Multiplexing among applications IP Layer Forwarding datagrams end-to-end over the network Multiplexing among transport layers Link Layer Moving frames over a physical layer Multiplexing among network layers Physical Layer Moving bits over physical channel ==== Big Table ==== Big Cons Possible Services Where?

  • Need per packet endhost addressing


  • Need multiplexing within the endhost


  • Packet Corruption/Drops


  • Out of order delivery


  • No Flow initiation


  • No Flow Agreement between end hosts


  • No network wide policing of traffic entering the network


  • Variable Delay/ Throughput


  • Network wide IP addressing


  • Multiplexing Service


  • Reliable Packet Delivery


  • In Order Delivery


  • Flow Initiation


  • Flow Control


  • Congestion Control


  • ??


  • IP Layer


  • Transport


  • Link/Transport


  • Transport


  • Transport


  • Transport


  • Transport

Application Layer Transport Layer—This is where we are IP Layer--Covered Link Layer--Covered Physical Layer—Electrical Engineering

Transport Protocols

  • Provide logical communication between application processes running on different hosts


  • Run on end hosts
    • Sender: breaks application messages into segments, and passes to transport layer
    • Receiver: reassembles segments into messages, passes to application layer


  • Multiple transport protocols available to applications
    • Internet: TCP and UDP application transport network data link physica l application transport network data link physica l network data link physica l network data link physica l network data link physica l network data link physica l network data link physica l Modified from Rexford

Multiplexing!

  • Recall Postal Mail


  • Canada Post delivers parcels building to building
    • using the building address


  • The building “mail room” delivers parcels within the building
    • using room # or person’s name on envelope


  • IP: end-host to end-host delivery
    • using IP address


  • Transport: within end-host delivery
    • using “port” name
  • Delivery within the endhost:
    • Deliver data to running applications
    • OS gives apps process IDs
      • These are dynamic, not stable
    • Can’t expect apps on remote endhosts to know the PID
    • Need something like a stable name
      • This is called a “port number”
    • Need mapping between port number and PID

Interface

  • “The protocol stack”
    • Link/IP/Transport
    • Runs as part of the operating system—”the kernel”


  • Set of interfaces between OS and applications called system calls


  • Special system calls between applications that want to use the network and the kernel
    • called “sockets”
    • sockets will do bookkeeping, e.g., mapping between port numbers and PIDs
    • Sockets manage interface between an app and a transport layer protocol, e.g., signaling about when data has arrived, more data can be pushed out, etc.

Demultiplexing: port numbers

Web server! (port 80)! Client host! Server host 128.2.194.242! Echo server! (port 7)! Service request for! 128.2.194.242:80! (i.e., the Web server)! Client! OS! Modified from Rexford

Port Numbers

  • Where do port numbers come from?


  • The Internet Engineering Task Force—IETF—maintains a list of “well known” ports
    • port 25 for mail
    • port 80 for http


  • Can be manually configured/privately agreed to


  • Can be negotiated in a “control” protocol


  • Can be embedded in a link


  • Can be retrieved from a naming service (that itself has a well known port #)
    • What port number is the server a.b.c.d using for mail?

User Datagram Protocol (UDP)

  • Datagram messaging service
    • Demultiplexing of messages: port numbers
    • Detecting corrupted messages: checksum
    • That’s it!


  • Lightweight communication between processes
    • Send messages to and receive them from a socket
    • Avoid overhead and delays of ordered, reliable delivery SRC port DST port checksum length DATA Modified from Rexford

Why Would Anyone Use UDP?

  • Fine control over what data is sent and when
    • As soon as an application process writes into the socket
    • … UDP will package the data and send the packet


  • No delay for connection establishment
    • UDP just blasts away without any formal preliminaries
    • … which avoids introducing any unnecessary delays


  • No connection state
    • No allocation of buffers, parameters, sequence #s, etc.
    • … making it easier to handle many active clients at once


  • Small packet header overhead
    • UDP header is only eight bytes long Modified from Rexford

Popular Applications That Use UDP

  • Multimedia streaming
    • Retransmitting lost/corrupted packets is not worthwhile
    • By the time the packet is retransmitted, it’s too late
    • E.g., telephone calls, video conferencing, gaming


  • Simple query protocols like Domain Name System
    • Overhead of connection establishment is overkill
    • Easier to have the application retransmit if needed

“Address for www.cnn.com?” “12.3.4.15”

Transmission Control Protocol (TCP)

  • Stream-of-bytes service
    • Sends and receives a stream of bytes, not messages
    • Sequence numbers to reorder data


  • Flow initiation
    • Explicit set-up and tear-down of TCP session


  • In-order delivery
    • Need sequence numbers


  • Reliability
    • Checksums to detect corrupted data
    • New problem: dropped packets
    • Acks, time-outs and retransmissions using sequence numbers
  • Flow control
    • Send as fast as the receiver can receive but not faster
    • Need to have lots of unacknowledged packets in flight to accomplish this
    • Need sequence numbers to do this—but we have them already


  • Congestion control
    • Adapt to network congestion for the greater good

Reliable Delivery When Packets Can Be Dropped

Simple ARQ revisited

Time Packet ACK Timeout

  • Possibility of packet loss requires sender to keep a timer
    • Timeout a lot like a NACK
    • When packet or ack is lost, timer will be tripped
    • When timer is tripped, packet/ ack not necessarily lost, could just be slow Sender Receiver

Reasons for Retransmission

Packet Timeout Packet ACK Timeout Packet Timeout Packet ACK Timeout Packet ACK Timeout Packet ACK Timeout ACK lost DUPLICATE PACKET Packet lost Early timeout DUPLICATE PACKETS Modified from Rexford

Protocol with drops

  • Is counting a timeout as a nack in the old protocol enough?


  • No, as per the example in class,
    • a delayed ack can lead to an ambiguity on the sender’s side


  • The solution is to label the acks
    • The acks carry a sequence number


  • It turns out that once you have labeled acks, you don’t need nacks for corrupted packets


  • We won’t explicitly define the stop and wait protocol for this
    • One such protocol is in the textbook

Stop and Wait is Reliable but…

  • Stop-and-wait is inefficient
    • Only one TCP segment is “in flight” at a time
    • Especially bad when delay-bandwidth product is high


  • Numerical example
    • 1KB packet
    • 10 Mbps link (rate R) with a 80 msec propagation time

(round trip time/2) Modified from Rexford

  • Numerical example
    • 1KB packet
    • 10 Mbps link (rate R) with a 80 msec propagation time (round trip time/2)
    • Number of possible bits “in flight” on the link at any time
      • 10 Mbps x 80 msec = 800 Kb
      • This is the Delay-bandwidth product = R x RTT/** Number of bits actually in flight
      • 8Kb
    • Utilization: 1/** Even worse since half the time (during the ack) no bits are on the line from sender to receiver Modified from Rexford

Throughput of stop and wait

  • Parameters:
    • Packets of length L bits
    • Rate of the link of R bits per sec
    • Round trip time of RTT
    • Assume ack packet has no transmission delay


  • Time between the start of sending a packet and the start of sending the next packet:
    • Transmission delay of packet + RTT = (L/R) + RTT


  • Data sent in that time period:
    • L


  • Effective Rate: L / ((L/R) + RTT)
  • Effective Rate: L / ((L/R) + RTT)

= R / (1 + B )

  • Where B is the ratio of the possible bits on the wire in both directions to the length of the packet B = RTT x R / L


  • When RTT x R >> L:
    • B is big, and
    • the effective rate is much smaller than the link rate

Reliability + Performance

  • Reliability is an important goal in computing systems, but…


  • Performance is an equally important goal


  • The challenge in systems is to achieve a high degree of both at the same time
  • Two performance tracks:
    • Find a “good” time-out value
    • “Pipeline” packets

Time out value

  • Too small
    • Wasted retransmissions that use up network resources


  • Too big
    • Could be waiting for too long to retransmit
    • Lower throughput


  • Try to achieve adaptive window size
    • by measuring the (recent) roundtrip time on packets and acknowledgements
    • And taking some kind of running average
      • This is what TCP does
      • Different implementations do it slightly differently

How Long Should Sender Wait?

  • TCP sets timeout as a function of the RTT
    • Expect ACK to arrive after an RTT “round-trip time”
    • … plus a fudge factor to account for queuing


  • But, how does the sender know the RTT?
    • Sample the RTT by watching the ACKs and
    • Take a “running” average
      • NewAveRTT = a * NewSampleRTT + (1 –a ) *OldAveRTT
    • Compute timeout: TimeOut = 2 * EstimatedRTT Modified from Rexford

Example RTT Estimation

RTT: gaia.cs.umass.edu to fantasia.eurecom.fr 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 time (seconnds) RTT (milliseconds) SampleRTT Estimated RTT

  • The most important technique for improving performance for a reliable delivery protocol is pipelining
    • The protocols that do this are called sliding window protocols
      • We’ll need packets to have sequence numbers

Sliding Window—Sender side

  • Basic Idea: Allow sender to always have up to W packets that have been sent but not yet acknowledged
    • Say W is 10. Idea is that S can send out packets 1 through 10 w/o waiting for any acknowledgements to arrive
    • When the ack for packet 1 arrives it can send out packet ** When the ack for pckt 2 arrives, the sender can send packet 12, etc.
    • If L is the number of the last acknowledged packet than the sender can send out up to packet number L + W
    • The window of consecutive packet numbers that S can have outstanding “slides” forward as the acknowledgements arrive


  • This seems like just an additive gain
    • Maybe the sender finishes about W units early
    • It’s actually a multiplicative gain 73

Lecture 20/21 - Transport Protocols Continued

Transport Protocols

Big Table

table

5 Layer Reference Model

model

User Datagram Protocol (UDP)

  • Datagram messaging service
    • Demultiplexing of messages: port numbers
    • Detecting corrupted messages: checksum
    • That’s it!


  • Lightweight communication between processes
    • Send messages to and receive them from a socket
    • Avoid overhead and delays of ordered, reliable delivery SRC port DST port checksum length DATA Modified from Rexford 5

Transmission Control Protocol (TCP)

  • Stream-of-bytes service
    • Sends and receives a stream of bytes, not messages
    • Sequence numbers to reorder data


  • Reliability
    • Checksums to detect corrupted data
    • New problem: dropped packets
    • Acks, time-outs and retransmissions using sequence numbers


  • Flow initiation
    • Explicit set-up and tear-down of TCP session


  • In-order delivery
    • Need sequence numbers


  • Flow control
    • Send as fast as the receiver can receive but not faster
    • Need to have lots of unacknowledged packets in flight to accomplish this
    • Need sequence numbers to do this—but we have them already


  • Congestion control
    • Adapt to network congestion for the greater good

Stop and Wait is Reliable but…

  • Stop-and-wait is inefficient
    • Only one TCP segment is “in flight” at a time
    • Especially bad when delay-bandwidth product is high Modified from Rexford

Reliability + Performance

  • The most important technique for improving performance for a reliable delivery protocol is pipelining
    • The protocols that do this are called sliding window protocols
    • We’ll need packets to have sequence numbers

Sliding Window-Performance

  • Basic Idea: Allow sender to always have up to W packets that have been sent but not yet acknowledged


  • Main point: The bigger the window, the higher the maximum throughput
    • (up to a point)

Sliding Window-Reliability

  • Ok, great. We know how to get some performance, but what about reliability?


  • TCP is mostly based on Go Back N
    • But unlike Go Back N, TCP
      • does not resend the window as soon as it gets one duplicate ack
      • reduces the size of its window before retransmitting its window
      • Is byte oriented rather than packet oriented
    • We’ll call the packet based version of TCP: Resend Window (RW)


  • For TCP/RW, if the sender
    • Receives a triple dup ack, or
    • The window times out


  • The sender assumes a packet has been dropped and
    • It reduces the size of the window
    • then retransmits all of the current data in the window

Rule for RW Receiver, Simple version

  • Whenever you get a good packet, always acknowledge the largest consecutive packet that you have received
    • Example: Receiver has received and ack’ed packets with seq nums:
      • 1,2,3,4
    • Receives packets with seq nums 6 and 7
      • It acks 4 twice (which is the third ack sent for 4)
      • It now has packets
      • 1,2,3,4,x,6,7
    • Now it receives packet 5
      • It acks packet 7

Rules for Sender, Simple version

  • Keep track of the largest ack that you’ve seen
    • You can assume that R has received all of the packets up through and including that packet number
      • You may get acks smaller than your current max but you can ignore them


  • Keep track of how many times you see duplicate acks for the current max
    • This indicates dropped or out of order packets


  • Keep a count down timer for the window
    • Start the timer for a “new” window
      • When the start of the window advances, or
      • When a retransmission of the window begins


  • When you get an ack that is the new max:
    • Slide your window forward using the current window size
    • You may be able to accept more packets from the layer above or immediately send buffered packets
      • You can have packets new_max + 1 through new_max + W sent but unacknowledged
    • Restart the countdown timer


  • If you get three duplicate acks for the current max
    • You do assume a packet has been dropped
    • (You reduce your window a la congestion control)
    • You retransmit the packets in the current window
      • This is called “fast retransmit” since you don’t wait for the timer to time out


  • If the counter timer expires,
    • You assume that a packet has been lost
    • (Reduce your window a la congestion control)
    • Resend the packets in the window

TCP Services

  • In-order delivery, stream-of-bytes service
    • Sends and receives a stream of bytes, not messages
    • Sequence numbers to reorder data


  • Flow initiation
    • Explicit set-up and tear-down of TCP session


  • Reliability
    • Checksums to detect corrupted data
    • New problem: dropped packets
    • Acks, time-outs and retransmissions using sequence numbers


  • Congestion control
    • Adapt to network congestion for the greater good


  • Flow control
    • Send as fast as the receiver can receive but not faster
    • Need to have lots of unacknowledged packets in flight to accomplish this
    • Need sequence numbers to do this—but we have them already

TCP Segment

  • IP packet
    • No bigger than Maximum Transmission Unit (MTU)
    • E.g., up to 1500 bytes on an Ethernet


  • TCP packet
    • IP packet with a TCP header and data inside
    • TCP header is typically 20 bytes long


  • TCP segment
    • No more than Maximum Segment Size (MSS) bytes
    • E.g., up to 1460 consecutive bytes from the stream IP Hdr IP Data TCP Data (segment) TCP Hdr

TCP Acknowledgments

picture

Flow Control and Congestion Control

Sender Window

  • The sender keeps track of two values that will change based on various events
    • The flow control window, also called the receive window, rwnd
    • The congestion control window, cwnd


  • Actual window size is the minimum of these two values


  • Whenever either of the values changes, S recalculates the min


  • Recall that window size and throughput are roughly proportional 20 Flow Control


  • Assume the Receiver can keep B in-order bytes buffered


  • TCP puts the newest (largest byte numbers) bytes into the buffer in order


  • An application takes the oldest (smallest byte numbers) bytes out of the buffer

TCP Buffers

  • If the application is consuming bytes at a slower rate than the sender is sending,
    • then the Receiver’s buffer will overflow


  • The receiver needs to get the sender to slow down


  • If R tells S to reduce its window, than S will slow down!

Flow Control

  • Even more pragmatically, the Receiver only has so much room left it’s buffer
    • It has to tell the sender that it can’t send any more bytes than the # of bytes left in its buffer

RWND on the Reciever

Flow Control-Receiver side

  • R keeps track of the max consecutive byte
    • It already does this
    • This is the packet it sends its cumulative ack for


  • R keeps track of the min consecutive byte in the buffer
    • This is updated every time the app pulls data out


  • R keeps track of the room left in the buffer
    • = B ** (max ** min + 1)
    • This is the current rwnd


  • When R sends an ack it includes the current rwnd in the ack 26

TCP Header

picture

Flow Control-Sender Side

  • When S gets an ack, it uses the value of rwnd in the ack to update its local value of rwnd
    • And then it recalculates its window value
      • (the new rwnd may or may not update the window depending of the relative value of cwnd)

Flow Control

  • Rwnd goes up when an app pulls two or more packets worth of data out between consecutive packets coming into R


  • Rwnd goes down when two or more consecutive packets come into R w/o the app pulling a packets worth of data out of the buffer

Congestion Control

Congestion

  • Congestion on a link: when the incoming rate to the link is greater than the outgoing rate of the link for a sustained enough period of time to cause:
    • the buffers to grow large
      • Which causes increased delay
    • or the buffers to fill
      • Which causes packets to be dropped


  • Congestion in a network: when one or more links in the network is experiencing congestion

Congestion collapse

  • In a circuit switched network you’ll never get congestion!
    • You don’t admit traffic into the network unless you are guaranteed you can handle it


  • In a packet routed network packet drops waists lots of resources
    • If a packet survives many hops but gets dropped later, it used up valuable space earlier


  • Reliable delivery mechanisms in a packet routed network create a positive feedback loop
    • One drop causes many packets to be retransmitted which may increase congestion enough to cause more packets to be dropped, etc.
    • This is called “congestive collapse”

Collective back off

  • To fight the positive feedback effect, the senders must collectively and cooperatively back off their rate
    • But we don’t want to back off too much otherwise we won’t be using the capacity of the network
    • This is a lot like the situation with collisions in Ethernet
      • A fixed small retransmission window might create continuous collisions
      • A fixed large retransmission window might be inefficient
      • Ethernet used a mechanism that tried to find the right window size dynamically


  • We’ll need to do something dynamic here too

Congestion feedback

  • Need congestion feedback
    • Hosts can’t back off their sending rate w/o a reliable signal that there is congestion on a link they’re using


  • Feedback from the network?
    • The routers could give them explicit feedback when a link buffer is getting full
    • Either by sending an explicit control message back to the sender or piggybacking some info on the packet and letting the receiver relay something back


  • But IP had the “end to end” philosophy
    • keep the routers simple, they just do forwarding

Congestion feedback

  • Using RTT at the Sender
    • A sender could keep track of RTT and as it gets bigger it could back off
    • There have been some proposals to do this but they are not widely adopted


  • Using packet drops at the Sender
    • If sender knew that one or more packets to the receiver had been dropped, it could use that as a solid indication of congestion on one of the links
    • That’s conservative because the drop may have been due to a very transient spike in usage of the link
      • But conservative is good in this case!


  • The sender makes an assumption that a packet has been dropped if
    • the timer times out or
    • there’s a triple duplicate ack


  • That’s how the sender “knows” that a packet has been dropped—we’ll use that for our indication of congestion


  • Once a sender assumes there’s congestion because of a time out or duplicate acks, how does it back off?


  • By decreasing the sliding window, of course!

Congestion Control—Sender Side

  • In TCP, all congestion control happens at the sender side


  • If S detects possible congestion
    • with a time out or triple duplicate ack,


  • It reduces its send rate
    • by reducing its window size
      • Actually it reduces cwnd and then finds min of cwnd and rwnd


  • Congestion control tied to reliability/efficiency sliding window protocol
    • This is natural since window/send rate, packet drops and congestion are interrelated

Hunting for the right window size

  • Once the sender’s sliding window is cut down, does it stay down?
    • No
    • It can’t be that the only direction for adjusting the window is down
    • We’d never use resources efficiently that way


  • Basic idea:
    • Increase window gradually when there’s no evidence of congestion
    • Decrease window quickly when there is evidence
    • This “hunts” for an appropriate rate and adapts to changing traffic conditions

Congestion Control--Details

  • Triple duplicate ack vs time out
    • How we decrease and then subsequently increase is different depending on whether we get a triple duplicate ack or a time out

Triple Duplicate Ack and AIMD

  • A triple duplicate ack means that some packets are getting through so the congestion is not horrible
    • No need to reduce cwnd too much


  • After a triple duplicate ack,
    • cut cwnd in half, and retransmit the window
      • This is called “fast retransmit” since you aren’t waiting for a timeout to retransmit


  • Then increase window size by ~one MSS every RTT until a congestion event
    • This is called Additive Increase, Multiplicative Decrease

Leads to the TCP “Sawtooth”

picture

AIMD

  • Fairness
    • Remember fairness as in one of the main three things we want when we’re sharing a resource:
    • Performance, reliability, fairness


  • AIMD is reasonably fair
    • At steady state, AIMD will bounce around a fair rate
    • Argument in book


  • AIMD is adaptive/dynamic
    • In addition, AIMD will adapt the value that it is hunting around as traffic increases or decreases over the links that a TCP session is using

How should a flow start?

  • Let’s interrupt the discussion of congestion for a second


  • Additive increase begs the question


  • How should a flow start?
    • Need to increase slowly when we suspect we’re getting close to congestion
    • But when we start the flow, we are far from congestion
    • And if we just increase slowly from the start—we’d all get impatient!!

How Should a New Flow Start

picture

“Slow Start” Phase

  • Start with a small congestion window
    • Initially, CWND is 1 Max Segment Size (MSS)
    • So, initial sending rate is ~MSS/RTT


  • That could be pretty wasteful
    • Might be much less than the actual bandwidth
    • Linear increase takes a long time to accelerate


  • Slow-start phase (really “fast start”)
    • Sender starts at a slow rate (hence the name)
    • … but increases the rate exponentially
    • … until the first loss event

Slow Start in Action

picture

Slow Start and the TCP Sawtooth

picture

Congestion Control--Details

  • Triple duplicate ack vs time out
    • How we decrease and then subsequently increase is different depending on whether we get a triple duplicate ack or a time out

Time out

  • After a time out, cut cwnd down to 1 MSS
    • Treat time out as an indication of serious congestion
    • Increase cwnd exponentially for a while
      • Increasing additively from 1 MSS may take too long to get close to the right send rate
    • Then increase additively


  • “for a while” means until you get to half the cwnd in effect at the time out

Repeating Slow Start After Timeout

picture

Congestion Control Recap

  • Congestion control in TCP is all at the sender side


  • Indications of congestion
    • Triple duplicate acks: some congestion
    • Timeout: significant congestion


  • Reduce cwnd substantially (by ½) after indication of some congestion (triple dup ack)
    • to substantially reduce sending rate and help the network quickly move out of congestion


  • Reduce cwnd to starting value (1 MSS) after indication of significant congestion (window timeout)
    • Need to back off sending rate drastically if there is significant congestion in the network


  • Increase cwnd as long as there is no indication of congestion
    • Traffic should make effective use of network resources
    • Increase cwnd quickly (exponentially) when it’s very small
    • Increase cwnd gradually (linearly) when it might be near a congestion point, based on history


  • The mechanisms for increasing and decreasing cwnd
    • Allow the sender to hunt for a reasonable sending rate based on feedback from the network
    • Allow the senders rate to adapt to changing network conditions

Lecture 23 - Part 1

Names, Addresses, and Bootstrapping

Three Kinds of host/interface Identifiers

  • Host name (e.g., www.cnn.com)
    • Mnemonic name appreciated by humans
    • Provides little (if any) information about location
    • Hierarchical, variable # of alpha-numeric characters
    • Application layer


  • IP address (e.g., 64.236.16.20)
    • Numerical address appreciated by routers
    • Related to host’s current location in the topology
    • Hierarchical name space of 32 bits
    • IP Layer


  • MAC address (e.g., 00-15-C5-49-04-A9)
    • Numerical address appreciated within local area network
    • Unique, hard-coded in the adapter when it is built
    • Flat name space of 48 bits
    • Link Layer

Separating Names and IP Addresses

  • Names are easier (for us) to remember
    • www.cnn.com vs. 64.236.16.20


  • Referencing/indirection is good--IP addresses can change underneath
    • Move www.cnn.com to 173.15.201.39
      • renumbering when changing providers


  • Name could map to multiple IP addresses
    • www.cnn.com to multiple replicas of the Web site


  • Map to different addresses in different places
    • Address of a nearby copy of the Web site
      • to reduce latency, or return different content


  • Multiple names for the same address
      • aliases like ee.mit.edu and cs.mit.edu

Separating IP and MAC Addresses

  • LANs are designed for arbitrary network protocols
    • Not just for IP (e.g., IPX, Appletalk, X.25, …)
      • Though now IP is the main game in town
    • Different LANs may have different addressing schemes
      • Though now Ethernet address is the main game in town


  • A host may move to a new location
    • So, cannot simply assign a static IP address
      • Since IP addresses depend on host’s position in topology
    • Instead, must reconfigure the adapter
      • To assign it an IP address based on its current location


  • Must identify the adapter during bootstrap process
    • Need to talk to the adapter to assign it an IP address

Mapping Between Identifiers

  • Domain Name System (DNS)
    • Given a host name, provide the IP address
    • Given an IP address, provide the host name


  • Dynamic Host Configuration Protocol (DHCP)
    • At boot, host only has a MAC address
    • To automate the boot-strapping process assign a unique IP address and tell host other stuff about the Local Area Network


  • Address Resolution Protocol (ARP)
    • Given an IP address, provide the MAC address
    • To enable communication within the Local Area Network

Domain Name System (DNS)

Proposed in 1983 by Paul Mockapetris

Outline

  • Computer science concepts underlying DNS
    • Indirection: names in place of addresses
    • Hierarchy: in names, addresses, and servers
    • Caching: of mappings from names to/from addresses


  • DNS software components
    • DNS resolvers
    • DNS servers


  • DNS queries
    • Iterative queries
    • Recursive queries


  • DNS caching based on time-to-live (TTL)

Strawman Solution #1: Local File

  • Original name to address mapping
    • One file contained the mapping from names to IP addresses
    • SRI(?) kept main copy
    • Downloaded regularly
    • /etc/hosts


  • Count of hosts was increasing: moving from a machine per domain to machine per user
    • Many more downloads
    • Many more updates

Strawman Solution #2: Central Server

  • Central server
    • One place where all mappings are stored
    • All queries go to the central server


  • Many practical problems
    • Single point of failure
    • High traffic volume
    • Distant centralized database
    • Single point of update
    • Does not scale


  • Need a distributed, hierarchical collection of servers

Domain Name System (DNS)

  • Properties of DNS
    • Hierarchical name space divided into zones
    • Distributed over a collection of DNS servers


  • Hierarchy of DNS servers
    • Root servers
    • Top-level domain (TLD) servers
    • Authoritative DNS servers


  • Performing the translations
    • Local DNS servers
    • Resolver software

DNS Root Servers

TLD and Authoritative DNS Servers

  • Top-level domain (TLD) servers
    • Generic domains (e.g., com, org, edu)
    • Country domains (e.g., uk, fr, ca, jp)
    • Typically managed professionally
      • Network Solutions maintains servers for “com”
      • Educause maintains servers for “edu”


  • .CA started at UBC
    • John Demco, member of the technical staff personally administered the registration/management of .ca names from 1987 until 2000
    • Now managed by Canadian Internet Registration Authority


  • Authoritative DNS servers
    • Provide public records for hosts at an organization
    • For the organization’s servers (e.g., Web and mail)
    • Can be maintained locally or by a service provider

Distributed Hierarchical Database

graphic

Using DNS

  • Local DNS server (“default name server”)
    • Usually near the end hosts who use it
    • Local hosts configured with local server (e.g., /etc/resolv.conf) or learn the server via DHCP


  • Client application
    • Extract server name (e.g., from the URL)
    • Do gethostbyname() to trigger resolver code


  • Server application
    • Extract client IP address from socket
    • Optional gethostbyaddr() to translate into name

Recursive vs. Iterative Queries

  • Recursive query
    • Ask server to get answer for you
      • request 1 and response 8
  • Iterative query
    • Ask server who to ask next
      • all other request-response pairs

DNS Caching

  • Performing all these queries take time
    • And all this before the actual communication takes place
      • 1-second latency before starting Web download
  • Caching can substantially reduce overhead
    • The top-level servers very rarely change
    • Popular sites (e.g., www.cnn.com) visited often
    • Local DNS server often has the information cached
  • How DNS caching works
    • DNS servers cache responses to queries
    • Responses include a “time to live” (TTL) field
    • Server deletes the cached entry after TTL expires

Negative Caching

  • Remember things that don’t work
    • Misspellings like www.cnn.comm and www.cnnn.com
    • These can take a long time to fail the first time
    • Mark as unresolvable with some TTL
    • Good to remember that they don’t work so the failure takes less time the next time around

DNS Resource Records

DNS: distributed db storing resource records (RR)

  • Type=NS
    • name is domain (e.g. foo.com)
    • value is hostname of authoritative name server for this domain RR format: (name, value, type, ttl)
  • Type=A
    • name is hostname
    • value is IP address
  • Type=CNAME
    • name is alias name for some “canonical” (the real) name www.ibm.com is really servereast.backup2.ibm.com
    • value is canonical name
  • Type=MX
    • value is name of mailserver associated with name

DNS Protocol

  • query and reply messages, both with same message format Message header


  • Identification: 16 bit # forquery, reply to query uses same #


  • Flags:
    • Query or reply
    • Recursion desired
    • Recursion available
    • Reply is authoritative

Reliability

  • DNS servers are replicated
    • Name service available if at least one replica is up
    • Queries can be load balanced between replicas


  • UDP used for queries
    • UDP port 53
    • Need reliability: implemented on top of UDP


  • Try alternate servers on timeout
    • Exponential backoff when retrying same server


  • Same identifier for all queries
    • Don’t care which server responds

Inserting Resource Records into DNS

  • Example: just created startup “FooBar”


  • Get static CIDR block from ISP
    • assign IP addresses to your authoritative name servers (primary and secondary)


  • Register foobar.com at Network Solutions
    • Provide registrar with names and IP addresses of name servers
    • Registrar inserts two RRs per server into the com TLD server:
      • (foobar.com, dns1.foobar.com, NS)
      • (dns1.foobar.com, 212.212.212.1, A)


  • Put in authoritative server dns1.foobar.com
    • Type A record for www.foobar.com
    • Type MX record for foobar.com


  • Play with “dig” on the web

Bootstraping an End Host

DHCP and ARP

How to

  • When the host boots:
    • How does it get an IP address?
      • Can’t be hard coded at the factory—the factory doesn’t know what subnet the device is going to join
      • Could be hand configured every boot-up—but that would be a pain
    • How does it get the IP address of the router and DNS server?
    • How does it get the MAC address of the router, and the other MAC addresses on the LAN for that matter?

Avoiding Manual Configuration

  • Dynamic Host Configuration Protocol (DHCP)
    • Learn IP address, subnet mask, gateway router, DNS servers
  • Address Resolution Protocol (ARP)
    • Learn mapping between IP & MAC addresses of other hosts on LAN in order to be able to create proper frames

Key Ideas in Both Protocols

  • Broadcasting: when in doubt, shout
    • When you don’t know how to identify the right host broadcast query to all hosts in the local-area-network


  • Caching: remember the past for a while
    • Store the information you learn to reduce overhead
    • Remember your own address & other host’s addresses


  • Soft state: … but eventually forget the past
    • Associate a time-to-live field with the information and either refresh or discard the information
    • Key for robustness in the face of unpredictable change
  • We’ve done this before in the LAN setting with switched

Bootstrapping Problem

  • Host doesn’t have an IP address yet
    • So, host doesn’t know what source IP address to use in IP packets


  • Host doesn’t know the IP addr of the DNS server to query for a dest IP address


  • Host doesn’t know the IP addr of gateway router interface
    • Can’t send packets out of LAN w/o it


  • Host doesn’t know which IPs stay on the LAN and which need to go through the gateway router


  • Host doesn’t know the link layer address of any interfaces

Broadcasting

  • But we do know how to broadcast on the LAN
    • Send a “discover” query with your MAC address as the source and with destination address: ff:ff:ff:ff:ff:ff
    • All adapters on the LAN receive the packet including the appropriate server(s)


  • And it is possible to broadcast in IP within a LAN
    • Use dest IP addr = 255.255.255.255

DHCP

  • DHCP is a client-server protocol,
    • the client initiates the session; the server is “waiting” for session initiations by clients


  • Client and server code are in the application layer
    • DHCP uses the UDP transport layer
    • A DHCP server listens on UDP port 67
    • A DHCP client uses UDP port 68

Dynamic Host Configuration Protocol

graphic

Response from the DHCP Server

  • Multiple servers may respond
    • Multiple servers on the same broadcast media
    • Each may respond with an offer
    • The client can decide which offer to accept


  • Accepting one of the offers
    • Client broadcasts a DHCP request echoing the parameters and the other servers see they were not chosen


  • The DHCP server responds with an ACK to

Deciding What IP Address to Offer

  • Server as centralized configuration database
    • All parameters are statically configured in the server
      • a dedicated IP address for each MAC address
    • Avoids complexity of configuring hosts directly while still having a permanent IP address per host


  • Or, dynamic assignment of IP addresses
    • Server maintains a pool of available addresses and assigns them to hosts on demand
    • Leads to less configuration complexity and more efficient use of the pool of addresses
    • Though, it is harder to track the same host over time

Time out, Release, Renew

  • Each lease is accompanied by a TTL
    • Important in order to be able to reallocate unused address


  • A long running host does not need to get a new IP addr after the TTL
    • It issues a Renew command to the DHCP server


  • A host may purposefully give up its IP addr by issuing a Release command
    • On graceful shut down, or leaving a network
    • To debug/fix a network problem
    • If you perform a DHCP Discover soon after you will likely get the same IP addr back

So, Now the Host Knows a Few Things

  • Its IP address
  • Its subnet Mask
  • The IP address of gateway router
  • The IP address of DNS server


  • And its IP forwarding table is simple
    • Everything goes out its one network interface
      • Except packets with dest addr in 127.0.0.0/8, i.e., localhost Ips
    • If the dest IP is outside the subnet, the next hop is the IP addr of the gateway router


  • But it doesn’t know any link layer addresses
    • In principle, could still route all frames out the adpater using dest MAC addr ff:ff:ff:ff:ff:ff—but there must be a better way…

Address Resolution Protocol Table

  • Every node maintains an ARP table
    • A mapping from IP address to MAC address


  • Consult the table when sending a packet
    • Get Next Hop IP from IP Forwarding table
    • Map that IP address to a MAC address via the ARP Table
    • Encapsulate IP packet in a frame using that MAC addr as the dest addr ARP Protocol

ARP Protocol

  • But, what if the IP address is not in the table?


  • Via link layer broadcast sender asks: “Who has IP address 1.2.3.156?”
    • Special ARP query packet encapsulated in a broadcast link layer frame


  • Node/Interface with IP addr in query responds: “I have that IP addr and my MAC address is 58-23-D7-FA-20-B0” in an ARP response
    • The ARP response packet is sent via a non-broadcast frame back to the querier
    • Sender caches the result in its ARP table


  • No need for network administrator to get involved

ARP Table Size

  • If ARP has to keep an entry for every IP outside of its subnet, its table could get large


  • Host learns gateway router interface IP address from DHCP
    • Can learn MAC address via ARP


  • Host learns the subnet mask from DHCP so can tell which packets are destined for outside the subnet
    • For these packets, the next hop IP is router interface IP
    • Create dest MAC addr of frame by consulting ARP table for that IP

Back to DHCP

  • A host that has just booted has an empty ARP table


  • How does it create frames for the DHCP packets?


  • Src MAC
    • Each host knows its own MAC addr and so can populate the src mac addr field of each frame


  • Dst MAC
    • Every dst IP in the DHCP protocol is the broadcast IP addr
    • This automatically maps to the broadcast MAC addr

Dynamic Host Configuration Protocol

graphic

Lecture 23 - Middleboxes, Firewalls,

Middleboxes

IP Principles

  • Internetworking principle
    • Global, scalable addressing
    • No packet “admission”, no global circuit computation
      • Path computations independent of packet forwarding
    • Open standards


  • End-to-end principle
    • Simple core/simple packet forwarding
      • Routing nodes simply forward packets (rather than modifying or filtering them)
    • Complexity, application support at end hosts, not in routers

Internet Reality

  • Security concerns
    • Discarding suspicious or unwanted packets
    • Detecting suspicious traffic


  • IP address depletion
    • Dynamic assignment of IP addresses
    • Private addresses (10.0.0.0/8, 192.168.0.0/16, …)


  • Host mobility
    • Changes in IP addresses as hosts move


  • Performance concerns
    • Controlling how link bandwidth is allocated
    • Storing popular content near the clients

Middleboxes

  • Middleboxes are intermediaries
    • Interposed in-between the communicating hosts
    • Often without knowledge of one or both parties


  • Examples
    • Network address translators
    • Firewalls
    • Traffic shapers
    • Intrusion detection systems
    • Web proxies
    • Application accelerators

Two Views of Middleboxes

  • An abomination
    • Violation of layering
    • Cause confusion in reasoning about the network
    • Responsible for many subtle bugs
    • Create roadblocks for some end-to-end applications


  • A practical necessity
    • Solving real and pressing problems
    • Needs that are not likely to go away


  • The real world is bound to conflict with

architectural purity

Firewalls

Network Effect

  • The Internetworking/end-to-end architecture enabled the Network Effect
    • The value to join a network is proportional to the number hosts/services reachable on the network
    • When a host joins the network, the value of the network increases


  • BUT, it also enabled the the Network Effect for (in)security
    • Each node that you can reach, can reach you—and is potentially malicious
    • Every new host joining the network is potentially a new attack vector
    • The “attack surface” grows as the network grows

Vulnerabilities and Trust Domains

  • Perhaps in an ideal world if no software vulnerabilities existed and authentication/authorization processes were perfect it would be fine to allow every Internet host even those with malicious intent to contact every other Internet host


  • Even in this ideal world, the malicious hosts could mount a Denial of Service attack
    • make so many requests to a given server as to make it unusable by the honest hosts


  • But we don’t live in an ideal world
    • Software vulnerabilities can be exploited via network-enabled Internet attacks
    • Bogus authorizations/privileges can be obtained and escalated


  • Organizations have data on servers and desktops and cpu cycles on application servers that they want to protect
    • They can’t rely solely on the end hosts to protect these resources—the end-to-end model is too idealistic here
    • They need some kind of defense in depth

Internet Attacks: Break-Ins

  • Breaking into a host
    • Outsider exploits a vulnerability in the end host with the goal of changing the behavior of the host


  • Example
    • Bad guys know a Web server has a buffer-overflow bug and, say, send an HTTP request with a long URL allowing them to run their own code


  • Motivations for break-ins
    • Take over the machine to launch other attacks
    • Steal information stored on the machine
    • Modify/replace the content the site normally returns

Internet Attacks: Denial of Service

  • Denial-of-service attacks
    • Outsider overwhelms the host with unsolicited traffic with the goal of preventing any useful work


  • Example: attacks by botnets
    • Bad guys take over a large collection of hosts and program these hosts to send traffic to your host Leading to excessive traffic


  • Motivations for denial-of-service attacks
    • Malice (e.g., just to be mean)
    • Revenge (e.g., for some past perceived injustice)
    • Greed (e.g., blackmailing)

Organizational boundaries

  • Pre Internet, organizations:
    • enforced boundaries/access around their physical property and buildings
      • Fences, security gates/guards, ID badges
    • enforced access to rooms, facilities, and paper assets,
      • via keys


  • Post Internet
    • Very natural for organizations to draw a line around their networks and enforce access via a “security gate”
    • Called a firewall

Firewall

  • Isolates an organization’s internal net from larger Internet, allowing some packets to pass, blocking others.

Packet Filtering

  • Internal network connected to Internet via firewall


  • Firewall filters packet-by-packet, based on:
    • Source IP address, destination IP address
    • TCP/UDP source and destination port numbers
    • ICMP message type
    • TCP SYN and ACK bits


  • Examples
    • Block all packets with IP protocol field = 17 and with either source or dest port = 23.
      • All incoming and outgoing UDP flows blocked
      • All Telnet connections are blocked
    • Block inbound TCP packets with SYN but no ACK
      • Prevents external clients from making TCP connections with internal clients
      • But allows internal clients to connect to outside
    • Block all packets with TCP port of Doom3

Firewall Configuration

  • Firewall applies a set of rules to each packet
    • To decide whether to permit or deny the packet


  • Each rule is a test on the packet
    • Comparing IP and TCP/UDP header fields and deciding whether to permit or deny


  • Order matters
    • Logic is defined as a list of rules
    • The packet is tested starting at the top of the list
    • Once the packet matches a rule, the decision is done


Example:

  • Alice runs a network in 222.22.0.0/16
    • Wants to let Bob’s school access certain hosts
      • Bob’s hosts are on 111.11.0.0/16
      • Bob’s hosts should be able to access hosts on 222.22.22.0/24
    • Alice doesn’t trust Trudy, inside Bob’s network
      • Trudy is on 111.11.11.0/24
    • Alice doesn’t want any other traffic from Internet
  • Rules
    • Don’t let Trudy’s machines in
      • Deny (src = 111.11.11.0/24, dst = 222.22.0.0/16)
    • Let rest of Bob’s network in to special dsts
      • Permit (src=111.11.0.0/16, dst = 222.22.22.0/24)
    • Block the rest of the world
      • Deny (src = 0.0.0.0/0, dst = 0.0.0.0/0)

A Variation: Traffic Management

  • Permit vs. deny is too binary a decision
    • Maybe better to classify the traffic based on rules and then handle the classes of traffic differently


  • Traffic shaping (rate limiting)
    • Limit the amount of bandwidth for certain traffic
      • rate limit on Web or P2P traffic


  • Separate queues
    • Use rules to group related packets
    • then do round-robin scheduling across the groups
      • separate queue for each internal IP address

Firewall Implementation Challenges

  • Per-packet handling
    • Must inspect every packet
    • Challenging on very high-speed links


  • Complex filtering rules
    • May have large # of rules
    • Filtering may be stateful
      • Don't let in any TCP packet unless the flow originated inside
    • May do application level filtering
      • Filtering on blacklisted urls, P2P app detection


  • Location of firewalls
    • Complex firewalls near the edge, at lower speed
    • Simpler firewalls in the core, at higher speed

Firewalls are not full proof

  • tunneling over http
  • lots of mobile devices go in and out of a a fire-walled network


Examples:

  • Enterprise wifi can be snooped from afar
    • Pringles directional antennae for wardriving
    • First practical WEP attack from my group at AT&T
    • WPA/WPA2 still vulnerable to password attacks
  • filtering dorm access to a server
    • Firewall rule based on IP addresses of dorms and the server IP address and port number
    • Problem: users may log in to another machine
      • connect from the dorms to another host and then onward to the blocked server


  • filtering P2P based on port #s
    • Firewall rule based on TCP/UDP port numbers
      • allow only port 80 (e.g., Web) traffic
    • Problem: software using non-traditional ports
      • write P2P client to use port 80 instead

Network Address Translation

History of NATs

  • IP address space depletion
    • Clear in early 90s that 232 ~ 4 billion addresses not enough
    • Work began on a successor to IPv4—IPv6


  • In the meantime…
    • Share addresses among numerous devices without requiring changes to existing hosts


  • Meant to provide temporary relief
    • Intended as a short-term remedy
    • Now, NAT are very widely deployed. Much more so than IPv6

Private Addresses

  • IANA set aside several address blocks for the purposes of private addressing
    • 10.0.0.0/8
    • 172.16.0.0/12
    • 192.168.0.0/16


  • Any organization can use these addresses. No delegation/permission from IANA required


  • Since lots of orgs can be using then at the same time, these addresses are not publically routable
    • They’re ignored/dropped by routers
    • Not propagated via BGP

Port Numbers

  • The so-called well-know port numbers: 0—1023
    • Defined by IANA/IETF
    • Well known services
    • Used as destination port numbers


  • The range of port numbers from 1024 to 49151 are the registered ports
    • They are assigned by IANA for specific service upon application by a requesting entity


  • IANA suggests the range 49152 to 65535 for private and dynamic ports
    • These are used as ephemeral ports--destination port numbers picked at random for each connection

What if Both Hosts Contact Same Site?

  • Suppose hosts contact the same destination
    • both hosts open a socket with local port 3345 to destination 128.119.40.186 on port 80


  • NAT has to give the packets the same source address
    • All packets sent to the public Internet have source address 138.76.29.7
    • This is the only publicly addressable IP addr
    • The 10.0.0.0/8 addresses are not publicly addressable


  • Problems
    • How can destinations differentiate between senders?
    • How can return traffic get back to the correct hosts?

NAT

  • The dst IP and port can’t be changed by the NAT box
    • The packet has to get to where it is supposed to go!


  • The src IP has to be the external IP of the NAT box
    • The destination host needs a public IP address to send response packets to


  • The only thing the NAT box has a choice about is the src port
    • The NAT box must make the src ports different so that return packets can be distinguished

Port-Translating NAT

  • Map outgoing packets
    • Replace source IP address with external public NAT address
    • Replace source port number with a NAT generated port number
    • Remote hosts respond using (NAT IP address, NAT port #) for dst IP and port


  • Maintain a translation table
    • Store map of (src IP, src port) to (NAT port #)
    • This map needs to be one-to-one: each unique IP_addr, port # pair must map to a different NAT port #


  • Map incoming packets
    • Consult the translation table
    • Convert (dst port) into (dst IP, dst port)
    • Local host receives the incoming packet

Maintaining the Mapping Table

  • For each outgoing packet, check if there’s an entry in the table
    • If not create a new entry/mapping


  • Eventually, need to delete the map entry
    • But when to remove the binding?


  • If no packets arrive within a time window then delete the mapping to free up the port #s
    • At risk of disrupting a temporarily idle connection


  • Yet another example of “soft state”
    • removing state if not refreshed for a while

Many Kinds of NATs

  • Some NATs also effectively include firewall rules:
    • Only allow return packets from src IP_addr/src_port#
    • Only allow return packet from src IP_addr
    • Allow any return packet with a (dest) NAT port# in the table


  • Some corporate NATs have many external IP addresses
    • Map from private IP/private port to NAT IP/NAT port is maintained in a table and is one-to-one


  • NATs provide some security and hide the structure of internal networks

Where is NAT Implemented?

  • Home router (e.g., Linksys box)
    • Integrates router, DHCP server, NAT, etc.
    • Use single IP address from the service provider and have a bunch of hosts hiding behind it


  • Campus or corporate network
    • NAT at the connection to the Internet
    • Share a collection of public IP addresses
    • Avoid complexity of renumbering end hosts and local routers when changing service providers

Running Servers Behind NATs

  • Running servers is still possible
    • Admittedly with a bit more difficulty


  • By explicit configuration of the NAT box
    • internal service at <dst 138.76.29.7, dst-port 80> mapped to <dst 10.0.0.1, dst-port 80>


  • More challenging for P2P applications
    • Especially if both peers are behind NAT boxes


  • Though solutions are possible here as well
    • Existing work-arounds (e.g., in Skype)
    • Ongoing work on “NAT traversal” techniques

Objections Against NATs

  • Practical Objections
    • Port #s are meant to identify sockets
    • Yet, NAT uses them to identify end hosts
    • Makes it hard to run a server behind a NAT


  • Principled Objections
    • Routers are not supposed to look at port #s
      • Network layer should care only about IP header and not be looking at the port numbers at all
    • NAT violates the end-to-end argument
      • Network nodes should not modify the packets
    • IPv6 is a cleaner solution
    • But NATs work and here now
    • I suspect NATs will be around for quite some time

Summary

  • Middleboxes address important problems
    • Blocking unwanted traffic
    • Getting by with fewer IP addresses
    • Making fair use of network resources
    • Improving end-to-end performance


  • Middleboxes cause problems of their own
    • No longer globally unique IP addresses
    • No longer can assume network simply delivers packets
    • Harder to deploy end-to-end applications and UDP applications