Course:CPSC317/2013WT1/Lecture Notes

Lecture 16

Routing

Admin

Exam
- Mean:
Regrading
- If there’s an addition error, let me know
- Otherwise, regrading requests needs to be for 4 or more points
  - Takes a change of 5.5 points on the midterm to change your course mark by 1 pt
- Rationale has to be in an email to me

If you didn’t do well
- Don’t panic, assignments, final and participation are worth 82% or your grad
- Come see me regularly at office hours

Course grading scheme
- Assignments 44%
- Final Exam 36%
- Midterm 18%
- PeerReview 4%

Percentages may move by 1 or 2 points

The midterm and/or final marks may be scaled up a bit if the course average is lower than I’d like

Last Lecture

IP layer uses the FT to find two things

FT:
- CIDB block à? outgoing InterfaceID/IP and Next Hop IP

The link layer at that interface knows it’s own MAC address
- but does not know the MAC address of the Next Hop IP

ARP module sits betw IP layer and link

ARP:
- Next Hop IP à? dest MAC addr

Now link layer has both src and dest MAC addrs ==== Now let’s get back to routing ====

Data plane: mostly about forwarding
- Uses the Forwarding Table data structure

Control plane: mostly about “routing”
- Routing is the process by which routers collectively compute forwarding tables such that:
  - Every host can reach every other host
  - There are no cycles, except perhaps for short periods
- Routing will use a data structure called a routing table
  - Much like a forwarding table but may have more info in it
  - Will depend on the routing protocol

Where do Forwarding Tables Come From?

Forwarding table entries can be statically configured
- E.g., “map 12.34.158.0/24 to Serial0/0.1”

But, this doesn’t adapt
- To failures
- To new equipment
- To the need to balance load

That is where routing protocols come in

Routing

Routing on the Internet is hierarchical
- The internet is broken up into Autonomous Systems
  - An AS is a large IP network within a single administrative domain"
  - One organizations can have several Ases"
- Not all networks are in an AS"
  - Some networks are stub IP networks"
  - They’re attached to exactly one AS"

Two main levels of routing hierarchy
- How to route between ASes"
  - Interdomain routing or Exterior Gateway Protocols "
- How to route within an AS or stub IP network"
  - Intradomain routing or Interior Gateway Protocols

Interdomain routing

As a first cut, think of each AS as “meta router”

Think about a given AS

It is connected to several neighboring ASes

For a given CIDR block, the given AS needs to decide on the “next hop” AS

For a given destination address block, the goal is for the next hop edges to form a directed tree where all paths lead to the “root” AS

Intradomain routing

Just worries about CIDR blocks within the AS
- and maybe those in connected stub networks

The routers can exchange detailed proprietary topology and network info, e.g., link speeds

The paths taken to a subnet in the AS must be only through routers in the AS

Putting them together

Eventually you have to glue the intradomain routing decisions and the interdomain routing decisions together
- More on that later

Routing

There are two main techniques for routing
- Link State routing
  - Crucially uses Dijkstra’s algorithm for single source shortest path
  - Used in Intradomain routing
- Distance Vector routing
  - Used in Intradomain for small networks
  - A variant used in Interdomain

There are two main techniques for routing
- Link State routing
  - Crucially uses Dijkstra’s algorithm for single source shortest path
  - Used in Intradomain routing
- Distance Vector routing
  - Used in Intradomain for small networks
  - A variant used in Interdomain

Link State Routing

We’ll abstract away the edge subnets and just talk about routing on a graph

Four steps 1. Link State Detection:
- - Each router keeps track of its incident links

Whether the link is up or down

The cost on the link 2. Link State Flooding:
- - Nodes collectively broadcast topology updates

After which each node has its own copy of the topology 3. Local Shortest Path Computation:
- - Every node, e.g., u, locally computes the shortest path between it and all the other nodes in the graph 4. FT Update: Every node updates it forwarding table
  - For every v, u uses the first edge on its path to v in it’s FT

E.g., if the first edge on the shortest path from u to v is (u,x) then u’s forwarding table for v is v: (u,x)"

Link state often used for intradomain routing

Example link state intradomain protocols
- Open Shortest Path First (OSPF)
- Intermediate System ** Intermediate System (IS-IS)

Four steps 1. Each router keeps track of its incident links
- - Whether the link is up or down
  - The cost on the link 2. Nodes collectively broadcast topology updates

(flooding)

- - After which each node has its own copy of the topology 3. Every node, e.g., u, locally computes the shortest path between it and all the other nodes in the graph 4. Every node updates it forwarding table
  - For every v, u uses the first edge on its path to v in it’s FT

E.g., if the first edge on the shortest path from u to v is (u,x) then u’s forwarding table for v is v: (u,x)"

Shortest-Path Tree

Shortest-path tree from u * Forwarding table at u u! v! w! x! y! z! s! t! v (u,v) w (u,w) x (u,w) y (u,v) z (u,v) link s (u,w) t (u,w)

Link State Routing

Four steps 1. Each router keeps track of its incident links
- - Whether the link is up or down
  - The cost on the link 2. Nodes collectively broadcast topology updates

(flooding)

- - After which each node has its own copy of the topology 3. Every node, e.g., u, locally computes the shortest path between it and all the other nodes in the graph 4. Every node updates it forwarding table
  - For every v, u uses the first edge on its path to v in it’s FT

E.g., if the first edge on the shortest path from u to v is (u,x) then u’s forwarding table for v is v: (u,x)"

For shortest path problem we use either the hop count or more generally edges can have different costs:
- E.g., cost might be inversely proportional to link rate

We assume that the costs are not a function of current traffic in the network
- But might be a function of long term averages of traffic

Single Source Shortest-Path Problem

Given: network topology with link costs
- c(x,y): link cost from node x to node y
- Infinity if x and y are not direct neighbors

Compute: least-cost paths to all nodes
- From a given source u to all other nodes u! v!

Dijsktra’s Algorithm

1 Initialization: 2 S0 = {u} 3 for all nodes v 4 if (v is adjacent to u) 5 D0(v) = c(u,v) 6 else D0(v) = 8 For i=1 to N 9 find w not in Sk-1 with the smallest Dk-1(u,w) 10 add w to S: Sk = Sk-1 union {w} 11 update D(u,v) for all v adjacent to w and not in S: 12 Dk(u,v) = min{Dk-1(u,v), Dk-1(u,w) + c(w,v)} 13 Scope of For

Link State Routing

Four steps 1. Each router keeps track of its incident links
- - Whether the link is up or down
  - The cost on the link 2. Nodes collectively broadcast topology updates

(flooding)

- - After which each node has its own copy of the topology 3. Every node, e.g., u, locally computes the shortest path between it and all the other nodes in the graph 4. Every node updates it forwarding table
  - For every v, u uses the first edge on its path to v in it’s FT

E.g., if the first edge on the shortest path from u to v is (u,x) then u’s forwarding table for v is v: (u,x)"

Detecting Topology Changes

Beaconing
- Periodic “hello” messages in both directions
- Detect a failure after a few missed “hellos”

Performance trade-offs
- Detection speed
- Overhead on link bandwidth and CPU
- Likelihood of false detection

“hello”!

Learning The Topology

The routers are using the Link State Protocol to build the routing table
- They can’t assume they have a routing table to use to forward their local topology information to the other nodes on the network

How do the nodes communicate local topology information?

Broadcasting Link State

Idea: Flooding
- Each node sends out it’s local link state to it’s neighbors
  - The neighbors rebroadcast it
- In general, if a node gets link state info it rebroadcasts it to its neighbors

Eventually, every nodes gets info about all of the links

Problem: exponential number of packets
- When do you stop rebroadcasting?

Solution: don’t rebroadcast an update over the link it came in on
- Ensures an update travels on each edge exactly once
- Can piggyback updates on one another

Broadcasting the Link State

Flooding
- Node sends link-state information out its links
- And then the next node sends out all of its links
- … except the one where the information arrived X A C B D

(a) X A C B D (b) X A C B D (c) X A C B D (d)

Problem: if a node broadcasts its link state periodically, an older broadcast may reach another node at a later time than a more recent broadcast
- How to prevent nodes from using “old” updates?

Solution: time-stamps and/or sequence numbers
- They have their own problems

When to Initiate Flooding

Topology change
- Link or node failure
- Link or node recovery

Configuration change
- Link cost change

Periodically
- Refresh the link-state information
- Typically (say) 30 minutes
- Corrects for possible corruption of the data

Challenges
- Packet loss
- Out-of-order arrival

Reliable flooding
- Ensure all nodes receive link-state information
- … and that they use the latest version

Solutions
- Acknowledgments and retransmissions
- Sequence numbers
- Time-to-live for each packet

Protocols are complex

When the Routers Disagree

(during transient periods)!

Convergence

Getting consistent routing information to all nodes
- E.g., all nodes having the same link-state database

Consistent forwarding after convergence
- All nodes have the same link-state database
- All nodes forward packets on shortest paths
- The next router on the path forwards to the next hop

Transient Disruptions

Detection delay
- A node does not detect a failed link immediately
- … and forwards data packets into a “blackhole”
- Depends on timeout for detecting lost hellos

Inconsistent link-state database
- Some routers know about failure before others
- The shortest paths are no longer consistent
- Can cause transient forwarding loops 4

Convergence Delay

Sources of convergence delay
- Detection latency
- Flooding of link-state information
- Shortest-path computation
- Creating the forwarding table

Performance during convergence period
- Lost packets due to blackholes and TTL expiry
- Looping packets consuming resources
- Out-of-order packets reaching the destination

Very bad for VoIP, online gaming, and video

Reducing Convergence Delay

Faster detection
- Smaller hello timers
- Link-layer technologies that can detect failures

Faster flooding
- Flooding immediately
- Sending link-state packets with high-priority

Faster computation
- Faster processors on the routers
- Incremental Dijkstra’s algorithm

Faster forwarding-table update
- Data structures supporting incremental updates

Scaling Link-State Routing

Overhead of link-state routing
- Flooding link-state packets throughout the network
- Running Dijkstra’s shortest-path algorithm

Introducing hierarchy through “areas” Area Area 1 Area Area 3 Area area border router

Where are we?

IP Control Plane: Routing
- Link State protocol family
  - Flooding link state + local computation of shortest paths
  - Used for Intra-AS routing
  - When links come up and down, black holes and loops possible until convergence
- Distance Vector protocol family
  - Share with your neighbors what you think your distance is to the other nodes in the graph
  - Used for Intra-AS and Inter-AS routing

Distance Vector Protocol

Abstract away IP addresses and subnet addresses again
- We’ll just concentrate on a graph

Each node in the graph is going to be running the distance vector protocol
- The goal is very similar as to the Link State protocol:
  - The protocol in node v is trying to compute the cost of the least cost path to every other node in the network, and the outgoing edge for that path
  - Unlike a link state protocol v does not compute the entire path itself

In DV, node v keeps it’s current estimate of the cost of the least cost paths to every other node in the network.
- This is v’s “distance vector”
- When v’s distance vector changes, it shares it’s distance vector with its neighbors
- Conversely, when v receives updates from neighbors, it updates its distance vector and sends updates to its neighbors if necessary

Distance Vector Algorithm

Distributed:

Each node performs its iterations locally and asynchronously Iterative: each local iteration caused by:

Local link cost change

Distance vector update message from neighbor Asynchronous:

Each node notifies neighbors when its DV changes, or at least according to a local timer wait for (change in local link cost or message from neighbor) recompute estimates if distance to any destination has changed, notify neighbors Each node:

How do the updates work?

Define distance between two nodes in a graph
- D(x,y) = cost of least-cost path from x to y

= minimum over all paths p from x to y of {the cost of p}

- Let N(y) be the set of all neighbors of y

Bellman Ford Equation
- D(x,y) = min over w in N(y) of { D(x,w) + c(w,y) }

= min over w in N(x) of { c(x,w) + D(w,y) }

We’re still at a prototypical node v.

Rather than worrying about v’s distance to every other node—it’s whole distance vector--let’s just focus on just one u.

Consider the neighbors of v: w1, w2,…,wk.

v knows the cost of its outgoing links: c(v,w1), c(v,w2),…

For each neighbor, v stores what that neighbor last told v about it’s current estimate of the distance to u

When v gets an update it uses the BF eq’s to calculate it’s its new estimate
- D(v,u) = min over w in N(v) of { c(v,w) + D(w,u) }

Distance Vector Algorithm

Distributed:

Each node performs its iterations locally and asynchronously Iterative: each local iteration caused by:

Local link cost change

Distance vector update message from neighbor Asynchronous:

Each node notifies neighbors when its DV changes, or at least according to a local timer wait for (change in local link cost or message from neighbor) recompute estimates if distance to any destination has changed, notify neighbors Each node:

Bellman Ford

Why is it true?

In the definition of D(x,y) we need to take the minimum over *every* path between x and y.

Let’s do the following 1. Let’s partition all of the paths into disjoint sets 2. Find the minimum cost of a path in each partition 3. Take the minimum over the minimum in each partition
- This is equivalent to the definition

1. What partition should we use?

- For each w in N(y), consider the paths through w. This is a partition of all the paths from x to y 2. What’s the minimum cost in each partition?
- Since each path goes through (w,y) each path shares the additive cost of c(w,y). The minimum of this partition is just the minimum cost from x to w + c(w,y) which is D(x,w) + c(w,y) 3. What’s the minimum over these partition values?
- Min over w in N(y) of {D(x,w) + c(w,y)}

DV protocol!

Quick note:
- DV often described as neighbors exchanging the entire distance vector (their current estimate to every other node in the network)
- In some variants only the distance estimate that have changed are sent in updates
- Updates can be
  - in response to changes in the topology,
  - Periodic
  - Or both

Lecture 15

IP FORWARDING AND ADDRESSING

IP ADDRESS (IPV4)

A unique 32-bit number

Represented in dotted-decimal notation
- Each byte represented by a integer between 0 and

E.g., 12.34.158.00001100 00100010 10011110 12 34 158 ====HIERARCHICAL ADDRESSING: IP PREFIXES====

An IP address is divided into network & host portions (left and right)

The network portion is often called the network prefix 11001100 00100010 10011110 Network (24 bits) Host (8 bits) 204 34 158

Host whose IP addresses share the same network portion/prefix are part of a subnetwork

IP address that share the same network prefix form a block of addresses, e.g., [204.34.158.0, 204.34.158.255] 11001100 00100010 10011110 Network (24 bits) Host (8 bits) 204 34 158

Forwarding Table Look-up

The forwarding table is a mapping from subnetworks (i.e., blocks of addresses) to outgoing links

When a destination IP address comes in:
- the router checks which subnetwork the IP address is in and retrieves the outgoing link.
  - It does that by comparing the prefix of the destination IP address with the network prefixes that define the subnetworks in its table
  - Can think of the prefix as the most significant bits of the address

CLASS-BASED ADDRESS BLOCKS

A Class A address block:
- X.0.0.0 where
  - X is a byte with leading * A Class B address block:
- X’.Y.0.0 where
  - X’ is a byte with leading o Y can be anything

A Class C address block:
- X’’.Y.Z.0 where
  - X’’ is a byte with leading o Y, and Z can be anything

CLASSLESS-BASED ADDRESSING

Size of Class-based blocks not flexible enough

New scheme in 1993: classless-addressing
- Explicitly encode the number of bits being used for the network prefix

CLASSLESS INTER-DOMAIN ROUTING (CIDR)

IP Address : 12.4.0.0 IP Mask: 255.254.0.Address 00001100!00000100!00000000!00000000! Mask 11111111!11111110!00000000!00000000! Network Prefix for hosts Use two 32-bit numbers to represent a network. CIDR block = (Network IP address, Mask) Written as 12.4.0.0/

SUBNET MASKS

With the subnet mask the router can quickly:
- perform a bit-wise-and of the mask and the destination IP address
- and then compare the result to the Network IP address

CIDR BLOCKS

Note that a CIDR Block address really denotes a block of contiguous addresses:

E.g., 193.34.158.0/23!
- The last 32-23 = 9 bits are wildcards!
- That’s 512 IP addresses!
- [193.34.158.0 to 193.34.160.0) = !
- [ 193.34.158.0 to 193.34.159.255]

DELEGATION AND AGGREGATION

CIDR block addressing allows for flexible block sizes:
- To better match the needs of each network

Address blocks delegated down from:
- IANA to RIRs to chain of provider/customer relationships

Allows FTs to aggregate address blocks

But need the longest prefix match rule to break ties

LONGEST PREFIX MATCH FORWARDING

Forwarding tables in IP routers
- Maps each IP prefix to output interfaces

When a dest IP matches more than one network prefix/CIDR block
- Choose the CIDR block with the longest prefix 4.0.0.0/4.83.128.0/201.10.0.0/201.10.6.0/126.255.103.0/201.10.6.destination forwarding table Serial0/0.outgoing link

LONGEST PREFIX MATCH

Recall address prefixes/CIDR blocks are blocks of addresses

The “longest prefix match” is the smallest or most specific address block to which the dest IP belongs

WORKING WITH CIDR ADDRESSES

/n Length of prefix

of

wildcard bits Size of block Size of block in base Size - /30 2 22 = 4 0.0.0.4 0.0.0./29 3 23=8 0.0.0.8 0.0.0./28 4 24=16 0.0.0.16 0.0.0./27 5 25=32 0.0.0.32 0.0.0./26 6 26=64 0.0.0.64 0.0.0./25 7 27=128 0.0.0.128 0.0.0./24 8 28=256 0.0.1.0 0.0.0./23 9 29=21*28=512 0.0.2.0 0.0.1./22 10 210=22*28=1024 0.0.4.0 0.0.3./21 11 211=23*28=2048 0.0.8.0 0.0.7.

Determine if 221.75.18.5 is in: a. 221.72.0.0/b. 221.73.128.0

What is the range of addresses represented by: a. 221.72.0.0/** 32-14 = 18 wildcard bits
- So the address block has 2^{18} addresses starting at 221.72.0.** 2^18 = 2^2*2^{16} = 0.4.0.** So, the last address (exclusive) is
- 221.72.0.0 + 0.4.0.0 = 221.76.0.** The range is
- [221.72.0.0, 221.76.0.0) =
- [221.72.0.0, 221.75.255.255]

What is the range of addresses represented by: b. 221.73.128.0/** 32-17 = 15 wildcard bits
- So the address block has 2^{15} addresses starting at 221.73.128.** 2^15 = 2^7*2^8 = 0.0.128.** So, the last address (exclusive) is
- 221.73.128.0 + 0.0.128.0 = 221.74.0.** The range is
- [221.73.128.0, 221.74.0.0) =
- [221.73.128.0, 221.73.255.255]

IP ADDRESSES FOR INTERFACES

IP addresses are for network interfaces not for hosts

Why?

IP ADDRESSES ARE FOR INTERFACES!

IP addresses are for network interfaces not for hosts

Why?

If a router is going to route packets between two networks, it has to have an address in each network 223.1.1.223.1.1.223.1.1.223.1.1.4 223.1.2.223.1.2.223.1.2.223.1.3.1 223.1.3.223.1.3.223.1.1.0/24 223.1.2.0/223.1.3.0/

SUBNETS

IP address:
- subnet part (high order bits)
- host part (low order bits)

What’s a subnet ?
- device interfaces with same subnet part of IP address
- can physically reach each other without intervening router 223.1.1.223.1.1.223.1.1.223.1.1.4 223.1.2.223.1.2.223.1.2.223.1.3.1 223.1.3.223.1.3.network consisting of 3 subnets subnet 223.1.1.0/223.1.2.0/223.1.3.0/

To determine the physical subnets, detach each router interface, creating islands of isolated networks.

The IP addresses of all interfaces in a subnet
- share the same address prefix
- are in the same address block 223.1.1.0/24 223.1.2.0/223.1.3.0/

How many? 223.1.1.223.1.1.223.1.1.223.1.2.1 223.1.2.223.1.2.223.1.3.1 223.1.3.223.1.3.223.1.1.223.1.7.223.1.7.223.1.8.1 223.1.8.223.1.9.223.1.9.

SPECIAL IP ADDRESSES

Private address blocks
- 10.0.0.0/8, 172.16.0.0/12,192.168.0.0/** Can be used inside a private network
- A router is never supposed to forward a packet with a private source or dest addr to the public Internet
- A router sitting between a private and public network has to do something clever to translate between private and public addresses—more on this later in the course

Loopback
- 127.0.0.0/** Can be used inside of a host (e.g., web server and browser running on the same host)

Directed Multicast
- Network part followed by all 1s
- Directed broadcast to a specific network:
- Only as dest addr
- E.g. 12.34.158.255 in 12.34.158.0/* In-network multicast
- All 1’s
- Can never be forwarded outside of the subnet

“This host”
- When a host doesn’t know its IP address but needs to put something in the source addr field
- 0.0.0.* And a few others….

NEXT HOP

We’ve been saying that a forwarding table is a mapping from an address prefix to an outgoing interface
- What happens when several routers are connected to a LAN?
  - Need a way to say which of the other routers should receive the packet
- This is common in “peering points” where routers of multiple ISPs share packets back and forth
- Also true for wifi and classic ethernet LANS
- The IP should not make any assumptions about whether the link layer is pt to pt or broadcast

CIDR block IP interface Interface ID Next hop Flags 201.98.20.0/10 25.8.16.1 Eth0 25.8.16. Simplified Forwarding Table!

- What happens at the destination router?
  - The subnet in which the destination host resides is called the directly connected network
  - How do you know you’re there?
  - What’s the next hop?

DIRECTLY CONNECTED

CIDR block IP interface Interface ID Next hop Flags 201.98.20.0/10 25.8.16.1 Eth0 25.8.16.15.22.8.0/25 15.22.8.5 Eth1 --- Simplified Forwarding Table!

NEXT HOP

- What happens at the destination router?
  - How do you know you’re there?
- If your dest IP addr matches on a CIDR block AND
- The IP addr of the output IP Interface in that row of the FT is in the CIDR block AND
- There is no value in the next hop field
  - Sometimes this is also indicated by a flag in that row of the FT
  - What is the next hop?
- The next hop in this case is just the dest IP addr of the datagram

Connection between Link and IP layer

Before moving on with routing, I want to back up just a bit

We talked about:
- IP addresses associated with interfaces
  - So every interface has an associated MAC address and IP address
- The notion of an IP subnet
- The need for “next hop” in the routing table
- A little bit about what the destination router does

Subnets and LANs

When you go hop by hop at the network layer across an IP network:
- The source and destination IP addresses stay the same the entire time
- BUT
- The source and destination MAC address (both) change at each hop

Example: A Sending a Packet to B

How does host A send an IP packet to B? A R B A sends packet to R, and R sends packet to B.!

Host A Sends Packet Through R

IP header
- Source: 111.111.111.** Dest: 222.222.222.A R B

A R B A’s FT gives 111.111.111.110 as Next Hop for 222….

Next Hop

CIDR block IP interface Interface ID Next hop Flags 222.222.222.128/25 111.111.111.111 Eth0 111.111.111.Dest IP 222.222.222.222 matches this row of the forwarding table!

Crafting the link layer frame

CIDR block IP interface Interface ID Next hop Flags 222.222.222.128/25 111.111.111.111 Eth0 111.111.111.* Passing the IP packet down to the link layer for Eth0!

IP interface/Interface ID à? src MAC addr!

Next hop IP à? dest MAC addr !

!

Host has map of Next hop IP to dest MAC addr !

ARP table/ARP protocol!

Host A Sends Packet Through R

IP header
- Source: 111.111.111.** Dest: 222.222.222.* Ethernet frame
- Source: 74-29-9C-E8-FF-** Dest: E6-E9-00-17-BB-4B A R B

R Decides how to Forward Packet

Router R’s adapter receives the packet
- R extracts the IP packet from the Ethernet frame
- R sees the IP packet is destined to 222.222.222.A R B

Router R consults its forwarding table
- Dest addr matches 222.222.222.0/24 row of FT
- Outgoing interface: 222….** Directly connected so Next hop = dest IP A R B

Next Hop

CIDR block IP interface Interface ID Next hop Flags 222.222.222.128/25 222.222.222.220 Eth1 --- Dest IP 222.222.222.222 matches this row of the forwarding table!

R constructs frame

Outgoing interface 222..220à?src MAC addr 1A::9B

Next hop = dest IP =222..220à?dest MAC addr 49::2A A R B

Router R Wants to Forward Packet

IP header
- From A: 111.111.111.** To B: 222.222.222.* Ethernet frame
- From R: 1A-23-F9-CD-06-9B
- To B: 49-BD-D2-C7-56-2A A R B

Question

How does the host know the IP address of the gateway router for its subnet?

How does the router know the MAC addresses of the IP hosts on its directly connected subnets?

Two protocols are needed:
- ARP: Address Resolution Protocol
- DHCP: Dynamic Host Configuration Protocol

We’ll cover DHCP later in the term

Address Resolution Protocol Table

Every node maintains an ARP table
- A mapping from IP address to MAC address

Consult the table when sending a packet
- Get Next Hop IP from IP Forwarding table
- Map that IP address to a MAC address via the ARP Table
- Encapsulate IP packet in a frame using that MAC addr as the dest addr

ARP Protocol

But, what if the IP address is not in the table?

Via link layer broadcast sender asks: “Who has IP address 1.2.3.156?”
- Special ARP query packet encapsulated in a broadcast link layer frame
- Broadcast means dest MAC addr is ff:..:ff

Node/Interface with IP addr in query responds: “I have that IP addr and my MAC address is 58-23-D7-FA-20-B0” in an ARP response
- The ARP response packet is sent via a non-broadcast frame back to the querier
- Sender caches the result in its ARP table

No need for network administrator to get involved

ARP Table Size

If ARP had to keep an entry for every IP outside of its subnet, its table could get large

But it only needs to know the MAC addr for the next hop IP
- For a stub subnetwork, Next Hop IP is either the dest IP address or the IP address of the gateway router

How does a host know, e.g., after boot,
- What is the IP address is of my gateway router?
- How do I know whether a dest IP should go to the gateway router or whether it is the Next Hop IP?
  - i.e., how do I know what subnet/CIDR block I’m sitting in?

Host will learn both of these from DHCP

TOPICS

Circuit Switching

Switching inside the hub, e.g., crossbar switch

Multiplexing: FDM, TDM, Stat muxing

Packet routing; Hop by hop routing; store and forward routing

Internet Architecture philosophy:
- End to End philosophy: network routers pretty much just forward packets
- Internetworking: make it easy to glue local networks together

Network Effect

Main tradeoffs be packet routing and circuit switching

5 layer reference model: what layer is responsible for what

Reliability techniques:
- Error Detection/correction
- Acknowledgement protocols

Link layer protoocls
- Classic ethernet, switched ethernet, wifi

IP forwarding and addressing

IP and Link Layer working together

Lecture 14

IP FORWARDING AND ADDRESSING

IP PACKET STRUCTURE

4-bit Version 4-bit Header Length 8-bit Type of Service (TOS) 16-bit Total Length (Bytes) 16-bit Identification 3-bit Flags 13-bit Fragment Offset 8-bit Time to Live (TTL) 8-bit Protocol 16-bit Header Checksum 32-bit Source IP Address 32-bit Destination IP Address Options (if any) Payload

IP HEADER: TO AND FROM ADDRESSES

Two IP addresses
- Source IP address (32 bits)
- Destination IP address (32 bits)

Destination address
- Unique identifier for the receiving host
- Allows each node to make forwarding decisions

Source address
- Unique identifier for the sending host
- Recipient can decide whether to accept packet
- Enables recipient to send a reply back to source

IP ADDRESS (IPV4)

A unique 32-bit number

Represented in dotted-decimal notation
- Each byte represented by a integer between 0 and

E.g., 12.34.158.00001100 00100010 10011110 12 34 158 ====SCALABILITY CHALLENGE====

HIERARCHICAL ADDRESSING: IP PREFIXES

An IP address is divided into network & host portions (left and right)

The network portion is often called the network prefix 11001100 00100010 10011110 Network (24 bits) Host (8 bits) 204 34 158

Host whose IP addresses share the same network portion/prefix are part of a subnetwork

IP address that share the same network prefix form a block of addresses, e.g., [204.34.158.0, 204.34.158.255] 11001100 00100010 10011110 Network (24 bits) Host (8 bits) 204 34 158

Forwarding Table Look-up

The forwarding table is a mapping from subnetworks (i.e., blocks of addresses) to outgoing links

When a destination IP address comes in, the router checks which subnetwork the IP address is in and retrieves the outgoing link.
- It does that by comparing the prefix of the destination IP address with the network prefixes that define the subnetworks in its table
- Can think of the prefix as the most significant bits of the address

LOCAL MANAGEMENT

Can locally add hosts to a subnet w/o needing any FT’s to be updated host! host! host! LAN 1!

...! host! host! host! LAN 2! ...! router! router! router! WAN! WAN! 1.2.3.4 1.2.3.7 1.2.3.156 5.6.7.8 5.6.7.9 5.6.7.1.2.3.* 5.6.7.*

Hierarchical Addressing and Forwarding

Similarity with Postal Addresses/Forwarding
- Both post offices and routers use the “most significant bits” of the hierarchical address to make their forwarding decision
  - They ignore the low order bits

E.g., a regional PO ignores the street address and focuses on city name, county, province, country."

A router ignores the host portion of the address"

Differences
- Postal addresses and forwarding have many levels of hierarchy
- IP addresses just have one
  - But there is more than one level of hierarchy in the system

WHICH PREFIX

Suppose a packet with dest IP addr comes in:
- X.Y.Z.W

Suppose there are subnets in the routing table of the form:
- X.*.*.*
- X.Y.*.*
- X.Y.Z.*

What to do?
- Try #1: Disallow this
- Add rules so that the three size subnets can’t be prefixes of one another

ADDRESS BLOCK SIZES: EARLY DAYS

Class-Based Addressing
- Class A:
  - X.*.*.* ~2^{24} = ~16 Mil addresses/block
  - X = any 8 bits starting with 0 128 Class A address blocks
- Class B:
  - X’.Y.*.* ~2^{16} = ~64K addresses/block
  - X’ = any 8 bits starting with 10 2^{14} = ~16K address blocks
- Class C:
  - X’’.Y.Z.* ~2^{8} = ~256 addresses/block
  - X’’ = any 8 bits starting with 110 2^{21} = ~2 Mil address blocks

REPRESENTATION IN THE FT

A Class A address block:
- X.0.0.0 where X is a byte with leading 0: 0*******

A Class B address block:
- X’.Y.0.0 where X’ is a byte with leading 10: 10******

A Class C address block:
- X’’.Y.Z.0 where X’’ is a byte with leading 110: 110

TABLE LOOK UP

When an IP packet comes in:
- If first bit of dest addr = o Compare first 8 bits with all the Class A prefixes in the table
- Else if first two bits are o Compare the first 16 bits with all of the Class B prefixes in the table
- Else if the first three bits are o Compare the first 24 bits with all of the Class C prefixes

“Compare with all” means find a match as efficiently as possible

PROBLEMS WITH CLASS-BASED ADDRESSING

Suppose an organization/subnet had 25K machines

A class B address would overprovision by

~40K address

Would need ~100 Class C addresses

Routing tables were getting too big

MORE FLEXIBLE ADDRESSING SCHEME

Needed a more flexible addressing scheme
- Needed more address block sizes

What about an address block for every power of two?

Instead of using a prefix free encoding in the address, use an additional number
- Use an explicit encoding of the number of bits of the network prefix
  - Which also gives an explicit encoding of the number of wildcard bits for the host portion of an IP address

CLASSLESS INTER DOMAIN ROUTING

The human readable way of specifying that the network prefix is n bits:
- a.b.c.d/n, e.g, 193.34.158.0/23 !
- The bit representation of a.b.c.d is assumed have 32-n lsb 0’s.

A router does not store the network address prefix as 193.34.158.0/23!

How does it store it?
- It stores it as a pair of 32 bit numbers
  - an IP address and a subnet mask

CLASSLESS INTER-DOMAIN ROUTING (CIDR)

IP Address : 12.4.0.0 IP Mask: 255.254.0.Address 00001100!00000100!00000000!00000000! Mask 11111111!11111110!00000000!00000000! Network Prefix for hosts Use two 32-bit numbers to represent a network. CIDR block = (Network IP address, Mask) Written as 12.4.0.0/

SUBNET MASKS

With the subnet mask the router can quickly perform a bit-wise-and of the mask and the destination IP address
- and then compare the result to the Network IP address

EXAMPLE

Suppose
- 193.34.158.0/23! Is in the forwarding table

Look up request for destination IP addr
- 193.34.158.* Router takes the bitwise-and of
- dest IP addr: 193.34.158.5 and!
- the network mask: 255.255.254.0, !
- and gets 193.34.158.0 !

Compares this with network IP address!
- In this case, it gets a match with 193.34.158.0 in the forwarding table!

CIDR ADDRESSING

Many legal CIDR address blocks could use the same Network IP address.
- In the example,
- 12.4.0.0/** 12.4.0.0/** 12.4.0.0/** …
- 12.4.0.0/* But not all prefixes sizes are legit
- 12.4.0.0/13 is not a legit CIDR block. Why?

CIDR BLOCKS

Note that a CIDR Block address really denotes a block of contiguous addresses:

E.g., 193.34.158.0/23!
- The last 32-23 = 9 bits are wildcards!
- That’s 512 IP addresses!
- [193.34.158.0 to 193.34.160.0) = !
- [ 193.34.158.0 to 193.34.159.255] !

SCALABILITY: ADDRESS AGGREGATION

Provider is given 201.10.0.0/201.10.0.0/22 201.10.4.0/24 201.10.5.0/24 201.10.6.0/Provider

The FT’s in the rest of the internet only need to have 201.10.0.0./21 in them

The provider’s FT’s will forward packets for the smaller networks on appropriately

IANA

How does a provider get a CIDR block of addresses?

Internet Assigned Number Authority
- IANA gives big blocks to Regional Internet Registries
- RIR gives CIDR blocks to providers
- Providers may give sub CIDR blocks to their customers

REGIONAL INTERNET REGISTRIES

CIDR: HIERARCHAL ADDRESS ALLOCATION

Prefixes are key to Internet scalability
- Address allocated in contiguous address blocks (prefixes)
- Routing protocols and packet forwarding based on prefixes
- Today, routing tables contain ~350,000 prefixes

PROBLEM WITH CIDR: CUSTOMERS KEEPING ADDRESS BLOCKS

PROBLEM WITH CIDR: MULTIHOMING

* Multi-homed customer with 201.10.6.0/23 has two providers. Provider 2 is a fail-over provider

Provider 2 needs to have a FT entry for 201.10.6.0/23 going to the green link"

When the yellow subnet fails over to the green provider, the FT of some upstream routers will have both 201.10.0.0/21 and 201.10.6.0/23

The IP address 201.10.6.5 will match both"

PREFIX MATCH FORWARDING

Forwarding tables in IP routers
- Maps each IP prefix to output interfaces

When a dest IP matches more than one network prefix/CIDR block
- Choose the CIDR block with the longest prefix 4.0.0.0/4.83.128.0/201.10.0.0/201.10.6.0/126.255.103.0/201.10.6.destination forwarding table Serial0/0.outgoing link

LONGEST PREFIX MATCH

Recall address prefixes/CIDR blocks are blocks of addresses

The “longest prefix match” is the smallest or most specific address block to which the dest IP belongs

Lecture 13

IP FORWARDING AND ADDRESSING

ROUTER ARCHITECTURE OVERVIEW

Two key router functions:

Run routing protocols to construct their forwarding table

Forward datagrams from incoming to outgoing links using their forwarding table

INPUT PORT FUNCTIONS

Decentralized switching

given datagram dest., lookup output port using forwarding table in input port memory

goal: complete input port processing at ‘line speed’

queuing: if datagrams arrive faster than forwarding rate into switch fabric Physical layer: bit-level reception Data link layer: e.g., Ethernet

OUTPUT PORTS

Buffering required when datagrams arrive from fabric faster than the transmission rate

Scheduling discipline chooses among queued datagrams for transmission

IP PACKET STRUCTURE

IP HEADER: VERSION, LENGTH, TOS

Version number (4 bits)
- Indicates the version of the IP protocol
- Necessary to know what other fields to expect
- Typically “4” (for IPv4), and sometimes “6” (for IPv6)

Header length (4 bits)
- Number of 32-bit/4 byte words in the header
- Typically “5” (for a 20-byte IPv4 header)
- Can be more when “IP options” are used

Type-of-Service (8 bits)
- Allow packets to be treated differently based on needs
- E.g., low delay for audio, high bandwidth for bulk transfer

IP HEADER: LENGTH, FRAGMENTS, TTL

Total length (16 bits)
- Number of bytes in the packet
- Maximum size is 63,535 bytes (216 -1)
- … though underlying links may impose harder limits

Fragmentation information (32 bits)
- Packet identifier, flags, and fragment offset
- Supports dividing a large IP packet into fragments
- … in case a link cannot handle a large IP packet

Time-To-Live (8 bits)
- Used to identify packets stuck in forwarding loops
- … and eventually discard them from the network

IP HEADER: MORE ON TIME-TO-LIVE (TTL)

Potential robustness problem
- Forwarding loops can cause packets to cycle forever
- Confusing if the packet arrives much later

Time-to-live field in packet header
- TTL field decremented by each router on the path
- Packet is discarded when TTL field reaches 0…
- …and “time exceeded” message is sent to the source

==== IP HEADER FIELDS: TRANSPORT PROTOCOL

Protocol (8 bits)
- Identifies the higher-level protocol
  - E.g., “6” for the Transmission Control Protocol

(TCP)

- - E.g., “17” for the User Datagram Protocol (UDP)
- Important for demultiplexing at receiving host
  - Indicates which protocol is to be given IP payload

IP HEADER: CHECKSUM ON THE HEADER

Checksum (16 bits)
- Sum of all 16-bit words in the IP packet header
  - Mod 2^{16}
- Receiving router discards corrupted packets
  - Sending host will retransmit the packet, if needed
- Recomputed every hop since TTL changes

+ = + = Mismatch

IP HEADER: TO AND FROM ADDRESSES

Two IP addresses
- Source IP address (32 bits)
- Destination IP address (32 bits)

Destination address
- Unique identifier for the receiving host
- Allows each node to make forwarding decisions

Source address
- Unique identifier for the sending host
- Recipient can decide whether to accept packet
- Enables recipient to send a reply back to source

IP ADDRESS (IPV4)

A unique 32-bit number

Represented in dotted-decimal notation

E.g., 12.34.158.00001100 00100010 10011110 12 34 158

SCALABILITY CHALLENGE

SCALABILITY IMPROVED

HIERARCHICAL ADDRESSING: IP PREFIXES

An IP address is divided into network & host portions (left and right)

The network portion is often called the network prefix

All the hosts with the same network portion/prefix are part of a subnetwork address block 11001100 00100010 10011110 Network (24 bits) Host (8 bits) 204 34 158

Forwarding Table Look-up

The forwarding table is a mapping from subnetworks (i.e., blocks of addresses) to outgoing links

When a destination IP address comes in, the router checks which subnetwork the IP address is in and retrieves the outgoing link.
- It does that by comparing the prefix of the IP address with the prefixes that define the subnetworks in its table
- Can think of the prefix as the most significant bits of the address

Hierarchical Addressing and Forwarding

Similarity with Postal Addresses/Forwarding
- Both post offices and routers use the “most significant bits” of the hierarchical address to make their forwarding decision
  - They ignore the low order bits

E.g., a regional PO ignores the street address and focuses on city name, county, province, country."

A router ignores the host portion of the address"

Differences
- Postal addresses and forwarding have many levels of hierarchy
- IP addresses just have one
  - But there is more than one level of hierarchy in the system— more on this later

Lecture 17 - Routing

IP Routing

Data plane: mostly about forwarding
- Uses the Forwarding Table data structure

Control plane: mostly about “routing”
- Routing is the process by which routers collectively compute forwarding tables such that:
  - Every host can reach every other host
  - There are no cycles, except perhaps for short periods
- Routing will use a data structure called a routing table
  - Much like a forwarding table but may have more info in it
  - Will depend on the routing protocol

Routing

Routing on the Internet is hierarchical
- The internet is broken up into Autonomous Systems
  - An AS is a large IP network within a single administrative domain"
  - One organizations can have several Ases"
- Not all networks are in an AS"
  - Some networks are stub IP networks"
  - They’re attached to exactly one AS"

Two main levels of routing hierarchy
- How to route between ASes"
  - Interdomain routing or Exterior Gateway Protocols"
- How to route within an AS or stub IP network"
  - Intradomain routing or Interior Gateway Protocols

There are two main techniques for routing
- Link State routing
  - Crucially uses Dijkstra’s algorithm for single source shortest path
  - Used in Intradomain routing
- Distance Vector routing
  - Used in Intradomain for small networks
  - A variant used in Interdomain

Distance Vector Protocol

In DV, node v keeps it’s current estimate of the cost of the least cost paths to every other node in the network.
- This is v’s “distance vector”
- When v’s distance vector changes, it shares it’s distance vector with its neighbors
- Conversely, when v receives updates from neighbors, it updates its distance vector and sends updates to its neighbors if necessary

Distance Vector: Initialization

graphic

Bad News Travels Slowly

graphic

Poisson Reverse

If your best route for a destination d is via a neighbor N,
- You tell N that the cost to get to d from you is infinity
- You tell all of your other neighbours the actual cost of getting to d from you

This prevents cycles of length two during transients in the DV protocol
- but it does not prevent longer cycles

Routing Information Protocol (RIP)

Distance vector protocol
- Nodes send distance vectors every 30 seconds
- … or, when an update causes a change in routing

Link costs in RIP
- All links have cost ** Valid distances of 1 through ** … with 16 representing infinity
- Small “infinity” à? smaller “counting to infinity” problem

RIP is limited to fairly small networks

Similarities of LS and DV Routing

Shortest-path routing
- Metric-based, using link weights

As such, commonly used inside an organization
- RIP and OSPF are mostly used as intradomain protocols

Differences btw LS and DV

Message complexity

LS: each link sends O(E log n) bits per round of updates

DV: each link sends O(n log n) bits per round of updates Speed of Convergence

LS: relatively fast

DV: convergence time varies
- Count-to-infinity problem

What works for IGP and EGP

table

Interdomain Routing and Autonomous Systems (ASes)

EGP/Interdomain Routing

Internet is divided into Autonomous Systems
- Distinct regions of administrative control
- Routers/links managed by a single “institution”
- Service provider, company, university, …

Hierarchy of Autonomous Systems
- Large, tier-1 provider with a nationwide backbone
- Medium-sized regional provider with smaller backbone
- Small network run by a single company or university

Interaction between Autonomous Systems
- Internal topology is not shared between ASes
- … but, neighboring ASes interact to coordinate routing

Autonomous System Numbers

AS Numbers are 16 bit values.

UBC: * U of T: * Stanford: * AT&T: 7018, 6341, 5074, …

Rogers: 812, 2491, 20453, …

Telus: 852, 24914, … Currently over 20,000 in use.

AS Number Trivia

AS number is a 16-bit quantity
- So, 65,536 unique AS numbers

Some are reserved (e.g., for private AS numbers)
- So, only 64,510 are available for public use

Managed by ICANN
- Gives blocks of 1024 to Regional Internet Registries

RIRs assign AS numbers to institutions

Recently started assigning 32-bit AS #s (2007)

AS-level Topology

- Destinations are IP prefixes (e.g., 12.0.0.0/8)
- Nodes are Autonomous Systems (ASes)
- Possibly multiple links between neighbours
  - Different types of links for different business relationships
EGP is about trying to find directed trees in the AS graph
- There may be more than one link between two Ases
- Really a tree where edges are between border routers
  - Alternate between inter-AS edges and intra-AS edges

EGP needs a directed tree for every CIDR block

EGP must scale
- Prefixes: 400,000, and growing
- ASes: 20,000+ visible ones, and 40K allocated
- Routers: at least in the millions…

EGP must allow for ASes to implement business relationships
- E.g., some paths preferred/disallowed because of business relationships

EGP must provide privacy
- Should not divulge the interior topology of the AS
- Should not divulge business relationships between ASes

There can only be one EGP
- All the ASes have to run the same protocol

What works for EGP?

Link State Distance Vector Interior Gateway Protocols OSPF IS-IS RIP Exterior Gateway Protocols ? ?

LS or DV for EGP?

Problem with LS
- Every change in AS topology in origin AS floods entire AS graph with update on all CIDR blocks originated by that AS
  - DV updates only percolate as far as they affect the shortest path
- No internet-wide notion of cost of edges between ASes
- With just hop count, not all shortest paths in AS graph are consistent with business relationships
  - Shortest path between some pairs of ASes allowed/ not allowed because of business relationships

Problem with DV
- Slow convergence
- Also based on shared notion of link weights and shortest paths

What to do?

Start with Distance Vector and modify/generalize
- Improve convergence, reduce loops
- Allow for flexible policy
  - Best path based on more factors than one number
  - To Implement business policies

Don’t have to accept announcements from all neighbours"

Don’t’ have to send announcements to all neighbours"

Can vary import and export policy for announcements per neighbour"

Generalization 1:
- Don’t base “least cost” or “best path” on a single integer
- Base it on a sequence of (four) “attributes”: (a,b,c,d)
  - Each attribute has an order relation
  - More about attributes later
- Announcements: CIDR block, sequence of attributes
- Store most recent announcement from each neighbor
- If a router gets two announcements for the same CIDR block
  - Take the one that’s better on the first attribute
  - If equal on first attr, take the one that’s better on the second
  - If equal on the second as well, take the one that’s better on the third, etc.

Generalization 2: Import/Export Policies
- Filter/modify incoming announcements:
  - Don’t have to use/accept announcements from all neighbors
  - Can even modify attributes from neighbors before comparing
  - Policies can be dependent on the peer AS or more specifically the peer router
- Always accept announcements from internal BGP neighbors
- Import Polices effect how traffic flows *out* of the AS/ border router

Generalization 2: Import/Export Policies
- Filter outgoing announcements:
  - Don’t have to send announcement to all all neighbors
  - Can send different announcements to different neigbors
  - Can modify attributes of an announcement to a neighbor
- If announcement was triggered by an announcement from an external neighbour
  - Send it to all internal BGP neighbours
- if announcement was triggered by an announcement from an internal neighbour
  - Do not send it to the internal BGP neighbors
  - Avoids routing loops inside an AS
- Export Policies effect how traffic flows into the AS/border router

Updates: 1 and 2 together

Get an update announcement for a prefix from a neighbor

Apply Import Policy
- Filter it out or modify some of the attributes

Compare with existing announcements, find best one

If “best” has changed

Apply Export Policy
- Send out properly filtered/modified announcement per neighbor

Border Gateway Protocol (BGP)

THE Interdomain Routing Protocol for the Internet

Border Gateway Protocol

1989 : BGP-1 [RFC 1105], replacement for EGP

1990 : BGP-2 [RFC 1163]

1991 : BGP-3 [RFC 1267]

1995 : BGP-4 [RFC 1771], support for CIDR

2006 : BGP-4 [RFC 4271], update

BGP Sessions

BGP is run between “border routers”

BGP is run on top of TCP

Each border router may have multiple Exterior neighbors

Each border router has a BGP session with every other border router in the AS
- These are “interior” neighbors

A BGP session starts by making announcements for all their prefixes to each other
- Thereafter, incremental updates are sent periodically

BGP Operations

Establish session on TCP port Exchange all active routes Exchange incremental updates ASASWhile connection is ALIVE exchange route UPDATE messages BGP session

BGP Attributes

BGP Attributes used for computing “best” announcement

Local Preference: bigger is better

AS Path: shorter is better

Multi-exit Discriminator

IGP metric: smaller is better Other Attributes for Routing and Filtering:

Next Hop IP

Community ID

….

BGP Attributes used for computing “best” announcement

Local Preference: bigger is better

AS Path: shorter is better

Multi-exit Discriminator

IGP metric: smaller is better Other Attributes for Routing and Filtering:

Next Hop IP

Community ID

Local Pref: Multihomed Backups

graphic Forces outbound traffic to take primary link, unless link is down. AS primary link backup link

AS Path

AS Path:
- The sequence of AS numbers from the “origin” AS of a prefix
- Most recent is on the left, origin is rightmost
- E.g., 14.25.19.0/24: AS_Path(54, 20, 10)

Allows protocol to filter out loops
- E.g., if router in AS 20 got the above announcement from a router in AS 54, it would throw the announcement out
- Improves convergence of protocol

Output modification:
- When a router chooses a route, it prepends its AS number to the AS_Path attribute before announcing it to its external neighbors

Other than input and output filtering, most important factor for external BGP

ASPATH Attribute

AS128.112.0.0/AS Path = AS Sprint AS Ebone AT&T AS Global Crossing 128.112.0.0/AS Path = 7018 128.112.0.0/AS Path = 3549 7018 AS 128.112.0.0/Princeton Prefix Originated AS RIPE NCC RIS project AS Global Access 128.112.0.0/AS Path = 7018 128.112.0.0/AS Path = 1239 7018 128.112.0.0/AS Path = 1129 1755 1239 7018 128.112.0.0/AS Path = 1755 1239 7018 ==== Next Hop IP ====

When two ASes have multiple connections, need more than just knowing what the next hop AS is

Also, need an IP address in order to stitch together the IGP info and the BGP info
- Next Hop IP stays the same for announcements to internal BGP neighbors
- More on this next lecture

Next Hop IP Attribute

Primary AS graph attributes
- AS path (e.g., “7018 88”)
- Next-hop IP address (e.g., 12.127.0.121)

Both stay unmodified for internal announcements AS Princeton 128.112.0.0/AS path = Next Hop = 192.0.2.AS AT&T AS Yale 192.0.2.128.112.0.0/AS path = 7018 Next Hop = 12.127.0.12.127.0.121

Lecture 19 - Routing

Routing

Routing on the Internet is hierarchical
- The internet is broken up into Autonomous Systems
  - An AS is a large IP network within a single administrative domain"
  - One organizations can have several Ases"
- Not all networks are in an AS"
  - Some networks are stub IP networks"
  - They’re attached to exactly one AS"

Two main levels of routing hierarchy
- How to route between ASes"
  - Interdomain routing or Exterior Gateway Protocols"
- How to route within an AS or stub IP network"
  - Intradomain routing or Interior Gateway Protocols

There are two main techniques for routing
- Link State routing
  - Crucially uses Dijkstra’s algorithm for single source shortest path
  - Used in Intradomain routing
- Distance Vector routing
  - Used in Intradomain for small networks
  - A variant used in Interdomain

What works for IGP and EGP

Link State Distance Vector Interior Gateway Protocols OSPF IS-IS RIP Exterior Gateway Protocols ? ? Interdomain Routing and Autonomous Systems (ASes)!

Interdomain Routing and Autonomous Systems (Ases)

AS-level Topology

- Destinations are IP prefixes (e.g., 12.0.0.0/8)
- Nodes are Autonomous Systems (ASes)
- Possibly multiple links between neighbours
  - Different types of links for different business relationships 7 Server Farm Network Client UBC

Requirements for Interdomain Routing

EGP is about trying to find directed trees in the AS graph
- There may be more than one link between two Ases
- Really a tree where edges are between border routers
  - Alternate between inter-AS edges and intra-AS edges

EGP needs a directed tree for every CIDR block

EGP must scale
- Prefixes: 400,000, and growing
- ASes: 20,000+ visible ones, and 40K allocated
- Routers: at least in the millions… UBC CS 317 2013WRequirements for Interdomain Routing?!

EGP must allow for ASes to implement business relationships
- E.g., some paths preferred/disallowed because of business relationships

EGP must provide privacy
- Should not divulge the interior topology of the AS
- Should not divulge business relationships between ASes

There can only be one EGP
- All the ASes have to run the same protocol

Border Gateway Protocol (BGP)

THE Interdomain Routing Protocol for the Internet!

What to do?

Start with Distance Vector and modify/generalize
- Improve convergence, reduce loops
- Allow for flexible policy
  - Best path based on more factors than one number
  - To Implement business policies

Don’t have to accept announcements from all neighbours"

Don’t’ have to send announcements to all neighbours"

Can vary import and export policy for announcements per neighbour"

Generalization 1:
- Don’t base “least cost” or “best path” on a single integer
- Base it on a sequence of (four) “attributes”: (a,b,c,d)
  - Each attribute has an order relation
  - More about attributes later
- Announcements: CIDR block, sequence of attributes
- Store most recent announcement from each neighbor
- If a router gets two announcements for the same CIDR block
  - Take the one that’s better on the first attribute
  - If equal on first attr, take the one that’s better on the second
  - If equal on the second as well, take the one that’s better on the third, etc.
Generalization 2: Import/Export Policies
- Filter/modify incoming announcements:
  - Don’t have to use/accept announcements from all neighbors
  - Can even modify attributes from neighbors before comparing
  - Policies can be dependent on the peer AS or more specifically the peer router
- Always accept announcements from internal BGP neighbors
- Import Polices effect how traffic flows *out* of the AS/ border router

Generalization 2: Import/Export Policies
- Filter outgoing announcements:
  - Don’t have to send announcement to all all neighbors
  - Can send different announcements to different neigbors
  - Can modify attributes of an announcement to a neighbor
- If announcement was triggered by an announcement from an external neighbour
  - Send it to all internal BGP neighbours
- if announcement was triggered by an announcement from an internal neighbour
  - Do not send it to the internal BGP neighbors
  - Avoids routing loops inside an AS
- Export Policies effect how traffic flows into the AS/border router

Updates: 1 and 2 together

Get an update announcement for a prefix from a neighbor

Apply Import Policy
- Filter it out or modify some of the attributes

Compare with existing announcements, find best one

If “best” has changed

Apply Export Policy
- Send out properly filtered/modified announcement per neighbor

BGP Operations

Establish session on TCP port Exchange all active routes Exchange incremental updates ASASWhile connection is ALIVE exchange route UPDATE messages BGP session

BGP Attributes

BGP Attributes used for computing “best” announcement

Local Preference: bigger is better

AS Path: shorter is better

Multi-exit Discriminator

IGP metric: smaller is better Other Attributes for Routing and Filtering:

Next Hop IP

Community ID

….

BGP Attributes used for computing “best” announcement

Local Preference: bigger is better

AS Path: shorter is better

Multi-exit Discriminator

IGP metric: smaller is better Other Attributes for Routing and Filtering:

Next Hop IP

Community ID

….

BGP Attributes used for computing “best” announcement

Local Preference: bigger is better

AS Path: shorter is better

Multi-exit Discriminator

IGP metric: smaller is better Other Attributes for Routing and Filtering:

Next Hop IP

Community ID

….

Next Hop IP

When two ASes have multiple connections
- need more than just knowing what the next hop AS is

Need an IP address in order to stitch together the IGP info and the BGP info
- Next Hop IP stays the same for announcements to internal BGP neighbors

This is the Next Hop IP in the BGP routing table not the Next Hop IP in the forwarding table

Next Hop IP Attribute

Primary AS graph attributes
- AS path (e.g., “7018 88”)
- Next-hop IP address (e.g., 12.127.0.121)

Both stay unmodified for internal announcements AS Princeton 128.112.0.0/AS path = Next Hop = 192.0.2.AS AT&T AS Yale 192.0.2.128.112.0.0/AS path = 7018 Next Hop = 12.127.0.12.127.0.==== Stitching IGP and BGP ====

Distribute BGP Routing Info:
- Each border router’s “best route” (i.e., sequence of attributes) for each of its CIDR blocks is distributed to all the routers in an AS
- Every router decides on its own “best route” from the ones it hears
  - Every router decides on a (BGP) Next Hop IP for each prefix in the ASes border routers BGP routing tables

IGP Routing Info:
- Every router will have a Next Hop IP for each subnet

“owned” by the AS, including subnets that border routers use to talk to their external neighbors

Goal: for a given router and for each prefix in the BGP routing table,
- find the proper Next Hop IP for the forwarding table for that router

Step 1:
- For a prefix A/n from the BGP routing table, find its

(BGP) Next Hop IP—call this B

Step 2:
- Find the longest prefix match for B among all the prefixes in the IGP routing table
- Call this C/m
Step 3:
- Find the Next Hop IP for C/m in the IGP routing table— call this D

Step 4:
- Install the prefix A/n with Next Hop IP D into the router’s forwarding table

Business Relationships and Export Policies

Example Topology

Business Relationships

Customer-Provider
- The provider is “higher up” in the AS hierarchy
- The provider provides access to the parts of the Internet the customer can’t access itself
- The customer pays the provider for this access
  - The cost of this access is often a function of the bandwidth and other service characteristics

Peering Relationship
- AS A and B may agree to exchange traffic as follows:
- Traffic from A and A’s customers to B or B’s customers can go across the peering link from A to B
- Traffic from B and B’s customers to A or A’s customers can go across the peering link from B to A
- Often no money is exchanged between peers as long as the amount of traffic flowing in each direction is approximately the same
- If an AS peers with one or more other ASes, it can reduce the amount of traffic coming from or going to its provider
- This allows it to pay less to its provider—this is the whole point! ==== Business Relationships ====

Tier 1 providers are required to peer with every other tier provider
- In order to handle all of the tier 1 peering traffic, each pair of tier providers have many high speed peering links and routers between them
Peers do not transit traffic
- Suppose A and B are peers, and B and C are peers
- B will not pass traffic from A through to C
  - B reasons that A and C are trying to use B’s infrastructure to reduce their own costs of having to use their providers
  - This is extra traffic through B’s network that is not to or from B or its customers—B is not benefiting from this traffic
- If A and C want to exchange traffic w/o using their providers they must establish their own peering link(s) and peering agreement

The “Peering” Relationship

peer peer provider customer Peers provide transit between their respective customers Peers do not provide transit between peers traffic Peers (often) do not exchange $$$ allowed traffic NOT allowed

Export Policies

What does this mean for how BGP announcements are made?

When an AS gets an announcement for an address block from a provider,
- it makes an announcement for that address block to all of its customers (with the AS PATH properly updated)
- it does *not* announce this block to its peers UBC CS 317 2013W1 provider customer Announcement from the provider Traffic to the provider peer

Announcement distributed down to customers

Announcement not distributed to peers

When an AS gets an announcement for an address block from a customer,
- it makes an announcement for that address block to all of its other customers (with the AS PATH properly updated)
- It makes an announcement to all of its peers
- It makes an announcement to its provider

customer provider Traffic to the customer announcements traffic Announcement from the customer

When an AS gets an announcement for an address block from a peer,
- it makes an announcement for that address block to all of its customers (with the AS PATH properly updated)
- It does *not* makes an announcement to its other peers
- It does *not* makes an announcement to its provider UBC CS 317 2013W1 peer peer Traffic to/from the peer and its customers d announcements traffic

Conclusions

BGP is solving a hard problem
- Routing protocol operating at a global scale
- With tens of thousands of independent networks
- That each have their own policy goals
- And all want fast convergence

Key features of BGP
- Multiple attributes for comparing announcements for CIDR blocks
- Incremental updates (announcements and withdrawals)
- Policies applied at import and export of routes

Transport Protocols

Where are we?

Packet Routing vs Circuit Switching Big Pros:

- Stat Muxing: more efficient use of individual links, better match for many/most computer communications
- Core just focuses on forwarding packets efficiently and little else
- Computation of paths decoupled from decision for accepting data into the network
  - Call admission in circuit switching vs accept every packet and provide it with best effort service

Big Table

Big Cons

Need per packet endhost addressing

Need multiplexing within the endhost

Packet Corruption/Drops

Out of order delivery

No Flow initiation

No Flow Agreement between end hosts

No network wide policing of traffic entering the network

Variable Delay/ Throughput UBC CS 317 2013WAdding services back in to overcome the cons ! Big Cons Possible Services

Need per packet endhost addressing

Need multiplexing within the endhost

Packet Corruption/Drops

Out of order delivery

No Flow initiation

No Flow Agreement between end hosts

No network wide policing of traffic entering the network

Variable Delay/ Throughput

Network wide IP addressing

Multiplexing Service

Reliable Packet Delivery

In Order Delivery

Flow Initiation

Flow Control

Congestion Control

??

Introduced Layered Architecture

Application Layer Providing end-to-end applications Transport Layer Providing end-to-end network services to applications Multiplexing among applications IP Layer Forwarding datagrams end-to-end over the network Multiplexing among transport layers Link Layer Moving frames over a physical layer Multiplexing among network layers Physical Layer Moving bits over physical channel ==== Big Table ==== Big Cons Possible Services Where?

Need per packet endhost addressing

Need multiplexing within the endhost

Packet Corruption/Drops

Out of order delivery

No Flow initiation

No Flow Agreement between end hosts

No network wide policing of traffic entering the network

Variable Delay/ Throughput

Network wide IP addressing

Multiplexing Service

Reliable Packet Delivery

In Order Delivery

Flow Initiation

Flow Control

Congestion Control

??

IP Layer

Transport

Link/Transport

Transport

Transport

Transport

Transport

Application Layer Transport Layer—This is where we are IP Layer--Covered Link Layer--Covered Physical Layer—Electrical Engineering

Transport Protocols

Provide logical communication between application processes running on different hosts

Run on end hosts
- Sender: breaks application messages into segments, and passes to transport layer
- Receiver: reassembles segments into messages, passes to application layer

Multiple transport protocols available to applications
- Internet: TCP and UDP application transport network data link physica l application transport network data link physica l network data link physica l network data link physica l network data link physica l network data link physica l network data link physica l Modified from Rexford

Multiplexing!

Recall Postal Mail

Canada Post delivers parcels building to building
- using the building address

The building “mail room” delivers parcels within the building
- using room # or person’s name on envelope

IP: end-host to end-host delivery
- using IP address

Transport: within end-host delivery
- using “port” name

Delivery within the endhost:
- Deliver data to running applications
- OS gives apps process IDs
  - These are dynamic, not stable
- Can’t expect apps on remote endhosts to know the PID
- Need something like a stable name
  - This is called a “port number”
- Need mapping between port number and PID

Interface

“The protocol stack”
- Link/IP/Transport
- Runs as part of the operating system—”the kernel”

Set of interfaces between OS and applications called system calls

Special system calls between applications that want to use the network and the kernel
- called “sockets”
- sockets will do bookkeeping, e.g., mapping between port numbers and PIDs
- Sockets manage interface between an app and a transport layer protocol, e.g., signaling about when data has arrived, more data can be pushed out, etc.

Demultiplexing: port numbers

Web server! (port 80)! Client host! Server host 128.2.194.242! Echo server! (port 7)! Service request for! 128.2.194.242:80! (i.e., the Web server)! Client! OS! Modified from Rexford

Port Numbers

Where do port numbers come from?

The Internet Engineering Task Force—IETF—maintains a list of “well known” ports
- port 25 for mail
- port 80 for http

Can be manually configured/privately agreed to

Can be negotiated in a “control” protocol

Can be embedded in a link

Can be retrieved from a naming service (that itself has a well known port #)
- What port number is the server a.b.c.d using for mail?

User Datagram Protocol (UDP)

Datagram messaging service
- Demultiplexing of messages: port numbers
- Detecting corrupted messages: checksum
- That’s it!

Lightweight communication between processes
- Send messages to and receive them from a socket
- Avoid overhead and delays of ordered, reliable delivery SRC port DST port checksum length DATA Modified from Rexford

Why Would Anyone Use UDP?

Fine control over what data is sent and when
- As soon as an application process writes into the socket
- … UDP will package the data and send the packet

No delay for connection establishment
- UDP just blasts away without any formal preliminaries
- … which avoids introducing any unnecessary delays

No connection state
- No allocation of buffers, parameters, sequence #s, etc.
- … making it easier to handle many active clients at once

Small packet header overhead
- UDP header is only eight bytes long Modified from Rexford

Popular Applications That Use UDP

Multimedia streaming
- Retransmitting lost/corrupted packets is not worthwhile
- By the time the packet is retransmitted, it’s too late
- E.g., telephone calls, video conferencing, gaming

Simple query protocols like Domain Name System
- Overhead of connection establishment is overkill
- Easier to have the application retransmit if needed

“Address for www.cnn.com?” “12.3.4.15”

Transmission Control Protocol (TCP)

Stream-of-bytes service
- Sends and receives a stream of bytes, not messages
- Sequence numbers to reorder data

Flow initiation
- Explicit set-up and tear-down of TCP session

In-order delivery
- Need sequence numbers

Reliability
- Checksums to detect corrupted data
- New problem: dropped packets
- Acks, time-outs and retransmissions using sequence numbers
Flow control
- Send as fast as the receiver can receive but not faster
- Need to have lots of unacknowledged packets in flight to accomplish this
- Need sequence numbers to do this—but we have them already

Congestion control
- Adapt to network congestion for the greater good

Reliable Delivery When Packets Can Be Dropped

Simple ARQ revisited

Time Packet ACK Timeout

Possibility of packet loss requires sender to keep a timer
- Timeout a lot like a NACK
- When packet or ack is lost, timer will be tripped
- When timer is tripped, packet/ ack not necessarily lost, could just be slow Sender Receiver

Reasons for Retransmission

Packet Timeout Packet ACK Timeout Packet Timeout Packet ACK Timeout Packet ACK Timeout Packet ACK Timeout ACK lost DUPLICATE PACKET Packet lost Early timeout DUPLICATE PACKETS Modified from Rexford

Protocol with drops

Is counting a timeout as a nack in the old protocol enough?

No, as per the example in class,
- a delayed ack can lead to an ambiguity on the sender’s side

The solution is to label the acks
- The acks carry a sequence number

It turns out that once you have labeled acks, you don’t need nacks for corrupted packets

We won’t explicitly define the stop and wait protocol for this
- One such protocol is in the textbook

Stop and Wait is Reliable but…

Stop-and-wait is inefficient
- Only one TCP segment is “in flight” at a time
- Especially bad when delay-bandwidth product is high

Numerical example
- 1KB packet
- 10 Mbps link (rate R) with a 80 msec propagation time

(round trip time/2) Modified from Rexford

Numerical example
- 1KB packet
- 10 Mbps link (rate R) with a 80 msec propagation time (round trip time/2)
- Number of possible bits “in flight” on the link at any time
  - 10 Mbps x 80 msec = 800 Kb
  - This is the Delay-bandwidth product = R x RTT/** Number of bits actually in flight
  - 8Kb
- Utilization: 1/** Even worse since half the time (during the ack) no bits are on the line from sender to receiver Modified from Rexford

Throughput of stop and wait

Parameters:
- Packets of length L bits
- Rate of the link of R bits per sec
- Round trip time of RTT
- Assume ack packet has no transmission delay

Time between the start of sending a packet and the start of sending the next packet:
- Transmission delay of packet + RTT = (L/R) + RTT

Data sent in that time period:
- L

Effective Rate: L / ((L/R) + RTT)
Effective Rate: L / ((L/R) + RTT)

= R / (1 + B )

Where B is the ratio of the possible bits on the wire in both directions to the length of the packet B = RTT x R / L

When RTT x R >> L:
- B is big, and
- the effective rate is much smaller than the link rate

Reliability + Performance

Reliability is an important goal in computing systems, but…

Performance is an equally important goal

The challenge in systems is to achieve a high degree of both at the same time

Two performance tracks:
- Find a “good” time-out value
- “Pipeline” packets

Time out value

Too small
- Wasted retransmissions that use up network resources

Too big
- Could be waiting for too long to retransmit
- Lower throughput

Try to achieve adaptive window size
- by measuring the (recent) roundtrip time on packets and acknowledgements
- And taking some kind of running average
  - This is what TCP does
  - Different implementations do it slightly differently

How Long Should Sender Wait?

TCP sets timeout as a function of the RTT
- Expect ACK to arrive after an RTT “round-trip time”
- … plus a fudge factor to account for queuing

But, how does the sender know the RTT?
- Sample the RTT by watching the ACKs and
- Take a “running” average
  - NewAveRTT = a * NewSampleRTT + (1 –a ) *OldAveRTT
- Compute timeout: TimeOut = 2 * EstimatedRTT Modified from Rexford

Example RTT Estimation

RTT: gaia.cs.umass.edu to fantasia.eurecom.fr 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 time (seconnds) RTT (milliseconds) SampleRTT Estimated RTT

The most important technique for improving performance for a reliable delivery protocol is pipelining
- The protocols that do this are called sliding window protocols
  - We’ll need packets to have sequence numbers

Sliding Window—Sender side

Basic Idea: Allow sender to always have up to W packets that have been sent but not yet acknowledged
- Say W is 10. Idea is that S can send out packets 1 through 10 w/o waiting for any acknowledgements to arrive
- When the ack for packet 1 arrives it can send out packet ** When the ack for pckt 2 arrives, the sender can send packet 12, etc.
- If L is the number of the last acknowledged packet than the sender can send out up to packet number L + W
- The window of consecutive packet numbers that S can have outstanding “slides” forward as the acknowledgements arrive

This seems like just an additive gain
- Maybe the sender finishes about W units early
- It’s actually a multiplicative gain 73

Lecture 20/21 - Transport Protocols Continued

Transport Protocols

Big Table

table

5 Layer Reference Model

model

User Datagram Protocol (UDP)

Datagram messaging service
- Demultiplexing of messages: port numbers
- Detecting corrupted messages: checksum
- That’s it!

Lightweight communication between processes
- Send messages to and receive them from a socket
- Avoid overhead and delays of ordered, reliable delivery SRC port DST port checksum length DATA Modified from Rexford 5

Transmission Control Protocol (TCP)

Stream-of-bytes service
- Sends and receives a stream of bytes, not messages
- Sequence numbers to reorder data

Reliability
- Checksums to detect corrupted data
- New problem: dropped packets
- Acks, time-outs and retransmissions using sequence numbers

Flow initiation
- Explicit set-up and tear-down of TCP session

In-order delivery
- Need sequence numbers

Flow control
- Send as fast as the receiver can receive but not faster
- Need to have lots of unacknowledged packets in flight to accomplish this
- Need sequence numbers to do this—but we have them already

Congestion control
- Adapt to network congestion for the greater good

Stop and Wait is Reliable but…

Stop-and-wait is inefficient
- Only one TCP segment is “in flight” at a time
- Especially bad when delay-bandwidth product is high Modified from Rexford

Reliability + Performance

The most important technique for improving performance for a reliable delivery protocol is pipelining
- The protocols that do this are called sliding window protocols
- We’ll need packets to have sequence numbers

Sliding Window-Performance

Basic Idea: Allow sender to always have up to W packets that have been sent but not yet acknowledged

Main point: The bigger the window, the higher the maximum throughput
- (up to a point)

Sliding Window-Reliability

Ok, great. We know how to get some performance, but what about reliability?

TCP is mostly based on Go Back N
- But unlike Go Back N, TCP
  - does not resend the window as soon as it gets one duplicate ack
  - reduces the size of its window before retransmitting its window
  - Is byte oriented rather than packet oriented
- We’ll call the packet based version of TCP: Resend Window (RW)

For TCP/RW, if the sender
- Receives a triple dup ack, or
- The window times out

The sender assumes a packet has been dropped and
- It reduces the size of the window
- then retransmits all of the current data in the window

Rule for RW Receiver, Simple version

Whenever you get a good packet, always acknowledge the largest consecutive packet that you have received
- Example: Receiver has received and ack’ed packets with seq nums:
  - 1,2,3,4
- Receives packets with seq nums 6 and 7
  - It acks 4 twice (which is the third ack sent for 4)
  - It now has packets
  - 1,2,3,4,x,6,7
- Now it receives packet 5
  - It acks packet 7

Rules for Sender, Simple version

Keep track of the largest ack that you’ve seen
- You can assume that R has received all of the packets up through and including that packet number
  - You may get acks smaller than your current max but you can ignore them

Keep track of how many times you see duplicate acks for the current max
- This indicates dropped or out of order packets

Keep a count down timer for the window
- Start the timer for a “new” window
  - When the start of the window advances, or
  - When a retransmission of the window begins

When you get an ack that is the new max:
- Slide your window forward using the current window size
- You may be able to accept more packets from the layer above or immediately send buffered packets
  - You can have packets new_max + 1 through new_max + W sent but unacknowledged
- Restart the countdown timer

If you get three duplicate acks for the current max
- You do assume a packet has been dropped
- (You reduce your window a la congestion control)
- You retransmit the packets in the current window
  - This is called “fast retransmit” since you don’t wait for the timer to time out

If the counter timer expires,
- You assume that a packet has been lost
- (Reduce your window a la congestion control)
- Resend the packets in the window

TCP Services

In-order delivery, stream-of-bytes service
- Sends and receives a stream of bytes, not messages
- Sequence numbers to reorder data

Flow initiation
- Explicit set-up and tear-down of TCP session

Reliability
- Checksums to detect corrupted data
- New problem: dropped packets
- Acks, time-outs and retransmissions using sequence numbers

Congestion control
- Adapt to network congestion for the greater good

Flow control
- Send as fast as the receiver can receive but not faster
- Need to have lots of unacknowledged packets in flight to accomplish this
- Need sequence numbers to do this—but we have them already

TCP Segment

IP packet
- No bigger than Maximum Transmission Unit (MTU)
- E.g., up to 1500 bytes on an Ethernet

TCP packet
- IP packet with a TCP header and data inside
- TCP header is typically 20 bytes long

TCP segment
- No more than Maximum Segment Size (MSS) bytes
- E.g., up to 1460 consecutive bytes from the stream IP Hdr IP Data TCP Data (segment) TCP Hdr

TCP Acknowledgments

picture

Flow Control and Congestion Control

Sender Window

The sender keeps track of two values that will change based on various events
- The flow control window, also called the receive window, rwnd
- The congestion control window, cwnd

Actual window size is the minimum of these two values

Whenever either of the values changes, S recalculates the min

Recall that window size and throughput are roughly proportional 20 Flow Control

Assume the Receiver can keep B in-order bytes buffered

TCP puts the newest (largest byte numbers) bytes into the buffer in order

An application takes the oldest (smallest byte numbers) bytes out of the buffer

TCP Buffers

If the application is consuming bytes at a slower rate than the sender is sending,
- then the Receiver’s buffer will overflow

The receiver needs to get the sender to slow down

If R tells S to reduce its window, than S will slow down!

Flow Control

Even more pragmatically, the Receiver only has so much room left it’s buffer
- It has to tell the sender that it can’t send any more bytes than the # of bytes left in its buffer

RWND on the Reciever

Flow Control-Receiver side

R keeps track of the max consecutive byte
- It already does this
- This is the packet it sends its cumulative ack for

R keeps track of the min consecutive byte in the buffer
- This is updated every time the app pulls data out

R keeps track of the room left in the buffer
- = B ** (max ** min + 1)
- This is the current rwnd

When R sends an ack it includes the current rwnd in the ack 26

TCP Header

picture

Flow Control-Sender Side

When S gets an ack, it uses the value of rwnd in the ack to update its local value of rwnd
- And then it recalculates its window value
  - (the new rwnd may or may not update the window depending of the relative value of cwnd)

Flow Control

Rwnd goes up when an app pulls two or more packets worth of data out between consecutive packets coming into R

Rwnd goes down when two or more consecutive packets come into R w/o the app pulling a packets worth of data out of the buffer

Congestion Control

Congestion

Congestion on a link: when the incoming rate to the link is greater than the outgoing rate of the link for a sustained enough period of time to cause:
- the buffers to grow large
  - Which causes increased delay
- or the buffers to fill
  - Which causes packets to be dropped

Congestion in a network: when one or more links in the network is experiencing congestion

Congestion collapse

In a circuit switched network you’ll never get congestion!
- You don’t admit traffic into the network unless you are guaranteed you can handle it

In a packet routed network packet drops waists lots of resources
- If a packet survives many hops but gets dropped later, it used up valuable space earlier

Reliable delivery mechanisms in a packet routed network create a positive feedback loop
- One drop causes many packets to be retransmitted which may increase congestion enough to cause more packets to be dropped, etc.
- This is called “congestive collapse”

Collective back off

To fight the positive feedback effect, the senders must collectively and cooperatively back off their rate
- But we don’t want to back off too much otherwise we won’t be using the capacity of the network
- This is a lot like the situation with collisions in Ethernet
  - A fixed small retransmission window might create continuous collisions
  - A fixed large retransmission window might be inefficient
  - Ethernet used a mechanism that tried to find the right window size dynamically

We’ll need to do something dynamic here too

Congestion feedback

Need congestion feedback
- Hosts can’t back off their sending rate w/o a reliable signal that there is congestion on a link they’re using

Feedback from the network?
- The routers could give them explicit feedback when a link buffer is getting full
- Either by sending an explicit control message back to the sender or piggybacking some info on the packet and letting the receiver relay something back

But IP had the “end to end” philosophy
- keep the routers simple, they just do forwarding

Congestion feedback

Using RTT at the Sender
- A sender could keep track of RTT and as it gets bigger it could back off
- There have been some proposals to do this but they are not widely adopted

Using packet drops at the Sender
- If sender knew that one or more packets to the receiver had been dropped, it could use that as a solid indication of congestion on one of the links
- That’s conservative because the drop may have been due to a very transient spike in usage of the link
  - But conservative is good in this case!

The sender makes an assumption that a packet has been dropped if
- the timer times out or
- there’s a triple duplicate ack

That’s how the sender “knows” that a packet has been dropped—we’ll use that for our indication of congestion

Once a sender assumes there’s congestion because of a time out or duplicate acks, how does it back off?

By decreasing the sliding window, of course!

Congestion Control—Sender Side

In TCP, all congestion control happens at the sender side

If S detects possible congestion
- with a time out or triple duplicate ack,

It reduces its send rate
- by reducing its window size
  - Actually it reduces cwnd and then finds min of cwnd and rwnd

Congestion control tied to reliability/efficiency sliding window protocol
- This is natural since window/send rate, packet drops and congestion are interrelated

Hunting for the right window size

Once the sender’s sliding window is cut down, does it stay down?
- No
- It can’t be that the only direction for adjusting the window is down
- We’d never use resources efficiently that way

Basic idea:
- Increase window gradually when there’s no evidence of congestion
- Decrease window quickly when there is evidence
- This “hunts” for an appropriate rate and adapts to changing traffic conditions

Congestion Control--Details

Triple duplicate ack vs time out
- How we decrease and then subsequently increase is different depending on whether we get a triple duplicate ack or a time out

Triple Duplicate Ack and AIMD

A triple duplicate ack means that some packets are getting through so the congestion is not horrible
- No need to reduce cwnd too much

After a triple duplicate ack,
- cut cwnd in half, and retransmit the window
  - This is called “fast retransmit” since you aren’t waiting for a timeout to retransmit

Then increase window size by ~one MSS every RTT until a congestion event
- This is called Additive Increase, Multiplicative Decrease

Leads to the TCP “Sawtooth”

picture

AIMD

Fairness
- Remember fairness as in one of the main three things we want when we’re sharing a resource:
- Performance, reliability, fairness

AIMD is reasonably fair
- At steady state, AIMD will bounce around a fair rate
- Argument in book

AIMD is adaptive/dynamic
- In addition, AIMD will adapt the value that it is hunting around as traffic increases or decreases over the links that a TCP session is using

How should a flow start?

Let’s interrupt the discussion of congestion for a second

Additive increase begs the question

How should a flow start?
- Need to increase slowly when we suspect we’re getting close to congestion
- But when we start the flow, we are far from congestion
- And if we just increase slowly from the start—we’d all get impatient!!

How Should a New Flow Start

picture

“Slow Start” Phase

Start with a small congestion window
- Initially, CWND is 1 Max Segment Size (MSS)
- So, initial sending rate is ~MSS/RTT

That could be pretty wasteful
- Might be much less than the actual bandwidth
- Linear increase takes a long time to accelerate

Slow-start phase (really “fast start”)
- Sender starts at a slow rate (hence the name)
- … but increases the rate exponentially
- … until the first loss event

Slow Start in Action

picture

Slow Start and the TCP Sawtooth

picture

Congestion Control--Details

Triple duplicate ack vs time out
- How we decrease and then subsequently increase is different depending on whether we get a triple duplicate ack or a time out

Time out

After a time out, cut cwnd down to 1 MSS
- Treat time out as an indication of serious congestion
- Increase cwnd exponentially for a while
  - Increasing additively from 1 MSS may take too long to get close to the right send rate
- Then increase additively

“for a while” means until you get to half the cwnd in effect at the time out

Repeating Slow Start After Timeout

picture

Congestion Control Recap

Congestion control in TCP is all at the sender side

Indications of congestion
- Triple duplicate acks: some congestion
- Timeout: significant congestion

Reduce cwnd substantially (by ½) after indication of some congestion (triple dup ack)
- to substantially reduce sending rate and help the network quickly move out of congestion

Reduce cwnd to starting value (1 MSS) after indication of significant congestion (window timeout)
- Need to back off sending rate drastically if there is significant congestion in the network

Increase cwnd as long as there is no indication of congestion
- Traffic should make effective use of network resources
- Increase cwnd quickly (exponentially) when it’s very small
- Increase cwnd gradually (linearly) when it might be near a congestion point, based on history

The mechanisms for increasing and decreasing cwnd
- Allow the sender to hunt for a reasonable sending rate based on feedback from the network
- Allow the senders rate to adapt to changing network conditions

Lecture 23 - Part 1

Names, Addresses, and Bootstrapping

Three Kinds of host/interface Identifiers

Host name (e.g., www.cnn.com)
- Mnemonic name appreciated by humans
- Provides little (if any) information about location
- Hierarchical, variable # of alpha-numeric characters
- Application layer

IP address (e.g., 64.236.16.20)
- Numerical address appreciated by routers
- Related to host’s current location in the topology
- Hierarchical name space of 32 bits
- IP Layer

MAC address (e.g., 00-15-C5-49-04-A9)
- Numerical address appreciated within local area network
- Unique, hard-coded in the adapter when it is built
- Flat name space of 48 bits
- Link Layer

Separating Names and IP Addresses

Names are easier (for us) to remember
- www.cnn.com vs. 64.236.16.20

Referencing/indirection is good--IP addresses can change underneath
- Move www.cnn.com to 173.15.201.39
  - renumbering when changing providers

Name could map to multiple IP addresses
- www.cnn.com to multiple replicas of the Web site

Map to different addresses in different places
- Address of a nearby copy of the Web site
  - to reduce latency, or return different content

Multiple names for the same address
- - aliases like ee.mit.edu and cs.mit.edu

Separating IP and MAC Addresses

LANs are designed for arbitrary network protocols
- Not just for IP (e.g., IPX, Appletalk, X.25, …)
  - Though now IP is the main game in town
- Different LANs may have different addressing schemes
  - Though now Ethernet address is the main game in town

A host may move to a new location
- So, cannot simply assign a static IP address
  - Since IP addresses depend on host’s position in topology
- Instead, must reconfigure the adapter
  - To assign it an IP address based on its current location

Must identify the adapter during bootstrap process
- Need to talk to the adapter to assign it an IP address

Mapping Between Identifiers

Domain Name System (DNS)
- Given a host name, provide the IP address
- Given an IP address, provide the host name

Dynamic Host Configuration Protocol (DHCP)
- At boot, host only has a MAC address
- To automate the boot-strapping process assign a unique IP address and tell host other stuff about the Local Area Network

Address Resolution Protocol (ARP)
- Given an IP address, provide the MAC address
- To enable communication within the Local Area Network

Domain Name System (DNS)

Proposed in 1983 by Paul Mockapetris

Outline

Computer science concepts underlying DNS
- Indirection: names in place of addresses
- Hierarchy: in names, addresses, and servers
- Caching: of mappings from names to/from addresses

DNS software components
- DNS resolvers
- DNS servers

DNS queries
- Iterative queries
- Recursive queries

DNS caching based on time-to-live (TTL)

Strawman Solution #1: Local File

Original name to address mapping
- One file contained the mapping from names to IP addresses
- SRI(?) kept main copy
- Downloaded regularly
- /etc/hosts

Count of hosts was increasing: moving from a machine per domain to machine per user
- Many more downloads
- Many more updates

Strawman Solution #2: Central Server

Central server
- One place where all mappings are stored
- All queries go to the central server

Many practical problems
- Single point of failure
- High traffic volume
- Distant centralized database
- Single point of update
- Does not scale

Need a distributed, hierarchical collection of servers

Domain Name System (DNS)

Properties of DNS
- Hierarchical name space divided into zones
- Distributed over a collection of DNS servers

Hierarchy of DNS servers
- Root servers
- Top-level domain (TLD) servers
- Authoritative DNS servers

Performing the translations
- Local DNS servers
- Resolver software

DNS Root Servers

13 root servers (see http://www.root-servers.org/)
Root servers tell where the TLD servers are

TLD and Authoritative DNS Servers

Top-level domain (TLD) servers
- Generic domains (e.g., com, org, edu)
- Country domains (e.g., uk, fr, ca, jp)
- Typically managed professionally
  - Network Solutions maintains servers for “com”
  - Educause maintains servers for “edu”

.CA started at UBC
- John Demco, member of the technical staff personally administered the registration/management of .ca names from 1987 until 2000
- Now managed by Canadian Internet Registration Authority

Authoritative DNS servers
- Provide public records for hosts at an organization
- For the organization’s servers (e.g., Web and mail)
- Can be maintained locally or by a service provider

Distributed Hierarchical Database

graphic

Using DNS

Local DNS server (“default name server”)
- Usually near the end hosts who use it
- Local hosts configured with local server (e.g., /etc/resolv.conf) or learn the server via DHCP

Client application
- Extract server name (e.g., from the URL)
- Do gethostbyname() to trigger resolver code

Server application
- Extract client IP address from socket
- Optional gethostbyaddr() to translate into name

Recursive vs. Iterative Queries

Recursive query
- Ask server to get answer for you
  - request 1 and response 8
Iterative query
- Ask server who to ask next
  - all other request-response pairs

DNS Caching

Performing all these queries take time
- And all this before the actual communication takes place
  - 1-second latency before starting Web download

Caching can substantially reduce overhead
- The top-level servers very rarely change
- Popular sites (e.g., www.cnn.com) visited often
- Local DNS server often has the information cached

How DNS caching works
- DNS servers cache responses to queries
- Responses include a “time to live” (TTL) field
- Server deletes the cached entry after TTL expires

Negative Caching

Remember things that don’t work
- Misspellings like www.cnn.comm and www.cnnn.com
- These can take a long time to fail the first time
- Mark as unresolvable with some TTL
- Good to remember that they don’t work so the failure takes less time the next time around

DNS Resource Records

DNS: distributed db storing resource records (RR)

Type=NS
- name is domain (e.g. foo.com)
- value is hostname of authoritative name server for this domain RR format: (name, value, type, ttl)
Type=A
- name is hostname
- value is IP address
Type=CNAME
- name is alias name for some “canonical” (the real) name www.ibm.com is really servereast.backup2.ibm.com
- value is canonical name
Type=MX
- value is name of mailserver associated with name

DNS Protocol

query and reply messages, both with same message format Message header

Identification: 16 bit # forquery, reply to query uses same #

Flags:
- Query or reply
- Recursion desired
- Recursion available
- Reply is authoritative

Reliability

DNS servers are replicated
- Name service available if at least one replica is up
- Queries can be load balanced between replicas

UDP used for queries
- UDP port 53
- Need reliability: implemented on top of UDP

Try alternate servers on timeout
- Exponential backoff when retrying same server

Same identifier for all queries
- Don’t care which server responds

Inserting Resource Records into DNS

Example: just created startup “FooBar”

Get static CIDR block from ISP
- assign IP addresses to your authoritative name servers (primary and secondary)

Register foobar.com at Network Solutions
- Provide registrar with names and IP addresses of name servers
- Registrar inserts two RRs per server into the com TLD server:
  - (foobar.com, dns1.foobar.com, NS)
  - (dns1.foobar.com, 212.212.212.1, A)

Put in authoritative server dns1.foobar.com
- Type A record for www.foobar.com
- Type MX record for foobar.com

Play with “dig” on the web

Bootstraping an End Host

DHCP and ARP

How to

When the host boots:
- How does it get an IP address?
  - Can’t be hard coded at the factory—the factory doesn’t know what subnet the device is going to join
  - Could be hand configured every boot-up—but that would be a pain
- How does it get the IP address of the router and DNS server?
- How does it get the MAC address of the router, and the other MAC addresses on the LAN for that matter?

Avoiding Manual Configuration

Dynamic Host Configuration Protocol (DHCP)
- Learn IP address, subnet mask, gateway router, DNS servers
Address Resolution Protocol (ARP)
- Learn mapping between IP & MAC addresses of other hosts on LAN in order to be able to create proper frames

Key Ideas in Both Protocols

Broadcasting: when in doubt, shout
- When you don’t know how to identify the right host broadcast query to all hosts in the local-area-network

Caching: remember the past for a while
- Store the information you learn to reduce overhead
- Remember your own address & other host’s addresses

Soft state: … but eventually forget the past
- Associate a time-to-live field with the information and either refresh or discard the information
- Key for robustness in the face of unpredictable change
We’ve done this before in the LAN setting with switched

Bootstrapping Problem

Host doesn’t have an IP address yet
- So, host doesn’t know what source IP address to use in IP packets

Host doesn’t know the IP addr of the DNS server to query for a dest IP address

Host doesn’t know the IP addr of gateway router interface
- Can’t send packets out of LAN w/o it

Host doesn’t know which IPs stay on the LAN and which need to go through the gateway router

Host doesn’t know the link layer address of any interfaces

Broadcasting

But we do know how to broadcast on the LAN
- Send a “discover” query with your MAC address as the source and with destination address: ff:ff:ff:ff:ff:ff
- All adapters on the LAN receive the packet including the appropriate server(s)

And it is possible to broadcast in IP within a LAN
- Use dest IP addr = 255.255.255.255

DHCP

DHCP is a client-server protocol,
- the client initiates the session; the server is “waiting” for session initiations by clients

Client and server code are in the application layer
- DHCP uses the UDP transport layer
- A DHCP server listens on UDP port 67
- A DHCP client uses UDP port 68

Dynamic Host Configuration Protocol

graphic

Response from the DHCP Server

Multiple servers may respond
- Multiple servers on the same broadcast media
- Each may respond with an offer
- The client can decide which offer to accept

Accepting one of the offers
- Client broadcasts a DHCP request echoing the parameters and the other servers see they were not chosen

The DHCP server responds with an ACK to

Deciding What IP Address to Offer

Server as centralized configuration database
- All parameters are statically configured in the server
  - a dedicated IP address for each MAC address
- Avoids complexity of configuring hosts directly while still having a permanent IP address per host

Or, dynamic assignment of IP addresses
- Server maintains a pool of available addresses and assigns them to hosts on demand
- Leads to less configuration complexity and more efficient use of the pool of addresses
- Though, it is harder to track the same host over time

Time out, Release, Renew

Each lease is accompanied by a TTL
- Important in order to be able to reallocate unused address

A long running host does not need to get a new IP addr after the TTL
- It issues a Renew command to the DHCP server

A host may purposefully give up its IP addr by issuing a Release command
- On graceful shut down, or leaving a network
- To debug/fix a network problem
- If you perform a DHCP Discover soon after you will likely get the same IP addr back

So, Now the Host Knows a Few Things

Its IP address
Its subnet Mask
The IP address of gateway router
The IP address of DNS server

And its IP forwarding table is simple
- Everything goes out its one network interface
  - Except packets with dest addr in 127.0.0.0/8, i.e., localhost Ips
- If the dest IP is outside the subnet, the next hop is the IP addr of the gateway router

But it doesn’t know any link layer addresses
- In principle, could still route all frames out the adpater using dest MAC addr ff:ff:ff:ff:ff:ff—but there must be a better way…

Address Resolution Protocol Table

Every node maintains an ARP table
- A mapping from IP address to MAC address

Consult the table when sending a packet
- Get Next Hop IP from IP Forwarding table
- Map that IP address to a MAC address via the ARP Table
- Encapsulate IP packet in a frame using that MAC addr as the dest addr ARP Protocol

ARP Protocol

But, what if the IP address is not in the table?

Via link layer broadcast sender asks: “Who has IP address 1.2.3.156?”
- Special ARP query packet encapsulated in a broadcast link layer frame

Node/Interface with IP addr in query responds: “I have that IP addr and my MAC address is 58-23-D7-FA-20-B0” in an ARP response
- The ARP response packet is sent via a non-broadcast frame back to the querier
- Sender caches the result in its ARP table

No need for network administrator to get involved

ARP Table Size

If ARP has to keep an entry for every IP outside of its subnet, its table could get large

Host learns gateway router interface IP address from DHCP
- Can learn MAC address via ARP

Host learns the subnet mask from DHCP so can tell which packets are destined for outside the subnet
- For these packets, the next hop IP is router interface IP
- Create dest MAC addr of frame by consulting ARP table for that IP

Back to DHCP

A host that has just booted has an empty ARP table

How does it create frames for the DHCP packets?

Src MAC
- Each host knows its own MAC addr and so can populate the src mac addr field of each frame

Dst MAC
- Every dst IP in the DHCP protocol is the broadcast IP addr
- This automatically maps to the broadcast MAC addr

Dynamic Host Configuration Protocol

graphic

Lecture 23 - Middleboxes, Firewalls,

Middleboxes

IP Principles

Internetworking principle
- Global, scalable addressing
- No packet “admission”, no global circuit computation
  - Path computations independent of packet forwarding
- Open standards

End-to-end principle
- Simple core/simple packet forwarding
  - Routing nodes simply forward packets (rather than modifying or filtering them)
- Complexity, application support at end hosts, not in routers

Internet Reality

Security concerns
- Discarding suspicious or unwanted packets
- Detecting suspicious traffic

IP address depletion
- Dynamic assignment of IP addresses
- Private addresses (10.0.0.0/8, 192.168.0.0/16, …)

Host mobility
- Changes in IP addresses as hosts move

Performance concerns
- Controlling how link bandwidth is allocated
- Storing popular content near the clients

Middleboxes

Middleboxes are intermediaries
- Interposed in-between the communicating hosts
- Often without knowledge of one or both parties

Examples
- Network address translators
- Firewalls
- Traffic shapers
- Intrusion detection systems
- Web proxies
- Application accelerators

Two Views of Middleboxes

An abomination
- Violation of layering
- Cause confusion in reasoning about the network
- Responsible for many subtle bugs
- Create roadblocks for some end-to-end applications

A practical necessity
- Solving real and pressing problems
- Needs that are not likely to go away

The real world is bound to conflict with

architectural purity

Firewalls

Network Effect

The Internetworking/end-to-end architecture enabled the Network Effect
- The value to join a network is proportional to the number hosts/services reachable on the network
- When a host joins the network, the value of the network increases

BUT, it also enabled the the Network Effect for (in)security
- Each node that you can reach, can reach you—and is potentially malicious
- Every new host joining the network is potentially a new attack vector
- The “attack surface” grows as the network grows

Vulnerabilities and Trust Domains

Perhaps in an ideal world if no software vulnerabilities existed and authentication/authorization processes were perfect it would be fine to allow every Internet host even those with malicious intent to contact every other Internet host

Even in this ideal world, the malicious hosts could mount a Denial of Service attack
- make so many requests to a given server as to make it unusable by the honest hosts

But we don’t live in an ideal world
- Software vulnerabilities can be exploited via network-enabled Internet attacks
- Bogus authorizations/privileges can be obtained and escalated

Organizations have data on servers and desktops and cpu cycles on application servers that they want to protect
- They can’t rely solely on the end hosts to protect these resources—the end-to-end model is too idealistic here
- They need some kind of defense in depth

Internet Attacks: Break-Ins

Breaking into a host
- Outsider exploits a vulnerability in the end host with the goal of changing the behavior of the host

Example
- Bad guys know a Web server has a buffer-overflow bug and, say, send an HTTP request with a long URL allowing them to run their own code

Motivations for break-ins
- Take over the machine to launch other attacks
- Steal information stored on the machine
- Modify/replace the content the site normally returns

Internet Attacks: Denial of Service

Denial-of-service attacks
- Outsider overwhelms the host with unsolicited traffic with the goal of preventing any useful work

Example: attacks by botnets
- Bad guys take over a large collection of hosts and program these hosts to send traffic to your host Leading to excessive traffic

Motivations for denial-of-service attacks
- Malice (e.g., just to be mean)
- Revenge (e.g., for some past perceived injustice)
- Greed (e.g., blackmailing)

Organizational boundaries

Pre Internet, organizations:
- enforced boundaries/access around their physical property and buildings
  - Fences, security gates/guards, ID badges
- enforced access to rooms, facilities, and paper assets,
  - via keys

Post Internet
- Very natural for organizations to draw a line around their networks and enforce access via a “security gate”
- Called a firewall

Firewall

Isolates an organization’s internal net from larger Internet, allowing some packets to pass, blocking others.

Packet Filtering

Internal network connected to Internet via firewall

Firewall filters packet-by-packet, based on:
- Source IP address, destination IP address
- TCP/UDP source and destination port numbers
- ICMP message type
- TCP SYN and ACK bits

Examples
- Block all packets with IP protocol field = 17 and with either source or dest port = 23.
  - All incoming and outgoing UDP flows blocked
  - All Telnet connections are blocked
- Block inbound TCP packets with SYN but no ACK
  - Prevents external clients from making TCP connections with internal clients
  - But allows internal clients to connect to outside
- Block all packets with TCP port of Doom3

Firewall Configuration

Firewall applies a set of rules to each packet
- To decide whether to permit or deny the packet

Each rule is a test on the packet
- Comparing IP and TCP/UDP header fields and deciding whether to permit or deny

Order matters
- Logic is defined as a list of rules
- The packet is tested starting at the top of the list
- Once the packet matches a rule, the decision is done

Example:

Alice runs a network in 222.22.0.0/16
- Wants to let Bob’s school access certain hosts
  - Bob’s hosts are on 111.11.0.0/16
  - Bob’s hosts should be able to access hosts on 222.22.22.0/24
- Alice doesn’t trust Trudy, inside Bob’s network
  - Trudy is on 111.11.11.0/24
- Alice doesn’t want any other traffic from Internet
Rules
- Don’t let Trudy’s machines in
  - Deny (src = 111.11.11.0/24, dst = 222.22.0.0/16)
- Let rest of Bob’s network in to special dsts
  - Permit (src=111.11.0.0/16, dst = 222.22.22.0/24)
- Block the rest of the world
  - Deny (src = 0.0.0.0/0, dst = 0.0.0.0/0)

A Variation: Traffic Management

Permit vs. deny is too binary a decision
- Maybe better to classify the traffic based on rules and then handle the classes of traffic differently

Traffic shaping (rate limiting)
- Limit the amount of bandwidth for certain traffic
  - rate limit on Web or P2P traffic

Separate queues
- Use rules to group related packets
- then do round-robin scheduling across the groups
  - separate queue for each internal IP address

Firewall Implementation Challenges

Per-packet handling
- Must inspect every packet
- Challenging on very high-speed links

Complex filtering rules
- May have large # of rules
- Filtering may be stateful
  - Don't let in any TCP packet unless the flow originated inside
- May do application level filtering
  - Filtering on blacklisted urls, P2P app detection

Location of firewalls
- Complex firewalls near the edge, at lower speed
- Simpler firewalls in the core, at higher speed

Firewalls are not full proof

tunneling over http
lots of mobile devices go in and out of a a fire-walled network

Examples:

Enterprise wifi can be snooped from afar
- Pringles directional antennae for wardriving
- First practical WEP attack from my group at AT&T
- WPA/WPA2 still vulnerable to password attacks

filtering dorm access to a server
- Firewall rule based on IP addresses of dorms and the server IP address and port number
- Problem: users may log in to another machine
  - connect from the dorms to another host and then onward to the blocked server

filtering P2P based on port #s
- Firewall rule based on TCP/UDP port numbers
  - allow only port 80 (e.g., Web) traffic
- Problem: software using non-traditional ports
  - write P2P client to use port 80 instead

Network Address Translation

History of NATs

IP address space depletion
- Clear in early 90s that 232 ~ 4 billion addresses not enough
- Work began on a successor to IPv4—IPv6

In the meantime…
- Share addresses among numerous devices without requiring changes to existing hosts

Meant to provide temporary relief
- Intended as a short-term remedy
- Now, NAT are very widely deployed. Much more so than IPv6

Private Addresses

IANA set aside several address blocks for the purposes of private addressing
- 10.0.0.0/8
- 172.16.0.0/12
- 192.168.0.0/16

Any organization can use these addresses. No delegation/permission from IANA required

Since lots of orgs can be using then at the same time, these addresses are not publically routable
- They’re ignored/dropped by routers
- Not propagated via BGP

Port Numbers

The so-called well-know port numbers: 0—1023
- Defined by IANA/IETF
- Well known services
- Used as destination port numbers

The range of port numbers from 1024 to 49151 are the registered ports
- They are assigned by IANA for specific service upon application by a requesting entity

IANA suggests the range 49152 to 65535 for private and dynamic ports
- These are used as ephemeral ports--destination port numbers picked at random for each connection

What if Both Hosts Contact Same Site?

Suppose hosts contact the same destination
- both hosts open a socket with local port 3345 to destination 128.119.40.186 on port 80

NAT has to give the packets the same source address
- All packets sent to the public Internet have source address 138.76.29.7
- This is the only publicly addressable IP addr
- The 10.0.0.0/8 addresses are not publicly addressable

Problems
- How can destinations differentiate between senders?
- How can return traffic get back to the correct hosts?

NAT

The dst IP and port can’t be changed by the NAT box
- The packet has to get to where it is supposed to go!

The src IP has to be the external IP of the NAT box
- The destination host needs a public IP address to send response packets to

The only thing the NAT box has a choice about is the src port
- The NAT box must make the src ports different so that return packets can be distinguished

Port-Translating NAT

Map outgoing packets
- Replace source IP address with external public NAT address
- Replace source port number with a NAT generated port number
- Remote hosts respond using (NAT IP address, NAT port #) for dst IP and port

Maintain a translation table
- Store map of (src IP, src port) to (NAT port #)
- This map needs to be one-to-one: each unique IP_addr, port # pair must map to a different NAT port #

Map incoming packets
- Consult the translation table
- Convert (dst port) into (dst IP, dst port)
- Local host receives the incoming packet

Maintaining the Mapping Table

For each outgoing packet, check if there’s an entry in the table
- If not create a new entry/mapping

Eventually, need to delete the map entry
- But when to remove the binding?

If no packets arrive within a time window then delete the mapping to free up the port #s
- At risk of disrupting a temporarily idle connection

Yet another example of “soft state”
- removing state if not refreshed for a while

Many Kinds of NATs

Some NATs also effectively include firewall rules:
- Only allow return packets from src IP_addr/src_port#
- Only allow return packet from src IP_addr
- Allow any return packet with a (dest) NAT port# in the table

Some corporate NATs have many external IP addresses
- Map from private IP/private port to NAT IP/NAT port is maintained in a table and is one-to-one

NATs provide some security and hide the structure of internal networks

Where is NAT Implemented?

Home router (e.g., Linksys box)
- Integrates router, DHCP server, NAT, etc.
- Use single IP address from the service provider and have a bunch of hosts hiding behind it

Campus or corporate network
- NAT at the connection to the Internet
- Share a collection of public IP addresses
- Avoid complexity of renumbering end hosts and local routers when changing service providers

Running Servers Behind NATs

Running servers is still possible
- Admittedly with a bit more difficulty

By explicit configuration of the NAT box
- internal service at <dst 138.76.29.7, dst-port 80> mapped to <dst 10.0.0.1, dst-port 80>

More challenging for P2P applications
- Especially if both peers are behind NAT boxes

Though solutions are possible here as well
- Existing work-arounds (e.g., in Skype)
- Ongoing work on “NAT traversal” techniques

Objections Against NATs

Practical Objections
- Port #s are meant to identify sockets
- Yet, NAT uses them to identify end hosts
- Makes it hard to run a server behind a NAT

Principled Objections
- Routers are not supposed to look at port #s
  - Network layer should care only about IP header and not be looking at the port numbers at all
- NAT violates the end-to-end argument
  - Network nodes should not modify the packets
- IPv6 is a cleaner solution
- But NATs work and here now
- I suspect NATs will be around for quite some time

Summary

Middleboxes address important problems
- Blocking unwanted traffic
- Getting by with fewer IP addresses
- Making fair use of network resources
- Improving end-to-end performance

Middleboxes cause problems of their own
- No longer globally unique IP addresses
- No longer can assume network simply delivers packets
- Harder to deploy end-to-end applications and UDP applications