Overviews

  • The fundamental limit on amount of information that can travel through a information channel is lossy (delete or alter bits)

    solution to single information channel packet bit error:

    1. error-correcting codes.
    2. ARQ (Automatic Repeated Request) protocol, send it again.

ARQ and retransmission

  • on multi-hop cascade channel, problems? package reordering, package duplicate, package drops.
  • package drops?

    resend it again.

    • how to ensure receivers receiving a certain packet?

      acks.

      • how long should sender wait for an ack? hard to decide. solve it later.
      • what if an ack is lost? can't distinguish, just resend.
      • what if a packet is received but with error? same as dropped. sender won't send ack.
      • how to ensure receivers receiving same packets senders send?

        1. error-correcting code, too complicated.
        2. error-detecting code, checksum, CRC etc.
  • package duplicate? assign each packet a unique sequence number, discard if duplicate.
  • package reordering? check sequence number, discard if not desired.
  • stop and wait protocol, throughput = packetSize / RTT (with no errors), low network occupation.
  • how to improve efficiency?

    goal: keep network busy.

    solution: sends multiple packets at a time.

    • for senders, Q: how many packets? how to keep records of ack timers? how to buffer packets not yet acked?
    • for receivers, Q: how to keep out of order packets in buffer?

    more subtle problems:

    1. what if sender sends faster than receiver receives?
    2. what if routers in middle can't handle rates senders send at?

Sliding window

manage complexity.

  • how large the window should be?

    problems to solve:

    • receiver can't handle sender's data rate?

      flow control.

      • how can sender know how fast it can send?

        two ways:

        • rate-based, gives sender a data rate allocation, and send ensures data never exceeds the allocation. preferred in streaming application and could be used with broadcast and multicast.
        • window-based, window size varied over time. mostly used with sliding window.

          • how to adjust window size? receiver signal sender windows size, a.k.a, window advertisement or window update.
          • throughput? WindowSize * WindowNum / RTT bit/s
    • network in between can't handle sender's data rate?

      congestion control, a special form of flow control.

      • rates? besides explicit signaling, another option is implicit signaling. sender tries to guess based on evidence.
  • related to queuing theory, still a major research topic.

Decide retransmission timeout

  • retry time should be the sum of

    • time of data transfer from sender to receiver.
    • time of receiver processing data.
    • time of ack transfer from receiver to sender.
    • time of sender processing ack.
  • problem is:

    • none of above are known or easy to know.
    • every of above changes over time.
  • solution? RTT estimation. retransmission time should be set larger than estimated time. how larger?

TCP Overview

A variation of ARQ.

characteristics:

  • connection-oriented(two-end)
  • reliable
  • byte stream abstraction

questions:

  • how to run on IP which is a protocol based on packet? packetization and repacketization, split chunks into what TCP consider best-sized chunks (usually fit into a single IP datagram), chunk called a segment
  • how to deal with packet duplicate or packet out of order? sequence number
  • how to ensure receiver receive same packet as sender sends? checksum. NOT strong enough. application layer error protection needed if data is large.
  • packet drop? retransmission. when sends a batch of segments, TCP normally set one ack timer. doesn't' set timer for every segments. when ack arrives, update timer.
  • ack? cumulative acks, ack N means all bytes up to N(exclusive) have already been received. more robust, later ack will cover previous ack.
  • full-duplex.

TCP header

20 bytes in total, includes:

  • src port, dest port. 4 bytes. (src ip, src port, dest ip, dest port) uniquely identifies a connection.
  • sequence number. 4 bytes.

    first bytes offset in this segment. tcp number each bytes with a sequence number. when connection is being established, SYN bit is turned on. sequence number contains the initial sequence number, which is randomly chosen.

    • why ISN not 0 or 1? security reason.

    sequence number + 1 is the first bytes of data being sent. SYN itself consume a sequence number.

  • ack number. 4 bytes. next sequence number the receiver expects to receive(sequence number of last received bytes + 1). valid only if ACK bit is on. indicate largest byte received in order at receiver. modern tcp has selective acknowledgement.
  • header length 4 bit. used for variable option. in 32-bit word. minimum is 5, maximum is
  • reserve 4bit.
  • flags 1 byte.

    CWR
    congestion window reduced.
    ECE
    ECN echo.
    URG
    Urgent.
    ACK
    acknowlegment.
    PSH
    push.
    RST
    reset connection.
    SYN
    synchronize sequence number.
    FIN
    sender of segment is finished sending data to peer.
  • window size 2 bytes used for flow control. max 65535 bytes. More on Window Scale option to further to increase this value.
  • TCP checksum 2 bytes. mandatory for sender to calculate and store, verified by receiver.
  • urgent pointer. 2 bytes. plus current sequence number is the last byte of urgent data.

Popular options:

  • maximum segment size(MSS). specify MSS sender willing to received in reverse direction.

data portion is option, e.g. pure ACK, window update.

TCP connection management

tcp connection is a 4 tuple (src ip, src port, dest ip, dest port)

tcp connection management includes:

  • connection establishment exchange several options.
  • data transfer
  • connection termination

challenge:

  • how to handle state transitions robustly?

Connection establishment and termination

standard procedures, three way handshakes:

  • connection establishment

    1. active opener sends SYN with ISN A
    2. passive opener receives SYN, respond with ACK (A + 1), SYN with ISN B
    3. active receives SYN, respond with ACK (B + 1)

    purpose:

    • let each end know a connection is starting
    • exchange ISN
    • exchange OPTION
  • connection closed

    1. active closer sends a FIN + ACK K
    2. passive closer receives FIN + ACK, sends ACK (K + 1).
    3. passive closer sends FIN + ACK L
    4. active closer receives FIN + ACK L, sends ACK (L + 1).
  • TCP half close

    • needs API support.
    • done sending data, but able to receive data.
  • simultaneous close and open both sides receive a SYN after sending one.

    simultaneous open will exchange 4 segments.

    1. A -> SYN X
    2. B -> SYN Y
    3. SYN X -> B, B -> SYN Y + ACK X + 1
    4. SYN Y -> A, A -> SYN X + ACK Y + 1

    simultaneous close:

    1. A -> FIN K + ACK L
    2. B -> FIN L + ACK K
    3. FIN K + ACK L -> A, A -> ACK K + 1, SEQ = L
    4. FIN L + ACK K -> B, B -> ACK L + 1, SEQ = K

ISN

  • Main target to solve: tcp segments may routed into network and show up later and disrupt connection.
  • Solution: choose ISN carefully. ISN should change over time.
  • Purpose: new sequence number must not be allowed to overlap between different instantiations of the same connection
  • Possible problems:

    1. sequence number may wrap

    employ application layer checksum

    1. knowing ip, port and seq number, one can forge tcp segments to interrupt tcp connection.

    choose ISN in a random way.

  • modern system implementation

    semirandom way. linux choose to:

    1. clock-based scheme, but start clock at random offset for each connection.
    2. random offset chosed by cryptographically hashed function on connection identifier.

    32 bit ISN, first 8 bit sequence number of secret, remaining generated by the hash

    property: hard to guess, increase over time

Connection Establishment Timeout

when connection can't be established, etc., server host is down, SYN needs to timeout.

Linux uses a way called exponential backoff, each backoff is always twice the previous, maximum backoff can be configured.

TCP options

  • MSS, maximum segment size. largest segments TCP is willing to receive from its peer. TCP header and IP header not inclusive.A limit, not negotiation.
  • SACK, SACK permitted and SACK block. Selective ACK.
  • WINDOW SCALE OPTION, increase window advertisement size
  • TIMESTAMP, monotonically increased value.

    • actions:

      1. senders put 32bit value in Timestamp Value field, receiver echoes back unchanged in Timestamp Echo Reply
    • purpose:

      1. allowing sender to calculate an estimate of connections'RTT for each ACK received.
      2. protection againest wrapped sequence number.
  • User Timout, amount of time a TCP sender is willing to wait for an ACK of outstanding data before concluding that remote end has failed.
  • TCP-AO. enhance security of TCP.

TCP PMTU Discovery

  1. when connection is established, TCP MTU = min(minimum of MTU of outgoing interface, MSS announced by other end)
  2. then all ip packets sent with DF flags on. If PTB received, TCP use smaller segment size and retransmit.
  3. due to the dynamically changing size of MTU, when time passed since last decrease, a larger value can be tried.

possible problems in an Internet environment:

  • blank holes, no PTB.

TCP State Transitions

diagram

  • TimeWait (2 MSL wait) MSL, Maximum Segment Lifetime, value chosen by implementation.

    • Rule: after sends last ack active closer should wait 2MSL in case ACK is lost.
    • Retransimit ACK triggered when receiving another FIN.
    • Effects:

      1. In 2MSL time, connection identitity(4 tupless) can't be reused, unless

        • 2 MSL time is over
        • new connection's ISN exceeds highest sequence number used by last instantiation
        • Timestamp for disambiguation is used.
    • Purpose: avoid segments from last instantiation interfares new one.
    • Bypass: Berkeley sockets API, SO_REUSEADDR option.
  • Quiet Time

    2MSL only works when host doesn't crash.

    • Protection: TCP should wait for MSL before creating new connections after crash or reboot.
  • FIN_WAIT_2 problem: might cause infinite wait. solution: timer, move to CLOSED when expired.

Reset Segment

reset segment is sent when segment arrives is not correct for referenced connection. usually results in fast teardown.

scenarios:

  • nonexistent port.

in correct reset segment:

  1. ACK must be set.
  2. ACK number should be in valid window.

FIN: orderly release RST: abortive release properties:

  1. any queued data is thrown away. reset segment is sent immediately.
  2. receiver can know the other end did an abort.

half-open connection means: one end has closed connection without the knowledge of the other end. happens when crash.

problem: the other end may never detect.

solved with Keep-Alive option in TCP.

Time-wait assassination:

  1. one TCP in time-wait states.
  2. RST segment comes and kills the TCP.

solution: most system don't respond to RST in Time-wait states.

TCP server options

  1. restricting foreign endpoints
  2. incoming connection queue. rule: new connections may be in one of two distinct states before they are made available before made available to an application:

    1. connections not yet completed but a SYN has been received(SYN_RCVD states)
    2. connections already in ESTABLISHED state but not yet accepted by Applications.

    one queues for each.

    with Berkley API, when application is told that a connection is arrived, TCP's three-way hand shake is already over.

Timeout and Retransimission

TCP use retransimission to guarantee correctness. Performance is heavily related to when to retransmission.

TCP use two factors to decide:

  • Timeout or timer-based
  • information from ACKs, i.e., fast restransmission, which is much more efficient.

Setting RTO

RTO :: retransmission timeout, based on the measurement of RTT

RTT estimation:

estimated for each TCP.

  • classic method SRTT, smoothed RTT. EWMA (exponentially weighted moving average or low-pass filter) SRTT <- alpha(SRTT) + (1 - alpha)RTT RTO = min(ubound, max(lbound, SRTT * beta)) flaws: can't keep up wide fluctuation. cause unnecessary retransmissions when actual RTO is much larger than estimated.
  • standard method

    estimate both RTT and estimation to set RTT. M is each RTT measurement.

    srtt <- (1 - g) srtt + (g)M

    rttvar <- (1 - h) rttvar + (h) (|M - srtt|)

    RTO = srtt + 4(rttvar)

    • clock granularity and RTO bounds TCP has running "clock", length of TCP clock "tick" is called granularity. In RFC 6298: RTO = max(strtt + max(G, 4(rttVar)), 1000) where G is granularity. Thus RTO is at least 1s.
    • initial value RFC 6298: initial RTO should be 1s. When first RTT measurement M arrives, srtt <- M rttvar <- M / 2.
    • retransmission ambiguity problem.

      sender sends packet 1. sender timeout. sender resend packet 2. sender receive ACK. How to tell which packet ACK is for?

      unless Timestamp option is used, ACK has no indication of which copy is being ACKed.

      Karn's Algorithm:

      1. when timeout and retransmission occurs, when ACK arrives, RTT estimator shouldn't be updated.
      2. To decrease network load, use backoff factor to RTO, doubles each time a subsequence retransmission occurs. Doubling continues until an ACK is received for a segment that was not retransmitted.
    • RTT estimation with Timestamp option
  • Linux method

    • 1ms granularity
    • TSOPT