Monday, April 30, 2012

Reducing Security Handoff Overhead with Opportunistic Key Caching



The good news is that the 802.1X mechanisms can be taken out of the picture for handoffs, for wireless architectures with a controller (or large number of radios in one access point). This mechanism, available today for many vendors, is known as opportunistic key caching (OKC). The name comes from the main concept underlying the technology. Once a client performs the authentication with the RADIUS server, and has a PMK, there is no reason for it to have to negotiate a new one just to handoff and create a new PTK just for that access point. The term "opportunistic" is used because the mechanism was designed to be a simple extension of 802. Hi, and the client is not made aware that OKC is enabled. If it works, it works. If not, no problems arise except the increased time required for doing the handshake.
The main protocol for OKC is identical to the ordinary key caching. The only difference is that whereas ordinary key caching requires that the client is returning to an access point where it had already performed 802. IX, opportunistic key caching requires only that the new access point somehow have access to the PMK, even though it was created on a different access point.
How can this work? The PMK, if you recall, does not have any information unique to the wireless network within it. It is a function purely of the EAP protocol in use between the wireline RADIUS server and the wireless client. There is no intrinsic reason that the same PMK cannot be used for different access points, as long as the following two restrictions are held to: the PMK must never be transmitted as plaintext or using weak encryption, and the PMK must not have expired.
In practice, opportunistic key caching implementations never move around the PMK. Instead, these implementations take advantage of the architecture of the WPA2 protocol and how it interacts with 802. IX. 802. IX doesn't know about clients and access points. Instead, it uses a different language, in which the role of the user is held by a supplicant, and the role of the network is held by an authenticator. The mapping of the supplicant to real devices is clear: the supplicant is a part of the client. The authenticator, on the other hand, has flexibility built in. For standalone access point architectures, the authenticator is a part of the access point. For controller-based architectures, however, the authenticator is almost always in the controller.
Now we get a sense for the scale of opportunistic key caching. The PMK was originally created in the authenticator, and most opportunistic key caching architectures leave the PMK inside the authenticator, never to come out. For controller-based architectures, the controller generates the PTK within the authenticator, and then distributes it to the encryption engine, which may be located locally in the controller or in the access points. With opportunistic key caching, then, the only change is to allow a client with a PMK to associate to a new access point, and to use the PMK for the new connection as if it had been negotiated on that access point.
There is no addition of protocols or state changes in opportunistic key caching, which explains why it is so prevalent within network implementations. The only changes are to clients, who have to create a new PMKID, based on the original PMK, when they associate to a new access point, and to the authenticator, which needs to look past that a PMKID was not created for the PMK, create the new one, and then continue as if nothing unusual had happened.
You should look for wireless clients and network infrastructure that supports opportunistic key caching when rolling out a voice mobility network. OKC has been generally embraced by the industry, though there are a few notable exceptions, and is generally used as the solution to the 802.1X overhead.

Thursday, April 26, 2012

The Wi-Fi Break-Before-Make Handoff



Basic Wi-Fi handoffs are always either break-before-make or just-in-time. In other words, there is no ability for a wireless phone to decide on a handoff and establish a relationship with a new access point without disconnecting from the previous one. The rules of 802.11 are rather simple here: no client is allowed to associate (send an Association message to one while maintaining data connectivity to another) to two access points at the same time. The reason for this is to remove any ambiguity as to which access point should forward wireline traffic destined to the client; otherwise, both access points would have the requirement of receiving the client's traffic, and therefore would not work in a switched wireline environment.
However, almost all of the important protocols for Wi-Fi happen only after a data connection has been established. This prevents clients from gaining much of a head start on establishing a connection when the old one is at risk.
Let's look at the contents of the Wi-Fi handoff protocol itself step by step. It will be helpful  for further information.
  1. Once a client has decided to hand off, it need not break the connection to the original access point, but it must not use it any longer.
  2. The client has the option of sending a Disassociation message to the old access point, a good practice that lets the old access point free up network resources.
  3. At this point, if the new access point is on a different channel, the client will change the channel of its receiver.
  4. If the new channel is a DFS channel, the client is required to wait until it receives a beacon frame from the access point, unless it has recently heard one as a part of a passive scanning procedure.
  5. The client will send an Authentication message to the new access point, establishing the beginnings of a relationship with this new access point, but not yet enabling data services.
  6. The access point will respond with its own Authentication message, accepting the client. A rejection can occur if load balancing is enabled, and the access point decides that it is oversubscribed, or if key state tables in the access point are full.
  7. The client will send a Reassociation Request message to the access point, requesting data services.
  8. The access point will send a Reassociation Response message to the access point. If the message has a status code for success, the client is now associated with and connected to this access point, and only this access point. Controller-based wireless architectures will usually ensure this by immediately destroying any connection that may have been left over if step 2 has not been performed. The access point may reject the association if it is oversubscribed, or if the additional services the client requests (mostly security or quality-of-service) in the Reassociation Request will not be supported.
    At this point, the client is associated and data services are available. Usually, the access point or controller behind it will send a broadcast frame, spoofed to appear as if it were sent by the client, to the connected Ethernet switch, informing it of the client's presence on that particular link and not on any one that may have been used previously.
    If no security is employed, skip ahead to the admission control mechanisms, towards the end of the list. If PSK security is employed, skip ahead to the four-way handshake. Otherwise, if 802.1X and RADIUS authentication is employed (WPA/WPA2 Enterprise), we'll continue immediately next.
  9. The access point and client can only exchange EAP messages at this point. The client may solicit the EAP exchange with an optional EAP Start message.
  10. The access point will request the client to log in with an EAP Request Identity message.
  11. Depending on the EAP method required by the RADIUS server on the network, the client and access point will continue to exchange a number of data frames, all EAPOL.
  12. The access point relays the RADIUS server's EAP Success or EAP Failure message. If this is a failure, the access point will also likely send a Deauthentication or Disassociation message to the client, to kick it off of the access point.
    At this point, the client and access point have agreed on the pairwise master key (PMK), based on the key material generated during the RADIUS exchange and sent to the access point when the authentication process concluded. But, the access point and client still need to generate a per-connection, pairwise transient key (PTK), which will be used to do the actual encryption. Pre-shared key (PSK) networks skipped the listed EAP exchanges, and use the PSK as the master key.
  13. The access point send the first message in the RSN (802. Hi) four-way handshake. This is an EAPOL Key frame.
  14. The client sends the second message in the four-way handshake.
  15. The access point sends the third message in the four-way handshake.
  16. The client sends the fourth message in the four-way handshake.
    At this point, all data services are enabled, and the client and access point can exchange data frames. However, if a call is in progress, and WMM Admission Control is enabled, the client is required to request the voice resources before it can send or receive a single voice packet with priority. Until this point, both sides may either buffer the packets or send the voice packets as best-effort. 
  17. The client sends the access point an ADDTS Request Action frame, with a TSPEC that specifies the over-the-air resources that both the upstream and downstream part of the voice call will occupy.
  18. The access point weighs whether it has enough resources to accept or deny the request. It sends an ADDTS Response Action frame with the results.
  19. If the request was successful, the client and access point will be sending voice traffic and the call successfully handed off. On the other hand, if the request fails, the client will disconnect from the access point with a Disassociation message, because, although it is allowed to remain on the access point, it can't send or receive any voice traffic.
Hopefully, everything went well and the handoff completed. On the other hand, if any of the processes failed, the connection is broken. The old connection was abandoned early on—in step 8 for sure and step 2 for more charitable clients. In order to not drop the phone call, the phone will need to restart the process from the beginning with another access point—perhaps the original access point it just left, if none is available.
You will notice that the client has a lot of work to do to make the handoff successful, and there are many places where the procedure can go wrong. Even if every request were to be accepted, any loss of some of the messages can cause long timeouts, often up to a second, as each side waits to make sure that no messages are passing each other by.
If nothing at all is done to optimize this transition, the handoff mechanics can take an additional second or two, on top of the second or so taken by the scanning process before the handoff decision was made. In the worst case, the 802.1X communication can take a number of seconds.
Part of the issue is that the mechanisms are nearly the same for a handoff as they are for when the client initially connects. This lack of memory within the network within basic Wi-Fi prevents any optimizations and requires a fresh start each time.

Sunday, April 22, 2012

When Scanning Happens | Inter-Access Point Handoffs



The client's handoff is only as good as its scanning table. The more the client scans, the more accurate the information it receives, and the better decision the client can make, thus ensuring a more robust call. However, scanning can cost as much in call quality as it saves, and most certainly diminishes battery life. So how do phones determine when to scan?
The most obvious way for a client to decide to scan is for it to be forced to scan. If the phone loses connection with the access point that it is currently attached to, then it will have no choice but to reach out and look for new options. Clients mainly determine that they have lost the connection with their current access point in three different ways.
The first method is to observe the beacons for loss. As mentioned earlier, beacon frames are transmitted on specific intervals, by default every 102.4ms. Because the beacons have such a strict transmission pattern, clients—even sleeping clients—know when to wake up to catch a beacon. In fact, they need to do this regularly, as a part of the power saving mechanisms built into the standard. A client can still miss a beacon, for two reasons: either the beacon frame was collided with (and, because beacon frames are sent as broadcast, there are no retransmissions), or because the client is out of the range that the beacons' data rates allow. Therefore, clients will usually observe the beacon loss rate. If the client finds itself unable to receive enough beacons according to its internal thresholds, it can declare the access point either lost or possibly suffering from heavy congestion, and thus trigger a new scan, as well as deprioritize the access point in the scanning table. The sort of loss thresholds used in real clients often are based on a combination of two or more different types of thresholds, such as triggering a scan if a certain number of beacons are lost consecutively, as well as triggering if a certain percentage is lost over time. These thresholds are likely not to directly specifiable by the user or administrator.
The second method is to observe data transmissions for loss. This can be done for received or transmitted frames. However, it is difficult for a client to adequately or accurately determine how many receive frames have been lost, given that the only evidence of a retransmission prior to a lost frame is the setting of the Retry bit in the frame's header, something that is not even required in the newer 802.1 In radios. Therefore, clients tend to monitor transmission retries. The retry process is invoked for a frame. Retransmissions are performed for both collisions and adapting to out-of-range conditions— because the transmitter does not know which problem caused the loss, both are handled by the transmitter simultaneously reducing the transmit data rate, in hopes of extending range, and increasing backoff, in hopes of avoiding further collisions for this one frame. Should a series of frames back-to-back be retransmitted until they time out, the client may decide that the root cause is for being out of range of the access point. Again, the thresholds required are not typically visible or exposed to the user or administrator.
Voice clients tend to be more proactive in the process of scanning. The two methods just described are for when the client has strong evidence that it is departing the range of the access point. However, because the scanning process itself can take as long as it does, clients may choose to initiate the scan before the client has disconnected. (This may sound like the beginnings of a make-before-break handoff scheme, but read on to Section 6.2.3, where we see that such a scheme does not, in fact, happen.) Clients may chose to start scanning proactively when the signal strength from the access point begins to dip below a predetermined threshold (the signal strength itself is usually measured directly for the beacons). Or, they may take into account increasing—but not yet disruptive—losses for data. Or, they may add into account observed information about channel conditions, such as an increasing noise floor or the encountering of a higher density of competing clients, to trigger the scan. In any event, the client is attempting to make some sort of preprogrammed expense/reward tradeoff. This tradeoff is often related to the problems of handoff, as mentioned shortly.
Scanning may also happen in the background, for no reason at all. This is less common in voice clients, where the desire to ensure battery life acts as a deterrent, but nevertheless is employed from time to time. The main reason to do this sort of background scanning is to ensure that the client's scanning table is generally not as stale, or to serve as a failsafe in case the triggered scanning behavior does not go off as expected. One of the chief problems with determining when to scan is that the client has no way of knowing whether it is moving or how fast it may be moving. A phone held in the hands of a forklift driver can rapidly go from having been standing still for many minutes to racing by at 15 miles per hour in a warehouse. This sort of scanning, not being triggered, is the least likely to lead to a change in access point selection, but may still serve its appropriate place in a network. For data clients, as a comparison, this form of background scanning, triggered for no reason, is often driven by the operating system. Windows-based systems often scan, for example, every 65 seconds, just to ensure that the operating system has a good sense of the networks that are available, in case the user should want to hop from one network to another. This sort of scanning causes a noticeable hit in performance for a short period of time on a periodic basis.

Thursday, April 19, 2012

The Scanning Process | Inter-Access Point Handoffs



The scanning table's contents come from beacons and probe requests. Scanning is a process that can be requested explicitly the user—often by performing an operation that is labeled "Reconnect." "Update," or "Scan." But far more often, scanning is a process that happens in the background or when the client decides that it is needed. To understand why the client makes those choices, we will need to look at the mechanisms of scanning itself.
There are two ways that the scanning table can be updated. When a client is associated to an access point, it has the ability to gather information about other access points on that channel. Especially when the client is not in power save mode, the client will usually ask its hardware to let it receive all beacon frames from any access point. Each beacon frame is then used to update the scanning table entry for that access point.
On the other hand, the client may want to survey other channels to find out what other access point options are out there. To do this, the client clearly needs to leave the channel of its access point for at least a small amount of time. Therefore, before engaging in this process, the client will usually tell the access point that it is going into power save mode, even though it is doing no such thing. That way, the access point will buffer traffic for the client, who can then look around the network with impunity.
When the client changes channels, it has two methods it can use to find out about the access points. The quickest method is to send out the probe request mentioned earlier. This probe request contains the SSID the client desires (with the option of a null SSID, an empty string, if the client wants to learn about all SSIDs), and is picked up by all access points in range that support the SSID and wish to make themselves known to the client.  Each access point that wishes to answer and that supports the SSID in question will respond with the probe response, a frame that is nearly identical to a beacon but is sent, unicast, directly to the client who asked for it. This procedure is called active scanning, though it can also be called probing, given the name of the frames that carry out the procedure. The other option is called passive scanning, and, as the name suggests, involves sending no frames out by the client. Instead, the client waits around for a beacon. Keep in mind that passive scanning clients do not know, ahead of time, how many access points are on a channel or when these access points may transmit the beacons. Therefore, a client may need to wait for at least one beacon period to maximize its chances of seeing beacons from every access point of possible interest.
In these two ways, the client goes from channel to channel, collecting as much information as possible about the available networks.
Clients may choose between active or passive scanning for a number of reasons. The advantage of active scanning is that the client will get definitive answers about the access points that are on that channel and in range in short order. Sometimes the client needs to send more than one probe request, just to make sure that none of those broadcast frames were lost because of transient RF effects or collisions. But the process itself concludes rather quickly. Furthermore, active scanning with probe requests is the only way to learn about which access points serve SSIDs that are hidden, where hidden SSIDs are not put in beacons and require the user to enter the SSID by hand. On the other hand, active scanning comes with two major penalties. The first one is for sheer network overhead. A probe request can trigger a storm of probe responses to the client, all of which take up valuable airtime. Especially when there is a network fluctuation (access point reboots, power outages, or RF interference), all of the probes pile onto an already fragile network, making traffic significantly worse. The second penalty is that active scanning is simply not allowed on the majority of the 5GHz channels. Any channel that is in a DFS band cannot be used with active scanning. Instead, the client is always required to wait for a beacon (an enabling signal), to know that the channel is allowed for operation, does not have a radar, and thus can be used. (Note that, once a client has an enabling signal, it is allowed to proceed with a probe request to discover hidden SSIDs. However, the time hit has been taken, and the process is no faster than a normal passive scan.)
Therefore, to better understand scanning, we need to look at the timing of scanning. Active scanning, of course, is the quicker process, but it too has a delay. Active scanning is limited by a probe delay, required by the standard to prevent clients from tuning into a channel in the middle of an existing transmission. The potential problem is that a client abruptly tuning into a channel might not be able to detect that a transmission is under way—carrier sense mechanisms that are based on detecting the preamble will miss out, and thus produce a false reading of a clear channel. Thus, if the client were then to send a probe request, the client could very well destroy the ongoing transmission and lose out on the access points' seeing the probe request, because of a collision. As it turns out, many voice clients set that probe delay to a trivial value, in order to not have to wait. But the common value for that delay is 12ms, which is a long time in the world of voice. Passive scanning is worse. Most access points send their beacons every 102.4ms, or as close as they can get. This means that a client who tunes to a channel has a good chance of having to wait 50ms just to get a beacon, and may have to wait the entire 100ms in the worst case, for just that one access point.
The timescale that dominates, for voice mobility, is the voice packet arrival interval. Normally, that value is 20ms (though it can be 30ms in some cases). A client will usually want to get all of the scanning it can get done in those 20ms, so that it can return to its original channel and not miss the next voice packet. Certainly, the client will not want to take 100ms unless it has to, because 100ms is a long enough jitter that it can be quite noticeable. Again, this tends to make active scanning the choice for voice clients, who are always in a hurry to learn about new access points.
If the client is going to scan between the voice packets, then the client's ability to scan will probably be limited to one channel at a time. When limited this way, the client may take up to a second, easily, to scan every possible channel. There are 11 channels in 2.4GHz, 9 non-DFS channels in 5GHz, and 11 more in the DFS bands, for a total of 31 channels to scan (or 23 channels if clients make the assumption that service is provided only on channels 1, 6, and 11 in the 2.4GHz band). Of course, scanning is also a battery-intensive process, and so a client may choose to spread out the scanning activity over time.
Furthermore, the process of changing channels is not always instantaneous. Depending on the radio chip vendor, some clients will have to wait through a multimillisecond radio settling and configuration time, reprogramming the various aspects of the radio in order to ensure proper transmission on the new channel. This adds additional padding time to the individual scanning channel transitions.
Overall, this scanning delay is a major source of handoff delays, and some methods for reducing the scanning time have been created, which we will examine shortly.

Sunday, April 15, 2012

The Scanning Table



Let's look at the scanning table in a bit more detail. This table is primarily a list of access point addresses (BSSIDs), and the parameters that the access point advertises. The 802.11 standard lists at least some parameters that may be useful to hold in the client's scanning table, as in Table 1.
Table 1: Scanning table contents from 802.11 
Field
Meaning
BSSID
The Ethernet address of the access point's service for this SSID
SSID
The SSID text string
BSS Type
Whether the access point is a real access point, or an ad hoc device
Beacon Period
Number of microseconds between beacons
DTIM Period
How many beacons must go by before broadcast/multicast frames are sent
Timestamp
The time the last beacon or probe response was scanned for this client
Local Time
The value of the access point's time counter
Physical Parameters
What type of radio the access point is using, and how it is configured
Channel
The channel of the access point
Capabilities
The capabilities the access point advertises in the Capabilities field
Basic Rate/MCS Set
The minimum rates (and MCS for 802.11 n) that this client must support to gain entry
Operational Rate/MCS Set
The allowed rates (and MCS for 802.11n) that this client can use once it associates
Country
The country and regional information for the radio
Security Information
The required security algorithms
Load
How loaded the access point reports itself to be
WMM Parameters
The WMM parameters that the client must use once it associates
Other Information
Depends on the standards that the client and access point supports
This table contains the fields taken from the access point's beacons and probe responses. Most of the information is necessary for the client to possess before it can associate, because this information contains parameters that the client needs to adopt upon association. By looking at this table, clients can easily see which access points have the right SSID, but will not allow the client to associate. Examples are for access points that require a higher grade of security than the client is configured for, or require a more advanced radio (such as 802.1 In) than the client supports. Most of the time, however, a properly configured network will not advertise anything that would prevent a properly configured client from entering.
In addition to all of this mostly static, configuration information that the access point reports, clients may collect other information that they may themselves find useful when deciding to which access point they should associate. This information is unique to the client, based on environmental factors. Generally, this information (not that in Table 1) is far more important in determining how a client chooses where to hand off or associate to. Table 2 contains some more frequent examples of information that different clients may choose to collect. Again, there is no standard here; clients may collect whatever information they want. Roughly, the information they collect is divided into two types: information observed about the access point, and information observed about the channel the access point is on. This split is necessary, because clients have to choose which channel to use as a part of choosing which access point to associate to. Properties like noise floor or observed over-the-air activity belong to the channel at the point in place and time that the client is in. On the other hand, some properties belong directly to the access point without regard to channel, such as the power level at which the client sees the access point's beacon frames. Furthermore, some of the per-access-point information may have been collected from previous periods when the client had been associated to that access point, and measured the quality of the connection.
Table 2: Other possible scanning table contents 
Field
Meaning
Signal Strength
The power level of the beacon or probe response from the access point
Channel Noise
The measured noise floor value on the channel the access point is on
Channel Activity
How often the channel the access point is on is busy
Number of Observed Clients
How many clients are on the channel the access point is on
Beacon Loss Rate
How often beacons are missed on that channel, even though they are expected
Probe Request Loss Rate
How many times probe requests had to be sent to get a probe response
Previous Data Loss Rate
If associated earlier, how much loss was present between the access point and client
Probe Request Needed
Whether the client needed to send a probe request
The scanning table is something that the client maintains over time, as a fluid, "living" menu of options. One of the challenges the client has is in determining how old, or stale, the information may be—especially the performance information—and whether it has observed that channel or access point long enough to have some confidence in what it has seen. This is a constant struggle, and different clients (even different software versions from the same client vendor) can have widely different ways of judging how much of the table to trust and whether it needs to get new information. This is one of the sources of the variability present in Wi-Fi.