VM Junkie

March 3, 2010

Anatomy of the DevTap bug (PC over IP)

Filed under: view, vmware — Justin Emerson @ 11:53 am

Hello all, thought I’d post a little interesting info here.

In ESX 4.0 update 1 (with no patches), there is an interesting bug that affects PCoIP connections on View. I thought I’d illuminate a little information about this bug as I think it does a good job of illustrating how the PCoIP software sender works with VMware’s underlying hypervisor.

In VMware ESX 3.5 Update 4 and vSphere 4.0 Update 1, there is a virtual device that’s part of each VM’s hardware called DevTap. DevTap is a way for other processes to “tap” into the device layer – specifically video and audio. In the case of PCoIP, it’s used to do screen scraping (for the PCoIP sender) as well as audio. You can see an example of this by looking at Device Manager when you connect to a VM with PCoIP:

The audio stuff isn’t terribly interesting, but what it does with the information attained from the video card is.

As described in Scott Davis’s blog article here, PCoIP will use different kinds of codecs for different types of objects. It will also prioritize things like the currently selected window. But how does it obtain this information?

Most remote display protocols either run in the Presentation layer or they scrape data directly from the video buffer. An example of the former is RDP – it reads the data that is supposed to be drawn and draws it to a virtual video device. This is why you can run RDP in a higher resolution than the physical video card on the machine could. An example of the latter would be VNC, which reads the data that’s in the video card’s frame buffer and sends that over the wire. RDP’s advantage in this case is it gets data earlier on in the drawing process, so it can be more descriptive with lines and shapes (write text here, draw a circle here) whereas VNC can only go by what’s already been drawn, so it has to work with bitmaps and images, not instructions on what to draw. On the other hand, because VNC is only scraping the screen, it’s completely platform independent. That’s why you can get VNC Servers for Windows, MAC, Linux… practically anything.

PCoIP is somewhere in the middle. The actual protocol itself was originally a pure hardware solution, so at its core PCoIP is a screen scraping solution like VNC. The difference is PCoIP has many different algorithms it can use to encode different parts of the screen, and so there is a lot of work that’s done to figure out what the best codec to use it for a particular region of the screen. Again, see the linked blog article above for details on this. VMware’s software implementation, though, is Windows only and will not work except on a Virtual Machine running on ESX 3.5u4+ or 4.0u1+. This is because the screen scraping is actually being given “hints” farther up in the stack. One of the interesting behaviors of PCoIP is that it’s aware of where you mouse cursor is and what your selected window is. This means that the window in focus will get updates faster and at a higher priority than background elements. How does it know what you’re doing? That’s from the hints being sent to the PCoIP sender from the View Agent, which does have information from the Presentation Layer.

After it gets the hints and knows where stuff is, the PCoIP sender service then accesses the DevTap device to read the contents of the video buffer. This has the effect of being able to see what’s coming out of the “virtual monitor” in a way that’s more efficient than reading directly from the frame buffer. This is why the software PCoIP sender is only available in a VM – because it needs the DevTap device to access the virtual monitor.

Once the PCoIP connection is made, a “monitor blanking” command is sent to the virtual monitor so that someone with the vSphere client can’t snoop on someone’s virtual machine. Unfortunately, this is where the bug comes in. Sometimes, when the monitor blanking command is issued, or when you resize your PCoIP window and it changes the resolution of the VM, the DevTap device crashes. This is so bad that the only way to recover from it is to kill the VM’s process, or reboot the ESX host. Resetting or Powering off the VM will hang at 95%, as outlined in this forum thread. A user in that thread recently said that the newest patches for ESX 4.0u1 in January may fix this bug, however I’ve seen it occur after installing this patch, and indeed at VMware Partner Exchange they were still fighting this bug in the View lab sessions, so I’m not sure if it’s resolved yet. It is, however, a very rare occurrence so I wouldn’t lose much sleep over it. I just thought that outlining the cause of this bug would be good opportunity to explain some of the reasons why PCoIP is so clever.

DISCLAIMER: I am not a VMware engineer. Some of this information has been obtained from sources in bits and pieces, and some of it was just from deduction and putting the pieces together. The definitive source for all info on PCoIP is Warren Ponder, whose blog is here.

UPDATE: And as of today (March 3rd) VMware JUST release a series of patches for ESX and ESXi 4.0, one of which fixes this particular bug. From the KB article:

Changing the resolution of the guest operating system over a PCoIP connection (desktops managed by View 4.0) might cause the virtual machine to stop responding.
Symptoms: The following symptoms might be visible:

  • When you try to connect to the virtual machine through a vCenter Server console, a black screen appears with the Unable to connect to MKS: vmx connection handshake failed for vmfs {VM Path} message.
  • Performance graphs for CPU and memory usage in vCenter Server drop to 0.
  • Virtual machines cannot be powered off or restarted.

So get patchin’ folks!


1 Comment

  1. Nice article… VMJunkie is one of my day-to-day followup.

    Comment by Steve Lavoie — March 4, 2010 @ 5:57 am

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Create a free website or blog at WordPress.com.

%d bloggers like this: