The Ultimate Guide on USB Audio

‌‌‌Reading time: 7 min read

USB Audio: The Ultimate Guide

audio-interface-blog_covers

Introduction

USB, or Universal Serial Bus, has been a standard interface in personal computers for over a decade. USB connectivity is widespread, making it easy to connect various peripherals such as microphones, speakers, external drives, and webcams. This article will focus on USB audio—a digital audio standard suitable for personal computers, smartphones, and tablets that supports connecting audio peripherals such as speakers, microphones, and mixers.

USB Basics

USB follows a protocol where the host computer initiates transfers to devices (like USB speakers). Each transfer targets a specific device and a specific endpoint on that device. There are four types of transfers: bulk transfers, isochronous transfers, interrupt transfers, and control transfers.

Bulk transfers are used to reliably transfer data between the host and the device. All USB transfers include a CRC (checksum) to indicate if an error has occurred. In bulk transfers, the recipient must check the CRC. If the CRC is correct, the transfer is acknowledged, and the data is considered error-free. If the CRC is incorrect, the transfer is not acknowledged and will be retried. If the device is not ready to accept data, it can send a negative acknowledgment, NAK, causing the host to retry the transfer. Bulk transfers are not considered time-critical and are scheduled around the more time-sensitive transfer types discussed below.

Isochronous transfers are used to transfer data in real-time between the host and the device. When the host sets up an isochronous endpoint, it allocates a specific amount of bandwidth for the endpoint and performs input or output transfers at regular intervals. For example, the host might output 1K bytes of data to the device every 125 milliseconds. Since a fixed and limited bandwidth is allocated, there is no time to resend data if a transfer problem occurs. The data has a normal CRC, but if the recipient detects an error, there is no retransmission mechanism.

Interrupt transfers are used by the host to poll devices periodically to check if something valuable has occurred. For example, the host might poll an audio device to check if the MUTE button has been pressed. The name "interrupt transfers" can be confusing because they do not interrupt anything. However, the regular polling of data provides functionality similar to host interrupts.

Control transfers are very similar to bulk transfers. Control transfers can be acknowledged, rejected, and delivered in a non-real-time manner. Control transfers are used for operations outside the normal data flow, such as querying device capabilities or endpoint status. Describing how device capabilities are reported is beyond the scope of this article, but we will say that there are predefined classes, such as "USB Audio Class" or "USB Mass Storage Class," that enable cross-platform interoperability.

All transfers are performed in units of USB frames. High-speed USB frames are 125μs (full-speed USB is 1ms) and are marked by a Start-of-Frame (SOF) message sent by the host. Isochronous and interrupt transfers can occur at most once per frame.

USB Audio

USB audio uses isochronous transfers, interrupt transfers, and control transfers to transmit audio data. Isochronous transfers are used to transfer audio data, interrupt transfers are used to monitor the availability of the audio clock, and control transfers are used to adjust settings such as volume and sampling rate.

Transmission between host and USB device

Figure 1: Transmission between host and USB device. Isochronous transfer input and output are used to transmit audio data, control transfers are used to set parameters, and interrupt transfers are used for status monitoring.

The data requirements of a USB audio system depend on the number of channels, the number of bits per sample, and the sampling rate. Typical channel counts are 2 (stereo), 6 (5.1), or more for studio and DJ channels. Typical sample sizes are 24 bits, with 16 bits used for traditional audio and 32 bits for high-quality audio. Typical sampling rates are 44.1, 48, 96, and 192 kHz, with the latter used for high-quality audio.

Suppose we design a stereo speaker with a sampling rate of 96 kHz and a sample size of 24 bits. To simplify data transmission on the host and device, the 24-bit value is usually padded with a zero byte, so the total data throughput is 96,000 x 2 channels x 4 bytes = 768,000 bytes/second. Isochronous endpoints operate at 125μs intervals—or 8000 transfers per second. Dividing the required bytes per second by the number of transfers per second gives the number of bytes per isochronous transfer: 768,000 / 8,000 = 96 bytes per transfer.

USB Clock Synchronization

A significant challenge in digital audio is agreeing on a common time concept. We defined USB frames as transmitting 8000 times per second and set the speaker to play 96000 samples per second. This works only if the speaker and host agree on the length of one second. USB audio provides three modes to ensure the host and speaker are synchronized in time:

  • In synchronous mode, the length of one second is defined by the host device. The host sends data at a rate, and the device must match this rate exactly.
  • In asynchronous mode, the opposite is true: the device sets the definition of one second, and the host must match the device.
  • In adaptive mode, the data stream determines the clock.

Adaptive and synchronous modes are not ideal because personal computers are notoriously bad at maintaining a stable clock and often involve other audio sources, such as external digital decks. Asynchronous mode uses an external clock source as the master clock or a low-jitter clock in the device. Typically, both rely on crystal-based PLLs.

Thus, there are at least two independent clocks in the system: one is the USB clock driven by the host at 8,000 transfers per second, and the other is the sampling clock driven by an external source at, for example, 96,000 Hz.

These clock frequencies will have slight differences and will vary slightly over time. Therefore, the average number of audio samples per frame will be slightly more or less than the expected rate. For example, at our 96,000 Hz sampling rate, the average number of samples might be 12.001. To ensure the host sends the correct amount of data, neither too much nor too little, the host requests the current sampling rate through an interrupt endpoint. Every few milliseconds, the average sampling rate of the previous phase is reported as a 16.16 fixed-point number. If the last cycle's average was 12.001 frames, a value of 0x000C0041 (65536*12.001) will be reported.

With this average rate, the host can calculate when to send an extra sample during transfers; in this example, eight transfers per second will carry an extra sample. Additionally, the host can use this value to synchronize itself with the audio device. This allows host applications, such as DVD players, to keep video and audio in sync. Without this, the audio would gradually run ahead of the video, and after two hours, the sound would be a second ahead of the picture.

To maintain a short feedback loop, the trick is to avoid unnecessarily buffering audio and feedback packets. Any extra buffering introduces delay in the report, making it more challenging to maintain smooth flow. This means the underlying USB protocol stack and USB audio protocol stack should be tightly coupled, with no intermediate buffering. While this is challenging to achieve on application processors, it is easily accomplished if the software is implemented on embedded processors with predictable execution times.

Overall, maintaining a consistent time concept is crucial in digital audio. USB audio provides three modes—synchronous, asynchronous, and adaptive—to ensure synchronization between the host and peripherals, with asynchronous mode being more reliable due to using an external clock source.

Multiple Clock Sources

In complex devices like mixers, multiple devices may provide sampling rates through different interfaces. USB audio allows designers to implement a clock selector to choose the input clock source used to provide the sampling rate.

The clock selector specifies which clock will be used to provide the sampling rate. The clock selector has multiple input clocks (e.g., input clock from S/PDIF connection; local oscillator, and input clock from ADAT connection), and through control transfers, the user can select which clock to use as the input, such as the input clock from the S/PDIF connection.

Compliance and Native Support

Compliance with the USB Audio Class 2.0 standard enables seamless integration of devices with operating systems, allowing control of parameters such as volume and sampling rate through standard OS dialogs.

Conclusion

With the advantages of high-speed USB 2.0, USB Audio Class 2.0 achieves low-latency transmission between PCs and audio devices, ensuring high throughput and excellent sound quality. USB audio class is suitable for various devices, from complex mixers to surround sound systems, PC speakers, and microphones.

FAQ

Q: What types of USB transfers are used in USB audio? A: The types of transfers used in USB audio include bulk transfers, isochronous transfers, interrupt transfers, and control transfers.

Q: How does USB audio ensure synchronization between the host and peripheral devices?
A: USB audio ensures synchronization by providing three modes: synchronous, asynchronous, and adaptive. Among these, the asynchronous mode is the most reliable as it uses an external clock source.

Q: How many clocks are typically used in a USB audio system?
A: A USB audio system typically uses at least two clocks: a USB clock and an external clock for the sampling rate.

Q: What is the function of the clock selector in a USB audio device?
A: The clock selector in a USB audio device allows the user to choose the input clock source, which provides the sampling rate.

Q: Why is adherence to the USB Audio Class 2.0 crucial for seamless integration into operating systems?
A: Adherence to the USB Audio Class 2.0 standard ensures that the device seamlessly integrates into the operating system and allows control over parameters like volume and sampling rate through standard OS dialogs.

Integration Support

In China, XMOS's partner, Pawpaw Technology, provides localized support and solution evaluation for your projects. For assistance, please visit pawpaw.cn or email sales@pawpaw.cn.

In summary, USB audio is a versatile and reliable digital audio transmission standard that offers high fidelity and low latency. By understanding the complexities of USB audio, users can fully leverage its features to achieve a seamless audio experience.

Contact Us