Enterprise-Grade Media Encoding Using the Yo Smartphone App: A White Paper

by Vijith Assar

A Guide To Converting Empty Transients Into Stateful Bits


  1. Abstract
  2. Precedents and Prior Art
  3. 2.1) Pulse Code Modulation
    2.2) The Nyquist Theorem
    2.3) Sony® Direct Stream Digital™
  4. Format Design
  5. Obstacles
  6. Credits


Yo is a single-purpose mobile app which has become very popular despite its limited ability to send only empty notifications which do not include any content (see Figure 1). Much of its use has taken the form of a new lightweight social network of sorts in which notifications can be exchanged between friends and acquaintances, who then separately interpret the implied meaning.


Figure 1: Yo mobile app

Recognizing that a notification mechanism is more useful than the app’s unusual marketing successes might suggest, events such as coding competitions have attempted to make use of the app’s functionality in more pragmatic ways, some of which are now even presented as available services inside the app. Yo was also recently added as a supported output mechanism by IFTTT, a popular internet service which allows users to aggregate other services and create customized new multiplatform functionality. Similarly, this paper puts forth a theoretical specification for a high-resolution file format for audio storage which could be built atop Yo and would outperform most existing encoders, including uncompressed WAV and AIFF as well as compressed FLAC, MP3, AAC, and other algorithms such as those used by streaming music services.


In order to understand the specification for a proposed Yo-based encoder, one must first understand the following three basic principles of digital audio.


Pulse code modulation (hereafter, PCM) is the most common method of representing audio, used in popular formats like MP3, AAC, and WAV. It works by taking a series of successive measurements called “samples,” which are basically just snapshots of the sound wave at a specific point in time. These can then be re-assembled and played back to create a replica of the sound wave over time.

It may help to visualize PCM as a sort of “graph paper” (see Figure 2) onto which the sound wave is to be drawn, progressing from left to right across the grid. The rows are bits, stacked vertically to provide the detail in each individual snapshot. The columns are samples, one following after the next to provide the detail over time.


Figure 2: Pulse Code Modulation

So-called “CD quality” audio is PCM which uses 16-bit samples and takes 44,100 of them per second — the latter number is therefore called the “sample rate.” High resolution contexts such as professional recordings often use 24-bit samples and higher sample rates ranging from 88,100 to 192,000 samples per second.


Sound waves are defined by oscillation, with higher frequencies such as high-pitched keys on the far right of a piano oscillating more rapidly than lower frequencies. In order to detect that something has oscillated back and forth, you have to measure it at least twice. Thus, the Nyquist Theorem states that in order to avoid unwanted audible distortions, the sample rate of a PCM audio file must be at least twice as fast as the highest frequency it contains. Further generalizing this point, increasing the sample rate at which a waveform is digitally measured also increases the amount of information that can be conveyed.


Throughout the short history of digital audio, PCM has been the leader, but a serious competitor finally emerged in 1999 when Sony introduced its Direct Stream Digital™ (hereafter, DSD) method of audio encoding. DSD never really took off commercially beyond the short-lived Super Audio Compact Disc album format, but for a while in the mid-2000s it captured the attention of audio technology professionals, many of whom thought it sounded significantly better than PCM.

DSD is a completely different paradigm from PCM. The individual samples are only 1-bit, each operating as a binary on/off switch with no capacity to represent any detail in measurement. However, the sample rate is considerably higher than PCM — 2,822,400 samples per second for conventional use, or up to 22,579,200 samples per second for high-resolution contexts like professional recordings.

With DSD encoding the wave is represented by the delta of timing changes between state flips of the one-bit sample; the speed at which the on/off value of the bit is changing is equivalent to the slope of the waveform curve (see Figure 3).


Figure 3: Sony® Direct Stream Digital™

A single bit can thus be used to represent a very complex sound wave, provided that the sample rate is extremely fast.


A Yo-based audio encoder would largely just re-implement DSD using the app’s notification system. However, unlike the single bit required for DSD, Yo’s notifications are stateless transient events which disappear once they’ve been received, rather than stable bits which can retain their on/off state. This problem can be rectified through the use of a second complementary notification.

The second notification can be obtained by simply doubling the sample rate; per the Nyquist Theorem, this increases the amount of information that can be encoded. Two stateless Yo events can then be used to store a persistent bit value.

The transformation from Yo event to bit state can happen in one of two ways. The first option involves interpolating two Yo events into a single oscillation of a one-bit square wave, with the second Yo validating the first. The second option uses the first Yo to convey timing, and then uses the presence or absence of the second expected Yo to convey the on/off status of the bit.

With either strategy, Nyquist-like principles from PCM can be used to increase the bandwidth of the waveform and attach to each Yo an additional data point, which is itself also conveyed with a second Yo. This turns the app’s empty notifications into bit-like constructs which reliably convey a state, and those pseudo-bits can then be used to build an audio signal using the DSD model of waveform representation.



Audio playback devices rely on an internal or external component called a “word clock” to manage the speed at which samples are read. Naturally, high sample rates require high-end word clocks. Widespread consumer adoption of Yo-based audio would depend on the development of a low-cost, high-speed word clock which could be easily incorporated in common devices such as phones and personal audio players, which typically have built-in audio decoding hardware which only supports PCM and its lower sample rates.


Clearly, any service which depends on aggregation of Yos must first be able to access them. Yo is first and foremost an app for mobile devices, but mobile operating systems typically mediate direct access to push notifications such as the Yos. Encoding audio directly on a mobile device in response to incoming Yos would therefore require either native support within the Yo app, or else a new software layer which can route push notifications from the Yo app to the companion encoder. The latter arrangement would require deep hooks into the operating system and thus could likely only be developed with vendor cooperation. Barring support within the Yo app or the cooperation of smartphone vendors, encoding could still happen remotely and finished files could be transferred to the device upon completion, after which the platform’s native audio handling and existing tools such as JACK and Audiobus could route the output stream as necessary.


Yo does provide a data exchange endpoint for developers to use when building tools that integrate with the service, such as the proposed new audio format. However, it is also customary for technology companies to limit the rate at which such endpoints can be accessed, so as to prevent spam and abuse. The current service rate limit allows third-party tools to send one Yo per minute, equivalent to 0.167 Yos per second (hereafter, YPS). The proposed Yo-based audio format requires a DSD-level sample rate which has been doubled to allow Nyquist-based bit state construction and validation. This works out to at least 5.65 million YPS, or up to 45.2 million YPS for high-resolution contexts like professional recordings. Our inquiries to the app’s parent company about raising the maximum rate limit for sending Yos by 270,409,581 percent were not acknowledged. Thus, the primary obstacle to a useful Yo-based digital audio format is Yo itself.

[Yo-based Encoder Company name pending] is currently seeking 1.5 million dollars in seed-stage venture funding.


Vijith Assar (@vijithassar) is a writer and programmer who previously worked as an audio engineer, recently helped build custom media archival software for Bob Dylan, and occasionally contributes to the recording studio industry magazine Tape Op.