Limitations of Ogg Framing and Possible Extensions to the Format

Alex Stewart <alex@foogod.com>

October 14, 2002

Author's Note: The following description of the Ogg format is based on my understanding from published documentation. I have not actually compared it with actual implementation (because I don't have time to go digging through a bunch of code to try to figure out what it does), and the Ogg documentation isn't stellar so I may have gotten something wrong. If this is the case, feel free to let me know.

This is not intended to be a criticism of the Ogg format. Ogg is a specialized format which is well designed for certain applications, such as Vorbis streaming and storage. This document is simply intended to highlight some of the problems that the Ogg format might encounter when being used for other applications than its current common uses, and discuss what would be needed to make the format work in some more general situations.

This document only discusses issues regarding the framing of data. There are other issues involving stream identification and per-stream metadata (stream headers, start/stop info, indexing, etc) which might require some enhancement from the current Ogg specification, but are not covered by this analysis.

This document is intended to perform an analysis of the Ogg framing format for multimedia data, identify some of the commonly acknowledged shortcomings of the format for use in some types of applications, propose possible changes to the format to remove these limitations, and then compare the results to existing general-purpose multimedia encapsulation formats to determine useful changes which could be made to those formats, or if indeed an improved Ogg framing format would be useful as an alternative to existing general-purpose container formats.

A Brief Description of Ogg

The Ogg format consists of a series of "pages". Each page contains a header, and a series of "segments". Each segment contains raw stream data and can be between 0 and 255 bytes in length.

The page header contains the following information:

  1. An identifying tag "OggS" (this is used for locating the next page if the stream is corrupted)
  2. Stream structure version (1 byte)
  3. flags (1 byte)
  4. "absolute granule position" (8 bytes)
  5. Stream ID (4 byes)
  6. Page sequence number (4 bytes)
  7. CRC (4 bytes)
  8. Segment count (1 byte) and segment sizes ("segment count" bytes)

This is then immediately followed by "segment count" segments' worth of data, each segment being the length specified by its entry in the "segment sizes" array.

The Ogg format also has a conception of "packets", which are basically completely independent of pages. They are not required to start or stop on page boundaries, and may span multiple pages (or multiple packets may be contained within the same page). Packets always start and end on segment boundaries, and the end of a packet is indicated by a segment with fewer than 255 bytes in it.

Packets are the form in which data is passed to and from codecs. Packets are therefore basically equivalent to "frames" in most other conventions. Ogg is unusual in that the primary stream boundaries (pages) are not in any way related to data boundaries (packets/frames), while in most other formats these boundaries are highly correlated.

For a (slightly) more detailed description of the Ogg framing format, see the official documentation at http://xiph.org/ogg/doc/framing.html.

Advantages of the Ogg Framing Method

One of the biggest advantages to the Ogg approach over typical framing models is that very small packet sizes can be encapsulated with a minimum of overhead. This is often important for highly-compressed audio formats such as Vorbis.

Pages are CRC-checked to detect corruption, and can contain anywhere from 0 bytes to slightly under 64K of data, allowing flexibility in the granulatrity of integrity checking. As pages are not tied in any way to packet boundaries, the granularity of CRC checking can be determined completely independently from the size of the actual data chunks being manipulated.

Limitations of the Ogg Framing Method

"absolute granule position" very vaguely defined

As with some other areas of the Ogg specification, the meaning of the "absolute granule position" field (the one field in the header which could conceivably be useful for determining presentation timing and appropriate seek locations without decoding all the data) is defined in a codec-dependent manner, making it effectively impossible to design a codec-agnostic Ogg-stream-handling application that needs to know stream timing/synchronization information, and making the logic required to properly synchronize different format streams (such as audio and video) significantly more complicated.

No per-packet sequencing/presentation information (timecodes, etc)

Since packets do not correspond in any way to pages, and packets do not have any information at all except for the raw data and an implied length, there is no way to specify presentation information (timecodes, etc) for an individual packet, or other useful information such as sequence numbers, ordering information, etc. This information can be specified to some extent on a page basis, but since pages bear little or no correlation to packets, this means applications must often guess as to the appropriate timing of a given packet within a multi-packet-page. It is also possible to know page sequencing information (from the "page sequence number") so it is possible to determine when a page has been lost, for example, but there is no way to know how many packets may or may not have been lost with that page. This makes many types of generic stream manipulation difficult or impossible to do cleanly, and with some codecs could make correct presentation and error recovery significantly more difficult.

No way to identify "keyframes", or other per-packet info useful for stream handling

While not important for many audio codecs (such as Vorbis), keyframe information can be very important for long-term temporally-coded data formats such as most video data. For any application which wishes to seek within an Ogg stream containing such data, it will be completely hit-and-miss whether they get clean (or possibly even usable) results. This information really needs to be correlated with specific packets (frames), so simply adding this to the page header would not be sufficient.

Multiplexing granularity is tied to CRC granularity

Because CRCs and stream IDs are stored at the same level, it's not possible to have one CRC-protected block with data from multiple streams in it. Since multiple streams' data are often highly temporally correlated, and in complex scenarios may even be data-correlated (requiring data from one stream to accurately decode another), in many situations it would be desirable to multiplex streams with fairly fine granularity, but under the Ogg format this necessarily requires the overhead of lots of additional integrity-checking data which really isn't required that frequently. (it's a minor issue, but as we'll see it also fits in with other issues later)

Altering the Ogg Framing Method

The following are some proposals for appropriate ways that the Ogg format could be modified if somebody were to wish to use it as a basis for more general applications without the above limitations. These are not necessarily the only ways to fix these problems, but are intended to be reasonable approaches to remedying the limitations without unduly encumbering the advantages already present in the format.

Absolute Granule Position

Perhaps the easiest issue to address is the vague definition of "Absolute Granule Position" in the page headers. To be properly useful as a header field for seeking and synchronization of streams, a value such as this really should be determined in a codec-independent way (preferably independent even of data type (audio/video/etc) to aid in synchronization of disparate types without needing to be aware of stream details). The logical way to fix this would be to change this field to represent a time stamp, or "timecode" for presentation of the encoded data instead, with a suitably small granularity of values (millisecond precision should be adequate for any data in most multimedia applications). This allows it to also serve a dual purpose both for stream seeking and synchronization but for presentation scheduling as well, for applications which care about such things.

Per-Packet Meta-Information

There are a few ways to approach this. One would be to add header information to each segment, but this would be horribly inefficient. A better approach would be to add header information to the first segment of each packet, but this has several drawbacks, not least of which being to significantly complicate parsing, and it would significantly increase the overhead for small-packet streams which do not require this sort of meta information (such as Vorbis).

An obvious, and more easily parseable, approach would be to add support for this sort of information to the page header, and add constraints for correlation between pages and packets as it applies to packet-meta-information. One drawback to this would be that it would require a one-page-per-packet encoding for streams which require this information to be specified for every packet. Some of the data contained in the page header really does not need to be repeated for every packet (CRC checking, for example), even when such packets do require information like keyframe flags and timestamps.

This leads us neatly into another related limitation:

Multiplexing Granularity is Tied to CRC Granularity

A logical approach to add flexibility both for one-page-per-packet streams and for separating multiplexing and CRC granularity would be to separate the CRC checking data out from the page level and create a larger level above pages to contain integrity-checking information. This "cluster" level would allow CRC checking to still be done on an independent scale from other aspects of the stream, as appropriate for the application. Also, as locating the beginning of blocks by characteristic data (tags, etc) only makes sense at the same level as the CRC checking is performed, the initial tag ("OggS" or whatever) could be moved to the cluster level along with the CRC check (some other things like "stream structure version" would obviously make more sense here too). The cluster header would also need some way to identify exactly how many bytes the CRC applies to (since it's no longer tied to just one page), but at this larger scale a couple more bytes for an additional length code are not terribly significant overhead, so this is not necessarily a big problem.

Increasing Efficiency at the Page Level

With the need in some types of streams to have one header per packet, this naturally implies the need for one page per packet for such streams. In these situations, it makes sense to try to optimize the size of the page header to reduce the overhead disadvantage of this approach. If integrity checking information is moved to a larger layer, this is a good start, but there are a few other things that could be trimmed without significant penalties on the overall format.

For starters, 8 bytes for the "absolute granule position" might have been reasonable for some interpretations of that value, given the ambiguous definition in the Ogg specification, but with it formalized to a timecode value with millisecond precision, 64-bits is really a bit excessive. 32-bits for this value allows for a range of 49.7 days of continuous stream duration. This combined with a protocol for handling rollover appropriately should be adequate for all potential streaming applications, and saves us several bytes in the header.

It also seems rather unlikely that anybody would need more than 256 streams per file. The 4-byte stream ID code therefore seems a bit excessive. Reducing this to 1 byte would also save us a fair amount of header space.

Note that the most common case for needing one page per packet is for video data, which also typically involve fairly large packet sizes (often on the order of 2-5K per packet). At these packet sizes, lacing information takes up a significant amount of space, and since we're constrained to one packet per page anyway, does not actually contribute any useful information over a simple length value. Replacing this information with a 16-bit length (indicated by a bit in the flags field) could save 10-20 (or more) bytes of header space for each packet.

This brings the page header size down to a fixed size of around 12 bytes. Note that this means that for packet sizes over about 2.5K, it's actually more efficient to represent them with one page per packet than it would be with Ogg-style lacing, and it has the added benefit of per-packet timecode and flagging information (not supported by Ogg) with no real disadvantages for such packet sizes.

Summary and Observations Related to MCF

Taking into account the suggested changes above, we are left with a structure resembling the following:
Cluster
Identifying tag
Stream structure version
CRC
Length
Page
Flags (including keyframe, etc)
Timecode (previously "absolute granule position")
Stream ID
Page sequence number
Size (either lacing information or a length value)
Segment
.
.
.
.
.
.

This modified framing arrangement retains almost all the current advantages of the Ogg format (indeed, in the simplified case of one page per cluster it is almost identical in functionality and overhead to the current Ogg framing model, with the exception of minor things like not supporting 4 billion multiplexed streams), while adding considerable flexibility and reasonably efficient solutions to the primary drawbacks of the current Ogg format.

The observant (and well versed) at this point may notice that we are approaching a framing format rather similar to that of "cluster data" within an MCF file. This was not intentional, but given this notable result it seems prudent to highlight the few differences between the two formats:

It is interesting to note that when logical extensions were made to the Ogg format to address some of the deficiencies it currently has for certain applications, a result fairly close to the current MCF implementation was derived, even though as far as I know the current MCF spec was not based on an Ogg starting point (though it clearly does have influences (such as lacing) taken from portions of Ogg).

It would undoubtedly be possible to start from the current Ogg specification, and address the limitations in different ways from those described here to arrive at a format significantly different from MCF, but based on the above analysis it seems unlikely that such a format would have much in the way of significant advantages over what has been presented here.

As mentioned above, a few small issues were brought up as a result of comparing a modified Ogg format to the current MCF format (most notably allowing lacing to span blocks), but it appears that MCF framing is already pretty close to what one would need if one were to turn Ogg framing into a more generally useful format, and it therefore might make some sense to use MCF as a starting point for future development along these lines.