Guide To DTV Standards and Training: MPEG-4
CODING OF MOVING PICTURES AND AUDIO
Graphics courtesy of Moving Picture Experts Group

For even greater detail and to download full specifications from ISO: 
http://mpeg.telecomitalialab.com/standards/mpeg-4/mpeg-4.htm

Overview

MPEG-4 is an ISO/IEC standard developed by MPEG (Moving Picture Experts Group) These standards made interactive video on CD-ROM, DVD and Digital Television possible. MPEG-4 was finalized in October 1998 and became an International Standard in the first months of 1999. The fully backward compatible extensions under the title of MPEG-4 Version 2 were frozen at the end of 1999, to acquire the formal International Standard Status early in 2000. Several extensions were added since and work on some specific work-items work is still in progress.

MPEG-4 builds on the proven success of three fields: 

  • Digital television;
  • Interactive graphics applications (synthetic content);
  • Interactive multimedia (World Wide Web, distribution of and access to content) 

MPEG-4 provides the standardized technological elements enabling the integration of the production, distribution and content access these three fields.


Table of Contents

1

  

Scope and features of the MPEG-4 standard

  

1.1

Coded representation of media objects

  

1.2

Composition of media objects

  

1.3

Description and synchronization of streaming data for media objects

  

1.4

Delivery of streaming data

  

1.5

Interaction with media objects

  

1.6

Management and Identification of Intellectual Property

2

  

Versions in MPEG-4

3

  

Major Functionalities in MPEG-4

  

3.1

Transport

  

3.2

DMIF

  

3.3

Systems

  

3.4

Audio

  

3.5

Visual

4

  

Extensions Underway

  

4.1

IPMP Extensions

  

4.2

The Animation Framework eXtension, AFX

  

4.3

Multi User Worlds

  

4.4

Advanced Video Coding 

  

4.5

Audio Extensions 

5

  

Profiles in MPEG-4 

  

5.1

Visual Profiles 

  

5.2

Audio Profiles 

  

5.3

Graphics Profiles 

  

5.4

Scene Graph Profiles 

  

5.5

MPEG-J Profiles

  

5.6

Object Descriptor Profile

6

  

Verification Testing: checking MPEG’s performance

  

6.1

Video

  

6.2

Audio

7

  

The MPEG-4 Industry Forum

8

  

Licensing of patents necessary to implement MPEG-4 (Skipped)

9

  

Deployment of MPEG-4

10

  

Detailed technical description of MPEG-4 DMIF and Systems

  

10.1

Transport of MPEG-4

  

10.2

DMIF

  

10.3

Demultiplexing, synchronization and description of streaming data

  

10.4

Advanced Synchronization (FlexTime) Model

  

10.5

Syntax Description

  

10.6

Binary Format for Scene description: BIFS

  

10.7

User interaction

  

10.8

Content-related IPR identification and protection

 

10.9

MPEG-4 File Format

  

10.10

MPEG-J

  

10.11

Object Content Information

11

  

Detailed technical description of MPEG-4 Visual

  

11.1

Natural Textures, Images and Video

  

11.2

Structure of the tools for representing natural video

  

11.3

The MPEG-4 Video Image Coding Scheme

  

11.4

Coding of Textures and Still Images

  

11.5

Synthetic Objects

12

  

Detailed technical description of MPEG-4 Audio

  

12.1

Natural Sound

  

12.2

Synthesized Sound

13

  

Detailed Description of current development

  

13.1

IPMP Extensions

  

13.2

The Animation Framework eXtension, AFX 

  

13.3

Multi User Worlds (Skipped)

  

13.4

Advanced Video Coding (Skipped)

  

13.5

Audio Extensions (Skipped)

14

  

Glossary and Acronyms


1. Scope and features of the MPEG-4 standard

The MPEG-4 standard provides a set of technologies to satisfy the needs of authors, service providers and end users alike. 

  • For authors, MPEG-4 enables the production of content that has far greater reusability, has greater flexibility than is possible today with individual technologies such as digital television, animated graphics, World Wide Web (WWW) pages and their extensions. Also, it is now possible to better manage and protect content owner rights. 
  • For network service providers MPEG-4 offers transparent information, which can be interpreted and translated into the appropriate native signaling messages of each network with the help of relevant standards bodies. The foregoing, however, excludes Quality of Service considerations, for which MPEG-4 provides a generic QoS descriptor for different MPEG-4 media. The exact translations from the QoS parameters set for each media to the network QoS are beyond the scope of MPEG-4 and are left to network providers. Signaling of the MPEG-4 media QoS descriptors end-to-end enables transport optimization in heterogeneous networks.
  • For end users, MPEG-4 brings higher levels of interaction with content, within the limits set by the author. It also brings multimedia to new networks, including those employing relatively low bitrate, and mobile ones. 

For all parties involved, MPEG seeks to avoid a multitude of proprietary, non-interworking formats and players. 

MPEG-4 achieves these goals by providing standardized ways to:

  1. represent units of aural, visual or audiovisual content, called “media objects”. These media objects can be of natural or synthetic origin; this means they could be recorded with a camera or microphone, or generated with a computer;
  2. describe the composition of these objects to create compound media objects that form audiovisual scenes;
  3. multiplex and synchronize the data associated with media objects, so that they can be transported over network channels providing a QoS appropriate for the nature of the specific media objects; and 
  4. interact with the audiovisual scene generated at the receiver’s end.

The following sections illustrate the MPEG-4 functionalities described above, using the audiovisual scene depicted in Figure 1.

1.1 Coded representation of media objects

MPEG-4 audiovisual scenes are composed of several media objects, organized in a hierarchical fashion. At the leaves of the hierarchy, we find primitive media objects, such as:

  • Still images (e.g. as a fixed background);
  • Video objects (e.g. a talking person - without the background;
  • Audio objects (e.g. the voice associated with that person, background music);

MPEG-4 standardizes a number of such primitive media objects, capable of representing both natural and synthetic content types, which can be either 2- or 3-dimensional. In addition to the media objects mentioned above and shown in Figure 1, MPEG-4 defines the coded representation of objects such as:

  • Text and graphics;
  • Talking synthetic heads and associated text used to synthesize the speech and animate the head; animated bodies to go with the faces;
  • Synthetic sound.

A media object in its coded form consists of descriptive elements that allow handling the object in an audiovisual scene as well as of associated streaming data, if needed. It is important to note that in its coded form, each media object can be represented independent of its surroundings or background.
The coded representation of media objects is as efficient as possible while taking into account the desired functionalities. Examples of such functionalities are error robustness, easy extraction and editing of an object, or having an object available in a scaleable form. 

1.2 Composition of media objects

Figure 1 explains the way in which an audiovisual scene in MPEG-4 is described as composed of individual objects. The figure contains compound media objects that group primitive media objects together. Primitive media objects correspond to leaves in the descriptive tree while compound media objects encompass entire sub-trees. As an example: the visual object corresponding to the talking person and the corresponding voice are tied together to form a new compound media object, containing both the aural and visual components of that talking person. 

Such grouping allows authors to construct complex scenes, and enables consumers to manipulate meaningful (sets of) objects.

More generally, MPEG-4 provides a standardized way to describe a scene, allowing for example to:

  • Place media objects anywhere in a given coordinate system;
  • Apply transforms to change the geometrical or acoustical appearance of a media object;
  • Group primitive media objects in order to form compound media objects;
  • Apply streamed data to media objects, in order to modify their attributes (e.g. a sound, a moving texture belonging to an object; animation parameters driving a synthetic face);
  • Change, interactively, the user’s viewing and listening points anywhere in the scene.

The scene description builds on several concepts from the Virtual Reality Modeling language (VRML) in terms of both its structure and the functionality of object composition nodes and extends it to fully enable the aforementioned features.

Undisplayed Graphic

Figure 1 - an example of an MPEG-4 Scene 

1.3 Description and synchronization of streaming data for media objects

Media objects may need streaming data, which is conveyed in one or more elementary streams. An object descriptor identifies all streams associated to one media object. This allows handling hierarchically encoded data as well as the association of meta-information about the content (called ‘object content information’) and the intellectual property rights associated with it.

Each stream itself is characterized by a set of descriptors for configuration information, e.g., to determine the required decoder resources and the precision of encoded timing information. Furthermore the descriptors may carry hints to the Quality of Service (QoS) it requests for transmission (e.g., maximum bit rate, bit error rate, priority, etc.) 

Synchronization of elementary streams is achieved through time stamping of individual access units within elementary streams. The synchronization layer manages the identification of such access units and the time stamping. Independent of the media type, this layer allows identification of the type of access unit (e.g., video or audio frames, scene description commands) in elementary streams, recovery of the media object’s or scene description’s time base, and it enables synchronization among them. The syntax of this layer is configurable in a large number of ways, allowing use in a broad spectrum of systems. 

1.4 Delivery of streaming data

The synchronized delivery of streaming information from source to destination, exploiting different QoS as available from the network, is specified in terms of the synchronization layer and a delivery layer containing a two-layer multiplexer, as depicted in Figure 2.

The first multiplexing layer is managed according to the DMIF specification, part 6 of the MPEG﷓4 standard. (DMIF stands for Delivery Multimedia Integration Framework) This multiplex may be embodied by the MPEG-defined FlexMux tool, which allows grouping of Elementary Streams (ESs) with a low multiplexing overhead. Multiplexing at this layer may be used, for example, to group ES with similar QoS requirements, reduce the number of network connections or the end to end delay.

The “TransMux” (Transport Multiplexing) layer in Figure 2 models the layer that offers transport services matching the requested QoS. Only the interface to this layer is specified by MPEG-4 while the concrete mapping of the data packets and control signaling must be done in collaboration with the bodies that have jurisdiction over the respective transport protocol. Any suitable existing transport protocol stack such as (RTP)/UDP/IP, (AAL5)/ATM, or MPEG-2’s Transport Stream over a suitable link layer may become a specific TransMux instance. The choice is left to the end user/service provider, and allows MPEG-4 to be used in a wide variety of operation environments.

  Undisplayed Graphic

Figure 2 - The MPEG-4 System Layer Model

Use of the FlexMux multiplexing tool is optional and, as shown in Figure 2, this layer may be empty if the underlying TransMux instance provides all the required functionality. The synchronization layer, however, is always present.

With regard to Figure 2, it is possible to:

  • Identify access units, transport timestamps and clock reference information and identify data loss.
  • Optionally interleave data from different elementary streams into FlexMux streams
  • Convey control information to:
  • Indicate the required QoS for each elementary stream and FlexMux stream;
  • Translate such QoS requirements into actual network resources;
  • Associate elementary streams to media objects
  • Convey the mapping of elementary streams to FlexMux and TransMux channels

Parts of the control functionalities are available only in conjunction with a transport control entity like the DMIF framework. 

1.5 Interaction with media objects

In general, the user observes a scene that is composed following the design of the scene’s author. Depending on the degree of freedom allowed by the author, however, the user has the possibility to interact with the scene. Operations a user may be allowed to perform include:

  • Change the viewing/listening point of the scene, e.g. by navigation through a scene;
  • Drag objects in the scene to a different position;
  • Trigger a cascade of events by clicking on a specific object, e.g. starting or stopping a video stream;
  • Select the desired language when multiple language tracks are available;

More complex kinds of behavior can also be triggered, e.g. a virtual phone rings, the user answers and a communication link is established.

1.6 Management and Identification of Intellectual Property

It is important to have the possibility to identify intellectual property in MPEG-4 media objects. Therefore, MPEG has worked with representatives of different creative industries in the definition of syntax and tools to support this. A full elaboration of the requirements for the identification of intellectual property can be found in ‘Management and Protection of Intellectual Property in MPEG-4, which is publicly available from the MPEG home page.

MPEG-4 incorporates identification the intellectual property by storing unique identifiers, which are issued by international numbering systems (e.g. ISAN, ISRC, etc. ). These numbers can be applied to identify a current rights holder of a media object. Since not all content is identified by such a number, MPEG-4 Version 1 offers the possibility to identify intellectual property by a key-value pair (e.g.:»composer«/»John Smith«). Also, MPEG-4 offers a standardized interface that is integrated tightly into the Systems layer to people who want to use systems that control access to intellectual property. With this interface, proprietary control systems can be easily amalgamated with the standardized part of the decoder.

2. Versions in MPEG-4

MPEG-4 Version 1 was approved by MPEG in December 1998; version 2 was frozen in December 1999. After these two major versions, more tools were added in subsequent amendments that could be qualified as versions, even though they are harder to recognize as such. Recognizing the versions is not too important, however; it is more important to distinguish Profiles. Existing tools and profiles from any version are never replaced in subsequent versions; technology is always added to MPEG﷓4 in the form of new profiles. Figure 3 below depicts the relationship between the versions. Version 2 is a backward compatible extension of Version 1, and version 3 is a backward compatible extension of Version 2 – and so on. The versions of all major parts of the MPEG-4 Standard (Systems, Audio, Video, DMIF) were synchronized; after that, the different parts took their own paths.




Figure 3 - relation between MPEG-4 Versions

The Systems layer of Version later versions is backward compatible with all earlier versions. In the area of Systems, Audio and Visual, new versions add Profiles, do not change existing ones. In fact, it is very important to note that existing systems will always remain compliant, because Profiles will never be changed in retrospect, and neither will the Systems Syntax, at least not in a backward-incompatible way. 

3. Major Functionalities in MPEG-4

This section contains, in an itemized fashion, the major functionalities that the different parts of the MPEG-4 Standard offers in the finalized MPEG-4 Version 1. Description of the functionalities can be found in the following sections. 

3.1 Transport

In principle, MPEG-4 does not define transport layers. In a number of cases, adaptation to a specific existing transport layer has been defined:

  • Transport over MPEG-2 Transport Stream (this is an amendment to MPEG-2 Systems)
  • Transport over IP (In cooperation with IETF, the Internet Engineering Task Force)

3.2 DMIF

DMIF, or Delivery Multimedia Integration Framework, is an interface between the application and the transport, that allows the MPEG-4 application developer to stop worrying about that transport. A single application can run on different transport layers when supported by the right DMIF instantiation. 

MPEG-4 DMIF supports the following functionalities:

  • A transparent MPEG-4 DMIF-application interface irrespective of whether the peer is a remote interactive peer, broadcast or local storage media.
  • Control of the establishment of FlexMux channels
  • Use of homogeneous networks between interactive peers: IP, ATM, mobile, PSTN, Narrowband ISDN.
  • Support for mobile networks, developed together with ITU-T
  • UserCommands with acknowledgment messages.
  • Management of MPEG-4 Sync Layer information. 

3.3 Systems

As explained above, MPEG-4 defines a toolbox of advanced compression algorithms for audio and visual information. The data streams (Elementary Streams, ES) that result from the coding process can be transmitted or stored separately, and need to be composed so as to create the actual multimedia presentation at the receiver side. 

The systems part of the MPEG-4 addresses the description of the relationship between the audio-visual components that constitute a scene. The relationship is described at two main levels. 

  • The Binary Format for Scenes (BIFS) describes the spatio-temporal arrangements of the objects in the scene. Viewers may have the possibility of interacting with the objects, e.g. by rearranging them on the scene or by changing their own point of view in a 3D virtual environment. The scene description provides a rich set of nodes for 2-D and 3-D composition operators and graphics primitives.
  • At a lower level, Object Descriptors (ODs) define the relationship between the Elementary Streams pertinent to each object (e.g the audio and the video stream of a participant to a videoconference) ODs also provide additional information such as the URL needed to access the Elementary Steams, the characteristics of the decoders needed to parse them, intellectual property and others. 

Other issues addressed by MPEG-4 Systems:

  • A standard file format supports the exchange and authoring of MPEG-4 content 
  • Interactivity, including: client and server-based interaction; a general event model for triggering events or routing user actions; general event handling and routing between objects in the scene, upon user or scene triggered events. 
  • Java (MPEG-J) is used to be able to query to terminal and its environment support and there is also a Java application engine to code 'MPEGlets'.
  • A tool for interleaving of multiple streams into a single stream, including timing information (FlexMux tool).
  • A tool for storing MPEG-4 data in a file (the MPEG-4 File Format, ‘MP4’)
  • Interfaces to various aspects of the terminal and networks, in the form of Java API’s (MPEG﷓J)
  • Transport layer independence. Mappings to relevant transport protocol stacks, like (RTP)/UDP/IP or MPEG-2 transport stream can be or are being defined jointly with the responsible standardization bodies.
    Text representation with international language support, font and font style selection, timing and synchronization.
  • The initialization and continuous management of the receiving terminal’s buffers.
  • Timing identification, synchronization and recovery mechanisms.
  • Datasets covering identification of Intellectual Property Rights relating to media objects

3.4 Audio

MPEG-4 Audio facilitates a wide variety of applications which could range from intelligible speech to high quality multichannel audio, and from natural sounds to synthesized sounds. In particular, it supports the highly efficient representation of audio objects consisting of:

3.4.1 General Audio Signals 

Support for coding general audio ranging from very low bitrates up to high quality is provided by transform coding techniques. With this functionality, a wide range of bitrates and bandwidths is covered. It starts at a bitrate of 6 kbit/s and a bandwidth below 4 kHz and extends to broadcast quality audio from mono up to multichannel. High quality can be achieved with low delays. Parametric Audio Coding allows sound manipulation at low speeds. Fine Granularity Scalability (or FGS, scalability resolution down to 1 kbit/s per channel)

3.4.2 Speech signals

Speech coding can be done using bitrates from 2 kbit/s up to 24 kbit/s using the speech coding tools. Lower bitrates, such as an average of 1.2 kbit/s, are also possible when variable rate coding is allowed. Low delay is possible for communications applications. When using the HVXC tools, speed and pitch can be modified under user control during playback. If the CELP tools are used, a change of the playback speed can be achieved by using and additional tool for effects processing.

3.4.3 Synthetic Audio

MPEG-4 Structured Audio is a language to describe 'instruments' (little programs that generate sound) and 'scores' (input that drives those objects). These objects are not necessarily musical instruments, they are in essence mathematical formulae, that could generate the sound of a piano, that of falling water – or something 'unheard' in nature. 

3.4.4 Synthesized SpeechScalable 

TTS coders bitrate range from 200 bit/s to 1.2 Kbit/s which allows a text, or a text with prosodic parameters (pitch contour, phoneme duration, and so on), as its inputs to generate intelligible synthetic speech. 

3.5 Visual

The MPEG-4 Visual standard allows the hybrid coding of natural (pixel based) images and video together with synthetic (computer generated) scenes. This enables, for example, the virtual presence of videoconferencing participants. To this end, the Visual standard comprises tools and algorithms supporting the coding of natural (pixel based) still images and video sequences as well as tools to support the compression of synthetic 2-D and 3-D graphic geometry parameters (i.e. compression of wire grid parameters, synthetic text). 

The subsections below give an itemized overview of functionalities that the tools and algorithms of in the MPEG-4 visual standard.

3.5.1 Formats Supported

The following formats and bitrates are be supported by MPEG-4 Visual :

  • bitrates: typically between 5 kbit/s and more than 1 Gbit/s
  • Formats: progressive as well as interlaced video
  • Resolutions: typically from sub-QCIF to 'Studio' resolutions (4k x 4k pixels)

3.5.2 Compression Efficiency

  • For all bit rates addressed, the algorithms are very efficient. This includes the compact coding of textures with a quality adjustable between "acceptable" for very high compression ratios up to "near lossless". 
  • Efficient compression of textures for texture mapping on 2-D and 3-D meshes.
  • Random access of video to allow functionalities such as pause, fast forward and fast reverse of stored video.

3.5.3 Content-Based Functionalities

  • Content-based coding of images and video allows separate decoding and reconstruction of arbitrarily shaped video objects.
  • Random access of content in video sequences allows functionalities such as pause, fast forward and fast reverse of stored video objects.
  • Extended manipulation of content in video sequences allows functionalities such as warping of synthetic or natural text, textures, image and video overlays on reconstructed video content. An example is the mapping of text in front of a moving video object where the text moves coherently with the object.

3.5.4 Scalability of Textures, Images and Video

  • Complexity scalability in the encoder allows encoders of different complexity to generate valid and meaningful bitstreams for a given texture, image or video.
  • Complexity scalability in the decoder allows a given texture, image or video bitstream to be decoded by decoders of different levels of complexity. The reconstructed quality, in general, is related to the complexity of the decoder used. This may entail that less powerful decoders decode only a part of the bitstream. 
  • Spatial scalability allows decoders to decode a subset of the total bitstream generated by the encoder to reconstruct and display textures, images and video objects at reduced spatial resolution. A maximum of 11 levels of spatial scalability are supported in so-called 'fine-granularity scalability', for video as well as textures and still images.
  • Temporal scalability allows decoders to decode a subset of the total bitstream generated by the encoder to reconstruct and display video at reduced temporal resolution. A maximum of three levels are supported.
  • Quality scalability allows a bitstream to be parsed into a number of bitstream layers of different bitrate such that the combination of a subset of the layers can still be decoded into a meaningful signal. The bitstream parsing can occur either during transmission or in the decoder. The reconstructed quality, in general, is related to the number of layers used for decoding and reconstruction.
  • Fine Grain Scalability – a combination of the above in fine grain steps, up to 11 steps

3.5.5 Shape and Alpha Channel Coding

  • Shape coding assists the description and composition of conventional images and video as well as arbitrarily shaped video objects. Applications that benefit from binary shape maps with images are content-based image representations for image databases, interactive games, surveillance, and animation. There is an efficient technique to code binary shapes. A binary alpha map defines whether or not a pixel belongs to an object. It can be ‘on’ or ‘off’.
  • ‘Gray Scale’ or ‘alpha’ Shape Coding. An alpha plane defines the ‘transparency’ of an object, which is not necessarily uniform; it can vary over the object, so that, e.g., edges are more transparent (a technique called feathering). Multilevel alpha maps are frequently used to blend different layers of image sequences. Other applications that benefit from associated binary alpha maps with images are content-based image representations for image databases, interactive games, surveillance, and animation. 

3.5.6 Robustness in Error Prone Environments

Error resilience allows accessing image and video over a wide range of storage and transmission media. This includes the useful operation of image and video compression algorithms in error-prone environments at low bit-rates (i.e., less than 64 Kbps). There are tools that address both the band-limited nature and error resiliency aspects of access over wireless networks.

3.5.7 Face and Body Animation

The ‘Face and Body Animation’ tools in the standard allow sending parameters that can define, calibrate and animate synthetic faces and bodies. These models themselves are not standardized by MPEG-4, only the parameters are, although there is a way to send, e.g., a well-defined face to a decoder. 

The tools include:

  • Definition and coding of face and body animation parameters (model independent):
  • Feature point positions and orientations to animate the face and body definition meshes
  • Visemes, or visual lip configurations equivalent to speech phonemes
  • Definition and coding of face and body definition parameters (for model calibration):
  • 3-D feature point positions
  • 3-D head calibration meshes for animation
  • Personal characteristics
  • Facial texture coding

3.5.8 Coding of 2-D Meshes with Implicit Structure

2D mesh coding includes:

  • Mesh-based prediction and animated texture transfiguration
  • 2-D Delaunay or regular mesh formalism with motion tracking of animated objects
  • Motion prediction and suspended texture transmission with dynamic meshes.
  • Geometry compression for motion vectors:
  • 2-D mesh compression with implicit structure & decoder reconstruction

3.5.9 Coding of 3-D Polygonal Meshes

MPEG-4 provides a suite of tools for coding 3-D polygonal meshes. Polygonal meshes are widely used as a generic representation of 3-D objects. The underlying technologies compress the connectivity, geometry, and properties such as shading normals, colors and texture coordinates of 3-D polygonal meshes. 

The Animation Framework eXtension (AFX, see further down) will provide more elaborate tools for 2D and 3D synthetic objects. 

4. Extensions underway

MPEG is currently working on a number of extensions:

4.1 IPMP Extension

 

4.2 The Animation Framework eXtension, AFX

The Animation Framework extension (AFX – pronounced ‘effects’) provides an integrated toolbox for building attractive and powerful synthetic MPEG-4 environments. The framework defines a collection of interoperable tool categories that collaborate to produce a reusable architecture for interactive animated contents. In the context of AFX, a tool represents functionality such as a BIFS node, a synthetic stream, or an audio-visual stream.

AFX utilizes and enhances existing MPEG-4 tools, while keeping backward-compatibility, by offering:

  • Higher-level descriptions of animations (e.g. inverse kinematics)
  • Enhanced rendering (e.g. multi-texturing, procedural texturing)
  • Compact representations (e.g. piecewise curve interpolators, subdivision surfaces)
  • Low bitrate animations (e.g. using interpolator compression and dead-reckoning)
  • Scalability based on terminal capabilities (e.g. parametric surfaces tessellation)
  • Interactivity at user level, scene level, and client-server session level
  • Compression of representations for static and dynamic tools

Compression of animated paths and animated models is required for improving the transmission and storage efficiency of representations for dynamic and static tools. 

4.3 Multi User Worlds

 

4.4 Advanced Video Coding

Work is ongoing on MPEG-4 part 10, 'Advanced Video Coding', This codec is being developed jointly with ITU-T, in the so-called Joint Video Team (JVT). The JVT unites the standard world's video coding experts in a single group. The work currently underway is based on earlier work in ITU-T on 'H.26L'. H.26L and MPEG-4 part 10 will be the same. (H.26L will be renamed when it is done. The final name may be H.264, but that is not yet sure). MPEG-4 AVC/H.26L is slated to be ready by the end of 2002.

4.5 Audio extensions

There are two work items underway for improving audio coding efficiency even further.

a) Bandwidth extension

Bandwidth extension is a tool that gives a better quality perception over the existing audio signal, while keeping the existing signal backward compatible.

MPEG is investigating bandwidth extensions, and may standardize of one or both of:

  1. General audio signals, to extend the capabilities currently provided by MPEG-4 general audio coders. 
  2. Speech signals, to extend the capabilities currently provided by MPEG-4 speech coders.

A single technology that addresses both of these signals is preferred. This technology shall be both forward and backward compatible with existing MPEG-4 technology. In other words, an MPEG-4 decoder can decode an enhanced stream and a new technology decoder can decode an MPEG-4 stream. There are two possible configurations for the enhanced stream: MPEG-4 AAC streams can carry the enhancement information in the DataStreamElement, while all MPEG-4 systems know the concept of elementary streams, which allow second Elementary Stream for a given audio object, containing the enhancement information.

b) Parametric coding

The MPEG-4 standard already provides a parametric coding scheme for coding of general audio signals for low bit-rates (HILN, "Harmonic Individual Lines and Noise"). The extension investigates parametric coding of general audio signals for the higher quality range, to extend the capabilities currently provided by HILN. Whenever possible this technology will build upon the existing MPEG-4 HILN technology. 

5. Profiles in MPEG-4

MPEG-4 provides a large and rich set of tools for the coding of audio-visual objects. In order to allow effective implementations of the standard, subsets of the MPEG-4 Systems, Visual, and Audio tool sets have been identified, that can be used for specific applications. These subsets, called ‘Profiles’, limit the tool set a decoder has to implement. For each of these Profiles, one or more Levels have been set, restricting the computational complexity. The approach is similar to MPEG-2, where the most well known Profile/Level combination is ‘Main Profile @ Main Level’. A Profile@Level combination allows: 

  • a codec builder to implement only the subset of the standard he needs, while maintaining interworking with other MPEG-4 devices built to the same combination, and
  • checking whether MPEG-4 devices comply with the standard (‘conformance testing’). 

Profiles exist for various types of media content (audio, visual, and graphics) and for scene descriptions. MPEG does not prescribe or advise combinations of these Profiles, but care has been taken that good matches exist between the different areas.

5.1 Visual Profiles

The visual part of the standard provides profiles for the coding of natural, synthetic, and synthetic/natural hybrid visual content. There are five profiles for natural video content:

  1. The Simple Visual Profile provides efficient, error resilient coding of rectangular video objects, suitable for applications on mobile networks, such as PCS and IMT2000.
  2. The Simple Scalable Visual Profile adds support for coding of temporal and spatial scalable objects to the Simple Visual Profile, It is useful for applications which provide services at more than one level of quality due to bit-rate or decoder resource limitations, such as Internet use and software decoding.
  3. The Core Visual Profile adds support for coding of arbitrary-shaped and temporally scalable objects to the Simple Visual Profile. It is useful for applications such as those providing relatively simple content-interactivity (Internet multimedia applications).
  4. The Main Visual Profile adds support for coding of interlaced, semi-transparent, and sprite objects to the Core Visual Profile. It is useful for interactive and entertainment-quality broadcast and DVD applications.
  5. The N-Bit Visual Profile adds support for coding video objects having pixel-depths ranging from 4 to 12 bits to the Core Visual Profile. It is suitable for use in surveillance applications.

The profiles for synthetic and synthetic/natural hybrid visual content are:

  1. The Simple Facial Animation Visual Profile provides a simple means to animate a face model, suitable for applications such as audio/video presentation for the hearing impaired.
  2. The Scalable Texture Visual Profile provides spatial scalable coding of still image (texture) objects useful for applications needing multiple scalability levels, such as mapping texture onto objects in games, and high-resolution digital still cameras. 
  3. The Basic Animated 2-D Texture Visual Profile provides spatial scalability, SNR scalability, and mesh-based animation for still image (textures) objects and also simple face object animation.
  4. The Hybrid Visual Profile combines the ability to decode arbitrary-shaped and temporally scalable natural video objects (as in the Core Visual Profile) with the ability to decode several synthetic and hybrid objects, including simple face and animated still image objects. It is suitable for various content-rich multimedia applications.

Version 2 adds the following Profiles for natural video:

  1. The Advanced Real-Time Simple Profile (ARTS) provides advanced error resilient coding techniques of rectangular video objects using a back channel and improved temporal resolution stability with the low buffering delay. It is suitable for real time coding applications; such as the videophone, tele-conferencing and the remote observation.
  2. The Core Scalable Profile adds support for coding of temporal and spatial scalable arbitrarily shaped objects to the Core Profile. The main functionality of this profile is object based SNR and spatial/temporal scalability for regions or objects of interest. It is useful for applications such as the Internet, mobile and broadcast. 
  3. The Advanced Coding Efficiency (ACE) Profile improves the coding efficiency for both rectangular and arbitrary shaped objects. It is suitable for applications such as mobile broadcast reception, the acquisition of image sequences (camcorders) and other applications where high coding efficiency is requested and small footprint is not the prime concern.

The Version 2 profiles for synthetic and synthetic/natural hybrid visual content are:

  1. The Advanced Scaleable Texture Profile supports decoding of arbitrary-shaped texture and still images including scalable shape coding, wavelet tiling and error-resilience. It is useful for applications that require fast random access as well as multiple scalability levels and arbitrary-shaped coding of still objects. Examples are fast content-based still image browsing on the Internet, multimedia-enabled PDA’s, and Internet-ready high-resolution digital still cameras. 
  2. The Advanced Core Profile combines the ability to decode arbitrary-shaped video objects (as in the Core Visual Profile) with the ability to decode arbitrary-shaped scalable still image objects (as in the Advanced Scaleable Texture Profile.) It is suitable for various content-rich multimedia applications such as interactive multimedia streaming over Internet.
  3. The Simple Face and Body Animation Profile is a superset of the Simple Face Animation Profile, adding - obviously - body animation.

In subsequent Versions, the following Profiles were added:

  1. The Advanced Simple Profile looks much like Simple in that it has only rectangular objects, but it has a few extra tools that make it more efficient: B-frames, ¼ pel motion compensation, extra quantization tables and global motion compensation.
  2. The Fine Granularity Scalability Profile allows truncation of the enhancement layer bitstream at any bit position so that delivery quality can easily adapt to transmission and decoding circumstances. It can be used with Simple or Advanced Simple as a base layer.
  3. The Simple Studio Profile is a profile with very high quality for usage in Studio editing applications. It only has I frames, but it does support arbitrary shape and in fact multiple alpha channels. Bitrates go up to almost 2 Gigabit per second.
  4. The Core Studio Profile adds P frames to Simple Studio, making it more efficient but also requiring more complex implementations. 

5.2 Audio Profiles

Four Audio Profiles have been defined in MPEG-4 V.1:

  1. The Speech Profile provides HVXC, which is a very-low bit-rate parametric speech coder, a CELP narrowband/wideband speech coder, and a Text-To-Speech interface. 
  2. The Synthesis Profile provides score driven synthesis using SAOL and wavetables and a Text-to-Speech Interface to generate sound and speech at very low bitrates.
  3. The Scalable Profile, a superset of the Speech Profile, is suitable for scalable coding of speech and music for networks, such as Internet and Narrow band Audio Digital Broadcasting (NADIB). The bitrates range from 6 kbit/s and 24 kbit/s, with bandwidths between 3.5 and 9 kHz.
  4. The Main Profile is a rich superset of all the other Profiles, containing tools for natural and synthetic Audio.

Another four Profiles were added in MPEG-4 V.2:

  1. The High Quality Audio Profile contains the CELP speech coder and the Low Complexity AAC coder including Long Term Prediction. Scalable coding can be performed by the AAC Scalable object type. Optionally, the new error resilient (ER) bitstream syntax may be used.
  2. The Low Delay Audio Profile contains the HVXC and CELP speech coders (optionally using the ER bitstream syntax), the low-delay AAC coder and the Text-to-Speech interface TTSI. 
  3. The Natural Audio Profile contains all natural audio coding tools available in MPEG-4, but not the synthetic ones.
  4. The Mobile Audio Internetworking Profile (MAUI) contains the low-delay and scalable AAC object types including TwinVQ and BSAC. This profile is intended to extend communication applications using non-MPEG speech coding algorithms with high quality audio coding capabilities.

5.3 Graphics Profiles

Graphics Profiles define which graphical and textual elements can be used in a scene. These profiles are defined in the Systems part of the standard:

  1. Simple 2-D Graphics Profile The Simple 2-D Graphics profile provides for only those graphics elements of the BIFS tool that are necessary to place one or more visual objects in a scene.
  2. Complete 2-D Graphics Profile The Complete 2-D Graphics profile provides two-dimensional graphics functionalities and supports features such as arbitrary two-dimensional graphics and text, possibly in conjunction with visual objects. 
  3. Complete Graphics Profile The Complete Graphics profile provides advanced graphical elements such as elevation grids and extrusions and allows creating content with sophisticated lighting. The Complete Graphics profile enables applications such as complex virtual worlds that exhibit a high degree of realism.
  4. The 3D Audio Graphics Profile sounds like a contradictory in terms, but really isn’t. This profile does not propose visual rendering, but graphics tools are provided to define the acoustical properties of the scene (geometry, acoustics absorption, diffusion, transparency of the material). This profile is used for applications that do environmental spatialization of audio signals. (See Section 12.1.7)

5.3.1 Profiles under Definition or Consideration

The following profiles were under development at the time of writing this Overview; their inclusion in the standard was highly likely, but not guaranteed.

  1. The Simple 2D+Text profile looks like simple 2D, adding the BIFS nodes to display text which can be colored or transparent. Like simple 2D, this is a useful profile for low-complexity audiovisual devices. 
  2. The Core 2D Profile supports fairly simple 2D graphics and text. Meant for set tops and similar devices, it can do such things as picture-in-picture, video warping for animated advertisements, logos, and so on.
  3. The Advanced 2D profile contains tools for advanced 2D graphics. Using it, one can implement cartoons, games, advanced graphical user interfaces, and complex, streamed graphics animations.
  4. The X3D Core profile is the only 3D profile that is likely to be added to MPEG-4. It is compatible with Web3D’s X3D core profile under development [Web3D], and it gives a rich environment for games, virtual worlds and other 3D applications.

5.4 Scene Graph Profiles

Scene Graph Profiles (or Scene Description Profiles), defined in the Systems part of the standard, allow audiovisual scenes with audio-only, 2-dimensional, 3-dimensional or mixed 2-D/3-D content. 

  1. The Audio Scene Graph Profile provides for a set of BIFS scene graph elements for usage in audio only applications. The Audio Scene Graph profile supports applications like broadcast radio.
  2. The Simple 2-D Scene Graph Profile provides for only those BIFS scene graph elements necessary to place one or more audio-visual objects in a scene. The Simple 2-D Scene Graph profile allows presentation of audio-visual content with potential update of the complete scene but no interaction capabilities. The Simple 2-D Scene Graph profile supports applications like broadcast television.
  3. The Complete 2-D Scene Graph Profile provides for all the 2-D scene description elements of the BIFS tool. It supports features such as 2-D transformations and alpha blending. The Complete 2-D Scene Graph profile enables 2-D applications that require extensive and customized interactivity.
  4. The Complete Scene Graph Profile provides the complete set of scene graph elements of the BIFS tool. The Complete Scene Graph profile enables applications like dynamic virtual 3-D world and games.
  5. The 3D Audio Scene Graph Profile provides the tools three-dimensional sound positioning in relation either with acoustic parameters of the scene or its perceptual attributes. The user can interact with the scene by changing the position of the sound source, by changing the room effect or moving the listening point. This Profile is intended for usage in audio-only applications. 

5.4.1 Profiles under definition

At the time of writing, the following profiles were likely to be defined:

  1. The basic 2D profile provides basic 2D composition for very simple scenes with only audio and visual elements. Only basic 2D composition and audio and video nodes interfaces are included. These nodes are required to put an audio or a video object in the scene. 
  2. The Core 2D profile has tools for creating scenes with visual and audio objects using basic 2D composition. Included are quantization tools, local animation and interaction, 2D texturing, Scene tree updates, and the inclusion of subscenes through weblinks. Also included are interactive service tools (ServerCommand, MediaControl, and MediaSensor), to be used in video-on-demand services. 
  3. The Advanced 2D profile forms a full superset of the basic 2D and core 2D profiles. It adds scripting, the PROTO tool, BIF-Anim for streamed animation, local interaction and local 2D composition as well as advanced audio. 
  4. The Main 2D profile adds the FlexTime model to Core 2D, as well as Layer2D and WorldInfo nodes and all input sensors. This profile was designed to be an interoperability point with tSMIL (see [SMIL]). It provides a very rich set of tools for highly interactive applications on, e.g., the World Wide Web. This name might still change.
  5. The X3D core profile was designed to be a common interworking point with the Web3D specifications [Web3D] and the MPEG-4 standard. The same profile is will be in a Web3D specification. It includes the nodes for an implementation of 3D applications on a low-footprint engine, reckoning with the limitations of software renderers.

5.5 MPEG-J Profiles

Two MPEG-J Profiles exist: Personal and Main:

  1. Personal - a lightweight package for personal devices.

The personal profile addresses a range of constrained devices including mobile and portable devices. Examples of such devices are cell video phones, PDAs, personal gaming devices. This profile includes the following packages of MPEG-J APIs: 

  1. Network 
  2. Scene 
  3. Resource 
  1. Main - includes all the MPEG-J API's.

The Main profile addresses a range of consumer devices including entertainment devices. Examples of such devices are set top boxes, computer based multimedia systems etc. It is a superset of the Personal profile. Apart from the packages in the Personal profile, this profile includes the following packages of the MPEG-J APIs:

  1. Decoder 
  2. Decoder Functionality
  3. Section Filter and Service Information

5.6 Object Descriptor Profile

The Object Descriptor Profile includes the following tools:

  • Object Descriptor (OD) tool 

  •  Sync Layer (SL) tool

  •  Object Content Information (OCI) tool 

  •  Intellectual Property Management and Protection (IPMP) tool

Currently, only one profile is defined that includes all these tools. The main reason for defining this profile is not subsetting the tools, but rather defining levels for them. This applies especially to the Sync Layer tool, as MPEG-4 allows multiple time bases to exist. In the context of Levels for this Profile, restrictions can be defined, e.g. to allow only a single time base.

[ Top ]

[ Previous ] [ DTV/HDTV Home ][ MPEG Home ][ TV Home ] [ Next ]

We attempt to provide the best and most accurate information possible to our customers. Information contained herein is provided "as is" and subject to change without notice.