For even
greater detail and to download full specifications from ISO:
http://mpeg.telecomitalialab.com/standards/mpeg-4/mpeg-4.htm
Overview
MPEG-4 is an ISO/IEC standard developed by MPEG
(Moving Picture Experts Group) These standards made interactive
video on CD-ROM, DVD and Digital Television possible. MPEG-4 was finalized in October 1998 and became an International Standard in
the first months of 1999. The fully backward compatible extensions under the
title of MPEG-4 Version 2 were frozen at the end of 1999, to acquire the formal
International Standard Status early in 2000. Several extensions were added since
and work on some specific work-items work is still in progress.
MPEG-4
builds on the proven success of three fields:
- Digital television;
- Interactive graphics applications (synthetic
content);
- Interactive multimedia (World Wide Web,
distribution of and access to content)
MPEG-4 provides the standardized technological
elements enabling the integration of the production, distribution and content
access these three fields.
Table of Contents
|
1 |
|
Scope and features of the MPEG-4 standard |
|
|
1.1 |
Coded representation of media objects |
|
|
1.2 |
Composition of media objects |
|
|
1.3 |
Description and synchronization of streaming data
for media objects |
|
|
1.4 |
Delivery of streaming data |
|
|
1.5 |
Interaction with media objects |
|
|
1.6 |
Management and Identification of Intellectual
Property |
|
2 |
|
Versions in MPEG-4 |
|
3 |
|
Major Functionalities in MPEG-4 |
|
|
3.1 |
Transport |
|
|
3.2 |
DMIF |
|
|
3.3 |
Systems |
|
|
3.4 |
Audio |
|
|
3.5 |
Visual |
|
4 |
|
Extensions Underway |
|
|
4.1 |
IPMP Extensions |
|
|
4.2 |
The Animation Framework
eXtension, AFX |
|
|
4.3 |
Multi User Worlds |
|
|
4.4 |
Advanced Video Coding |
|
|
4.5 |
Audio Extensions |
|
5 |
|
Profiles in MPEG-4 |
|
|
5.1 |
Visual Profiles |
|
|
5.2 |
Audio Profiles |
|
|
5.3 |
Graphics Profiles |
|
|
5.4 |
Scene Graph Profiles |
|
|
5.5 |
MPEG-J Profiles |
|
|
5.6 |
Object Descriptor Profile |
|
6 |
|
Verification Testing: checking MPEG’s
performance |
|
|
6.1 |
Video |
|
|
6.2 |
Audio |
|
7 |
|
The MPEG-4 Industry Forum |
|
8 |
|
Licensing of patents necessary to implement
MPEG-4 (Skipped) |
|
9 |
|
Deployment of MPEG-4 |
|
10 |
|
Detailed technical description of MPEG-4 DMIF and
Systems |
|
|
10.1 |
Transport of MPEG-4 |
|
|
10.2 |
DMIF |
|
|
10.3 |
Demultiplexing, synchronization and description of
streaming data |
|
|
10.4 |
Advanced Synchronization
(FlexTime) Model |
|
|
10.5 |
Syntax Description |
|
|
10.6 |
Binary Format for Scene description: BIFS |
|
|
10.7 |
User interaction |
|
|
10.8 |
Content-related IPR identification and
protection |
|
|
10.9 |
MPEG-4 File Format |
|
|
10.10 |
MPEG-J |
|
|
10.11 |
Object Content Information |
|
11 |
|
Detailed technical description of MPEG-4
Visual |
|
|
11.1 |
Natural Textures, Images and Video |
|
|
11.2 |
Structure of the tools for representing natural
video |
|
|
11.3 |
The MPEG-4 Video Image Coding Scheme |
|
|
11.4 |
Coding of Textures and Still Images |
|
|
11.5 |
Synthetic Objects |
|
12 |
|
Detailed technical description of MPEG-4 Audio |
|
|
12.1 |
Natural Sound |
|
|
12.2 |
Synthesized Sound |
|
13 |
|
Detailed Description of current development |
|
|
13.1 |
IPMP Extensions |
|
|
13.2 |
The Animation Framework
eXtension, AFX |
|
|
13.3 |
Multi User Worlds (Skipped) |
|
|
13.4 |
Advanced Video Coding
(Skipped) |
|
|
13.5 |
Audio Extensions
(Skipped) |
|
14 |
|
Glossary and Acronyms |
1. Scope and features of the
MPEG-4 standard
The MPEG-4 standard provides a set of
technologies to satisfy the needs of authors, service providers and end users
alike.
- For authors, MPEG-4 enables the production of
content that has far greater reusability, has greater flexibility than is
possible today with individual technologies such as digital television,
animated graphics, World Wide Web (WWW) pages and their extensions. Also, it
is now possible to better manage and protect content owner
rights.
- For network service providers MPEG-4 offers
transparent information, which can be interpreted and translated into the
appropriate native signaling messages of each network with the help of
relevant standards bodies. The foregoing, however, excludes Quality of Service
considerations, for which MPEG-4 provides a generic QoS descriptor for
different MPEG-4 media. The exact translations from the QoS parameters set for
each media to the network QoS are beyond the scope of MPEG-4 and are left to
network providers. Signaling of the MPEG-4 media QoS descriptors end-to-end
enables transport optimization in heterogeneous networks.
- For end users, MPEG-4 brings higher levels of
interaction with content, within the limits set by the author. It also brings
multimedia to new networks, including those employing relatively low bitrate,
and mobile ones.
For all parties involved, MPEG seeks to avoid a
multitude of proprietary, non-interworking formats and players.
MPEG-4 achieves these goals by providing
standardized ways to:
- represent units of aural, visual or
audiovisual content, called “media objects”. These media objects can be of
natural or synthetic origin; this means they could be recorded with a camera
or microphone, or generated with a computer;
- describe the composition of these objects to
create compound media objects that form audiovisual scenes;
- multiplex and synchronize the data associated
with media objects, so that they can be transported over network channels
providing a QoS appropriate for the nature of the specific media objects;
and
- interact with the audiovisual scene generated
at the receiver’s end.
The following sections illustrate the MPEG-4
functionalities described above, using the audiovisual scene depicted in Figure
1.
1.1 Coded representation
of media objects
MPEG-4 audiovisual scenes are composed of
several media objects, organized in a hierarchical fashion. At the leaves of the
hierarchy, we find primitive media objects, such as:
- Still images (e.g. as a fixed
background);
- Video objects (e.g. a talking person -
without the background;
- Audio objects (e.g. the voice associated with
that person, background music);
MPEG-4 standardizes a number of such primitive
media objects, capable of representing both natural and synthetic content types,
which can be either 2- or 3-dimensional. In addition to the media objects
mentioned above and shown in Figure 1, MPEG-4 defines the coded representation
of objects such as:
- Text and graphics;
- Talking synthetic heads and associated text
used to synthesize the speech and animate the head; animated bodies to go with
the faces;
- Synthetic sound.
A media object in its coded form consists of
descriptive elements that allow handling the object in an audiovisual scene as
well as of associated streaming data, if needed. It is important to note that in
its coded form, each media object can be represented independent of its
surroundings or background. The coded representation of media objects is as
efficient as possible while taking into account the desired functionalities.
Examples of such functionalities are error robustness, easy extraction and
editing of an object, or having an object available in a scaleable
form.
1.2 Composition of media
objects
Figure 1 explains the way in which an
audiovisual scene in MPEG-4 is described as composed of individual objects. The
figure contains compound media objects that group primitive media objects
together. Primitive media objects correspond to leaves in the descriptive tree
while compound media objects encompass entire sub-trees. As an example: the
visual object corresponding to the talking person and the corresponding voice
are tied together to form a new compound media object, containing both the aural
and visual components of that talking person.
Such grouping allows
authors to construct complex scenes, and enables consumers to manipulate
meaningful (sets of) objects.
More generally, MPEG-4 provides a standardized
way to describe a scene, allowing for example to:
- Place media objects anywhere in a given
coordinate system;
- Apply transforms to change the geometrical or
acoustical appearance of a media object;
- Group primitive media objects in order to
form compound media objects;
- Apply streamed data to media objects, in
order to modify their attributes (e.g. a sound, a moving texture belonging to
an object; animation parameters driving a synthetic face);
- Change, interactively, the user’s viewing and
listening points anywhere in the scene.
The scene description builds on several concepts
from the Virtual Reality Modeling language (VRML) in terms of both its structure
and the functionality of object composition nodes and extends it to fully enable
the aforementioned features.
Figure 1 - an example of an MPEG-4
Scene
1.3 Description and
synchronization of streaming data for media objects
Media objects may need streaming data, which is
conveyed in one or more elementary streams. An object descriptor identifies all
streams associated to one media object. This allows handling hierarchically
encoded data as well as the association of meta-information about the content
(called ‘object content information’) and the intellectual property rights
associated with it.
Each stream itself is characterized by a set of
descriptors for configuration information, e.g., to determine the required
decoder resources and the precision of encoded timing information. Furthermore
the descriptors may carry hints to the Quality of Service (QoS) it requests for
transmission (e.g., maximum bit rate, bit error rate, priority,
etc.)
Synchronization of elementary streams is achieved through
time stamping of individual access units within elementary streams. The
synchronization layer manages the identification of such access units and the
time stamping. Independent of the media type, this layer allows identification
of the type of access unit (e.g., video or audio frames, scene description
commands) in elementary streams, recovery of the media object’s or scene
description’s time base, and it enables synchronization among them. The syntax
of this layer is configurable in a large number of ways, allowing use in a broad
spectrum of systems.
1.4 Delivery of streaming
data
The synchronized delivery of streaming
information from source to destination, exploiting different QoS as available
from the network, is specified in terms of the synchronization layer and a
delivery layer containing a two-layer multiplexer, as depicted in Figure
2.
The first multiplexing layer is managed
according to the DMIF specification, part 6 of the MPEG4 standard. (DMIF stands
for Delivery Multimedia Integration Framework) This multiplex may be embodied by
the MPEG-defined FlexMux tool, which allows grouping of Elementary Streams (ESs)
with a low multiplexing overhead. Multiplexing at this layer may be used, for
example, to group ES with similar QoS requirements, reduce the number of network
connections or the end to end delay.
The “TransMux” (Transport
Multiplexing) layer in Figure 2 models the layer that offers transport services
matching the requested QoS. Only the interface to this layer is specified by
MPEG-4 while the concrete mapping of the data packets and control signaling must
be done in collaboration with the bodies that have jurisdiction over the
respective transport protocol. Any suitable existing transport protocol stack
such as (RTP)/UDP/IP, (AAL5)/ATM, or MPEG-2’s Transport Stream over a suitable
link layer may become a specific TransMux instance. The choice is left to the
end user/service provider, and allows MPEG-4 to be used in a wide variety of
operation environments.
Figure 2 - The MPEG-4 System Layer
Model
Use of the FlexMux multiplexing tool is optional
and, as shown in Figure 2, this layer may be empty if the underlying TransMux
instance provides all the required functionality. The synchronization layer,
however, is always present.
With regard to Figure 2, it is possible
to:
- Identify access units, transport timestamps
and clock reference information and identify data loss.
- Optionally interleave data from different
elementary streams into FlexMux streams
- Convey control information to:
- Indicate the required QoS for each elementary
stream and FlexMux stream;
- Translate such QoS requirements into actual
network resources;
- Associate elementary streams to media
objects
- Convey the mapping of elementary streams to
FlexMux and TransMux channels
Parts of the control functionalities are
available only in conjunction with a transport control entity like the DMIF
framework.
1.5 Interaction with
media objects
In general, the user observes a scene that is
composed following the design of the scene’s author. Depending on the degree of
freedom allowed by the author, however, the user has the possibility to interact
with the scene. Operations a user may be allowed to perform include:
- Change the viewing/listening point of the
scene, e.g. by navigation through a scene;
- Drag objects in the scene to a different
position;
- Trigger a cascade of events by clicking on a
specific object, e.g. starting or stopping a video stream;
- Select the desired language when multiple
language tracks are available;
More complex kinds of behavior can also be
triggered, e.g. a virtual phone rings, the user answers and a communication link
is established.
1.6 Management and
Identification of Intellectual Property
It is important to have the possibility to
identify intellectual property in MPEG-4 media objects. Therefore, MPEG has
worked with representatives of different creative industries in the definition
of syntax and tools to support this. A full elaboration of the requirements for
the identification of intellectual property can be found in ‘Management and
Protection of Intellectual Property in MPEG-4, which is publicly available from
the MPEG home page.
MPEG-4 incorporates identification the intellectual
property by storing unique identifiers, which are issued by international
numbering systems (e.g. ISAN, ISRC, etc. ). These numbers can be applied to
identify a current rights holder of a media object. Since not all content is
identified by such a number, MPEG-4 Version 1 offers the possibility to identify
intellectual property by a key-value pair (e.g.:»composer«/»John Smith«). Also,
MPEG-4 offers a standardized interface that is integrated tightly into the
Systems layer to people who want to use systems that control access to
intellectual property. With this interface, proprietary control systems can be
easily amalgamated with the standardized part of the decoder.
2. Versions in
MPEG-4
MPEG-4 Version 1 was approved by MPEG
in December 1998; version 2 was frozen in December 1999. After these two major
versions, more tools were added in subsequent amendments that could be qualified
as versions, even though they are harder to recognize as such. Recognizing the
versions is not too important, however; it is more important to distinguish
Profiles. Existing tools and profiles from any version are never replaced in
subsequent versions; technology is always added to MPEG4 in the form of new
profiles. Figure 3 below depicts the relationship between the versions. Version
2 is a backward compatible extension of Version 1, and version 3 is a backward
compatible extension of Version 2 – and so on. The versions of all major parts
of the MPEG-4 Standard (Systems, Audio, Video, DMIF) were synchronized; after
that, the different parts took their own paths.
Figure 3 - relation between
MPEG-4 Versions
The Systems layer of Version later
versions is backward compatible with all earlier versions. In the area of
Systems, Audio and Visual, new versions add Profiles, do not change existing
ones. In fact, it is very important to note that existing systems will
always remain compliant, because Profiles will never be changed in
retrospect, and neither will the Systems Syntax, at least not in a
backward-incompatible way.
3. Major Functionalities in
MPEG-4
This section contains, in an itemized fashion,
the major functionalities that the different parts of the MPEG-4 Standard offers
in the finalized MPEG-4 Version 1. Description of the functionalities can be
found in the following sections.
In principle, MPEG-4 does not define transport
layers. In a number of cases, adaptation to a specific existing transport layer
has been defined:
- Transport over MPEG-2 Transport Stream (this
is an amendment to MPEG-2 Systems)
- Transport over IP (In cooperation with IETF,
the Internet Engineering Task Force)
DMIF, or Delivery Multimedia Integration
Framework, is an interface between the application and the transport, that
allows the MPEG-4 application developer to stop worrying about that transport. A
single application can run on different transport layers when supported by the
right DMIF instantiation.
MPEG-4 DMIF supports the following
functionalities:
- A transparent MPEG-4 DMIF-application
interface irrespective of whether the peer is a remote interactive peer,
broadcast or local storage media.
- Control of the establishment of FlexMux
channels
- Use of homogeneous networks between
interactive peers: IP, ATM, mobile, PSTN, Narrowband ISDN.
- Support for mobile networks, developed
together with ITU-T
- UserCommands with acknowledgment
messages.
- Management of MPEG-4 Sync Layer
information.
3.3 Systems
As explained above, MPEG-4 defines a toolbox of
advanced compression algorithms for audio and visual information. The data
streams (Elementary Streams, ES) that result from the coding process can be
transmitted or stored separately, and need to be composed so as to create the
actual multimedia presentation at the receiver side.
The systems
part of the MPEG-4 addresses the description of the relationship between the
audio-visual components that constitute a scene. The relationship is described
at two main levels.
- The Binary Format for Scenes (BIFS) describes
the spatio-temporal arrangements of the objects in the scene. Viewers may have
the possibility of interacting with the objects, e.g. by rearranging them on
the scene or by changing their own point of view in a 3D virtual environment.
The scene description provides a rich set of nodes for 2-D and 3-D composition
operators and graphics primitives.
- At a lower level, Object Descriptors (ODs)
define the relationship between the Elementary Streams pertinent to each
object (e.g the audio and the video stream of a participant to a
videoconference) ODs also provide additional information such as the URL
needed to access the Elementary Steams, the characteristics of the decoders
needed to parse them, intellectual property and others.
Other issues addressed by MPEG-4
Systems:
- A standard file format supports the exchange
and authoring of MPEG-4 content
- Interactivity, including: client and
server-based interaction; a general event model for triggering events or
routing user actions; general event handling and routing between objects in
the scene, upon user or scene triggered events.
- Java (MPEG-J) is used to be able to query to
terminal and its environment support and there is also a Java application
engine to code 'MPEGlets'.
- A tool for interleaving of multiple streams
into a single stream, including timing information (FlexMux tool).
- A tool for storing MPEG-4 data in a file (the
MPEG-4 File Format, ‘MP4’)
- Interfaces to various aspects of the terminal
and networks, in the form of Java API’s (MPEGJ)
- Transport layer independence. Mappings to
relevant transport protocol stacks, like (RTP)/UDP/IP or MPEG-2 transport
stream can be or are being defined jointly with the responsible
standardization bodies.
Text representation with international language
support, font and font style selection, timing and synchronization.
- The initialization and continuous management
of the receiving terminal’s buffers.
- Timing identification, synchronization and
recovery mechanisms.
- Datasets covering identification of
Intellectual Property Rights relating to media objects
MPEG-4 Audio facilitates a wide variety of
applications which could range from intelligible speech to high quality
multichannel audio, and from natural sounds to synthesized sounds. In
particular, it supports the highly efficient representation of audio objects
consisting of:
3.4.1 General Audio
Signals
Support for coding general audio ranging from
very low bitrates up to high quality is provided by transform coding techniques.
With this functionality, a wide range of bitrates and bandwidths is covered. It
starts at a bitrate of 6 kbit/s and a bandwidth below 4 kHz and extends to
broadcast quality audio from mono up to multichannel. High quality can be
achieved with low delays. Parametric Audio Coding allows sound manipulation at
low speeds. Fine Granularity Scalability (or FGS, scalability resolution down to
1 kbit/s per channel)
3.4.2 Speech signals
Speech coding can be done using bitrates from 2
kbit/s up to 24 kbit/s using the speech coding tools. Lower bitrates, such as an
average of 1.2 kbit/s, are also possible when variable rate coding is allowed.
Low delay is possible for communications applications. When using the HVXC
tools, speed and pitch can be modified under user control during playback. If
the CELP tools are used, a change of the playback speed can be achieved by using
and additional tool for effects processing.
3.4.3 Synthetic Audio
MPEG-4 Structured Audio is a language to
describe 'instruments' (little programs that generate sound) and 'scores' (input
that drives those objects). These objects are not necessarily musical
instruments, they are in essence mathematical formulae, that could generate the
sound of a piano, that of falling water – or something 'unheard' in
nature.
3.4.4 Synthesized
SpeechScalable
TTS coders bitrate range from 200 bit/s to 1.2
Kbit/s which allows a text, or a text with prosodic parameters (pitch contour,
phoneme duration, and so on), as its inputs to generate intelligible synthetic
speech.
3.5 Visual
The MPEG-4 Visual standard allows the hybrid
coding of natural (pixel based) images and video together with synthetic
(computer generated) scenes. This enables, for example, the virtual presence of
videoconferencing participants. To this end, the Visual standard comprises tools
and algorithms supporting the coding of natural (pixel based) still images and
video sequences as well as tools to support the compression of synthetic 2-D and
3-D graphic geometry parameters (i.e. compression of wire grid parameters,
synthetic text).
The subsections below give an itemized overview
of functionalities that the tools and algorithms of in the MPEG-4 visual
standard.
3.5.1 Formats Supported
The following formats and bitrates are be
supported by MPEG-4 Visual :
- bitrates: typically between 5 kbit/s and more
than 1 Gbit/s
- Formats: progressive as well as interlaced
video
- Resolutions: typically from sub-QCIF to
'Studio' resolutions (4k x 4k pixels)
3.5.2 Compression Efficiency
- For all bit rates addressed, the algorithms
are very efficient. This includes the compact coding of textures with a
quality adjustable between "acceptable" for very high compression ratios up to
"near lossless".
- Efficient compression of textures for texture
mapping on 2-D and 3-D meshes.
- Random access of video to allow
functionalities such as pause, fast forward and fast reverse of stored
video.
3.5.3 Content-Based
Functionalities
- Content-based coding of images and
video allows separate decoding and reconstruction of arbitrarily shaped video
objects.
- Random access of content in video
sequences allows functionalities such as pause, fast forward and fast reverse
of stored video objects.
- Extended manipulation of content in
video sequences allows functionalities such as warping of synthetic or natural
text, textures, image and video overlays on reconstructed video content. An
example is the mapping of text in front of a moving video object where the
text moves coherently with the object.
3.5.4 Scalability of Textures, Images and
Video
- Complexity scalability in the encoder
allows encoders of different complexity to generate valid and meaningful
bitstreams for a given texture, image or video.
- Complexity scalability in the decoder
allows a given texture, image or video bitstream to be decoded by decoders of
different levels of complexity. The reconstructed quality, in general, is
related to the complexity of the decoder used. This may entail that less
powerful decoders decode only a part of the bitstream.
- Spatial scalability allows decoders to
decode a subset of the total bitstream generated by the encoder to reconstruct
and display textures, images and video objects at reduced spatial resolution.
A maximum of 11 levels of spatial scalability are supported in so-called
'fine-granularity scalability', for video as well as textures and still
images.
- Temporal scalability allows decoders
to decode a subset of the total bitstream generated by the encoder to
reconstruct and display video at reduced temporal resolution. A maximum of
three levels are supported.
- Quality scalability allows a bitstream
to be parsed into a number of bitstream layers of different bitrate such that
the combination of a subset of the layers can still be decoded into a
meaningful signal. The bitstream parsing can occur either during transmission
or in the decoder. The reconstructed quality, in general, is related to the
number of layers used for decoding and reconstruction.
- Fine Grain Scalability – a combination
of the above in fine grain steps, up to 11 steps
3.5.5 Shape and Alpha Channel
Coding
- Shape coding assists the description
and composition of conventional images and video as well as arbitrarily shaped
video objects. Applications that benefit from binary shape maps with images
are content-based image representations for image databases, interactive
games, surveillance, and animation. There is an efficient technique to code
binary shapes. A binary alpha map defines whether or not a pixel belongs to an
object. It can be ‘on’ or ‘off’.
- ‘Gray Scale’ or ‘alpha’ Shape
Coding. An alpha plane defines the
‘transparency’ of an object, which is not necessarily uniform; it can vary
over the object, so that, e.g., edges are more transparent (a technique called
feathering). Multilevel alpha maps are frequently used to blend different
layers of image sequences. Other applications that benefit from associated
binary alpha maps with images are content-based image representations for
image databases, interactive games, surveillance, and animation.
3.5.6 Robustness in Error Prone
Environments
Error resilience allows accessing image
and video over a wide range of storage and transmission media. This includes the
useful operation of image and video compression algorithms in error-prone
environments at low bit-rates (i.e., less than 64 Kbps). There are tools that
address both the band-limited nature and error resiliency aspects of access over
wireless networks.
3.5.7 Face and Body Animation
The ‘Face and Body Animation’ tools in the
standard allow sending parameters that can define, calibrate and animate
synthetic faces and bodies. These models themselves are not standardized by
MPEG-4, only the parameters are, although there is a way to send, e.g., a
well-defined face to a decoder.
The tools include:
- Definition and coding of face and body
animation parameters (model independent):
- Feature point positions and orientations to
animate the face and body definition meshes
- Visemes, or visual lip configurations
equivalent to speech phonemes
- Definition and coding of face and body
definition parameters (for model calibration):
- 3-D feature point positions
- 3-D head calibration meshes for
animation
- Personal characteristics
- Facial texture coding
3.5.8 Coding of 2-D Meshes with Implicit
Structure
2D mesh coding includes:
- Mesh-based prediction and animated texture
transfiguration
- 2-D Delaunay or regular mesh formalism with
motion tracking of animated objects
- Motion prediction and suspended texture
transmission with dynamic meshes.
- Geometry compression for motion
vectors:
- 2-D mesh compression with implicit structure
& decoder reconstruction
3.5.9 Coding of 3-D Polygonal
Meshes
MPEG-4 provides a suite of tools for coding 3-D
polygonal meshes. Polygonal meshes are widely used as a generic representation
of 3-D objects. The underlying technologies compress the connectivity, geometry,
and properties such as shading normals, colors and texture coordinates of 3-D
polygonal meshes.
The Animation Framework eXtension (AFX, see
further down) will provide more elaborate tools for 2D and 3D synthetic
objects.
MPEG is currently working on a number of
extensions:
4.1 IPMP
Extension
4.2 The Animation
Framework eXtension, AFX
The Animation Framework extension (AFX –
pronounced ‘effects’) provides an integrated toolbox for building attractive and
powerful synthetic MPEG-4 environments. The framework defines a collection of
interoperable tool categories that collaborate to produce a reusable
architecture for interactive animated contents. In the context of AFX, a tool
represents functionality such as a BIFS node, a synthetic stream, or an
audio-visual stream.
AFX utilizes and enhances existing MPEG-4 tools,
while keeping backward-compatibility, by offering:
- Higher-level descriptions of animations (e.g.
inverse kinematics)
- Enhanced rendering (e.g. multi-texturing,
procedural texturing)
- Compact representations (e.g. piecewise curve
interpolators, subdivision surfaces)
- Low bitrate animations (e.g. using
interpolator compression and dead-reckoning)
- Scalability based on terminal capabilities
(e.g. parametric surfaces tessellation)
- Interactivity at user level, scene level, and
client-server session level
- Compression of representations for static and
dynamic tools
Compression of animated paths and animated
models is required for improving the transmission and storage efficiency of
representations for dynamic and static tools.
4.3 Multi User
Worlds
4.4
Advanced Video Coding
Work is ongoing on MPEG-4 part 10, 'Advanced
Video Coding', This codec is being developed jointly with ITU-T, in the
so-called Joint Video Team (JVT). The JVT unites the standard world's video
coding experts in a single group. The work currently underway is based on
earlier work in ITU-T on 'H.26L'. H.26L and MPEG-4 part 10 will be the same.
(H.26L will be renamed when it is done. The final name may be H.264, but that is
not yet sure). MPEG-4 AVC/H.26L is slated to be ready by the end of
2002.
4.5 Audio
extensions
There are two work items underway for improving
audio coding efficiency even further.
a) Bandwidth extension
Bandwidth extension is a tool that gives a
better quality perception over the existing audio signal, while keeping the
existing signal backward compatible.
MPEG is investigating bandwidth extensions, and
may standardize of one or both of:
- General audio signals, to extend the
capabilities currently provided by MPEG-4 general audio coders.
- Speech signals, to extend the capabilities
currently provided by MPEG-4 speech coders.
A single technology that addresses both of these
signals is preferred. This technology shall be both forward and backward
compatible with existing MPEG-4 technology. In other words, an MPEG-4 decoder
can decode an enhanced stream and a new technology decoder can decode an MPEG-4
stream. There are two possible configurations for the enhanced stream: MPEG-4
AAC streams can carry the enhancement information in the DataStreamElement,
while all MPEG-4 systems know the concept of elementary streams, which allow
second Elementary Stream for a given audio object, containing the enhancement
information.
b) Parametric coding
The MPEG-4 standard already provides a
parametric coding scheme for coding of general audio signals for low bit-rates (HILN, "Harmonic Individual Lines and Noise"). The extension investigates
parametric coding of general audio signals for the higher quality range, to
extend the capabilities currently provided by HILN. Whenever possible this
technology will build upon the existing MPEG-4 HILN technology.
5. Profiles in MPEG-4
MPEG-4 provides a large and rich set of tools
for the coding of audio-visual objects. In order to allow effective
implementations of the standard, subsets of the MPEG-4 Systems, Visual, and
Audio tool sets have been identified, that can be used for specific
applications. These subsets, called ‘Profiles’, limit the tool set a decoder has
to implement. For each of these Profiles, one or more Levels have been set,
restricting the computational complexity. The approach is similar to MPEG-2,
where the most well known Profile/Level combination is ‘Main Profile @ Main
Level’. A Profile@Level combination allows:
- a codec builder to implement only the subset
of the standard he needs, while maintaining interworking with other MPEG-4
devices built to the same combination, and
- checking whether MPEG-4 devices comply with
the standard (‘conformance testing’).
Profiles exist for various types of media
content (audio, visual, and graphics) and for scene descriptions. MPEG does not
prescribe or advise combinations of these Profiles, but care has been taken that
good matches exist between the different areas.
5.1 Visual Profiles
The visual part of the standard provides
profiles for the coding of natural, synthetic, and synthetic/natural hybrid
visual content. There are five profiles for natural video content:
- The Simple Visual Profile provides
efficient, error resilient coding of rectangular video objects, suitable for
applications on mobile networks, such as PCS and IMT2000.
- The Simple Scalable Visual Profile
adds support for coding of temporal and spatial scalable objects to the Simple
Visual Profile, It is useful for applications which provide services at more
than one level of quality due to bit-rate or decoder resource limitations,
such as Internet use and software decoding.
- The Core Visual Profile adds support
for coding of arbitrary-shaped and temporally scalable objects to the Simple
Visual Profile. It is useful for applications such as those providing
relatively simple content-interactivity (Internet multimedia
applications).
- The Main Visual Profile adds support
for coding of interlaced, semi-transparent, and sprite objects to the Core
Visual Profile. It is useful for interactive and entertainment-quality
broadcast and DVD applications.
- The N-Bit Visual Profile adds support
for coding video objects having pixel-depths ranging from 4 to 12 bits to the
Core Visual Profile. It is suitable for use in surveillance
applications.
The profiles for synthetic and synthetic/natural
hybrid visual content are:
- The Simple Facial Animation Visual
Profile provides a simple means to animate a face model, suitable for
applications such as audio/video presentation for the hearing impaired.
- The Scalable Texture Visual Profile
provides spatial scalable coding of still image (texture) objects useful for
applications needing multiple scalability levels, such as mapping texture onto
objects in games, and high-resolution digital still cameras.
- The Basic Animated 2-D Texture Visual
Profile provides spatial scalability, SNR scalability, and mesh-based
animation for still image (textures) objects and also simple face object
animation.
- The Hybrid Visual Profile combines the
ability to decode arbitrary-shaped and temporally scalable natural video
objects (as in the Core Visual Profile) with the ability to decode several
synthetic and hybrid objects, including simple face and animated still image
objects. It is suitable for various content-rich multimedia
applications.
Version 2 adds the following Profiles for
natural video:
- The Advanced Real-Time Simple Profile
(ARTS) provides advanced error resilient coding techniques of rectangular
video objects using a back channel and improved temporal resolution stability
with the low buffering delay. It is suitable for real time coding
applications; such as the videophone, tele-conferencing and the remote
observation.
- The Core Scalable Profile adds support
for coding of temporal and spatial scalable arbitrarily shaped objects to the
Core Profile. The main functionality of this profile is object based SNR and
spatial/temporal scalability for regions or objects of interest. It is useful
for applications such as the Internet, mobile and broadcast.
- The Advanced Coding Efficiency (ACE)
Profile improves the coding efficiency for both rectangular and arbitrary
shaped objects. It is suitable for applications such as mobile broadcast
reception, the acquisition of image sequences (camcorders) and other
applications where high coding efficiency is requested and small footprint is
not the prime concern.
The Version 2 profiles for synthetic and
synthetic/natural hybrid visual content are:
- The Advanced Scaleable Texture Profile
supports decoding of arbitrary-shaped texture and still images including
scalable shape coding, wavelet tiling and error-resilience. It is useful for
applications that require fast random access as well as multiple scalability
levels and arbitrary-shaped coding of still objects. Examples are fast
content-based still image browsing on the Internet, multimedia-enabled PDA’s,
and Internet-ready high-resolution digital still cameras.
- The Advanced Core Profile combines the
ability to decode arbitrary-shaped video objects (as in the Core Visual
Profile) with the ability to decode arbitrary-shaped scalable still image
objects (as in the Advanced Scaleable Texture Profile.) It is suitable for
various content-rich multimedia applications such as interactive multimedia
streaming over Internet.
- The Simple Face and Body Animation
Profile is a superset of the Simple Face Animation Profile, adding -
obviously - body animation.
In subsequent Versions, the following Profiles
were added:
- The Advanced Simple Profile looks much
like Simple in that it has only rectangular objects, but it has a few extra
tools that make it more efficient: B-frames, ¼ pel motion compensation, extra
quantization tables and global motion compensation.
- The Fine Granularity Scalability
Profile allows truncation of the enhancement layer bitstream at any bit
position so that delivery quality can easily adapt to transmission and
decoding circumstances. It can be used with Simple or Advanced Simple as a
base layer.
- The Simple Studio Profile is a profile
with very high quality for usage in Studio editing applications. It only has I
frames, but it does support arbitrary shape and in fact multiple alpha
channels. Bitrates go up to almost 2 Gigabit per second.
- The Core Studio Profile adds P frames
to Simple Studio, making it more efficient but also requiring more complex
implementations.
5.2 Audio Profiles
Four Audio Profiles have been defined in MPEG-4
V.1:
- The Speech Profile provides HVXC,
which is a very-low bit-rate parametric speech coder, a CELP
narrowband/wideband speech coder, and a Text-To-Speech interface.
- The Synthesis Profile provides score
driven synthesis using SAOL and wavetables and a Text-to-Speech Interface to
generate sound and speech at very low bitrates.
- The Scalable Profile, a superset of
the Speech Profile, is suitable for scalable coding of speech and music for
networks, such as Internet and Narrow band Audio Digital Broadcasting (NADIB).
The bitrates range from 6 kbit/s and 24 kbit/s, with bandwidths between 3.5
and 9 kHz.
- The Main Profile is a rich superset of
all the other Profiles, containing tools for natural and synthetic
Audio.
Another four Profiles were added in MPEG-4
V.2:
- The High Quality Audio Profile
contains the CELP speech coder and the Low Complexity AAC coder including Long
Term Prediction. Scalable coding can be performed by the AAC Scalable
object type. Optionally, the new error resilient (ER) bitstream syntax may be
used.
- The Low Delay Audio Profile contains
the HVXC and CELP speech coders (optionally using the ER bitstream syntax),
the low-delay AAC coder and the Text-to-Speech interface TTSI.
- The Natural Audio Profile contains all
natural audio coding tools available in MPEG-4, but not the synthetic
ones.
- The Mobile Audio Internetworking
Profile (MAUI) contains the low-delay and scalable AAC object types including
TwinVQ and BSAC. This profile is intended to extend communication applications
using non-MPEG speech coding algorithms with high quality audio coding
capabilities.
5.3 Graphics
Profiles
Graphics Profiles define which graphical and
textual elements can be used in a scene. These profiles are defined in the
Systems part of the standard:
- Simple 2-D Graphics Profile The Simple
2-D Graphics profile provides for only those graphics elements of the BIFS
tool that are necessary to place one or more visual objects in a scene.
- Complete 2-D Graphics Profile The
Complete 2-D Graphics profile provides two-dimensional graphics
functionalities and supports features such as arbitrary two-dimensional
graphics and text, possibly in conjunction with visual objects.
- Complete Graphics Profile The Complete
Graphics profile provides advanced graphical elements such as elevation grids
and extrusions and allows creating content with sophisticated lighting. The
Complete Graphics profile enables applications such as complex virtual worlds
that exhibit a high degree of realism.
- The 3D Audio Graphics Profile sounds
like a contradictory in terms, but really isn’t. This profile does not propose
visual rendering, but graphics tools are provided to define the acoustical
properties of the scene (geometry, acoustics absorption, diffusion,
transparency of the material). This profile is used for applications that do
environmental spatialization of audio signals. (See Section 12.1.7)
5.3.1 Profiles under Definition or
Consideration
The following profiles were under development at
the time of writing this Overview; their inclusion in the standard was highly
likely, but not guaranteed.
- The Simple 2D+Text profile looks like
simple 2D, adding the BIFS nodes to display text which can be colored or
transparent. Like simple 2D, this is a useful profile for low-complexity
audiovisual devices.
- The Core 2D Profile supports fairly
simple 2D graphics and text. Meant for set tops and similar devices, it can do
such things as picture-in-picture, video warping for animated advertisements,
logos, and so on.
- The Advanced 2D profile contains tools
for advanced 2D graphics. Using it, one can implement cartoons, games,
advanced graphical user interfaces, and complex, streamed graphics
animations.
- The X3D Core profile is the only 3D
profile that is likely to be added to MPEG-4. It is compatible with Web3D’s
X3D core profile under development [Web3D], and it gives a rich environment
for games, virtual worlds and other 3D applications.
5.4 Scene Graph
Profiles
Scene Graph Profiles (or Scene Description
Profiles), defined in the Systems part of the standard, allow audiovisual scenes
with audio-only, 2-dimensional, 3-dimensional or mixed 2-D/3-D
content.
- The Audio Scene Graph Profile provides
for a set of BIFS scene graph elements for usage in audio only applications.
The Audio Scene Graph profile supports applications like broadcast
radio.
- The Simple 2-D Scene Graph Profile
provides for only those BIFS scene graph elements necessary to place one or
more audio-visual objects in a scene. The Simple 2-D Scene Graph profile
allows presentation of audio-visual content with potential update of the
complete scene but no interaction capabilities. The Simple 2-D Scene Graph
profile supports applications like broadcast television.
- The Complete 2-D Scene Graph Profile
provides for all the 2-D scene description elements of the BIFS tool. It
supports features such as 2-D transformations and alpha blending. The Complete
2-D Scene Graph profile enables 2-D applications that require extensive and
customized interactivity.
- The Complete Scene Graph Profile
provides the complete set of scene graph elements of the BIFS tool. The
Complete Scene Graph profile enables applications like dynamic virtual 3-D
world and games.
- The 3D Audio Scene Graph Profile
provides the tools three-dimensional sound positioning in relation either with
acoustic parameters of the scene or its perceptual attributes. The user can
interact with the scene by changing the position of the sound source, by
changing the room effect or moving the listening point. This Profile is
intended for usage in audio-only applications.
5.4.1 Profiles under
definition
At the time of writing, the following profiles
were likely to be defined:
- The basic 2D profile provides basic 2D
composition for very simple scenes with only audio and visual elements. Only
basic 2D composition and audio and video nodes interfaces are included. These
nodes are required to put an audio or a video object in the
scene.
- The Core 2D profile has tools for creating
scenes with visual and audio objects using basic 2D composition. Included are
quantization tools, local animation and interaction, 2D texturing, Scene tree
updates, and the inclusion of subscenes through weblinks. Also included are
interactive service tools (ServerCommand, MediaControl, and MediaSensor), to
be used in video-on-demand services.
- The Advanced 2D profile forms a full superset
of the basic 2D and core 2D profiles. It adds scripting, the PROTO tool,
BIF-Anim for streamed animation, local interaction and local 2D composition as
well as advanced audio.
- The Main 2D profile adds the FlexTime model
to Core 2D, as well as Layer2D and WorldInfo nodes and all input sensors. This
profile was designed to be an interoperability point with tSMIL (see [SMIL]).
It provides a very rich set of tools for highly interactive applications on,
e.g., the World Wide Web. This name might still change.
- The X3D core profile was designed to be a
common interworking point with the Web3D specifications [Web3D] and the MPEG-4
standard. The same profile is will be in a Web3D specification. It includes
the nodes for an implementation of 3D applications on a low-footprint engine,
reckoning with the limitations of software renderers.
5.5 MPEG-J
Profiles
Two MPEG-J Profiles exist: Personal and
Main:
- Personal - a lightweight package for
personal devices.
The personal profile addresses a range of
constrained devices including mobile and portable devices. Examples of such
devices are cell video phones, PDAs, personal gaming devices. This profile
includes the following packages of MPEG-J APIs:
- Network
- Scene
- Resource
- Main - includes all the MPEG-J
API's.
The Main profile addresses a range of consumer
devices including entertainment devices. Examples of such devices are set top
boxes, computer based multimedia systems etc. It is a superset of the Personal
profile. Apart from the packages in the Personal profile, this profile
includes the following packages of the MPEG-J APIs:
- Decoder
- Decoder Functionality
- Section Filter and Service
Information
5.6 Object Descriptor
Profile
The Object Descriptor Profile includes the following
tools:
-
Object
Descriptor (OD) tool
-
Sync Layer (SL) tool
-
Object Content Information (OCI) tool
-
Intellectual Property Management and
Protection (IPMP) tool
Currently, only one profile is defined that includes all
these tools. The main reason for defining this profile is not subsetting the
tools, but rather defining levels for them. This applies especially to the Sync
Layer tool, as MPEG-4 allows multiple time bases to exist. In the context of
Levels for this Profile, restrictions can be defined, e.g. to allow only a
single time base.
|